Meta’s new Megabyte system solves one of the biggest roadblocks for GPTs

business AI Machine Learning Meta ChatGPT

26.05.2023 - 15:29

Reading now: 127

cointelegraph.com:

Meta AI recently published pre-print research showing off a radical new “Megabyte” framework for building generative pre-trained transformer (GPT) systems.

Dubbed “promising” by OpenAI’s Andrej Karpathy, former director of artificial intelligence at Tesla, the new architecture is designed to process large volumes of data — such as images, novels and video files — without the use of a process known as tokenization.

Promising. Everyone should hope that we can throw away tokenization in LLMs. Doing so naively creates (byte-level) sequences that are too long, so the devil is in the details.Tokenization means that LLMs are not actually fully end-to-end. There is a whole separate stage with… https://t.co/t240ZPxPm7

Tokenization is a lossy process that’s comparable to file compression. To process large amounts of data, GPT models convert bytes to tokens. The tokens are then processed by the transformer and used to generate output tokens, which are then decoded.

The tokenization process allows an AI system to process larger strings of data as numbers. The words “my favorite color is red,” if processed by OpenAI’s ChatGPT, for example, would be converted to the token string “3666, 4004, 3124, 318, 2266, 13” for processing.

Unfortunately, even through tokenization, the amount of data current state-of-the-art systems can process still has a hard limit. For GPT-3.5, the limit is slightly over 4,000 tokens or about 3,000 words, whereas GPT-4 maxes out at around 32,000 tokens or about 24,000 words.

Meta’s new Megabyte system ditches tokenization in favor of a novel multi-layer prediction architecture capable of end-to-end modeling over 1 million bytes of data.

Most standard English-language encoding systems use standard 8-bit encoding.

Meta’s new Megabyte system solves one of the biggest roadblocks for GPTs

Related news

Marinade (MNDE) Price Pumps 160%, LBR Trends On Liquid Staking Derivatives Influx

FTX debtors object Genesis' 'critical' claim estimate of '$0.00'

While Investors Remain Bullush In Ethereum (ETH), DigiToads (TOADS) Presale Sets Sights on $5 Million Milestone

Is Biden’s controversial Bitcoin mining tax dead or set to rise from the ashes?

FTX Exchange Objects to Prolonged Mediation Talks with Bankrupt Genesis – Here's the Latest

OKX announces its reserves Tradecurve to feature Proof of Reserves (PoR)

MUFG to facilitate Japanese bank-backed stablecoins via Progmat Coin platform

Crypto Exchange BKEX Halts Withdrawals Amid Money Laundering Investigation

Tether Ventures into Sustainable Energy Production and Bitcoin Mining in Renewable-Rich Uruguay

Microsoft pens AI cloud computing deal with former Ethereum miner CoreWeave: CNBC

Crypto Biz: Six months on from FTX, Tether mines BTC, and Nvidia’s AI superchips

US lawmakers aim for crypto regulatory clarity with proposed bill putting the screws to SEC

Privacy-focused Aleo blockchain gets new wallet as mainnet launch approaches

Litecoin Price Prediction as LTC Becomes One of the Best Performing Coins of the Week – Time to Buy?

Crypto bill from Republicans lays out clear roles for SEC and CFTC

Billionaire Mike Novogratz Says Crypto Market Rally is Stalling as Bitcoin Posts First Monthly Loss – Here's What's Happening

Stocks making the biggest moves midday: Lululemon, SentinelOne, T-Mobile, MongoDB and more

Taurus deploys on Polygon blockchain for asset tokenization and custody

TikTok Influencer Uses Bitcoin to Launder Fraudulent COVID Relief Funds - Here’s What Happened

Top Stocks for June 2023

CFTC commissioner says proposal to reassess risk management could consider crypto