Meta AI recently published pre-print research showing off a radical new “Megabyte” framework for building generative pre-trained transformer (GPT) systems.
Dubbed “promising” by OpenAI’s Andrej Karpathy, former director of artificial intelligence at Tesla, the new architecture is designed to process large volumes of data — such as images, novels and video files — without the use of a process known as tokenization.
Promising. Everyone should hope that we can throw away tokenization in LLMs. Doing so naively creates (byte-level) sequences that are too long, so the devil is in the details.Tokenization means that LLMs are not actually fully end-to-end. There is a whole separate stage with… https://t.co/t240ZPxPm7
Tokenization is a lossy process that’s comparable to file compression. To process large amounts of data, GPT models convert bytes to tokens. The tokens are then processed by the transformer and used to generate output tokens, which are then decoded.
The tokenization process allows an AI system to process larger strings of data as numbers. The words “my favorite color is red,” if processed by OpenAI’s ChatGPT, for example, would be converted to the token string “3666, 4004, 3124, 318, 2266, 13” for processing.
Unfortunately, even through tokenization, the amount of data current state-of-the-art systems can process still has a hard limit. For GPT-3.5, the limit is slightly over 4,000 tokens or about 3,000 words, whereas GPT-4 maxes out at around 32,000 tokens or about 24,000 words.
Meta’s new Megabyte system ditches tokenization in favor of a novel multi-layer prediction architecture capable of end-to-end modeling over 1 million bytes of data.
Most standard English-language encoding systems use standard 8-bit encoding.
Read more on cointelegraph.com