The Challenges of Tokenization in Generative AI Models

Understanding the Token-Based Internal Environments of Generative AI Models

Generative AI models, such as Gemma and OpenAI’s GPT-4o, rely on an architecture called the transformer. However, these models do not process text in the same way humans do. They work with smaller units of text called tokens due to technical and pragmatic reasons. Tokenization is the process of breaking down text into these bite-sized pieces.

Tokens can be words, syllables, or even individual characters within words. While tokenization allows transformers to take in more information before reaching an upper limit known as the context window, it can introduce biases and create challenges. For example, odd spacing in tokens can derail a transformer’s understanding of a sentence. Tokenization also treats case differently, which can lead to failures in capital letter recognition.

Sheridan Feucht, a PhD student studying large language model interpretability, explains that there is no perfect tokenizer due to the “fuzziness” of defining what a word should be for a language model. This fuzziness becomes even more problematic in languages other than English. Many tokenization methods assume that a space indicates a new word, but languages like Chinese, Japanese, Korean, Thai, and Khmer do not use spaces to separate words. As a result, transformers take longer to complete tasks in non-English languages, and users of less token-efficient languages may pay more for usage.

Tokenizers also struggle with logographic systems of writing like Chinese and agglutinative languages like Turkish. Each character in logographic systems is treated as a distinct token, while each meaningful word element in agglutinative languages becomes a token. This leads to high token counts and increased complexity in processing these languages.

Additionally, tokenization might explain why today’s models struggle with math-related tasks. Digits are not tokenized consistently, leading to confusion in understanding numerical patterns and context. Models also face challenges in solving anagram problems or reversing words.

While tokenization presents challenges, there are potential solutions. “Byte-level” state space models like MambaByte work directly with raw bytes instead of tokens, allowing them to handle noise like swapped characters, spacing, and capitalization. These models show promise in language-analyzing tasks and may offer an alternative to tokenization. However, they are still in the early research stages.

In conclusion, tokenization is a crucial stage in generative AI models but also a source of limitations and problems. Researchers are exploring new model architectures, like MambaByte, to overcome these challenges and improve the performance of AI models in understanding and processing text.