The dominance of AI transformers may soon be challenged as researchers search for new architectures that can overcome the technical limitations they face. Transformers, such as those used in OpenAI’s Sora and text-generating models like Claude and Gemini, struggle with processing and analyzing vast amounts of data, leading to increased power consumption and infrastructure demands. However, a promising architecture known as test-time training (TTT) has been proposed by researchers from Stanford, UC San Diego, UC Berkeley, and Meta. TTT models claim to process more data than transformers while consuming less compute power.
Transformers rely on a “hidden state” as a fundamental component, which is essentially a long list of data. As the transformer processes information, it adds entries to the hidden state to remember what it has processed. However, this hidden state also poses challenges. For example, to mention a single word about a book it has read, the transformer would have to scan through its entire lookup table, which is computationally demanding. To address this issue, the TTT model replaces the hidden state with a machine learning model nested within the transformer. This allows the TTT model to encode processed data into representative variables called weights, making it highly performant. Unlike transformers, the size of the TTT model’s internal model does not change regardless of the amount of data processed.
The potential of TTT models is vast, with the ability to efficiently process billions of pieces of data, including words, images, audio recordings, and videos. Sun, a researcher involved in the TTT project, envisions a future where TTT models can process long videos that resemble the visual experience of a human life. While TTT models show promise, they are not yet a direct replacement for transformers. The researchers have only developed two small models for study, making it difficult to compare them to larger transformer implementations.
Despite this uncertainty, the increasing research into transformer alternatives indicates a growing recognition of the need for a breakthrough. Mistral, an AI startup, recently released a model called Codestral Mamba based on another alternative architecture called state space models (SSMs). SSMs, like TTT models, appear to be more computationally efficient than transformers and can handle larger amounts of data. AI21 Labs and Cartesia are also exploring SSMs, suggesting a potential shift in generative AI towards these alternatives.
If these efforts are successful, it could lead to more accessible and widespread generative AI. However, it remains to be seen whether TTT models, SSMs, or any other emerging architecture will ultimately surpass transformers. The ongoing research and development in this field highlight the importance of finding more efficient and powerful AI architectures to meet the demands of processing ever-increasing amounts of data.