Home ai “Transfusion: A Unified Approach to Multi-Modal Learning in AI”

“Transfusion: A Unified Approach to Multi-Modal Learning in AI”


Transfusion: A Unified Approach to Multi-Modal Learning

Multi-modal models that can process both text and images are an important area of research in artificial intelligence. However, training these models presents a unique challenge. Language models deal with discrete values like words and tokens, while image generation models must handle continuous pixel values. Existing approaches to address this challenge involve different tradeoffs.

One approach involves using separate architectures for language and image processing, often pre-training each component individually. This method struggles to learn the complex interactions between different modalities, especially when processing documents where images and text are interleaved. Another approach involves quantizing images into discrete values, effectively converting them into a sequence of tokens similar to text. While this enables the use of language models for image processing, it results in the loss of information contained in the continuous pixel values.

In a new research paper, scientists from Meta and the University of South Carolina introduce Transfusion, a novel technique that enables a single model to seamlessly handle both discrete and continuous modalities. Transfusion is a recipe for training a single model that can handle both discrete and continuous modalities without the need for quantization or separate modules. The core idea behind Transfusion is to train a single model with two objectives: language modeling for text and diffusion for images. The researchers show that it is possible to fully integrate both modalities, with no information loss, by training a single model to both predict discrete text tokens and diffuse continuous images.

Transfusion uses a unified architecture and vocabulary to process mixed-modality inputs. The model includes lightweight modality-specific components that convert text tokens and image patches into the appropriate representations before they are processed by the transformer. To improve the representation of image data, Transfusion uses variational autoencoders (VAE), neural networks that can learn to represent complex data, such as images, in a lower-dimensional continuous space. In Transfusion, a VAE is used to encode each 8×8 patch of an image into a list of continuous values.

The researchers trained a 7-billion model based on Transfusion and evaluated it on a variety of standard uni-modal and cross-modal benchmarks. They compared its performance to an equally-sized model based on Chameleon, which is the current prominent open-science method for training native mixed-modal models. Transfusion consistently outperformed Chameleon across all modalities. It achieved better results with less computational cost in text-to-image generation and matched Chameleon’s performance with only 21.8% of the computational resources in image-to-text generation.

Transfusion also showed better performance on text-only benchmarks, suggesting that training on quantized image tokens can negatively impact text performance. The researchers ran separate experiments on image generation and compared Transfusion with other models. Transfusion outperformed other popular models such as DALL-E 2 and Stable Diffusion XL while also being able to generate text.

Transfusion opens up new opportunities for multi-modal learning and new interesting use cases. It potentially unlocks new applications with better controllability on interactive sessions of user inputs, such as interactive editing of images and videos. Overall, Transfusion offers a unified approach to multi-modal learning, addressing the challenges of handling both discrete and continuous modalities in a single model without information loss.

Exit mobile version