Revolutionizing Multimodal AI with Deep Fusion Technology
Written on
Introduction to CogVLM and Deep Fusion
A team of researchers has introduced a groundbreaking model that significantly redefines the standards of multimodal AI, outpacing nearly all existing competitors. This model, named CogVLM, brings forth a novel concept known as Deep Fusion, aimed at addressing the prevalent "shallow alignment problem" faced by Multimodal Large Language Models (MLLMs).
If successful, CogVLM could emerge as a pivotal research work, inspiring a new generation of MLLMs—those built on deep fusion principles. The model showcases remarkable abilities, such as solving mathematical problems from images, among other impressive functionalities.
Understanding the Shallow Alignment Problem
To grasp the significance of this advancement, it's essential to first understand the shallow alignment problem. Many insights I share on Medium have been previously discussed in my weekly newsletter, TheTechOasis. If you're keen to stay informed in the fast-paced realm of AI and feel motivated to take action or prepare for the future, consider subscribing below to become a leader in AI among your peers. You will receive exclusive content not available on other platforms, including Medium.
Building a Large Language Model (LLM) is no simple feat. It requires vast datasets, a team of top-tier researchers, and a powerful GPU infrastructure. In other words, it demands substantial financial resources and expertise. Moreover, commercializing such a model involves ensuring it understands the nuances of what to articulate or avoid. This necessitates a cadre of human annotators to engage in Reinforcement Learning from Human Feedback (RLHF), which further escalates the financial burden.
The costs are staggering; for instance, the initial development of GPT-4 amounted to $100 million, while LLaMa's 65-billion-parameter model incurred $5 million in just 21 days. Constructing an MLLM, however, is a far more complex endeavor. Beyond the standard procedures, one must also train an image encoder to process images and ensure they work harmoniously with the language model. Thus, unless you're part of the elite group of leading tech companies—often referred to as the "magnificent seven" (Alphabet, Amazon, Apple, Meta, Microsoft, Nvidia, and Tesla)—or have substantial backing, it's nearly impossible to train these models.
For open-source researchers primarily funded by universities, resourcefulness in design is crucial.
Innovative Grafting Techniques
To circumvent the need to build a model entirely from scratch, the prevalent approach for creating MLLMs is known as grafting. This involves connecting a pre-trained image encoder with an LLM via an adapter, typically a Q-former or an MLP layer. The rationale is straightforward: utilizing pre-trained components requires aligning them so they can effectively communicate.
Once the image encoder processes the visual input, its output is projected into the embedding space of the LLM. Frontier AI models convert various forms of data (images, text, audio, etc.) into vectors called embeddings, which are essential since machines operate on numerical data. This allows for the calculation of similarity between concepts, with closely related concepts sharing similar vector representations—a process OpenAI describes as 'relatedness.'
However, a significant challenge arises here. The LLM’s weights are not designed to interpret images, and thus, crucial features like color and spatial relationships may be overlooked. For example, if the model is tasked with identifying "What object is on the right side of the image?", it processes the image and text separately. While the image encoder identifies objects accurately, the LLM's weights are primarily trained for textual features, resulting in a loss of image-specific information during the embedding projection.
Although one could fine-tune the LLM with image data, this can lead to catastrophic forgetting in natural language processing tasks, as evidenced by the PaLM-E model (Driess et al., 2023). Fortunately, Deep Fusion resolves this issue.
Deep Fusion: A Paradigm Shift
In essence, CogVLM addresses the shallow alignment problem by introducing a visual expert module. The process remains similar on the left side of the architecture: an MLP adapter transforms the image encoder's outputs to match the dimensions of the word embeddings.
However, the right side reveals a significant advancement. The researchers have effectively duplicated the attention weights of the LLM. The attention mechanism is fundamental to Transformers, facilitating communication between words in a sequence and enabling the model to grasp the relationships between them.
The innovative approach allows for the original weights—trained on extensive text data—to remain untouched while new weights are trained specifically on image features. This means the model can now leverage distinct image characteristics while still operating within the established framework of text. The results from both attention mechanisms are then combined, producing a unified representation that fuses both visual and textual information.
For instance, in the earlier example, the LLM within CogVLM can now address queries like "What is the object on the right-hand side of the image?" with ease. This integration mirrors how humans naturally combine visual and textual information, enhancing the model's overall comprehension of the input.
Additionally, the architecture is computationally efficient; although parameters are duplicated, the number of FLOPs remains constant since parameters are applied only to their respective modalities.
Outstanding Performance Metrics
CogVLM has demonstrated exceptional performance when compared to other models of similar size, outperforming all previous state-of-the-art models except for PaLI-X-55b, which is three times larger. Remarkably, CogVLM surpasses even this model in numerous scenarios, establishing itself as a leading contender alongside the likes of GPT-4V.
The model excels in executing sophisticated tasks, such as intricate counting and advanced image analysis.
Conclusion: A New Era of Multimodality
In summary, CogVLM is not just another incremental advancement in AI; it signifies a groundbreaking shift in multimodal development. It offers a computationally efficient framework that empowers smaller models to tackle complex cross-modal tasks where traditional MLLMs often falter. Given CogVLM's capabilities, it is poised to attract global attention, and the deep fusion architecture may soon become synonymous with multimodality. For further details, refer to the research paper linked here.
This video discusses CogVLM, RoboVQA, GLaMM, and more, showcasing their capabilities in multimodal AI.
In this video, Ekaterina Sirazitdinova demystifies Multimodal Generative AI, providing insights into its future.