Uni-MoE:

Scaling Unified Multimodal LLMs with Mixture of Experts


Harbin Institute of Technology, Shenzhen

Demo Presentation

Abstract

The Mixture of Experts (MoE) architecture has emerged as a promising approach for scaling up Large Language Models (LLMs), facilitating more efficient training and inference. Regarding this potential, our work extends the MoE architecture to develop Uni-MoE, a unified Multimodal LLM designed to process a wide array of modalities, including audio, speech, images, text, and video. Specifically, our methodology enriches the transformer architecture of LLMs by incorporating multimodal experts, comprising: 1). a shared self-attention mechanism for all modalities, 2). modality-specific experts derived from feed-forward networks, and 3). a sparse routing mechanism for allocating token-level expert attention. We evaluate our instruction-tuned, MoE-based MLLM on a comprehensive set of multimodal datasets. The results underscore Uni-MoE's three principal benefits: a). superior performance compared to existing dense multimodal models with single-expert configurations across various benchmarks; b). enhanced multi-expert collaboration and generalization through the pre-training of modality-specific experts, rather than applying a generic MoE model directly to heterogeneous data; c). exceptional ability in handling complex cross-modality tasks such as the High School English Listening Test and Video Comprehension, demonstrating effective complexity management and reduced bias in handling mixed multimodal datasets. Our findings suggest the significant potential of MoE frameworks in the advancement of MLLMs and encourage further research in this domain. The code is available at https://github.com/HITsz-TMG/UMOE-Scaling-Unified-Multimodal-LLMs.

Architecture


Teaser

Figure 1:Architecture of Uni-MoE. By connecting LLM with multimodal encoders, Uni-MoE shows unified multimodal understanding capability. It mainly employs the MoE architecture to achieve stable and powerful performance on any multi-modal information input.


  • Cross-Modality Alignment. In the initial stage, we aim to establish connectivity between different modalities and linguistics. We achieve this by constructing connectors that translate various modal data into soft tokens within the language space. The primary objective here is minimizing generative entropy loss. As illustrated in the upper section of Figure 1, the LLM is optimized to generate descriptions for inputs across modalities and only the connectors are subject to training. This approach ensures seamless integration of all modalities within a unified language framework, facilitating mutual comprehension by the LLM.
  • Training Single-Modality Expert. This stage concentrates on developing single-modality experts through dedicated training on specific cross-modal data. The goal is to refine each expert's proficiency within its respective domain, thereby enhancing the overall performance of the MoE system on diverse multimodal data. While maintaining generative entropy loss as the focal training metric, we tailor the FFN to align more closely with the characteristics of the targeted modality.
  • Tuning Uni-MoE. The concluding stage involves integrating the expertly tuned weights from Stage 2 into the MoE layers. We then proceed to jointly fine-tune the MLLMs utilizing mixed multimodal instructional data. The progression of the training process, as reflected by the loss curves, is depicted in Figure 3 (paper). Comparative analysis between MoE configurations reveals that experts refined during Stage 2 achieve quicker convergence and display enhanced stability on mixed-modality datasets. Furthermore, in scenarios involving complex mixed multimodal data, encompassing video, audio, images, and text, our model employing four modality-specific experts demonstrates reduced loss variability and more consistent training performance than its two-expert counterpart.



Demonstrations

• Example-1: Text + Image

avatar-user

Describe the image in detail.

avatar-bot

The image features a large leopard walking across a grassy area, close to a body of water. The leopard is positioned in the center of the scene, with its reflection visible in the water. The water appears to be a lake or a pond, providing a serene backdrop for the leopard's stroll.

• Example-2: Text + Audio

avatar-user

Where is this sound take place?​​

avatar-bot

Concert.

avatar-user

Name all the instruments you can hear.

avatar-bot

Accordion, drums, guitars, trumpets, violins, and cellos.

• Example-3: Speech Instruction + Image

avatar-user

(Transcription: What religion is represented at the top of the tower?)

avatar-bot

The top of the clock tower is decorated with a cross, which represents Christianity.

avatar-user

(Transcription: What is powering the bike?)

avatar-bot

The bikes are being powered by the dogs, who are pulling the bicycles through the grass.

• Example-4: Long Speech Understanding

avatar-user

What is the topic of the article?

avatar-bot

The importance of learning English.

• Example-5: Text + video

avatar-user

Describe this video in detail.​​

avatar-bot

The video shows a group of people enjoying a day out on a river. They are seen riding down the river on inflatable tubes, taking in the beautiful scenery and having a great time. The group is seen drinking beer and enjoying each other's company while floating down the river. The video captures the essence of a perfect day outdoors with friends, surrounded by nature.

avatar-user

What can you see and hear from the video?​​

avatar-bot

The video shows a group of people playing instruments in a large orchestra, with a conductor leading them. The conductor is wearing a suit and tie, and the orchestra is dressed in formal attire. The video also shows the audience watching the performance, and the orchestra is playing a piece of music.

avatar-user

What can you hear in the audio?​​

avatar-bot

The audio in the video features the sound of the orchestra playing the music, with the conductor leading the way. The woman playing the flute is also featured in the audio, and her performance is highlighted.


BibTeX


   @misc{li2024unimoe,
      title={Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts}, 
      author={Yunxin Li and Shenyuan Jiang and Baotian Hu and Longyue Wang and Wanqi Zhong and Wenhan Luo and Lin Ma and Min Zhang},
      year={2024},
      eprint={2405.11273},
      archivePrefix={arXiv},
      primaryClass={cs.AI}
    }
  

Acknowledgement

This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. We thank the LLaMA team for giving us access to their models, and open-source projects, including Vicuna, BEATs and whisper.

Usage and License Notices: The data, code and checkpoint is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of CLIP, BEATs, whisper, LLaMA and Vicuna. The dataset and models trained using the dataset should not be used outside of research purposes.