Researchers developed the CoDi-2 Multimodal Large Language Model (MLLM) from UC Berkeley, Microsoft Azure AI, Zoom, and UNC-Chapel Hill to address the problem of generating and understanding complex multimodal instructions, as well as excelling in subject-driven image generation, vision transformation, and audio editing tasks. This model represents a significant breakthrough in establishing a comprehensive multimodal foundation.

CoDi-2 extends the capabilities of its predecessor, CoDi, by excelling in tasks like subject-driven image generation and audio editing. The model’s architecture includes encoders and decoders for audio and vision inputs. Training incorporates pixel loss from diffusion models alongside token loss. CoDi-2 showcases remarkable zero-shot and few-shot abilities in tasks like style adaptation and subject-driven generation. 

CoDi-2 addresses challenges in multimodal generation, emphasizing zero-shot fine-grained control, modality-interleaved instruction following, and multi-round multimodal chat. Utilizing an LLM as its brain, CoDi-2 aligns modalities with language during encoding and generation. This approach enables the model to understand complex instructions and produce coherent multimodal outputs. 

CoDi-2 architecture incorporates encoders and decoders for audio and vision inputs within a multimodal large language model. Trained on a diverse generation dataset, CoDi-2 utilizes pixel loss from diffusion models alongside token loss during the training phase. Demonstrating superior zero-shot capabilities, it outperforms prior models in subject-driven image generation, vision transformation, and audio editing, showcasing competitive performance and generalization across new unseen tasks.

CoDi-2 exhibits extensive zero-shot capabilities in a multimodal generation, excelling in in-context learning, reasoning, and any-to-any modality generation through multi-round interactive conversation. The evaluation results demonstrate highly competitive zero-shot performance and robust generalization to new, unseen tasks. CoDi-2 outperforms audio manipulation tasks, achieving superior performance in adding, dropping, and replacing elements within audio tracks, as indicated by the lowest scores across all metrics. It highlights the significance of in-context age, concept learning, editing, and fine-grained control in advancing high-fidelity multimodal generation.

In conclusion, CoDi-2 is an advanced AI system that excels in various tasks, including following complex instructions, learning in context, reasoning, chatting, and editing across different input-output modes. Its ability to adapt to different styles and generate content based on various subject matters and its proficiency in manipulating audio make it a major breakthrough in multimodal foundation modeling. CoDi-2 represents an impressive exploration of creating a comprehensive system that can handle many tasks, even those for which it has yet to be trained.

Future directions for CoDi-2 plan to enhance its multimodal generation capabilities by refining in-context learning, expanding conversational abilities, and supporting additional modalities. It aims to improve image and audio fidelity by using techniques such as diffusion models. Future research may also involve evaluating and comparing CoDi-2 with other models to understand its strengths and limitations.


Check out the Paper, Github, and ProjectAll credit for this research goes to the researchers of this project. Also, don’t forget to join our 33k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..


Hello, My name is Adnan Hassan. I am a consulting intern at Marktechpost and soon to be a management trainee at American Express. I am currently pursuing a dual degree at the Indian Institute of Technology, Kharagpur. I am passionate about technology and want to create new products that make a difference.




Source link