Microsoft Research

Microsoft Research - Magma: A foundation model for multimodal AI Agents | Microsoft Research Forum

Magma is introduced as a foundation model for multimodal AI agents, designed to perceive, reason, and act in both digital and physical environments. Unlike previous models, Magma aims to bridge the gap between understanding inputs and interacting with the world. It processes multimodal inputs like images and videos, and predicts actions to achieve real-world goals. The model uses innovative pretraining techniques, Set-of-Mark and Trace-of-Mark, to leverage large-scale image and video data without human labels. These techniques help in grounding actions spatially and capturing object motions, respectively. Magma's pretraining involves a unified objective similar to large language models, enhancing its action grounding and planning capabilities. Evaluations show Magma's superior performance in tasks like spatial grounding, UI navigation, and robot manipulation, outperforming models like GPT-4v. The model's effectiveness is further demonstrated in robotics and UI navigation benchmarks, achieving state-of-the-art results with limited data. Magma's development involved collaboration across Microsoft Research and external partners, and its code and model are publicly available for experimentation.

Key Points:

Magma is a multimodal AI model that perceives, reasons, and acts in digital and physical environments.
It uses Set-of-Mark and Trace-of-Mark techniques for pretraining with large-scale image and video data.
Magma outperforms existing models in spatial grounding, UI navigation, and robot manipulation tasks.
The model achieves state-of-the-art results in benchmarks using limited pretraining data.
Magma's code and model are available for public use and experimentation.

Details:

1. 🎤 Introduction to Magma: Agentic Foundation Model

Magma is an agentic foundation model, designed as a generalist model with capabilities to perceive its environment, reason, and take actions to achieve defined goals.
The model understands multimodal inputs, including visual and textual data, enabling diverse applications.
Magma can predict actions for real-world objectives in both digital and physical environments.
An example application of Magma includes autonomous navigation systems, where it processes environmental data to make informed decisions.
In digital markets, Magma can optimize ad placements by analyzing user interactions and predicting engagement outcomes.
Magma's versatility is demonstrated in its ability to adapt to various sectors, from robotics to personalized digital services.

2. 🔍 Evolution of Multimodal Models

Vision-language models initially used BERT architecture with under 1 billion parameters, trained on limited image datasets, leading to basic multimodal capabilities.
OpenAI's CLIP model significantly expanded multimodal training to billions of images, offering superior performance and setting a new standard in the field.
Microsoft's Florence model showcased strong open vocabulary and zero-shot recognition across diverse visual domains, achieving impressive results despite its relatively smaller size.
Recent integrations of multimodal vision models like CLIP with large language models such as GPT have propelled advancements in multimodal capabilities.
The development of multimodal chatbots, like GPT-4o, marks a significant leap, enabling features such as seeing, talking, and reasoning, showcasing the practical applications of evolved multimodal technology.

3. 🤖 Bridging the Interaction Gap in AI Models

Existing multimodal models are adept at understanding the world but lack interaction capabilities, both virtual and physical.
These models are disconnected from direct world interaction due to sensor input detachment from large foundation models.
A gap remains between AI and humans in executing simple tasks like web navigation and manipulation.
Magma was developed as a foundation model aiming to close this gap by enabling multimodal agents to understand and interact with the environment.
Magma strives to be a comprehensive model that not only interprets visual and textual inputs but also predicts actions to achieve real-world goals.

4. 🛠️ Pretraining and Techniques for Magma

The model processes images, videos, and task prompts to generate textual, spatial, and action outputs across various tasks, leveraging human instructional videos for pretraining.
Two primary techniques introduced include Set-of-Mark, which focuses on spatial grounding in images, and Trace-of-Mark, aimed at capturing motions of foreground objects in videos and robotics data.
Pretraining utilized around 20 million samples, including images, video, and robotics data, each contributing to different training goals.
The unified pretraining objective, akin to large language models, requires the model to predict verbal, spatial, and action outputs from text inputs, enhancing action grounding and planning.
Significant improvement in model performance was observed with increased pretraining data, showcasing strong generalization across tasks with the same image input.

5. 📊 Evaluation and Performance of Magma

Magma model evaluated in zero-shot manner on tasks: spatial grounding, digital UI navigation, and physical robot manipulation, outperforming methods including GPT-4v.
Magma is the first model capable of performing all three agentic tasks simultaneously.
Configured for robotics manipulation, Magma nearly doubles performance in simulated environments using the same robot data as OpenVLA.
Pretraining techniques effectively leverage unlabeled image and video data for agentic pretraining.
Fine-tuned for real-world robot manipulation and UI navigation, Magma shows superior performance on both seen and unseen tasks compared to OpenVLA.
In a realistic UI navigation benchmark 'Manage-to-Work', using only image data, Magma achieves state-of-the-art success rate.

6. 📈 Conclusion and Future Directions

Developed the first agentic foundation model, Magma, capable of understanding multimodal input and taking action in both digital and physical environments.
Proposed two techniques, Set-of-Mark and Trace-of-Mark, to leverage large amounts of images and videos without human labels for model pretraining, addressing the challenge of limited pretraining data.
Produced a highly compatible foundation model suitable for a wide range of multimodal tasks, including understanding and action prediction.
Released code and model for public access, encouraging experimentation and further development.
Collaborative effort involving the Deep Learning group, Microsoft Research, and many external collaborators.
Future research directions include enhancing model's adaptability to dynamic environments and expanding its real-world applications, particularly in autonomous systems and robotics.
Potential impact includes revolutionizing fields such as autonomous vehicles, robotics, and digital assistants by providing more intelligent and adaptable solutions.

View Full Content

Upgrade to Plus to unlock complete episodes, key insights, and in-depth analysis

Starting at $5/month. Cancel anytime.