Motivation. AI Agents empowered by Large Language Models (LLMs) have shown advanced automation capabilities in executing complex tasks. With the rapid advancement in LLMs' reasoning abilities, AI agents have proven successful in diverse applications such as coding. Recently, Multimodal Foundation Models (MFMs) have emerged as a cutting-edge frontier in artificial intelligence, demonstrating remarkable potential to enhance cognitive and perceptual capabilities in AI Agents. Recent breakthroughs reveal that models integrating visual, textual, and auditory inputs enable fine-grained cross-modal reasoning. These systems excel at parsing complex scenarios, generating context-aware descriptions, and tackling tasks requiring synergistic perception and language understanding—capabilities critical to robotics, human-computer interaction, and beyond. Here, we propose empowering AI agents to operate in multimodal scenarios by leveraging MFMs' advanced multimodal reasoning capabilities.
Background & Application. Recent AI agents have revealed their potential in solving more complex problems under multimodal settings. For example, OS-Copilot is ready to interact with computer OS like a human, such as web browsing, coding, and using third-party applications. Another trending AI agent application is the AI Scientist Agent for research automation, such as designing and running experiments. Nevertheless, existing AI Scientist Agents typically rely on textual input, and the informative multimodal signals are often overlooked. Lastly, Embodied Agents represent a trending area in both research and application, where AI agents interact with the physical (or simulated) world using sensors and robotic arms. Driven by potential applications and deployment scenarios, we define four categories of multimodal AI agents: Digital Agents, Virtual Agents, Wearable Agents, and Physical Agents. Digital Agents are software-based agents that operate exclusively with digital environments where Virtual Agents exist within virtual or simulated environments, such as VR platforms. Wearable Agents are integrated into wearable devices to provide real-time assistance and Physical Agents primarily interact with the physical world. Each category leverages multimodal reasoning—the integration and interpretation of multiple input modalities to perceive, understand, and respond effectively within their respective environments. Challenges. Multimodal reasoning faces unique theoretical and technical challenges compared to unimodal approaches. First, integrating heterogeneous data (text, images, video) demands innovations in model architectures, training paradigms, and evaluation frameworks. The exponential growth in computational complexity due to multimodal data raises concerns about resource efficiency and real-time responsiveness. Second, applying these models to frontier domains—such as multi-agent collaboration, scientific discovery, and embodied intelligence systems—requires overcoming bottlenecks in cross-modal semantic understanding and knowledge transfer. Crucially, the interpretability and robustness of multimodal reasoning systems remain unresolved foundational issues, directly impacting the deployment of reliable real-world applications (e.g., scientific AI agents, bio-inspired robotics). Addressing these challenges necessitates breakthroughs in cross-modal representation alignment, dynamic attention mechanisms, and uncertainty modeling of multi-source information.
Challenges. Multimodal reasoning faces unique theoretical and technical challenges compared to unimodal approaches. First, integrating heterogeneous data (text, images, video) demands innovations in model architectures, training paradigms, and evaluation frameworks. The exponential growth in computational complexity due to multimodal data raises concerns about resource efficiency and real-time responsiveness.
Second, applying these models to frontier domains—such as multi-agent collaboration, scientific discovery, and embodied intelligence systems—requires overcoming bottlenecks in cross-modal semantic understanding and knowledge transfer. Crucially, the interpretability and robustness of multimodal reasoning systems remain unresolved foundational issues, directly impacting the deployment of reliable real-world applications (e.g., scientific AI agents, bio-inspired robotics). Addressing these challenges necessitates breakthroughs in cross-modal representation alignment, dynamic attention mechanisms, and uncertainty modeling of multi-source information.
About this Workshop. All discussions under the scope of multimodal reasoning are welcome. This workshop aims to bring. together researchers with various backgrounds to study the next generation of multimodal reasoning systems. To foster an inclusive dialogue and debate space, we invite speakers and panelists from diverse backgrounds and areas of expertise. Our roster includes both renowned researchers and emerging investigators who have driven promising advances in the field.
Call for Papers
This workshop primarily focuses on the advancement of MFM-based Agents from the perspective of enhanced multimodal perception and reasoning. To foster an inclusive environment for discussion and debate, we welcome speakers and panelists from diverse backgrounds and expertise. Our lineup features distinguished researchers alongside emerging investigators who have made significant contributions to the field. Spotlight and poster sessions will highlight new ideas, key challenges, and retrospective insights related to the workshop’s themes.
Relevant topics include, but are not limited to:
- Multimodal Alignment: semantically consistent alignment across vision, language, and audio modalities.
- Multimodal Foundation Model Training: novel and efficient training paradigms for multimodal systems.
- Multimodal Reasoning: quantifying and enhancing the causal reasoning capabilities of multimodal foundation models.
- Multimodal Reasoning Evaluation: novel metrics for evaluating cross-modal reasoning in open-ended scenarios.
- Multimodal Perceptioning: enhancing and balancing reliance on visual, audio, and other modalities in reasoning.
- Reducing computational demands introduced by highly redundant modalities, such as images and videos.
- Multimodal signal influences on the behavior of LM-MM Agents.
Submission:
Proceedings Track Submit to OpenReview
Submission Guideline:
- Paper Formatting: Papers are limited to eight pages, including figures and tables, in the ICCV style
- Double Blind Review: ICCV reviewing is double blind, in that authors do not know the names of the area chairs or reviewers for their papers, and the area chairs/reviewers cannot, beyond a reasonable doubt, infer the names of the authors from the submission and the additional material.
Archival Policy (to be discussed): Submissions will be indexed or have archival proceedings.
Key Dates:
- Paper Submission Open: May 21th 2025, 23:59 AoE Time
- Paper Submission Deadline: June 24th 2025, 23:59 AoE Time
- Acceptance Notification: July 11th 2025, 23:59 AoE Time
- Camera-Ready Deadline: August 18th 2025, 23:59 AoE Time
- Workshop Date: October 19-20th 2025, 23:59 AoE Time
Non-Proceedings Track: Submit to OpenReview
Submission Guideline:
- Paper Formatting: Papers must be between four and eight pages in length, including figures and tables, and formatted in the ICCV style.
- Double Blind Review: ICCV reviewing is double blind, in that authors do not know the names of the area chairs or reviewers for their papers, and the area chairs/reviewers cannot, beyond a reasonable doubt, infer the names of the authors from the submission and the additional material.
Archival Policy (to be discussed): Submissions will not be indexed or have archival proceedings.
Key Dates:
- Paper Submission Open: May 21th 2025, 23:59 AoE Time
- Paper Submission Deadline: July 24th 2025, 23:59 AoE Time
- Acceptance Notification: August 7th 2025 (Flexible), 23:59 AoE Time
- Camera-Ready Deadline: August 30th 2025, 23:59 AoE Time
- Workshop Date: October 19-20th 2025, 23:59 AoE Time
Review Guide
Thank you for your interest in the MMRAgI ICCV 2025 workshop. Your expertise and dedication contribute greatly to the success of this event.
Review:
- Confidentiality: All review assignments and the content of the papers you review should be kept confidential. Do not share these materials or discuss them with others unless they are also reviewers for the same paper.
- Conflict of Interest: If you recognize a conflict of interest with any paper you are assigned to review, please notify the program chairs immediately.
- Length Requirement: We recommend paper submission within 4 pages (excluding references).
- Review Criteria:
- (1) Relevance: Does the paper align with the theme of the workshop, i.e., multi-llm-agent systems?
- (2) Originality: Does the paper present new ideas or results, or does it significantly build upon previous work?
- (3) Technical Soundness: Both position papers and methodology papers are acceptable. Is the opinion or methodology correct and properly explained? Are the claims supported by theoretical analysis or experimental results?
- (4) Clarity: Is the paper well-written and well-structured? Is it easy for readers to understand the problem, the approach, and the results?
- (5) Impact: If the results are applied, do they have the potential to contribute to the advancement of multi-llm-agent systems?
Speakers and panelists

Alexander Toshev
Research scientist and manager, Apple.
Ranjay Krishna
Assistant Professor, University of Washington.
Ani Kembhavi
Director of Science Strategy at Wayve.
Lijuan Wang
Principal Research Manager, Microsoft Cloud & AI.
Guy Van den Broeck
Professor, University of California, Los Angeles.
Kristen Grauman
Professor, University of Texas at Austin.
Hannaneh Hajishirzi
Associate Professor, University of Washington.
Lucy Shi
Ph.D, Stanford University.Workshop Schedule
Time | Session | Duration | Details |
---|---|---|---|
08:50 am - 09:00 am | Opening Remarks | 10 min | Welcome and Introduction to the Workshop |
09:00 am - 09:30 am | Invited Talk 1 | 30 min | Invited speaker presentation |
09:30 am - 10:00 am | Invited Talk 2 | 30 min | Invited speaker presentation |
10:00 am - 10:30 am | Contributed Opinion Talk 1 | 30 min | Community member presentation |
10:30 am - 10:45 am | Break | 15 min | Refreshments |
10:45 am - 11:00 am | Best Paper Talk 1 | 15 min | Top rated paper presentation |
11:00 am - 11:15 am | Best Paper Talk 2 | 15 min | Top rated paper presentation |
11:15 am - 01:05 pm | Poster Session & Lunch | 110 min | Networking and food |
01:05 pm - 01:35 pm | Invited Talk 3 | 30 min | Invited speaker presentation |
01:35 pm - 02:05 pm | Invited Talk 4 | 30 min | Invited speaker presentation |
02:05 pm - 02:35 pm | Contributed Opinion Talk 2 | 30 min | Community member presentation |
02:35 pm - 03:05 pm | Invited Talk 5 | 30 min | Invited speaker presentation |
03:05 pm - 03:20 pm | Break | 15 min | Refreshments |
03:20 pm - 03:50 pm | Invited Talk 6 | 30 min | Invited speaker presentation |
03:50 pm - 04:50 pm | Panel Discussion | 60 min | Interactive session with panelists |
04:50 pm - 05:20 pm | Breakout Rooms Discussion | 30 min | Small group discussions |
05:20 pm - 05:30 pm | Closing Remarks | 10 min | Concluding the workshop |
Organization
Workshop Organizers

Zhenfei Yin
PhD, USYD. Visiting Researcher, Oxford.
Naji Khosravan
Senior Applied Science Manager, Zillow Group.Yin Wang
Senior Applied Scientist, Zillow Group.
Roozbeh Mottagi
Senior Applied Scientist Manager, Meta Fair.
Iro Armeni
Assistant Professor, Stanford University.
Zhuqiang Lu
Ph.D student, University of Sydney.
Annie S. Chen
Ph.D, Stanford University.
Yufang Liu
Ph.D, East China Normal University.
Zixian Ma
Ph.D, the University of Washington.
Mahtab Bigverdi
Ph.D, the University of Washington.
Amita Kamath
Ph.D, the University of Washington and the University of California, Los Angeles.
Chen Feng
Institute Associate Professor, New York University.
Lei Bai
Research Scientist, Shanghai AI Laboratory.
Gordon Wetzstein
Associate Professor, Stanford University.