Motivation. AI Agents empowered by Large Language Models (LLMs) have shown advanced automation capabilities in executing complex tasks. With the rapid advancement in LLMs' reasoning abilities, AI agents have proven successful in diverse applications such as coding. Recently, Multimodal Foundation Models (MFMs) have emerged as a cutting-edge frontier in artificial intelligence, demonstrating remarkable potential to enhance cognitive and perceptual capabilities in AI Agents. Recent breakthroughs reveal that models integrating visual, textual, and auditory inputs enable fine-grained cross-modal reasoning. These systems excel at parsing complex scenarios, generating context-aware descriptions, and tackling tasks requiring synergistic perception and language understanding—capabilities critical to robotics, human-computer interaction, and beyond. Here, we propose empowering AI agents to operate in multimodal scenarios by leveraging MFMs' advanced multimodal reasoning capabilities.

Background & Application. Recent AI agents have revealed their potential in solving more complex problems under multimodal settings. For example, OS-Copilot is ready to interact with computer OS like a human, such as web browsing, coding, and using third-party applications. Another trending AI agent application is the AI Scientist Agent for research automation, such as designing and running experiments. Nevertheless, existing AI Scientist Agents typically rely on textual input, and the informative multimodal signals are often overlooked. Lastly, Embodied Agents represent a trending area in both research and application, where AI agents interact with the physical (or simulated) world using sensors and robotic arms. Driven by potential applications and deployment scenarios, we define four categories of multimodal AI agents: Digital Agents, Virtual Agents, Wearable Agents, and Physical Agents. Digital Agents are software-based agents that operate exclusively with digital environments where Virtual Agents exist within virtual or simulated environments, such as VR platforms. Wearable Agents are integrated into wearable devices to provide real-time assistance and Physical Agents primarily interact with the physical world. Each category leverages multimodal reasoning—the integration and interpretation of multiple input modalities to perceive, understand, and respond effectively within their respective environments. Challenges. Multimodal reasoning faces unique theoretical and technical challenges compared to unimodal approaches. First, integrating heterogeneous data (text, images, video) demands innovations in model architectures, training paradigms, and evaluation frameworks. The exponential growth in computational complexity due to multimodal data raises concerns about resource efficiency and real-time responsiveness. Second, applying these models to frontier domains—such as multi-agent collaboration, scientific discovery, and embodied intelligence systems—requires overcoming bottlenecks in cross-modal semantic understanding and knowledge transfer. Crucially, the interpretability and robustness of multimodal reasoning systems remain unresolved foundational issues, directly impacting the deployment of reliable real-world applications (e.g., scientific AI agents, bio-inspired robotics). Addressing these challenges necessitates breakthroughs in cross-modal representation alignment, dynamic attention mechanisms, and uncertainty modeling of multi-source information.

Challenges. Multimodal reasoning faces unique theoretical and technical challenges compared to unimodal approaches. First, integrating heterogeneous data (text, images, video) demands innovations in model architectures, training paradigms, and evaluation frameworks. The exponential growth in computational complexity due to multimodal data raises concerns about resource efficiency and real-time responsiveness.

Second, applying these models to frontier domains—such as multi-agent collaboration, scientific discovery, and embodied intelligence systems—requires overcoming bottlenecks in cross-modal semantic understanding and knowledge transfer. Crucially, the interpretability and robustness of multimodal reasoning systems remain unresolved foundational issues, directly impacting the deployment of reliable real-world applications (e.g., scientific AI agents, bio-inspired robotics). Addressing these challenges necessitates breakthroughs in cross-modal representation alignment, dynamic attention mechanisms, and uncertainty modeling of multi-source information.

About this Workshop. All discussions under the scope of multimodal reasoning are welcome. This workshop aims to bring. together researchers with various backgrounds to study the next generation of multimodal reasoning systems. To foster an inclusive dialogue and debate space, we invite speakers and panelists from diverse backgrounds and areas of expertise. Our roster includes both renowned researchers and emerging investigators who have driven promising advances in the field.

Call for Papers

This workshop primarily focuses on the advancement of MFM-based Agents from the perspective of enhanced multimodal perception and reasoning. To foster an inclusive environment for discussion and debate, we welcome speakers and panelists from diverse backgrounds and expertise. Our lineup features distinguished researchers alongside emerging investigators who have made significant contributions to the field. Spotlight and poster sessions will highlight new ideas, key challenges, and retrospective insights related to the workshop’s themes.

Relevant topics include, but are not limited to:

  • Multimodal Alignment: semantically consistent alignment across vision, language, and audio modalities.
  • Multimodal Foundation Model Training: novel and efficient training paradigms for multimodal systems.
  • Multimodal Reasoning: quantifying and enhancing the causal reasoning capabilities of multimodal foundation models.
  • Multimodal Reasoning Evaluation: novel metrics for evaluating cross-modal reasoning in open-ended scenarios.
  • Multimodal Perceptioning: enhancing and balancing reliance on visual, audio, and other modalities in reasoning.
  • Reducing computational demands introduced by highly redundant modalities, such as images and videos.
  • Multimodal signal influences on the behavior of LM-MM Agents.

Submission:

Proceedings Track Submit to OpenReview

Submission Guideline:

  • Paper Formatting: Papers are limited to eight pages, including figures and tables, in the ICCV style
  • Double Blind Review: ICCV reviewing is double blind, in that authors do not know the names of the area chairs or reviewers for their papers, and the area chairs/reviewers cannot, beyond a reasonable doubt, infer the names of the authors from the submission and the additional material.

Archival Policy (to be discussed): Submissions will be indexed or have archival proceedings.

Key Dates:

  • Paper Submission Open: May 21th 2025, 23:59 AoE Time
  • Paper Submission Deadline: June 24th 2025, 23:59 AoE Time
  • Acceptance Notification: July 11th 2025, 23:59 AoE Time
  • Camera-Ready Deadline: August 18th 2025, 23:59 AoE Time
  • Workshop Date: October 19-20th 2025, 23:59 AoE Time
Non-Proceedings Track: Submit to OpenReview

Submission Guideline:

  • Paper Formatting: Papers must be between four and eight pages in length, including figures and tables, and formatted in the ICCV style.
  • Double Blind Review: ICCV reviewing is double blind, in that authors do not know the names of the area chairs or reviewers for their papers, and the area chairs/reviewers cannot, beyond a reasonable doubt, infer the names of the authors from the submission and the additional material.

Archival Policy (to be discussed): Submissions will not be indexed or have archival proceedings.

Key Dates:

  • Paper Submission Open: May 21th 2025, 23:59 AoE Time
  • Paper Submission Deadline: July 24th 2025, 23:59 AoE Time
  • Acceptance Notification: August 7th 2025 (Flexible), 23:59 AoE Time
  • Camera-Ready Deadline: August 30th 2025, 23:59 AoE Time
  • Workshop Date: October 19-20th 2025, 23:59 AoE Time

Review Guide

Thank you for your interest in the MMRAgI ICCV 2025 workshop. Your expertise and dedication contribute greatly to the success of this event.

Review:

  • Confidentiality: All review assignments and the content of the papers you review should be kept confidential. Do not share these materials or discuss them with others unless they are also reviewers for the same paper.
  • Conflict of Interest: If you recognize a conflict of interest with any paper you are assigned to review, please notify the program chairs immediately.
  • Length Requirement: We recommend paper submission within 4 pages (excluding references).
  • Review Criteria:
    • (1) Relevance: Does the paper align with the theme of the workshop, i.e., multi-llm-agent systems?
    • (2) Originality: Does the paper present new ideas or results, or does it significantly build upon previous work?
    • (3) Technical Soundness: Both position papers and methodology papers are acceptable. Is the opinion or methodology correct and properly explained? Are the claims supported by theoretical analysis or experimental results?
    • (4) Clarity: Is the paper well-written and well-structured? Is it easy for readers to understand the problem, the approach, and the results?
    • (5) Impact: If the results are applied, do they have the potential to contribute to the advancement of multi-llm-agent systems?

Speakers and panelists

Alexander Toshev

Research scientist and manager, Apple.

Ranjay Krishna

Assistant Professor, University of Washington.

Ani Kembhavi

Director of Science Strategy at Wayve.

Lijuan Wang

Principal Research Manager, Microsoft Cloud & AI.

Guy Van den Broeck

Professor, University of California, Los Angeles.

Kristen Grauman

Professor, University of Texas at Austin.

Hannaneh Hajishirzi

Associate Professor, University of Washington.

Lucy Shi

Ph.D, Stanford University.

Workshop Schedule

Time Session Duration Details
08:50 am - 09:00 amOpening Remarks10 minWelcome and Introduction to the Workshop
09:00 am - 09:30 amInvited Talk 130 minInvited speaker presentation
09:30 am - 10:00 amInvited Talk 230 minInvited speaker presentation
10:00 am - 10:30 amContributed Opinion Talk 130 minCommunity member presentation
10:30 am - 10:45 amBreak15 minRefreshments
10:45 am - 11:00 amBest Paper Talk 115 minTop rated paper presentation
11:00 am - 11:15 amBest Paper Talk 215 minTop rated paper presentation
11:15 am - 01:05 pmPoster Session & Lunch110 minNetworking and food
01:05 pm - 01:35 pmInvited Talk 330 minInvited speaker presentation
01:35 pm - 02:05 pmInvited Talk 430 minInvited speaker presentation
02:05 pm - 02:35 pmContributed Opinion Talk 230 minCommunity member presentation
02:35 pm - 03:05 pmInvited Talk 530 minInvited speaker presentation
03:05 pm - 03:20 pmBreak15 minRefreshments
03:20 pm - 03:50 pmInvited Talk 630 minInvited speaker presentation
03:50 pm - 04:50 pmPanel Discussion60 minInteractive session with panelists
04:50 pm - 05:20 pmBreakout Rooms Discussion30 minSmall group discussions
05:20 pm - 05:30 pmClosing Remarks10 minConcluding the workshop

Organization

Workshop Organizers

Zhenfei Yin

PhD, USYD. Visiting Researcher, Oxford.

Naji Khosravan

Senior Applied Science Manager, Zillow Group.

Yin Wang

Senior Applied Scientist, Zillow Group.

Roozbeh Mottagi

Senior Applied Scientist Manager, Meta Fair.

Iro Armeni

Assistant Professor, Stanford University.

Zhuqiang Lu

Ph.D student, University of Sydney.

Annie S. Chen

Ph.D, Stanford University.

Yufang Liu

Ph.D, East China Normal University.

Zixian Ma

Ph.D, the University of Washington.

Mahtab Bigverdi

Ph.D, the University of Washington.

Amita Kamath

Ph.D, the University of Washington and the University of California, Los Angeles.

Chen Feng

Institute Associate Professor, New York University.

Lei Bai

Research Scientist, Shanghai AI Laboratory.

Gordon Wetzstein

Associate Professor, Stanford University.

Philip Torr

Professor, University of Oxford.