MMRAgI

Motivation. AI Agents empowered by Large Language Models (LLMs) have shown advanced automation capabilities in executing complex tasks. With the rapid advancement in LLMs' reasoning abilities, AI agents have proven successful in diverse applications such as coding. Recently, Multimodal Foundation Models (MFMs) have emerged as a cutting-edge frontier in artificial intelligence, demonstrating remarkable potential to enhance cognitive and perceptual capabilities in AI Agents. Recent breakthroughs reveal that models integrating visual, textual, and auditory inputs enable fine-grained cross-modal reasoning. These systems excel at parsing complex scenarios, generating context-aware descriptions, and tackling tasks requiring synergistic perception and language understanding—capabilities critical to robotics, human-computer interaction, and beyond. Here, we propose empowering AI agents to operate in multimodal scenarios by leveraging MFMs' advanced multimodal reasoning capabilities.

Background & Application. Recent AI agents have revealed their potential in solving more complex problems under multimodal settings. For example, OS-Copilot is ready to interact with computer OS like a human, such as web browsing, coding, and using third-party applications. Another trending AI agent application is the AI Scientist Agent for research automation, such as designing and running experiments. Nevertheless, existing AI Scientist Agents typically rely on textual input, and the informative multimodal signals are often overlooked. Lastly, Embodied Agents represent a trending area in both research and application, where AI agents interact with the physical (or simulated) world using sensors and robotic arms. Driven by potential applications and deployment scenarios, we define four categories of multimodal AI agents: Digital Agents, Virtual Agents, Wearable Agents, and Physical Agents. Digital Agents are software-based agents that operate exclusively with digital environments where Virtual Agents exist within virtual or simulated environments, such as VR platforms. Wearable Agents are integrated into wearable devices to provide real-time assistance and Physical Agents primarily interact with the physical world. Each category leverages multimodal reasoning—the integration and interpretation of multiple input modalities to perceive, understand, and respond effectively within their respective environments. Challenges. Multimodal reasoning faces unique theoretical and technical challenges compared to unimodal approaches. First, integrating heterogeneous data (text, images, video) demands innovations in model architectures, training paradigms, and evaluation frameworks. The exponential growth in computational complexity due to multimodal data raises concerns about resource efficiency and real-time responsiveness. Second, applying these models to frontier domains—such as multi-agent collaboration, scientific discovery, and embodied intelligence systems—requires overcoming bottlenecks in cross-modal semantic understanding and knowledge transfer. Crucially, the interpretability and robustness of multimodal reasoning systems remain unresolved foundational issues, directly impacting the deployment of reliable real-world applications (e.g., scientific AI agents, bio-inspired robotics). Addressing these challenges necessitates breakthroughs in cross-modal representation alignment, dynamic attention mechanisms, and uncertainty modeling of multi-source information.

Challenges. Multimodal reasoning faces unique theoretical and technical challenges compared to unimodal approaches. First, integrating heterogeneous data (text, images, video) demands innovations in model architectures, training paradigms, and evaluation frameworks. The exponential growth in computational complexity due to multimodal data raises concerns about resource efficiency and real-time responsiveness.

Second, applying these models to frontier domains—such as multi-agent collaboration, scientific discovery, and embodied intelligence systems—requires overcoming bottlenecks in cross-modal semantic understanding and knowledge transfer. Crucially, the interpretability and robustness of multimodal reasoning systems remain unresolved foundational issues, directly impacting the deployment of reliable real-world applications (e.g., scientific AI agents, bio-inspired robotics). Addressing these challenges necessitates breakthroughs in cross-modal representation alignment, dynamic attention mechanisms, and uncertainty modeling of multi-source information.

About this Workshop. All discussions under the scope of multimodal reasoning are welcome. This workshop aims to bring. together researchers with various backgrounds to study the next generation of multimodal reasoning systems. To foster an inclusive dialogue and debate space, we invite speakers and panelists from diverse backgrounds and areas of expertise. Our roster includes both renowned researchers and emerging investigators who have driven promising advances in the field.

Call for Papers

This workshop primarily focuses on the advancement of MFM-based Agents from the perspective of enhanced multimodal perception and reasoning. To foster an inclusive environment for discussion and debate, we welcome speakers and panelists from diverse backgrounds and expertise. Our lineup features distinguished researchers alongside emerging investigators who have made significant contributions to the field. Spotlight and poster sessions will highlight new ideas, key challenges, and retrospective insights related to the workshop’s themes.

Relevant topics include, but are not limited to:

Multimodal Alignment: semantically consistent alignment across vision, language, and audio modalities.
Multimodal Foundation Model Training: novel and efficient training paradigms for multimodal systems.
Multimodal Reasoning: quantifying and enhancing the causal reasoning capabilities of multimodal foundation models.
Multimodal Reasoning Evaluation: novel metrics for evaluating cross-modal reasoning in open-ended scenarios.
Multimodal Perceptioning: enhancing and balancing reliance on visual, audio, and other modalities in reasoning.
Reducing computational demands introduced by highly redundant modalities, such as images and videos.
Multimodal signal influences on the behavior of LM-MM Agents.

Submission:

Proceedings Track Submit to OpenReview

Submission Guideline:

Paper Formatting: Papers are limited to eight pages, including figures and tables, in the ICCV style
Double Blind Review: ICCV reviewing is double blind, in that authors do not know the names of the area chairs or reviewers for their papers, and the area chairs/reviewers cannot, beyond a reasonable doubt, infer the names of the authors from the submission and the additional material.

Archival Policy (to be discussed): Submissions will be indexed or have archival proceedings.

Key Dates:

Paper Submission Open: May 21th 2025, 23:59 AoE Time
Paper Submission Deadline: June 24th 2025, 23:59 AoE Time
Acceptance Notification: July 11th 2025, 23:59 AoE Time
Camera-Ready Deadline: August 18th 2025, 23:59 AoE Time
Workshop Date: October 19-20th 2025, 23:59 AoE Time

Non-Proceedings Track: Submit to OpenReview

Submission Guideline:

Paper Formatting: Papers must be between four and eight pages in length, including figures and tables, and formatted in the ICCV style.
Double Blind Review: ICCV reviewing is double blind, in that authors do not know the names of the area chairs or reviewers for their papers, and the area chairs/reviewers cannot, beyond a reasonable doubt, infer the names of the authors from the submission and the additional material.

Archival Policy (to be discussed): Submissions will not be indexed or have archival proceedings.

Key Dates:

Paper Submission Open: May 21th 2025, 23:59 AoE Time
Paper Submission Deadline: September 15th 2025, 23:59 AoE Time
Acceptance Notification: September 29th 2025 (Flexible), 23:59 AoE Time
Camera-Ready Deadline: October 15th 2025, 23:59 AoE Time
Workshop Date: October 19-20th 2025, 23:59 AoE Time

Review Guide

Thank you for your interest in the MMRAgI ICCV 2025 workshop. Your expertise and dedication contribute greatly to the success of this event.

Review:

Confidentiality: All review assignments and the content of the papers you review should be kept confidential. Do not share these materials or discuss them with others unless they are also reviewers for the same paper.
Conflict of Interest: If you recognize a conflict of interest with any paper you are assigned to review, please notify the program chairs immediately.
Length Requirement: We recommend paper submission within 4 pages (excluding references).
Review Criteria:

(1) Relevance: Does the paper align with the theme of the workshop, i.e., multi-llm-agent systems?
(2) Originality: Does the paper present new ideas or results, or does it significantly build upon previous work?
(3) Technical Soundness: Both position papers and methodology papers are acceptable. Is the opinion or methodology correct and properly explained? Are the claims supported by theoretical analysis or experimental results?
(4) Clarity: Is the paper well-written and well-structured? Is it easy for readers to understand the problem, the approach, and the results?
(5) Impact: If the results are applied, do they have the potential to contribute to the advancement of multi-llm-agent systems?

Speakers and panelists

Workshop Schedule

Time	Session	Duration	Details
08:50 am - 09:00 am	Opening Remarks	10 min	Welcome and Introduction to the Workshop
09:00 am - 09:30 am	Invited Talk 1: Words, Worlds and Wheels: The Road Ahead	30 min	Speaker: Ani Kembhavi
09:30 am - 10:00 am	Invited Talk 2: Reinforcing World Model Reasoning for VLM Agents	30 min	Speaker: Manling Li
10:00 am - 10:30 am	Invited Talk 3: Robot Foundation Models for Open-Ended, Long-Horizon Reasoning	30 min	Speaker: Lucy Shi
10:30 am - 10:45 am	ICCV2025 SFE Challenge	15 min	Introduction to the challenge
10:45 am - 11:00 am	Best Paper Talk 1: Achilles Heel of Distributed Multi-Agent Systems	15 min	Top rated paper presentation
11:00 am - 11:15 am	Best Paper Talk 2: Thinking vs. Doing: Agents that Reason by Scaling Test-Time Interaction	15 min	Top rated paper presentation
11:15 am - 12:15 am	Poster Session (at Exhibit Hall II, #77-84)	60 min	Networking
12:15 am - 01:05 pm	Lunch	50 min	Food
01:05 pm - 01:35 pm	Invited Talk 4: AI Agents: From Language to Multimodal Reasoning	30 min	Speaker: Juan Carlos Niebles
01:35 pm - 02:05 pm	Invited Talk 5: From LLMs to Generalist Embodied AI Agents: Methods and Lessons	30 min	Speaker: Alexander Toshev
02:05 pm - 02:20 pm	Best Paper Talk 3: AutoTool: Dynamic Tool Selection and Integration for Agentic Reasoning	15 min	Top rated paper presentation
02:20 pm - 02:35 pm	Best Paper Talk 4: CDP: Towards Robust Autoregressive Visuomotor Policy Learning via Causal Diffusion	15 min	Top rated paper presentation
02:35 pm - 03:05 pm	Invited Talk 6: MolmoAct: Robotics Models need to Reason in Space	30 min	Speaker: Ranjay Krishna
03:05 pm - 03:20 pm	Break	15 min	Refreshments
03:20 pm - 03:50 pm	Invited Talk 7: Show and Tell: Towards AI Coaching Agents	30 min	Speaker: Kristen Grauman
03:50 pm - 04:20 pm	Invited Talk 8: Advanced frameworks and algorithm for large langauge model reasoning	30 min	Speaker: Ling Yang
04:20 pm - 05:20 pm	Panel Discussion	60 min	Interactive session with panelists
05:20 pm - 05:50 pm	Breakout Rooms Discussion	30 min	Small group discussions
05:50 pm - 06:00 pm	Closing Remarks	10 min	Concluding the workshop

Organization

Workshop Organizers

ICCV 2025 Workshop on MMRAgI

Multi-Modal Reasoning for Agentic Intelligence

8:00A.M. - 6:00P.M. OCT 20, 2025, 301 A

IMPORTANT: You can find our posters in Exhibit Hall II (numbers 77–84).

Call for Papers

Submission:

Proceedings Track Submit to OpenReview

Non-Proceedings Track: Submit to OpenReview

Review Guide

Speakers and panelists

Ani Kembhavi

Manling Li

Lucy Shi

Juan Carlos Niebles

Alexander Toshev

Ranjay Krishna

Kristen Grauman

Ling Yang

Ran Xu

Workshop Schedule

Organization

Zhenfei Yin

Naji Khosravan

Tao Ji

Yin Wang

Roozbeh Mottagi

Iro Armeni

Zhuqiang Lu

Xinyu Yang

Annie S. Chen

Yufang Liu

Zixian Ma

Mahtab Bigverdi

Amita Kamath

Chen Feng

Lei Bai

Gordon Wetzstein

Philip Torr

Contact us

Email us at zhuqiang.lu@sydney.edu.au | naji.khosravan@gmail.com