Invited Speakers

Ranjay Krishna
Ranjay Krishna

University of Washington

Ziwei Liu
Ziwei Liu

Nanyang Technological University

Yilun Du
Yilun Du

Harvard University

Aishwarya Agrawal
Aishwarya Agrawal

University of Montreal

Tentative Schedule

June 3, 2026  |  1:00 PM – 6:00 PM  |  Room 111

Time Event
13:00–13:10 Opening
13:10–13:40 Invited Talk 1 – Ranjay Krishna
13:40–13:55 Oral 1: Uncertainty-Guided Data Curation for 3D Object Detection
13:55–14:10 Oral 2: ChartNet: A Million-Scale, High-Quality Multimodal Dataset for Robust Chart Understanding
14:10–14:40 Invited Talk 2 – Ziwei Liu
14:40–14:55 Oral 3: Vero: An Open Reinforcement Learning Recipe for Visual Reasoning
14:55–15:45 Poster Session + Coffee Break
15:45–16:15 Invited Talk 3 – Aishwarya Agrawal
16:15–16:30 Oral 4: See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models
16:30–16:45 Oral 5: SVHalluc: Benchmarking Speech–Vision Hallucination in Audio-Visual Large Language Models
16:45–17:15 Invited Talk 4 – Yilun Du
17:15–17:30 Oral 6: Thinking in Dynamics: How Multimodal Large Language Models Perceive, Track, and Reason Dynamics in Physical 4D World
17:30–17:40 Competition Announcement
17:40–17:50 Challenge Task 1 Winner Talk
17:50–18:00 Challenge Task 2 Winner Talk
18:00–18:05 Concluding Remarks
18:05 Adjourn

DataMFM Challenge

The DataMFM Challenge focuses on multimodal document understanding at the intersection of vision, language, and structured reasoning. It currently covers two complementary components: Document Parsing and Chart Understanding, based on newly prepared challenge datasets built from OmniDocBench and ChartNet.
Scope: Document Parsing + Chart Understanding.
Timeline: Apr 27 Release · May 11 Submission Opening · May 29 Submission Deadline · Jun 03 Workshop.
Challenge Portal: DataMFM Challenge Portal →

Call for Papers

We invite submissions on any topics related to Data for Multimodal Foundation Models (DataMFM), including, but not limited to:
  • Data collection, generation, and curation for multimodal foundation models
  • Data quality improvement, filtering, and pruning for scalable and efficient multimodal training
  • Data recipes and mixture design for balancing scale, quality, diversity, and coverage
  • Synthetic–real hybrid datasets and multimodal data augmentation for robust model development
  • Benchmark renewal, creation, and evaluation design for trustworthy multimodal applications
  • Detection and mitigation of dataset contamination in training and evaluation
  • Cross-modal alignment and grounding across text, image, audio, and video modalities
  • Fairness, bias reduction, and inclusive representation in multimodal datasets
  • Data provenance, documentation, licensing, and governance for trustworthy dataset lifecycles
  • Metrics and frameworks for assessing multimodal data quality, diversity, and contamination
  • Bridging modality gaps between text-rich and vision-centric domains
  • Agentic synthetic data generation and self-improving data pipelines driven by multimodal or VLA models
  • Building sustainable, transparent, and community-driven multimodal data ecosystems for next generation foundation models
Submission Guidelines:
The workshop accepts submissions in three tracks:
(1) Full-length Papers (Archival, Proceedings Track): Up to 8 pages, excluding references; Double-blind review; Accepted papers will appear in the CVPR 2026 Workshop Proceedings;
(2) Short Papers / Extended Abstracts (Non-archival): Up to 4 pages, excluding references; Double-blind review; Intended for work-in-progress, datasets, benchmarks, and early-stage ideas;
(3) CVPR 2026 Accepted Papers (Non-archival, Non-anonymous): Papers accepted to the main CVPR 2026 conference; Presented at the workshop but not included in the workshop proceedings
Submission Site: Proceedings Track: https://openreview.net/group?id=thecvf.com/CVPR/2026/Workshop/DataMFM_Proceedings_Track
Non-archival Track: https://openreview.net/group?id=thecvf.com/CVPR/2026/Workshop/DataMFM_Non-archival
All submissions should use the CVPR 2026 paper template.

Accepted Papers

Proceedings Track

  • DataMFM-1   VLA-AD: Agentic Vision-Language Foundation Models for Context-Aware Anomaly Detection
  • DataMFM-3   Scalable Parallel Prompting for Complex AV Video Captioning
  • DataMFM-5   Adversarial Feedback from Segmentation Network to Siamese Diffusion for Improving Tumor Segmentation
  • DataMFM-8   AdGaze-3500: Evaluating Large Multimodal Models' Ability to Predict Human Attention to Ads
  • DataMFM-9   TimeCausality: Evaluating the Causal Ability in Time Dimension for Vision Language Models
  • DataMFM-10   Uncertainty-Guided Data Curation for 3D Object Detection
  • DataMFM-11   Longitudinal Multimodal Modeling for Alzheimer's Disease with Pre-trained Brain Latent Diffusion and Mixture-of-Experts Fusion
  • DataMFM-12   Learning Multimodal Priors with Shared Vector Quantization for Incomplete Multimodal Diagnosis
  • DataMFM-13   VLM Reality Check: A Causal Counterfactual Benchmark for Diagnosing Cognitive Biases in Vision-Language Models
  • DataMFM-19   Cinéaste: A Fine-grained Contextual Movie Question Answering Benchmark with Automated Data Curation

Non-archival Track

  • DataMFM-2   Vero: An Open Reinforcement Learning Recipe for Visual Reasoning
  • DataMFM-3   CodeV: Code with Images for Faithful Visual Reasoning via Tool-Aware Policy Optimization
  • DataMFM-4   M3Grounder: Mask-Based Multi-Span and Multi-Granular Grounding for Document QA
  • DataMFM-5   Entropy-Guided Prototype Selection for Data-Efficient k-NN: CIFAR-10 Deep Features and MNIST Pixels
  • DataMFM-6   See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models
  • DataMFM-7   Chasing Ghosts: A Simulation-to-Real Olfactory Navigation Stack with Optional Vision Augmentation
  • DataMFM-8   Thinking in Dynamics: How Multimodal Large Language Models Perceive, Track, and Reason Dynamics in Physical 4D World
  • DataMFM-9   Aligning What Vision-Language Models See and Perceive with Adaptive Information Flow
  • DataMFM-10   Evaluating Multimodal Embeddings for Board Game Knowledge Representation
  • DataMFM-11   MV-Fashion: Towards Enabling Virtual Try-On and Size Estimation with Multi-View Paired Data
  • DataMFM-12   Failure Modes for Deep Learning-Based Online Mapping: How to Measure and Address Them
  • DataMFM-13   AVRobustBench: Benchmarking the Robustness of Audio-Visual Recognition Models at Test-Time
  • DataMFM-14   HandX: Scaling Bimanual Motion and Interaction Generation
  • DataMFM-15   FigEx2: Visual-Conditioned Panel Detection and Captioning for Scientific Compound Figures
  • DataMFM-16   A Dataset for Dynamic Human Preferences for Vision Language Models
  • DataMFM-17   SVHalluc: Benchmarking Speech–Vision Hallucination in Audio-Visual Large Language Models
  • DataMFM-18   Video2Reaction: Training Foundation Video Models to Predict Audience Reaction
  • DataMFM-20   MAVEN: Agentic Multi-Scale Video Annotation Pipeline for Structured Synthetic Data Generation
  • DataMFM-21   TimeBlind: A Spatio-Temporal Compositionality Benchmark for Video LLMs
  • DataMFM-22   OctoMed: Data Recipes for State-of-the-Art Multimodal Medical Reasoning
  • DataMFM-23   CRIT: Graph-Based Automatic Data Synthesis to Enhance Cross-Modal Multi-Hop Reasoning
  • DataMFM-24   ChartNet: A Million-Scale, High-Quality Multimodal Dataset for Robust Chart Understanding
  • DataMFM-25   Multimodal Distribution Matching for Vision-Language Dataset Distillation
  • DataMFM-26   RDFace: A Benchmark Dataset for Rare Disease Facial Image Analysis under Extreme Data Scarcity and Phenotype-Aware Synthetic Generation
  • DataMFM-27   Rethinking Dataset Distillation: Hard Truths about Soft Labels
  • DataMFM-28   SAGE: Scalable Agentic Grounded Evaluation for Crop Disease Diagnosis
  • DataMFM-29   VQA-DISAGREE: Multi-Model Disagreement as an Annotation-Free Difficulty Signal for VQA Benchmarks
  • DataMFM-30   VISTA: Dense Multi-Label Classroom Coding with Vision-Language Models
  • DataMFM-31   CLAY: Conditional Visual Similarity Modulation in Vision-Language Embedding Space

Poster Printing: Please follow the instructions to print your posters. https://cvpr.thecvf.com/Conferences/2026/PosterPrinting

Important Dates

Event Date
Paper submission deadline March 10, 2026 (archival); April 13, 2026 (non-archival)
Notification of acceptance March 25, 2026 (archival); April 21, 2026 (non-archival)
Camera-ready submission deadline April 7, 2026 (archival)
Workshop date June 3, 2026, 1:00 PM – 6:00 PM, Room 111

Challenge Organizers

Xiaolong Luo

Harvard University

Simeng Han

Stanford University

Longtian Ye

2077AI Foundation

Minglai Yang

2077AI Foundation

Henry Zhang

University of California, Berkeley

Liam Liu

2077AI Foundation

Organizers

Pengyuan Li

MIT-IBM Watson AI lab

Zexue He

Stanford University

Zihan Wang

Abaka AI

Xuan (Ruby) Zhang

2077AI Foundation

Wenhu Chen

University of Waterloo

Manling Li

Northwestern University

Rogerio Feris

MIT-IBM Watson AI lab

Sponsors