MiroEval

A holistic benchmark for deep research systems that evaluates not just what they write, but whether it's factually grounded and how they research — featuring adaptive rubrics, agentic fact verification, multimodal tasks, and process-level auditing.

Abstract

Bridging the Evaluation Gap for Deep Research

Recent progress in deep research systems has been impressive, but evaluation still lags behind real user needs. Existing benchmarks assess final reports with fixed rubrics, overlooking the research process; most offer limited multimodal coverage, rely on synthetic tasks, and cannot be refreshed as knowledge evolves. MiroEval addresses these gaps with a benchmark of 100 tasks (70 text-only, 30 multimodal) grounded in real user needs and constructed via a dual-path pipeline that supports periodic updates. The evaluation suite assesses systems along three complementary dimensions: adaptive synthesis quality evaluation with task-specific rubrics, agentic factuality verification via active retrieval and reasoning over web sources and multimodal attachments, and process-centric evaluation that audits how systems search, reason, and refine throughout their investigation. Evaluation across 13 systems yields three principal findings: the three dimensions capture complementary aspects of capability with each revealing distinct strengths and weaknesses; process quality reliably predicts overall outcome while exposing weaknesses invisible to output-level metrics; and multimodal tasks pose substantially greater challenges, with most systems declining by 3 to 10 points. Validated by a human study achieving 92.0% precision, MiroEval provides a holistic diagnostic tool for the next generation of deep research agents.

100

Research Tasks

from real-world interactions

Systems Evaluated

leading deep-research agents

Domain Categories

medicine, law, finance & more

Evaluation Dimensions

report, factuality & process

Query Distribution Overview — (a) Domain x Task sunburst, (b) Domain distribution, (c) Task type distribution across 100 benchmark queries.

Benchmark Construction

Dual-Path Query Construction

100 queries assembled via two complementary paths — grounded in real user needs, with rigorous human verification and support for temporal refresh.

Path 1

65 User-Derived Queries

Inspired by real user query patterns from closed internal testing of MiroMind, covering both text-only and multimodal interactions. Rather than anonymizing original queries, the pipeline produces entirely new benchmark queries that preserve the topic distribution, complexity profile, and modality coverage of the original population — no original user query appears in any form. Queries span three difficulty tiers from basic retrieval to multi-step reasoning and contradiction detection, yielding 35 text-only and 30 multimodal tasks with diverse attachments (images, PDFs, spreadsheets, slides).

Path 2

35 Trend-Grounded Queries

Generated through a fully automated pipeline grounded in real-time web trends across 12 topics. Three-stage quality filtering progressively retains only queries that expose the limitations of parametric knowledge — ensuring each task genuinely requires deep investigation with external sources. The pipeline supports on-demand re-execution with the latest web trends, preventing the benchmark from becoming stale and reducing the risk of overfitting to known tasks.

Overview of the dual-path query construction pipeline — user-derived curation (left) and automated trend-grounded generation (right).

Quality Verification

Three graduate-level annotators independently assess each query on validity (legitimate deep-research task) and non-triviality (requires web search). Both paths support periodic re-execution for temporal refresh — preventing benchmark staleness and reducing overfitting risk.

Metric	User-Derived	Auto-Generated	Aggregated
Fleiss' κ (validity)	0.83	0.79	0.81
Fleiss' κ (non-triviality)	0.78	0.74	0.76
Majority-vote precision	94.0%	90.0%	92.0%
Unanimous agreement	86.0%	82.0%	—

Evaluation Framework

Three-Layered Evaluation Pipeline

MiroEval assesses deep research systems across three complementary dimensions, capturing both the quality of the final output and the research process itself.

Synthesis Quality

Comprehensive Adaptive Synthesis Quality Evaluation that dynamically generates task-specific dimensions, criteria, and weights. Supports both text-only and attachment-augmented queries with grounding criteria.

Coverage Insight Instruction-following Clarity Specificity

Agentic Factuality

An evaluation agent decomposes reports into verifiable statements, retrieves evidence from web and attachments via dual-source RAG, and assesses consistency with four-label classification.

RIGHT WRONG CONFLICT UNKNOWN

Process-Centric

Audits the research trajectory by decomposing it into atomic units, evaluating intrinsic quality across five dimensions, and measuring alignment between process findings and the final report.

Breadth Depth Refinement Critical Thinking Efficiency

F→R R→P Contradiction

Overview of the MiroEval evaluation pipeline — integrating Synthesis Quality, Agentic Factuality, and Process-Centric evaluation.

Key Findings

What We Discovered

Finding 01

Rankings Shift Across Dimensions

Kimi-K2.5 achieves the highest Synthesis score among non-MiroThinker systems at 75.7, yet its Factuality ranks near the bottom. Manus obtains the lowest Synthesis score at 55.4, yet its Factuality surpasses several stronger-synthesis systems.

Finding 02

Process Predicts Outcome

The top three systems on Process (MiroThinker-H1, OpenAI, MiroThinker-1.7) are also the top three on overall outcome, and the weakest process system (Doubao) also produces a near-bottom outcome. The Pearson correlation between Process and combined Outcome reaches 0.88.

Finding 03

Multimodal Tasks Are Harder

Overall scores drop by 3 to 10 points for most systems when moving from Text-Only to MultiModal. MiroThinker-H1 proves the most resilient with only a 3.0-point decline.

View Detailed Analysis →

#	Model	Text-Only				MultiModal				Overall
		Synthesis	Factuality	Process	Overall	Synthesis	Factuality	Process	Overall
1	MiroThinker-H1	76.7	81.1	74.7	77.5	71.5	78.5	73.5	74.5	76.6
2	OpenAI Deep Research	73.8	83.3	73.1	76.7	66.7	77.0	66.8	70.2	74.8
3	MiroThinker-1.7	74.3	79.4	72.7	75.5	69.0	78.4	67.4	71.6	74.3
4	MiroThinker-1.7-mini	74.0	76.2	68.5	72.9	–	–	–	–	–
5	Gemini-3.1-Pro Deep Research	71.2	71.3	67.1	69.9	66.4	73.7	64.1	68.1	69.3
6	Kimi-K2.5 Deep Research	75.7	65.4	64.2	68.4	–	–	–	–	–
7	Claude-Opus-4.6 Research	67.3	69.8	66.0	67.7	62.5	70.7	65.9	66.4	67.3
8	MiniMax-M2.5 Research	63.3	71.8	67.1	67.4	56.7	71.0	62.2	63.3	66.2
9	ChatGLM Agent	63.2	68.6	65.6	65.8	61.6	71.6	57.7	63.6	65.1
10	Qwen-3.5-Plus Deep Research	60.0	73.1	61.1	64.7	44.6	69.9	53.8	56.1	62.1
11	Manus-1.6-Max Wide Research	55.4	72.6	64.2	64.0	54.3	70.0	61.8	62.0	63.4
12	Doubao Deep Research	64.2	64.9	53.1	60.7	–	–	–	–	–
13	Grok Deep Research	58.7	63.7	58.3	60.2	56.3	71.5	53.9	60.5	60.3

Performance comparison of models with MiroEval. Text-Only comprises 70 tasks; MultiModal comprises 30 tasks. Systems are ranked by Text-Only Overall score. "–" indicates the system does not support multimodal deep research or data is unavailable.

Detailed score heatmap across Synthesis, Factuality, and Process sub-metrics for Text-Only (70 tasks) and Multimodal (30 tasks) settings.

Model	Text-Only					MultiModal					Overall
Model	Cov.	Insight	Instr.	Clarity	Spec.	Cov.	Insight	Instr.	Clarity	Spec.	Overall
MiroThinker-H1	80.6	80.3	84.7	81.0	70.0	72.7	76.0	78.6	78.3	59.5	76.7
Kimi-K2.5 Deep Research	80.4	79.8	78.6	76.3	71.7	–	–	–	–	–	75.7
MiroThinker-1.7	79.2	74.7	84.7	80.1	68.4	72.6	69.2	78.6	75.1	53.6	74.3
MiroThinker-1.7-mini	78.8	75.0	84.3	78.7	68.1	–	–	–	–	–	74.0
OpenAI Deep Research	78.2	74.3	81.6	77.1	69.1	70.6	63.9	74.8	70.5	54.2	73.8
Gemini-3.1-Pro Deep Research	77.4	76.6	80.0	70.1	64.9	72.4	70.8	72.4	62.5	50.1	71.2
Claude-Opus-4.6 Research	73.3	72.0	73.5	71.2	61.1	68.9	66.8	62.8	59.3	50.0	67.3
Doubao Deep Research	72.9	62.7	74.6	67.2	58.2	–	–	–	–	–	64.2
MiniMax-M2.5 Research	69.8	62.7	74.2	70.6	56.7	63.1	53.3	69.1	62.0	39.2	63.3
ChatGLM Agent	69.9	62.8	74.5	67.5	57.1	67.1	60.2	71.7	65.4	45.1	63.2
Qwen-3.5-Plus Deep Research	64.0	64.7	69.9	67.8	52.6	46.8	46.3	52.9	52.6	30.1	60.0
Grok Deep Research	67.3	56.3	74.9	64.7	51.1	61.8	52.5	68.9	60.4	40.5	58.7
Manus-1.6-Max Wide Research	61.2	54.8	67.9	65.6	48.1	58.7	50.2	65.0	61.2	40.4	55.4

Synthesis quality sub-metrics across five dimensions: Coverage, Insight, Instruction-following, Clarity, and Query Specification. Text-Only comprises 70 tasks; MultiModal comprises 30 tasks. "–" indicates data is unavailable.

Model	Text-Only					MultiModal					Overall
Model	Right	Wrong	Conf.	Unk.	Ratio	Right	Wrong	Conf.	Unk.	Ratio	Overall
OpenAI Deep Research	3335	170	–	496	83.3	1062	100	36	157	77.0	83.3
MiroThinker-H1	3746	161	–	673	81.1	1316	82	56	238	78.5	81.1
MiroThinker-1.7	3334	181	–	670	79.4	1306	103	63	235	78.4	79.4
MiroThinker-1.7-mini	3397	246	–	802	76.2	–	–	–	–	–	76.2
Qwen-3.5-Plus Deep Research	1706	244	–	380	73.1	576	99	19	101	69.9	73.1
Manus-1.6-Max Wide Research	1972	191	–	459	72.6	681	81	32	134	70.0	72.6
MiniMax-M2.5 Research	3872	486	–	921	71.8	1255	184	59	255	71.0	71.8
Gemini-3.1-Pro Deep Research	4039	526	–	1068	71.3	1502	158	94	302	73.7	71.3
Claude-Opus-4.6 Research	2838	338	–	910	69.8	964	84	44	243	70.7	69.8
ChatGLM Agent	4096	580	–	981	68.6	1038	144	46	215	71.6	68.6
Kimi-K2.5 Deep Research	3702	595	–	1256	65.4	–	–	–	–	–	65.4
Doubao Deep Research	3890	780	–	1393	64.9	–	–	–	–	–	64.9
Grok Deep Research	1924	368	–	699	63.7	734	104	37	163	71.5	63.7

Factuality evaluation. Measured by atomic claim counts (Right, Wrong, Conflict, Unknown) and average per-task right ratio (scaled to [0, 100]). Ranked by Text-Only Ratio. Text-Only comprises 70 tasks; MultiModal comprises 30 tasks. "–" indicates data is unavailable.

Model	Text-Only										MultiModal										Overall
	Intrinsic						Alignment				Intrinsic						Alignment
	Brdth	Depth	Refin	Critl	Effic	Avg	F→R	R→P	Contr	Avg	Brdth	Depth	Refin	Critl	Effic	Avg	F→R	R→P	Contr	Avg
MiroThinker-H1	74.9	64.9	72.2	69.1	71.0	70.4	87.0	63.3	86.4	78.9	68.6	63.1	73.4	71.0	64.1	68.1	86.6	63.4	86.9	79.0	74.7
OpenAI Deep Research	77.4	67.3	76.7	74.7	63.7	72.0	83.6	59.0	79.9	74.1	65.5	62.1	73.8	70.0	54.5	65.2	77.2	56.2	72.1	68.5	73.1
MiroThinker-1.7	74.4	64.4	75.7	71.6	64.6	70.1	83.7	59.4	82.5	75.2	65.0	57.0	72.0	63.0	57.7	62.9	80.7	58.7	76.0	71.8	72.7
MiroThinker-1.7-mini	75.5	56.3	71.3	70.9	59.0	66.6	79.7	56.3	75.2	70.4	–	–	–	–	–	–	–	–	–	–	68.5
Gemini-3.1-Pro Deep Research	75.4	66.6	75.9	64.1	59.0	68.2	72.9	50.6	74.4	66.0	69.7	65.3	71.0	58.3	47.0	62.3	75.7	49.0	73.0	65.9	67.1
MiniMax-M2.5 Research	71.9	62.2	70.1	62.5	63.5	66.0	77.4	53.0	74.3	68.3	51.0	59.0	65.0	43.7	63.0	56.3	77.0	52.0	75.0	68.0	67.1
Claude-Opus-4.6 Research	79.1	58.8	67.2	56.7	62.2	64.8	81.0	47.1	73.5	67.2	75.2	60.7	69.6	59.3	60.0	65.0	78.9	49.3	72.6	66.9	66.0
ChatGLM Agent	76.2	59.4	67.1	59.3	59.0	64.2	77.1	51.4	72.3	67.0	52.7	52.3	55.7	44.7	54.3	51.9	73.0	47.0	70.7	63.6	65.6
Manus-1.6-Max Wide Research	62.8	58.4	60.6	53.5	68.8	60.8	75.1	51.3	76.3	67.6	52.4	57.2	60.7	43.4	65.9	55.9	74.5	54.5	74.1	67.7	64.2
Kimi-K2.5 Deep Research	77.5	59.4	71.0	67.6	53.5	65.8	70.7	46.8	70.4	62.6	–	–	–	–	–	–	–	–	–	–	64.2
Qwen-3.5-Plus Deep Research	74.4	64.1	75.0	74.1	63.2	70.2	59.6	39.7	56.9	52.1	57.0	51.3	58.7	57.7	51.3	55.2	61.7	39.3	56.3	52.4	61.1
Grok Deep Research	50.9	49.4	61.0	54.6	64.7	56.1	74.6	42.2	64.6	60.4	41.9	44.3	52.4	42.4	59.5	48.1	72.4	41.4	65.2	59.7	58.3
Doubao Deep Research	59.3	41.6	59.6	55.7	53.3	53.9	65.7	36.8	54.2	52.2	–	–	–	–	–	–	–	–	–	–	53.1

Process evaluation results. Intrinsic metrics: Search Breadth, Analytical Depth, Progressive Refinement, Critical Thinking, and Efficiency. Alignment metrics: Findings→Report coverage, Report→Process traceability, and Contradiction detection. Text-Only comprises 70 tasks; MultiModal comprises 30 tasks. "–" indicates data is unavailable.

Finding 01

Rankings Shift Substantially Across Dimensions

The three dimensions capture fundamentally different aspects of system capability. Kimi-K2.5 tops Synthesis at 75.7 yet ranks near-bottom on Factuality; Manus shows the opposite pattern at 55.4 Synthesis but competitive Factuality. A polished report does not guarantee factual grounding, nor vice versa.

Dimension-Level Rank Shifts — Bump chart showing how system rankings change across Synthesis, Factuality, and Process dimensions.

Finding 02

Specificity is the Bottleneck, Insight is the Differentiator

Specificity is the universal bottleneck, trailing Coverage by 10–14 points across all systems — systems find relevant topics but struggle with granular, evidence-grounded details. Insight shows the widest spread (54.8–80.3), making it the most discriminative sub-metric for differentiating capability.

Sub-dimension Score Spread — Specificity is consistently the lowest sub-metric across systems, while Insight exhibits the widest variance.

Finding 03

Factual Claims: A Precision–Volume Trade-off

A fundamental tension exists between claim volume and accuracy. ChatGLM and Gemini produce 4,000+ correct claims but with 500+ errors (Ratio ~70%); OpenAI generates fewer but achieves the highest Ratio at 83.3. The MiroThinker series balances both: H1 produces the most claims among top systems (3,746) with the fewest errors (161).

Statement Volume vs. Factual Precision — the trade-off between broad claim coverage and strict factual discipline, with the MiroThinker series achieving a distinctive balance.

Finding 04

Process Quality Predicts Overall Outcome

The top three Process systems are also the top three on overall outcome; the weakest (Doubao at 53.1) produces a near-bottom result. The Pearson correlation between Process and combined Outcome reaches r = 0.88, confirming process quality as a reliable predictor.

Process Quality vs. Overall Outcome — strong positive correlation (r = 0.88) across all systems in the Text-Only setting.

Finding 05

The Traceability Gap: Reports Outrun the Process

Process evaluation exposes a gap invisible to output-level metrics. F→R scores are generally above 70 (findings reach the report), but R→P falls below 55 for most systems — even MiroThinker-H1 achieves only 63.3. A substantial portion of report content cannot be traced back to the documented research process.

Process–Report Alignment Gap — F→R (blue) vs. R→P (orange). The gap reveals reports routinely contain content not originating from the documented research process.

Finding 06

Multimodal Tasks Pose Greater Challenges

Overall scores drop by 3 to 10 points for most systems when moving from Text-Only to MultiModal. MiroThinker-H1 proves the most resilient with a decline of only 3.0 points, while Qwen-3.5-Plus suffers the largest drop.

Text-Only vs. Multimodal — Dumbbell chart comparing overall scores across both settings. The gap indicates performance degradation on multimodal tasks.