MiroEval

A holistic benchmark for deep research systems that evaluates not just what they write, but whether it's factually grounded and how they research — featuring adaptive rubrics, agentic fact verification, multimodal tasks, and process-level auditing.

7 z A+ Search Analysis Writing Evaluation

Bridging the Evaluation Gap for Deep Research

Recent progress in deep research systems has been impressive, but evaluation still lags behind real user needs. Existing benchmarks assess final reports with fixed rubrics, overlooking the research process; most offer limited multimodal coverage, rely on synthetic tasks, and cannot be refreshed as knowledge evolves. MiroEval addresses these gaps with a benchmark of 100 tasks (70 text-only, 30 multimodal) grounded in real user needs and constructed via a dual-path pipeline that supports periodic updates. The evaluation suite assesses systems along three complementary dimensions: adaptive synthesis quality evaluation with task-specific rubrics, agentic factuality verification via active retrieval and reasoning over web sources and multimodal attachments, and process-centric evaluation that audits how systems search, reason, and refine throughout their investigation. Evaluation across 13 systems yields three principal findings: the three dimensions capture complementary aspects of capability with each revealing distinct strengths and weaknesses; process quality reliably predicts overall outcome while exposing weaknesses invisible to output-level metrics; and multimodal tasks pose substantially greater challenges, with most systems declining by 3 to 10 points. Validated by a human study achieving 92.0% precision, MiroEval provides a holistic diagnostic tool for the next generation of deep research agents.

100
Research Tasks
from real-world interactions
13
Systems Evaluated
leading deep-research agents
12
Domain Categories
medicine, law, finance & more
3
Evaluation Dimensions
report, factuality & process

Query Distribution Overview — (a) Domain x Task sunburst, (b) Domain distribution, (c) Task type distribution across 100 benchmark queries.

Dual-Path Query Construction

100 queries assembled via two complementary paths — grounded in real user needs, with rigorous human verification and support for temporal refresh.

Path 1

65 User-Derived Queries

Inspired by real user query patterns from closed internal testing of MiroMind, covering both text-only and multimodal interactions. Rather than anonymizing original queries, the pipeline produces entirely new benchmark queries that preserve the topic distribution, complexity profile, and modality coverage of the original population — no original user query appears in any form. Queries span three difficulty tiers from basic retrieval to multi-step reasoning and contradiction detection, yielding 35 text-only and 30 multimodal tasks with diverse attachments (images, PDFs, spreadsheets, slides).

Path 2

35 Trend-Grounded Queries

Generated through a fully automated pipeline grounded in real-time web trends across 12 topics. Three-stage quality filtering progressively retains only queries that expose the limitations of parametric knowledge — ensuring each task genuinely requires deep investigation with external sources. The pipeline supports on-demand re-execution with the latest web trends, preventing the benchmark from becoming stale and reducing the risk of overfitting to known tasks.

Query Construction Pipeline

Overview of the dual-path query construction pipeline — user-derived curation (left) and automated trend-grounded generation (right).

Quality Verification

Three graduate-level annotators independently assess each query on validity (legitimate deep-research task) and non-triviality (requires web search). Both paths support periodic re-execution for temporal refresh — preventing benchmark staleness and reducing overfitting risk.

Metric User-Derived Auto-Generated Aggregated
Fleiss' κ (validity)0.830.790.81
Fleiss' κ (non-triviality)0.780.740.76
Majority-vote precision94.0%90.0%92.0%
Unanimous agreement86.0%82.0%

Three-Layered Evaluation Pipeline

MiroEval assesses deep research systems across three complementary dimensions, capturing both the quality of the final output and the research process itself.

Synthesis Quality

Comprehensive Adaptive Synthesis Quality Evaluation that dynamically generates task-specific dimensions, criteria, and weights. Supports both text-only and attachment-augmented queries with grounding criteria.

Coverage Insight Instruction-following Clarity Specificity

Agentic Factuality

An evaluation agent decomposes reports into verifiable statements, retrieves evidence from web and attachments via dual-source RAG, and assesses consistency with four-label classification.

RIGHT WRONG CONFLICT UNKNOWN

Process-Centric

Audits the research trajectory by decomposing it into atomic units, evaluating intrinsic quality across five dimensions, and measuring alignment between process findings and the final report.

Breadth Depth Refinement Critical Thinking Efficiency
F→R R→P Contradiction
MiroEval Evaluation Pipeline

Overview of the MiroEval evaluation pipeline — integrating Synthesis Quality, Agentic Factuality, and Process-Centric evaluation.

What We Discovered

Finding 01

Rankings Shift Across Dimensions

Kimi-K2.5 achieves the highest Synthesis score among non-MiroThinker systems at 75.7, yet its Factuality ranks near the bottom. Manus obtains the lowest Synthesis score at 55.4, yet its Factuality surpasses several stronger-synthesis systems.

Finding 02

Process Predicts Outcome

The top three systems on Process (MiroThinker-H1, OpenAI, MiroThinker-1.7) are also the top three on overall outcome, and the weakest process system (Doubao) also produces a near-bottom outcome. The Pearson correlation between Process and combined Outcome reaches 0.88.

Finding 03

Multimodal Tasks Are Harder

Overall scores drop by 3 to 10 points for most systems when moving from Text-Only to MultiModal. MiroThinker-H1 proves the most resilient with only a 3.0-point decline.

View Detailed Analysis →

Leaderboard

Performance comparison of leading deep research systems across Synthesis Quality, Factuality, and Process dimensions.

1
# Model Text-Only MultiModal Overall
Synthesis Factuality Process Overall Synthesis Factuality Process Overall
1 MiroThinker-H1 76.7 81.1 74.7 77.5 71.5 78.5 73.5 74.5 76.6
2 OpenAI Deep Research 73.8 83.3 73.1 76.7 66.7 77.0 66.8 70.2 74.8
3 MiroThinker-1.7 74.3 79.4 72.7 75.5 69.0 78.4 67.4 71.6 74.3
4 MiroThinker-1.7-mini 74.0 76.2 68.5 72.9
5 Gemini-3.1-Pro Deep Research 71.2 71.3 67.1 69.9 66.4 73.7 64.1 68.1 69.3
6 Kimi-K2.5 Deep Research 75.7 65.4 64.2 68.4
7 Claude-Opus-4.6 Research 67.3 69.8 66.0 67.7 62.5 70.7 65.9 66.4 67.3
8 MiniMax-M2.5 Research 63.3 71.8 67.1 67.4 56.7 71.0 62.2 63.3 66.2
9 ChatGLM Agent 63.2 68.6 65.6 65.8 61.6 71.6 57.7 63.6 65.1
10 Qwen-3.5-Plus Deep Research 60.0 73.1 61.1 64.7 44.6 69.9 53.8 56.1 62.1
11 Manus-1.6-Max Wide Research 55.4 72.6 64.2 64.0 54.3 70.0 61.8 62.0 63.4
12 Doubao Deep Research 64.2 64.9 53.1 60.7
13 Grok Deep Research 58.7 63.7 58.3 60.2 56.3 71.5 53.9 60.5 60.3

Performance comparison of models with MiroEval. Text-Only comprises 70 tasks; MultiModal comprises 30 tasks. Systems are ranked by Text-Only Overall score. "–" indicates the system does not support multimodal deep research or data is unavailable.

Detailed score heatmap across Synthesis, Factuality, and Process sub-metrics for Text-Only (70 tasks) and Multimodal (30 tasks) settings.

Model Text-Only MultiModal Overall
Cov. Insight Instr. Clarity Spec. Cov. Insight Instr. Clarity Spec.
MiroThinker-H180.680.384.781.070.072.776.078.678.359.576.7
Kimi-K2.5 Deep Research80.479.878.676.371.775.7
MiroThinker-1.779.274.784.780.168.472.669.278.675.153.674.3
MiroThinker-1.7-mini78.875.084.378.768.174.0
OpenAI Deep Research78.274.381.677.169.170.663.974.870.554.273.8
Gemini-3.1-Pro Deep Research77.476.680.070.164.972.470.872.462.550.171.2
Claude-Opus-4.6 Research73.372.073.571.261.168.966.862.859.350.067.3
Doubao Deep Research72.962.774.667.258.264.2
MiniMax-M2.5 Research69.862.774.270.656.763.153.369.162.039.263.3
ChatGLM Agent69.962.874.567.557.167.160.271.765.445.163.2
Qwen-3.5-Plus Deep Research64.064.769.967.852.646.846.352.952.630.160.0
Grok Deep Research67.356.374.964.751.161.852.568.960.440.558.7
Manus-1.6-Max Wide Research61.254.867.965.648.158.750.265.061.240.455.4

Synthesis quality sub-metrics across five dimensions: Coverage, Insight, Instruction-following, Clarity, and Query Specification. Text-Only comprises 70 tasks; MultiModal comprises 30 tasks. "–" indicates data is unavailable.

Model Text-Only MultiModal Overall
Right Wrong Conf. Unk. Ratio Right Wrong Conf. Unk. Ratio
OpenAI Deep Research333517049683.310621003615777.083.3
MiroThinker-H1374616167381.11316825623878.581.1
MiroThinker-1.7333418167079.413061036323578.479.4
MiroThinker-1.7-mini339724680276.276.2
Qwen-3.5-Plus Deep Research170624438073.1576991910169.973.1
Manus-1.6-Max Wide Research197219145972.6681813213470.072.6
MiniMax-M2.5 Research387248692171.812551845925571.071.8
Gemini-3.1-Pro Deep Research4039526106871.315021589430273.771.3
Claude-Opus-4.6 Research283833891069.8964844424370.769.8
ChatGLM Agent409658098168.610381444621571.668.6
Kimi-K2.5 Deep Research3702595125665.465.4
Doubao Deep Research3890780139364.964.9
Grok Deep Research192436869963.77341043716371.563.7

Factuality evaluation. Measured by atomic claim counts (Right, Wrong, Conflict, Unknown) and average per-task right ratio (scaled to [0, 100]). Ranked by Text-Only Ratio. Text-Only comprises 70 tasks; MultiModal comprises 30 tasks. "–" indicates data is unavailable.

Model Text-Only MultiModal Overall
Intrinsic Alignment Intrinsic Alignment
Brdth Depth Refin Critl Effic Avg F→R R→P Contr Avg Brdth Depth Refin Critl Effic Avg F→R R→P Contr Avg
MiroThinker-H174.964.972.269.171.070.487.063.386.478.968.663.173.471.064.168.186.663.486.979.074.7
OpenAI Deep Research77.467.376.774.763.772.083.659.079.974.165.562.173.870.054.565.277.256.272.168.573.1
MiroThinker-1.774.464.475.771.664.670.183.759.482.575.265.057.072.063.057.762.980.758.776.071.872.7
MiroThinker-1.7-mini75.556.371.370.959.066.679.756.375.270.468.5
Gemini-3.1-Pro Deep Research75.466.675.964.159.068.272.950.674.466.069.765.371.058.347.062.375.749.073.065.967.1
MiniMax-M2.5 Research71.962.270.162.563.566.077.453.074.368.351.059.065.043.763.056.377.052.075.068.067.1
Claude-Opus-4.6 Research79.158.867.256.762.264.881.047.173.567.275.260.769.659.360.065.078.949.372.666.966.0
ChatGLM Agent76.259.467.159.359.064.277.151.472.367.052.752.355.744.754.351.973.047.070.763.665.6
Manus-1.6-Max Wide Research62.858.460.653.568.860.875.151.376.367.652.457.260.743.465.955.974.554.574.167.764.2
Kimi-K2.5 Deep Research77.559.471.067.653.565.870.746.870.462.664.2
Qwen-3.5-Plus Deep Research74.464.175.074.163.270.259.639.756.952.157.051.358.757.751.355.261.739.356.352.461.1
Grok Deep Research50.949.461.054.664.756.174.642.264.660.441.944.352.442.459.548.172.441.465.259.758.3
Doubao Deep Research59.341.659.655.753.353.965.736.854.252.253.1

Process evaluation results. Intrinsic metrics: Search Breadth, Analytical Depth, Progressive Refinement, Critical Thinking, and Efficiency. Alignment metrics: Findings→Report coverage, Report→Process traceability, and Contradiction detection. Text-Only comprises 70 tasks; MultiModal comprises 30 tasks. "–" indicates data is unavailable.

Finding 01

Rankings Shift Substantially Across Dimensions

The three dimensions capture fundamentally different aspects of system capability. Kimi-K2.5 tops Synthesis at 75.7 yet ranks near-bottom on Factuality; Manus shows the opposite pattern at 55.4 Synthesis but competitive Factuality. A polished report does not guarantee factual grounding, nor vice versa.

Dimension-Level Rank Shifts — Bump chart showing how system rankings change across Synthesis, Factuality, and Process dimensions.

Finding 02

Specificity is the Bottleneck, Insight is the Differentiator

Specificity is the universal bottleneck, trailing Coverage by 10–14 points across all systems — systems find relevant topics but struggle with granular, evidence-grounded details. Insight shows the widest spread (54.8–80.3), making it the most discriminative sub-metric for differentiating capability.

Sub-dimension Score Spread — Specificity is consistently the lowest sub-metric across systems, while Insight exhibits the widest variance.

Finding 03

Factual Claims: A Precision–Volume Trade-off

A fundamental tension exists between claim volume and accuracy. ChatGLM and Gemini produce 4,000+ correct claims but with 500+ errors (Ratio ~70%); OpenAI generates fewer but achieves the highest Ratio at 83.3. The MiroThinker series balances both: H1 produces the most claims among top systems (3,746) with the fewest errors (161).

Statement Volume vs. Factual Precision — the trade-off between broad claim coverage and strict factual discipline, with the MiroThinker series achieving a distinctive balance.

Finding 04

Process Quality Predicts Overall Outcome

The top three Process systems are also the top three on overall outcome; the weakest (Doubao at 53.1) produces a near-bottom result. The Pearson correlation between Process and combined Outcome reaches r = 0.88, confirming process quality as a reliable predictor.

Process Quality vs. Overall Outcome — strong positive correlation (r = 0.88) across all systems in the Text-Only setting.

Finding 05

The Traceability Gap: Reports Outrun the Process

Process evaluation exposes a gap invisible to output-level metrics. F→R scores are generally above 70 (findings reach the report), but R→P falls below 55 for most systems — even MiroThinker-H1 achieves only 63.3. A substantial portion of report content cannot be traced back to the documented research process.

Process–Report Alignment Gap — F→R (blue) vs. R→P (orange). The gap reveals reports routinely contain content not originating from the documented research process.

Finding 06

Multimodal Tasks Pose Greater Challenges

Overall scores drop by 3 to 10 points for most systems when moving from Text-Only to MultiModal. MiroThinker-H1 proves the most resilient with a decline of only 3.0 points, while Qwen-3.5-Plus suffers the largest drop.

Text-Only vs. Multimodal — Dumbbell chart comparing overall scores across both settings. The gap indicates performance degradation on multimodal tasks.