Study Overview
We present an inter-annotator agreement (IAA) study for Multidimensional Quality Metrics (MQM) annotation of machine translation output. Two professional linguists independently annotated English-to-Italian translations from two neural MT systems using the MQM error typology. The source documents were drawn from the WMT 2025 Human Evaluation dataset, specifically selecting segments without prior MQM annotations.
Our analysis reveals a Kendall's tau correlation of 0.317 for segment-level MQM scores — 2.6× higher than the typical 0.12 reported in WMT shared tasks. While annotators achieved 100% agreement on identifying segments containing errors, significant differences emerged in error density and category preferences.
Dataset
MT Systems
Multilingual large language model
Large-scale multilingual model
Annotation Setup
Source Material
WMT 2025 Human Evaluation dataset, social media domain
WMT 2025 General Machine Translation Shared Task
Document ID: 114294867111841563
· Social Media Domain ·
WMT 2025 Repo
Source documents were obtained from the WMT 2025 Human Evaluation dataset, selecting documents that had not received prior MQM or other quality annotations. The social media domain was chosen for its high-context nature — requiring annotators to interpret informal register, cultural references, and pragmatic nuance.
Annotation Setup
Two professional linguists, independent double-blind annotation
Annotator Profiles
MQM Categories
Severity Weights
MQM Score = −∑(weighti × errori). Lower (more negative) = worse quality.
Annotation Statistics
3.2× error density gap between annotators
| Category | A-5BFF0F0F | A-7A8BCDCD |
|---|---|---|
| Fluency/Grammar | 7 (17%) | 54 (40%) |
| Accuracy/Mistranslation | 11 (26%) | 28 (21%) |
| Style | 15 (36%) | 22 (16%) |
| Accuracy/Untranslated | 1 (2%) | 20 (15%) |
| Terminology | 5 (12%) | 0 (0%) |
| Other | 3 (7%) | 10 (8%) |
Severity Distribution
A-5BFF0F0F
A-7A8BCDCD
Key Difference
A-5BFF0F0F marked 21% of errors as Major/Critical versus only 2% for A-7A8BCDCD — a fundamental difference in severity threshold interpretation. A-7A8BCDCD found 3.2× more errors overall but assigned almost exclusively Minor severity.
Segment-Level Agreement
Do annotators assign similar quality scores to segments?
| Segment | A-5BFF0F0F | A-7A8BCDCD | Δ |
|---|---|---|---|
| auto_0 | -8.0 | -11.0 | 3.0 |
| auto_1 | -16.0 | -15.0 | 1.0 |
| auto_2 | -9.0 | -19.0 | 10.0 |
| auto_3 | -3.0 | -13.0 | 10.0 |
| auto_4 | -8.0 | -15.0 | 7.0 |
| auto_5 | -2.0 | -14.0 | 12.0 |
| auto_6 | -8.0 | -14.0 | 6.0 |
| auto_7 | -9.0 | -9.0 | 0.0 |
| auto_8 | -7.0 | -13.0 | 6.0 |
| auto_9 | -28.0 | -19.0 | 9.0 |
| Mean | -9.8 | -14.2 | 6.4 |
2.6× Higher Than WMT
Our Kendall's tau of 0.317 substantially exceeds the ~0.12 typically reported for MQM in WMT evaluations. Contributing factors include domain consistency (single social media document), language pair clarity (EN→IT has well-defined error patterns), full document context for both annotators, and professional linguist expertise.
Span-Level Agreement
Do annotators identify the same error locations?
Span Overlap Breakdown
Spans considered matching at ≥30% overlap threshold (standard practice for span-based evaluation).
Category & Severity Agreement
On the 21 spans where both annotators identified an error
Severity Is Easier to Agree On
When annotators agree on error location, they agree on severity more often (71%) than category (48%). This suggests that while error severity is relatively intuitive (how bad is it?), the choice between overlapping categories like Style vs. Grammar is more subjective and depends on individual annotator training and mental models.
Error Analysis
Category preferences diverge substantially between annotators
A-5BFF0F0F — Category Focus
A-7A8BCDCD — Category Focus
Different Mental Models
A-5BFF0F0F focused on Style (36%) and Terminology (12%) — prioritizing fluent, domain-appropriate language. A-7A8BCDCD emphasized Fluency/Grammar (40%) and Untranslated content (15%) — taking a more granular, error-counting approach. Neither is "wrong" — they reflect complementary perspectives on translation quality that, combined, produce a richer quality signal than either alone.
Key Findings
2.6× Higher Agreement Than WMT
Kendall's tau of 0.317 vs. the typical ~0.12 in WMT shared tasks. Domain consistency, language pair clarity, and professional expertise all contribute to stronger agreement.
3.2× Error Density Variance
One annotator found 42 errors, the other 134 — reflecting different thresholds for what constitutes an "error" and different levels of annotation granularity.
Category Agreement Near-Chance
Only 48% category agreement on matched spans. Overlapping categories (Style vs. Grammar, Terminology vs. Mistranslation) create systematic disagreement even when annotators spot the same issue.
Multiple Annotators Essential
Single-annotator MQM provides unreliable estimates. Each annotator brings different expertise and catches different errors — only a multi-annotator setup produces the comprehensive quality signal needed for reliable MT evaluation.
Downloads & Data
Annotation Files
| MT System | Annotation Data / MQM Report |
|---|---|
| EuroLLM-22B | |
| EuroLLM-22B | |
| Qwen3-235B | |
| Qwen3-235B |
Need professional annotation or MT evaluation?
We produce MQM-annotated datasets and MT quality evaluations on demand — any language pair, any model, any domain. Alconost's linguist network covers 100+ languages with native-speaker evaluators ready to deploy. From structured dataset annotation to full MT benchmarking studies — we handle the entire process from project setup to delivery.