Case Study

MQM Tool // Inter-Annotator Agreement in MQM Annotation

Two professional linguists independently annotated English-to-Italian translations from two neural MT systems using the MQM framework — achieving 2.6× higher agreement than typical WMT benchmarks.

2 Annotators 2 MT Systems 10 Segments 176 Annotations

Study Overview

We present an inter-annotator agreement (IAA) study for Multidimensional Quality Metrics (MQM) annotation of machine translation output. Two professional linguists independently annotated English-to-Italian translations from two neural MT systems using the MQM error typology. The source documents were drawn from the WMT 2025 Human Evaluation dataset, specifically selecting segments without prior MQM annotations.

Our analysis reveals a Kendall's tau correlation of 0.317 for segment-level MQM scores — 2.6× higher than the typical 0.12 reported in WMT shared tasks. While annotators achieved 100% agreement on identifying segments containing errors, significant differences emerged in error density and category preferences.

Dataset

Source WMT 2025
Language Pair EN → IT
Domain Social Media
Segments 10
Source Words ~1,630

MT Systems

EuroLLM-22B

Multilingual large language model

Qwen3-235B

Large-scale multilingual model

20 translation instances (10 segments × 2 systems)

Annotation Setup

Annotators 2 Professional Linguists
Method Double-Blind MQM
Framework MQM Error Typology
Total Annotations 176

Source Material

WMT 2025 Human Evaluation dataset, social media domain

WMT 2025 General Machine Translation Shared Task

Document ID: 114294867111841563 · Social Media Domain · WMT 2025 Repo

Source documents were obtained from the WMT 2025 Human Evaluation dataset, selecting documents that had not received prior MQM or other quality annotations. The social media domain was chosen for its high-context nature — requiring annotators to interpret informal register, cultural references, and pragmatic nuance.

10
Segments
~1,630
Source Words
Social
Domain
EN → IT
Language Pair

Annotation Setup

Two professional linguists, independent double-blind annotation

Annotator Profiles

A1
A-5BFF0F0F
Professional · Native Italian speaker
A2
A-7A8BCDCD
Professional · Native Italian speaker

MQM Categories

Accuracy (Mistranslation, Omission, Addition, Untranslated)
Fluency (Grammar, Spelling, Punctuation, Inconsistency)
Terminology
Style

Severity Weights

25
Critical
5
Major
1
Minor
0.1
Punctuation

MQM Score = −∑(weighti × errori). Lower (more negative) = worse quality.

Annotation Statistics

3.2× error density gap between annotators

42
A-5BFF0F0F Errors
1.5h · 28/hr
134
A-7A8BCDCD Errors
3.5h · 43/hr
3.2×
Error Density Gap
42 vs 134 errors
10/10
Segments w/ Errors
100% detection
Category A-5BFF0F0F A-7A8BCDCD
Fluency/Grammar 7 (17%) 54 (40%)
Accuracy/Mistranslation 11 (26%) 28 (21%)
Style 15 (36%) 22 (16%)
Accuracy/Untranslated 1 (2%) 20 (15%)
Terminology 5 (12%) 0 (0%)
Other 3 (7%) 10 (8%)

Severity Distribution

A-5BFF0F0F

Minor 33 (79%)
Major 8 (19%)
Critical 1 (2%)

A-7A8BCDCD

Minor 132 (98%)
Major 2 (2%)
Critical 0 (0%)

Key Difference

A-5BFF0F0F marked 21% of errors as Major/Critical versus only 2% for A-7A8BCDCD — a fundamental difference in severity threshold interpretation. A-7A8BCDCD found 3.2× more errors overall but assigned almost exclusively Minor severity.

Segment-Level Agreement

Do annotators assign similar quality scores to segments?

0.317
Kendall's τ
p = 0.229
0.530
Pearson r
p = 0.115
0.458
Spearman ρ
p = 0.183
~0.12
WMT Benchmark
Typical τ
Segment A-5BFF0F0F A-7A8BCDCD Δ
auto_0 -8.0 -11.0 3.0
auto_1 -16.0 -15.0 1.0
auto_2 -9.0 -19.0 10.0
auto_3 -3.0 -13.0 10.0
auto_4 -8.0 -15.0 7.0
auto_5 -2.0 -14.0 12.0
auto_6 -8.0 -14.0 6.0
auto_7 -9.0 -9.0 0.0
auto_8 -7.0 -13.0 6.0
auto_9 -28.0 -19.0 9.0
Mean -9.8 -14.2 6.4

2.6× Higher Than WMT

Our Kendall's tau of 0.317 substantially exceeds the ~0.12 typically reported for MQM in WMT evaluations. Contributing factors include domain consistency (single social media document), language pair clarity (EN→IT has well-defined error patterns), full document context for both annotators, and professional linguist expertise.

Span-Level Agreement

Do annotators identify the same error locations?

50%
A1 Spans Matched
21 of 42
15.7%
A2 Spans Matched
21 of 134
13.5%
Jaccard Index
21 / 155 unique
155
Unique Spans
Total identified

Span Overlap Breakdown

Both annotators 21 spans (13.5%)
A-5BFF0F0F only 21 spans (13.5%)
A-7A8BCDCD only 113 spans (72.9%)

Spans considered matching at ≥30% overlap threshold (standard practice for span-based evaluation).

Category & Severity Agreement

On the 21 spans where both annotators identified an error

48%
Category Agreement
10 of 21 matched spans
71%
Severity Agreement
15 of 21 matched spans
38%
Both Match
8 of 21 matched spans

Severity Is Easier to Agree On

When annotators agree on error location, they agree on severity more often (71%) than category (48%). This suggests that while error severity is relatively intuitive (how bad is it?), the choice between overlapping categories like Style vs. Grammar is more subjective and depends on individual annotator training and mental models.

Error Analysis

Category preferences diverge substantially between annotators

A-5BFF0F0F — Category Focus

Style
15 (36%)
Accuracy/Mistranslation
11 (26%)
Fluency/Grammar
7 (17%)
Terminology
5 (12%)
Other
3 (7%)
Accuracy/Untranslated
1 (2%)

A-7A8BCDCD — Category Focus

Fluency/Grammar
54 (40%)
Accuracy/Mistranslation
28 (21%)
Style
22 (16%)
Accuracy/Untranslated
20 (15%)
Other
10 (8%)
Terminology
0 (0%)

Different Mental Models

A-5BFF0F0F focused on Style (36%) and Terminology (12%) — prioritizing fluent, domain-appropriate language. A-7A8BCDCD emphasized Fluency/Grammar (40%) and Untranslated content (15%) — taking a more granular, error-counting approach. Neither is "wrong" — they reflect complementary perspectives on translation quality that, combined, produce a richer quality signal than either alone.

Key Findings

1

2.6× Higher Agreement Than WMT

Kendall's tau of 0.317 vs. the typical ~0.12 in WMT shared tasks. Domain consistency, language pair clarity, and professional expertise all contribute to stronger agreement.

2

3.2× Error Density Variance

One annotator found 42 errors, the other 134 — reflecting different thresholds for what constitutes an "error" and different levels of annotation granularity.

3

Category Agreement Near-Chance

Only 48% category agreement on matched spans. Overlapping categories (Style vs. Grammar, Terminology vs. Mistranslation) create systematic disagreement even when annotators spot the same issue.

4

Multiple Annotators Essential

Single-annotator MQM provides unreliable estimates. Each annotator brings different expertise and catches different errors — only a multi-annotator setup produces the comprehensive quality signal needed for reliable MT evaluation.

Downloads & Data

Annotation Files

MT System Annotation Data / MQM Report
EuroLLM-22B
EuroLLM-22B
Qwen3-235B
Qwen3-235B

Need professional annotation or MT evaluation?

We produce MQM-annotated datasets and MT quality evaluations on demand — any language pair, any model, any domain. Alconost's linguist network covers 100+ languages with native-speaker evaluators ready to deploy. From structured dataset annotation to full MT benchmarking studies — we handle the entire process from project setup to delivery.

Try the MQM Tool

This tool uses the Multidimensional Quality Metrics (MQM) framework, licensed under CC BY 4.0 by The MQM Council.