Case Study

MQM Tool // TranslateGemma Quality Evaluation

45 professional linguists evaluated Google's TranslateGemma-12B across 16 languages in 6 days — identifying 1,169 errors and producing a publicly available MQM dataset.

45 Linguists 16 Languages 34 Hours Total Effort 1,169 Annotations

Study Overview

Google's TranslateGemma-12B is a promising open-weight translation model — compact enough to run locally on a single GPU (~24GB VRAM in bfloat16), with support for 55 languages. We wanted to understand how it performs on real-world technical content — the kind of domain where machine translation often needs human review.

We took 7 segments from an academic ASR research paper, translated them into 16 languages, and had Alconost linguists evaluate each translation using the MQM (Multidimensional Quality Metrics) framework — the same methodology used in WMT human evaluation campaigns.

Model & Deployment

Model google/translategemma-12b-it
Parameters 12B (bfloat16)
Platform HF Inference Endpoints
Hardware NVIDIA A100 80GB
Decoding Greedy (deterministic)

Translation Approach

Supported (12 langs)

Structured chat template with source_lang_code / target_lang_code fields

Unsupported (4 langs)

Custom prompt: Translate to Belarusian (беларуская мова): Output only the translation.

Belarusian, Hmong, Arabic (MSA), Arabic (Morocco)

Study Design

Source Material Academic ASR paper
Segments per Language 7
Evaluators per Language 3 Alconost linguists
Total Projects 48
Completion 322 / 336 (95.8%)

Source Material

7 segments from an academic paper on multilingual speech recognition

Selective Invocation for Multilingual ASR: A Cost-effective Approach Adapting to Speech Recognition Difficulty

Xue et al., 2025 · Interspeech 2025 · arXiv:2505.16168

"Multilingual automatic speech recognition (ASR) models have gained significant attention for their ability to recognize multiple languages using a single model. Recent advances have led to impressive performance in various languages through large-scale supervised or self-supervised pre-training. For example, Whisper is trained on 680,000 hours of weakly multilingual data..."

"Motivated by these limitations, we propose an alternative strategy that selectively invokes models based on the complexity of the input speech..."

7
Segments
~615
Source Words
ASR / NLP
Domain
English
Source Language

Evaluation Pipeline

From zero to 1,169 annotations in 6 days

01

Translate

Model

7 source segments translated into 16 languages via TranslateGemma on a private HuggingFace Inference Endpoint (A100 80GB, greedy decoding).

02

Create Projects via API

API

48 MQM annotation projects created programmatically — source/target pairs uploaded, metadata set (system_id, doc_id, language codes), unique project URLs generated for each linguist.

03

Annotate

Human

45 Alconost linguists annotated independently in the browser-based MQM tool — marking error spans, selecting categories and severities, writing comments. 3 evaluators per language, blind.

04

Export & Analyze

Data

Annotation data exported in JSONL/TSV. Quality scores, inter-annotator agreement, and reports generated from the structured data.

API-Driven Workflow

The entire project lifecycle was managed through the MQM Tool REST API: batch project creation with POST /import (uploading source/target pairs as TSV/JSONL, setting metadata like system_id, doc_id, language codes), real-time progress tracking with GET /projects/:id (completion rates, error counts), and structured data export with GET /projects/:id/export?format=jsonl. No manual file handling — the API handles the full pipeline from segment upload to annotated dataset delivery.

Annotation Team

45 linguists drawn from Alconost's vetted multilingual workforce of 2,000+ professionals across 100+ languages

45
Linguists
34h
Total Effort
~2 min
Per Annotation
6 days
Campaign Duration

Median Annotation Time per Linguist

Hmong
92 min · 1 linguist
Ukrainian
78 min · 3 linguist s
Belarusian
77 min · 3 linguist s
Japanese
67 min · 3 linguist s
Russian
67 min · 3 linguist s
Arabic (Egypt)
50 min · 3 linguist s
Arabic (Saudi Arabia)
47 min · 3 linguist s
Korean
43 min · 3 linguist s
Arabic (MSA)
33 min · 3 linguist s
French
31 min · 3 linguist s
Polish
28 min · 3 linguist s
German
22 min · 3 linguist s
Arabic (Morocco)
21 min · 3 linguist s
Portuguese (Brazil)
17 min · 3 linguist s
Italian
15 min · 2 linguist s
Portuguese (Portugal)
15 min · 3 linguist s
Supported Unsupported | Hmong: 1 linguist (rare language, limited availability).

Throughput Range

15.8 – 71.8
errors/hour across evaluators
4.5x variance — reflects individual style and language complexity

Median Throughput

30.5 errors/hour
~2 minutes per error annotation
Includes reading source, marking spans, categorizing, writing comments

Avg Evaluation Time

~35 min
per evaluator (7 segments)
Range: 2 min (Italian) to 4.5 hrs (Ukrainian)

Campaign Timeline

From project kickoff to full dataset in 8 days

The entire campaign was coordinated by a dedicated Alconost vendor manager who onboarded 45 linguists, distributed project links, tracked deadlines, and handled linguist questions. Progress was monitored in real time using the MQM Tool API — a CLI script queried project data and export endpoints for each of the 48 projects, flagging stalled evaluators and surfacing completion rates daily.

Annotations per Day

121
Jan 27
377
Jan 28
71
Jan 29
124
Jan 30
38
Jan 31
166
Feb 1
194
Feb 2
78
Feb 4

Inter-Annotator Agreement

Why multiple annotators matter — and what low agreement actually tells us

With 3 evaluators per language working independently, we measured how often they agree — and the answer is: not very often. This isn't a flaw in our process. It's a fundamental property of translation quality assessment. Different linguists notice different errors, weight them differently, and bring different expertise. That's exactly why a single annotator is never enough for reliable MQM data.

0.165
Avg Kendall's τ
Segment ranking
16.5%
Span Overlap
Different text marked
8%
Category Agreement
Jaccard index
1 of 15
Strong Agreement
Only Italian τ=0.716
Language Kendall's τ Agreement
Italian 0.716 Strong
Ukrainian 0.429 Moderate
Japanese 0.429 Moderate
Portuguese (BR) 0.400 Moderate
Korean 0.400 Moderate
Russian 0.365 Moderate
Arabic (MSA) 0.175 Weak
Arabic (Saudi) 0.111 Weak
Arabic (Egypt) 0.048 Weak
Arabic (Morocco) 0.039 Weak
Polish 0.035 Weak
Belarusian -0.048 Weak
Portuguese (PT) -0.154 Weak
French -0.206 Weak
German -0.270 Weak

The Takeaway

Low agreement doesn't mean the annotations are wrong — it means translation quality is inherently multidimensional. One linguist flags a terminology issue, another catches an awkward phrasing, a third spots a subtle omission. Each perspective adds signal. This is why production-grade MQM evaluation requires multiple independent annotators — and why scaling this to 45 linguists across 16 languages is exactly the kind of operation we're built for.

Quality Rankings

MQM scores across 16 languages — lower is better

# Language MQM Quality Verdict
1 German 48 98.9% FAIL
2 Arabic (Morocco) 65 98.7% FAIL
3 Polish 69 98.5% FAIL
4 Italian 77 98.4% FAIL
5 Arabic (Egypt) 87 98.2% FAIL
6 French 95 98.2% FAIL
7 Portuguese (Brazil) 118 97.7% FAIL
8 Arabic (MSA) 130 97.3% FAIL
9 Arabic (Saudi Arabia) 142 97.0% FAIL
10 Portuguese (Portugal) 174 96.6% FAIL
11 Japanese 344 84.8% FAIL
12 Russian 353 92.4% FAIL
13 Korean 409 90.0% FAIL
14 Belarusian 481 89.3% FAIL
15 Ukrainian 568 87.5% FAIL
16 Hmong 1,129 44.0% FAIL
Quality Score = (1 − penalty ÷ tokens) × 100, token-weighted via XLM-R tokenizer. Pass threshold: 99.0%.

0 of 16 languages pass at the 99% threshold

Using the Alconost MQM Tool's token-weighted quality score with the standard 99% pass threshold, none of the 16 languages achieve a passing grade — even German, the best performer, reaches 98.94%. This doesn't mean TranslateGemma is bad — it's a capable model, especially for a 12B parameter model running locally. It means that technical content demands human review. The model does the heavy lifting; the linguist catches what the model misses.

Error Breakdown

Critical

82
25 pts each · 7.0%

Major

288
5 pts each · 24.6%

Minor

769
1 pt each · 65.8%

Severity Distribution

Minor 769 (65.8%)
Major 288 (24.6%)
Critical 82 (7%)

Top Error Categories

Accuracy/Mistranslation
286
Terminology
139
Fluency/Grammar
130
Style
128
Fluency/Inconsistency
124
Accuracy/Omission
97
Accuracy/Addition
76
Fluency/Punctuation
50

MQM Score by Language

German
48
Arabic (Morocco)
65
Polish
69
Italian
77
Arabic (Egypt)
87
French
95
Portuguese (Brazil)
118
Arabic (MSA)
130
Arabic (Saudi Arabia)
142
Portuguese (Portugal)
174
Japanese
344
Russian
353
Korean
409
Belarusian
481
Ukrainian
568
Hmong
1,129
≤100 101–200 >200 | Lower = better quality

Key Findings

1

Strong on Supported European Languages

German (48), Polish (69), Italian (77), and French (95) all scored well — impressive for a 12B model running locally on technical academic content. With human post-editing, these translations are production-usable.

2

Technical Content Is Hard

Academic ASR terminology is challenging for any translator — human or machine. Accuracy/Mistranslation (286 errors, 24.5%) and Terminology (139, 11.9%) dominate, which is expected for this domain and reinforces the need for expert review.

3

Language Support Matters

Officially supported languages averaged ~2.18 MQM per segment vs ~13.67 for unsupported ones. This highlights the importance of checking language support before deploying any MT model, and of having human evaluation to quantify the gap.

4

Arabic Variants Diverge

Four Arabic variants showed wide variation: Morocco (65), Egypt (87), MSA (130), Saudi Arabia (142). Arabic (Morocco) is unsupported yet scored 2nd overall — showing that language proximity can compensate for missing explicit support.

5

MT + Human Review = Production Quality

TranslateGemma is a compelling model for local MT deployment — fast, open-weight, and performant for supported languages. But as with any LLM-based translation, specialized domains benefit from human review. MQM annotation identifies exactly where and how MT falls short, enabling targeted post-editing that's faster and cheaper than translating from scratch. The model does the heavy lifting; the linguist ensures quality.

Human vs Automatic Metrics

How do state-of-the-art automatic metrics compare to our linguists' assessments?

We ran two leading automatic MT evaluation metrics — MetricX (Google) and COMET-Kiwi (Unbabel) — on the same translations, in QE (quality estimation) mode without reference translations. Both are neural metrics used in WMT evaluation campaigns. We compared their language rankings against our human MQM scores to measure how well machines can approximate expert judgment.

Metric Pearson r Correlation
MetricX-24 XXL 0.882 Strongest
COMET-Kiwi XL 0.841 Strong
COMET-Kiwi (base) 0.841 Strong
MetricX-24 XL 0.798 Strong
COMET-Kiwi XXL 0.796 Strong
MetricX (Vertex AI) 0.250 Weak

Best Automatic Metric

MetricX-24 XXL
Pearson r = 0.882 with human MQM
Google's largest neural MT evaluation model, run on HuggingFace Inference Endpoint

Surprise Finding

Vertex AI Underperforms
MetricX via Vertex AI: r = 0.250 (3.5x worse)
Same metric family, but the API version shows dramatically weaker correlation with human judgment

What This Means

The best automatic metrics correlate strongly with our human annotations (r=0.88) — which validates that the Alconost linguists' MQM scores are consistent and reliable. But automatic metrics still miss nuance: they struggle with mid-range languages and can't provide the error spans, categories, corrections, and explanations that human annotators produce. Automatic metrics tell you how good a translation is; human MQM tells you what's wrong and how to fix it.

Cost Analysis

What does a 16-language MQM evaluation cost?

$971
Total Cost
$0.91
Per Error Found
$3.41
Per Segment
$28.65
Avg Rate / Hour

Cost by Region

Eastern Europe — Ukrainian, Russian, Polish, Belarusian $272 (28%)
$18–25/hr
East Asia — Japanese, Korean $219 (22.6%)
$35–38/hr
Arabic — Egypt, Saudi, Morocco, MSA $187 (19.3%)
$25–30/hr
Western Europe — German, French, Italian $145 (14.9%)
$35–40/hr
Portuguese — Brazil, Portugal $79 (8.1%)
$28–32/hr
Low-resource — Hmong $69 (7.1%)
$45/hr

Rates reflect 2024–2025 market research from Upwork, ProZ, and regional salary data. Rare language expertise (Hmong) commands a premium. Eastern Europe offers the best cost-efficiency for Slavic languages. The total cost of under $1,000 for a 16-language, 45-linguist MQM evaluation demonstrates that professional human annotation is accessible — not just for large enterprises.

Downloads & Data

Sample MQM Reports

Generated by the MQM Tool — one report per language, showing error breakdowns, quality scores, and annotations

Language MQM Score Annotation Data / MQM Report
German 48 98.94%
Italian 77 98.44%
Japanese 344 84.84%
Korean 409 90.04%
Hmong 1129 44.00%

Need professional annotation or MT evaluation?

We produce MQM-annotated datasets and MT quality evaluations on demand — any language pair, any model, any domain. Alconost's linguist network covers 100+ languages with native-speaker evaluators ready to deploy. From structured dataset annotation to full MT benchmarking studies — we handle the entire process from project setup to delivery.

Try the MQM Tool

This tool uses the Multidimensional Quality Metrics (MQM) framework, licensed under CC BY 4.0 by The MQM Council.