Study Overview
Google's TranslateGemma-12B is a promising open-weight translation model — compact enough to run locally on a single GPU (~24GB VRAM in bfloat16), with support for 55 languages. We wanted to understand how it performs on real-world technical content — the kind of domain where machine translation often needs human review.
We took 7 segments from an academic ASR research paper, translated them into 16 languages, and had Alconost linguists evaluate each translation using the MQM (Multidimensional Quality Metrics) framework — the same methodology used in WMT human evaluation campaigns.
Model & Deployment
Translation Approach
Structured chat template with source_lang_code / target_lang_code fields
Custom prompt: Translate to Belarusian (беларуская
мова): Output only the translation.
Belarusian, Hmong, Arabic (MSA), Arabic (Morocco)
Study Design
Source Material
7 segments from an academic paper on multilingual speech recognition
Selective Invocation for Multilingual ASR: A Cost-effective Approach Adapting to Speech Recognition Difficulty
Xue et al., 2025 · Interspeech 2025 · arXiv:2505.16168
"Multilingual automatic speech recognition (ASR) models have gained significant attention for their ability to recognize multiple languages using a single model. Recent advances have led to impressive performance in various languages through large-scale supervised or self-supervised pre-training. For example, Whisper is trained on 680,000 hours of weakly multilingual data..."
"Motivated by these limitations, we propose an alternative strategy that selectively invokes models based on the complexity of the input speech..."
Evaluation Pipeline
From zero to 1,169 annotations in 6 days
Translate
Model7 source segments translated into 16 languages via TranslateGemma on a private HuggingFace Inference Endpoint (A100 80GB, greedy decoding).
Create Projects via API
API48 MQM annotation projects created programmatically — source/target pairs uploaded, metadata set (system_id, doc_id, language codes), unique project URLs generated for each linguist.
Annotate
Human45 Alconost linguists annotated independently in the browser-based MQM tool — marking error spans, selecting categories and severities, writing comments. 3 evaluators per language, blind.
Export & Analyze
DataAnnotation data exported in JSONL/TSV. Quality scores, inter-annotator agreement, and reports generated from the structured data.
API-Driven Workflow
The entire project lifecycle was managed through the
MQM Tool REST API: batch project creation with POST /import (uploading source/target pairs as TSV/JSONL, setting metadata
like system_id, doc_id, language codes), real-time progress
tracking with GET /projects/:id (completion rates, error counts), and structured data export
with GET /projects/:id/export?format=jsonl. No manual file handling — the API handles the full
pipeline from segment upload to annotated dataset
delivery.
Annotation Team
45 linguists drawn from Alconost's vetted multilingual workforce of 2,000+ professionals across 100+ languages
Median Annotation Time per Linguist
Throughput Range
Median Throughput
Avg Evaluation Time
Campaign Timeline
From project kickoff to full dataset in 8 days
The entire campaign was coordinated by a dedicated Alconost vendor manager who onboarded 45 linguists, distributed project links, tracked deadlines, and handled linguist questions. Progress was monitored in real time using the MQM Tool API — a CLI script queried project data and export endpoints for each of the 48 projects, flagging stalled evaluators and surfacing completion rates daily.
Annotations per Day
Inter-Annotator Agreement
Why multiple annotators matter — and what low agreement actually tells us
With 3 evaluators per language working independently, we measured how often they agree — and the answer is: not very often. This isn't a flaw in our process. It's a fundamental property of translation quality assessment. Different linguists notice different errors, weight them differently, and bring different expertise. That's exactly why a single annotator is never enough for reliable MQM data.
| Language | Kendall's τ | Agreement |
|---|---|---|
| Italian | 0.716 | Strong |
| Ukrainian | 0.429 | Moderate |
| Japanese | 0.429 | Moderate |
| Portuguese (BR) | 0.400 | Moderate |
| Korean | 0.400 | Moderate |
| Russian | 0.365 | Moderate |
| Arabic (MSA) | 0.175 | Weak |
| Arabic (Saudi) | 0.111 | Weak |
| Arabic (Egypt) | 0.048 | Weak |
| Arabic (Morocco) | 0.039 | Weak |
| Polish | 0.035 | Weak |
| Belarusian | -0.048 | Weak |
| Portuguese (PT) | -0.154 | Weak |
| French | -0.206 | Weak |
| German | -0.270 | Weak |
The Takeaway
Low agreement doesn't mean the annotations are wrong — it means translation quality is inherently multidimensional. One linguist flags a terminology issue, another catches an awkward phrasing, a third spots a subtle omission. Each perspective adds signal. This is why production-grade MQM evaluation requires multiple independent annotators — and why scaling this to 45 linguists across 16 languages is exactly the kind of operation we're built for.
Quality Rankings
MQM scores across 16 languages — lower is better
| # | Language | MQM | Quality | Verdict |
|---|---|---|---|---|
| 1 | German | 48 | 98.9% | FAIL |
| 2 | Arabic (Morocco) | 65 | 98.7% | FAIL |
| 3 | Polish | 69 | 98.5% | FAIL |
| 4 | Italian | 77 | 98.4% | FAIL |
| 5 | Arabic (Egypt) | 87 | 98.2% | FAIL |
| 6 | French | 95 | 98.2% | FAIL |
| 7 | Portuguese (Brazil) | 118 | 97.7% | FAIL |
| 8 | Arabic (MSA) | 130 | 97.3% | FAIL |
| 9 | Arabic (Saudi Arabia) | 142 | 97.0% | FAIL |
| 10 | Portuguese (Portugal) | 174 | 96.6% | FAIL |
| 11 | Japanese | 344 | 84.8% | FAIL |
| 12 | Russian | 353 | 92.4% | FAIL |
| 13 | Korean | 409 | 90.0% | FAIL |
| 14 | Belarusian | 481 | 89.3% | FAIL |
| 15 | Ukrainian | 568 | 87.5% | FAIL |
| 16 | Hmong | 1,129 | 44.0% | FAIL |
0 of 16 languages pass at the 99% threshold
Using the Alconost MQM Tool's token-weighted quality score with the standard 99% pass threshold, none of the 16 languages achieve a passing grade — even German, the best performer, reaches 98.94%. This doesn't mean TranslateGemma is bad — it's a capable model, especially for a 12B parameter model running locally. It means that technical content demands human review. The model does the heavy lifting; the linguist catches what the model misses.
Error Breakdown
Critical
Major
Minor
Severity Distribution
Top Error Categories
MQM Score by Language
Key Findings
Strong on Supported European Languages
German (48), Polish (69), Italian (77), and French (95) all scored well — impressive for a 12B model running locally on technical academic content. With human post-editing, these translations are production-usable.
Technical Content Is Hard
Academic ASR terminology is challenging for any translator — human or machine. Accuracy/Mistranslation (286 errors, 24.5%) and Terminology (139, 11.9%) dominate, which is expected for this domain and reinforces the need for expert review.
Language Support Matters
Officially supported languages averaged ~2.18 MQM per segment vs ~13.67 for unsupported ones. This highlights the importance of checking language support before deploying any MT model, and of having human evaluation to quantify the gap.
Arabic Variants Diverge
Four Arabic variants showed wide variation: Morocco (65), Egypt (87), MSA (130), Saudi Arabia (142). Arabic (Morocco) is unsupported yet scored 2nd overall — showing that language proximity can compensate for missing explicit support.
MT + Human Review = Production Quality
TranslateGemma is a compelling model for local MT deployment — fast, open-weight, and performant for supported languages. But as with any LLM-based translation, specialized domains benefit from human review. MQM annotation identifies exactly where and how MT falls short, enabling targeted post-editing that's faster and cheaper than translating from scratch. The model does the heavy lifting; the linguist ensures quality.
Human vs Automatic Metrics
How do state-of-the-art automatic metrics compare to our linguists' assessments?
We ran two leading automatic MT evaluation metrics — MetricX (Google) and COMET-Kiwi (Unbabel) — on the same translations, in QE (quality estimation) mode without reference translations. Both are neural metrics used in WMT evaluation campaigns. We compared their language rankings against our human MQM scores to measure how well machines can approximate expert judgment.
| Metric | Pearson r | Correlation |
|---|---|---|
| MetricX-24 XXL | 0.882 | Strongest |
| COMET-Kiwi XL | 0.841 | Strong |
| COMET-Kiwi (base) | 0.841 | Strong |
| MetricX-24 XL | 0.798 | Strong |
| COMET-Kiwi XXL | 0.796 | Strong |
| MetricX (Vertex AI) | 0.250 | Weak |
Best Automatic Metric
Surprise Finding
What This Means
The best automatic metrics correlate strongly with our human annotations (r=0.88) — which validates that the Alconost linguists' MQM scores are consistent and reliable. But automatic metrics still miss nuance: they struggle with mid-range languages and can't provide the error spans, categories, corrections, and explanations that human annotators produce. Automatic metrics tell you how good a translation is; human MQM tells you what's wrong and how to fix it.
Cost Analysis
What does a 16-language MQM evaluation cost?
Cost by Region
Rates reflect 2024–2025 market research from Upwork, ProZ, and regional salary data. Rare language expertise (Hmong) commands a premium. Eastern Europe offers the best cost-efficiency for Slavic languages. The total cost of under $1,000 for a 16-language, 45-linguist MQM evaluation demonstrates that professional human annotation is accessible — not just for large enterprises.
Downloads & Data
Sample MQM Reports
Generated by the MQM Tool — one report per language, showing error breakdowns, quality scores, and annotations
| Language | MQM Score | Annotation Data / MQM Report |
|---|---|---|
| German | 48 98.94% | |
| Italian | 77 98.44% | |
| Japanese | 344 84.84% | |
| Korean | 409 90.04% | |
| Hmong | 1129 44.00% |
Need professional annotation or MT evaluation?
We produce MQM-annotated datasets and MT quality evaluations on demand — any language pair, any model, any domain. Alconost's linguist network covers 100+ languages with native-speaker evaluators ready to deploy. From structured dataset annotation to full MT benchmarking studies — we handle the entire process from project setup to delivery.