MQM-based translation quality annotation for human and machine translation evaluation.
Upload translation outputs, annotate errors according to MQM guidelines, and export structured annotations for system-level and segment-level analysis.
A service by Alconost, a localization provider with over 20 years of experience.
Aligned with WMT evaluation practices and inspired by open MQM annotation tools such as Anthea and Marot.
MQM (Multidimensional Quality Metrics) is an error-based framework for translation quality evaluation widely used in WMT Shared Tasks and industry benchmarks.
Instead of holistic scores, MQM evaluates translations by annotating explicit error categories and severities (e.g. Accuracy, Terminology, Fluency), enabling fine-grained, reproducible, and comparable evaluation.
MQM is used for:
We convert MQM annotations into a single Quality Score (%) by normalizing error penalties by translation length using XLM-R SentencePiece tokens. This ensures fair and consistent scoring across all languages, including CJK and languages without whitespace.
By default, a project passes if the Quality Score is 99.0% or higher. Both the Pass/Fail threshold and Error Weights are fully adjustable.
Gain insights into annotator performance and translation quality distributions.
Example from Case Study: EuroLLM-22B (EN-IT)
Simple tabular data. One error per row. For errors, use <v>tags</v> in the target column.
source target src_lang tgt_lang category severity comment Cat Gato en es Dog <v>Dug</v> en es Fluency Minor Typo
source, target (alias: translation)segment_idsystem_iddoc_idsrc_langtgt_langcontextannotator_idcorrectioncategoryseveritycommenttimestamp: an error or "No error" labelling event timeStandard line-delimited JSON. Ideal for nested annotations and metadata.
{
"source": "Dog",
"target": "Dug",
"src_lang": "en",
"tgt_lang": "es",
"correction": "Perro",
"timestamp": 1234567890,
"annotations": [
{
"start": 0, "end": 3,
"category": "Fluency",
"severity": "Minor",
"comment": "Typo",
"timestamp": 1234567890
}
]
}source, targetsegment_idsystem_iddoc_idsrc_langtgt_langcontextannotator_idcorrectiontimestamp: "No error" labelling event timeannotations (start, end, category, severity, comment, timestamp: error labelling event time)
To explicitly mark a segment as valid (containing no errors), provide an annotation with the category
no-error. This ensures the
segment is counted as "checked and correct" rather than just
"skipped" or "pending".
You can provide additional context for linguists using the context field. This is ideal for passing glossary terms, reference links, style guide rules, or any other metadata that helps annotators make better
decisions. This field is displayed prominently in the annotation
interface.
Automate your localization quality workflow with our simple REST API. OpenAPI Spec
Programmatically create projects and upload content. Supports TSV, CSV, JSONL, and raw JSON.
Retrieve annotated data and metrics in standard formats (JSONL, TSV, CSV).
Unsure about translation quality from AI or vendors?
Have a professional linguist complete a blind MQM error annotation and get an objective quality verdict.
Annotation performed by
linguists with 20+ years track record in translation
quality, in 120 languages.
Building the next SOTA translation model or metric?
Access high-quality, human-verified MQM gold standards for robust evaluation and RLHF.
Ethically Sourced Data. We create custom datasets from scratch or use open licenses. We never resell client data.
This inter-annotator agreement study analyzes independent, double-blind annotations of EuroLLM-22B and Qwen3-235B outputs by two professional linguists using the MQM framework. Using a high-context document (ID: 114294867111841563) from the WMT 2025 General Machine Translation Shared Task, it focuses on English → Italian translations in the Social Media domain (10 segments, approx. 1,630 words). The labeled datasets used in this study are part of the Alconost MQM Translation Gold Dataset.
| Model Source | Annotation Data / MQM Report |
|---|---|
| EuroLLM-22B | |
| EuroLLM-22B | |
| Qwen3-235B | |
| Qwen3-235B |
45 Alconost linguists evaluated Google's TranslateGemma-12B — a compact open-weight translation model — across 16 target languages (12 officially supported, 4 unsupported). Source material: 7 segments from an academic ASR paper, translated and annotated using the MQM framework with 3 independent evaluators per language. The API-driven workflow managed 48 annotation projects from creation to export in 6 days. Moroccan Arabic ranked 2nd overall despite being unsupported, outperforming 10 officially supported languages — suggesting language proximity matters more than explicit model support. Automatic metrics (MetricX-24 XXL) correlated at r=0.88 with human MQM scores, strong enough to triage quality before expensive human review. All annotation data is part of the Alconost MQM Translation Gold Dataset.
| Language | MQM | Annotation Data / MQM Report |
|---|---|---|
| German | 48 | |
| Italian | 77 | |
| Japanese | 344 | |
| Korean | 409 | |
| Hmong | 1129 |
We offer Private and Source Code licenses for enterprises with strict security requirements. Deploy on your own infrastructure, integrate with your internal tools, or embed in your platform.
Alconost is a global localization company, providing multilingual localization, translation, and language quality services to international technology companies for over 20 years.
The company’s core business is language services at scale, including translation, localization QA, and multilingual content operations across 120+ languages. Alconost works with a range of global digital platforms and SaaS companies, including long-term engagements with large technology clients.
Over time, and in response to client needs, Alconost has expanded its capabilities into AI-related services, including multilingual data labeling, machine translation and LLM evaluation, and human-in-the-loop quality assurance. These services are developed as an extension of Alconost’s localization and linguistic quality expertise, not as a standalone AI-only offering.
This tool uses the Multidimensional Quality Metrics (MQM) framework, licensed under CC BY 4.0 by The MQM Council.