Free MQM Annotation Tool by

MQM-based translation quality annotation for human and machine translation evaluation.

Upload translation outputs, annotate errors according to MQM guidelines, and export structured annotations for system-level and segment-level analysis.

A service by Alconost, a localization provider with over 20 years of experience.

Aligned with WMT evaluation practices and inspired by open MQM annotation tools such as Anthea and Marot.

What is MQM?

MQM (Multidimensional Quality Metrics) is an error-based framework for translation quality evaluation widely used in WMT Shared Tasks and industry benchmarks.

Instead of holistic scores, MQM evaluates translations by annotating explicit error categories and severities (e.g. Accuracy, Terminology, Fluency), enabling fine-grained, reproducible, and comparable evaluation.

MQM is used for:

Human translation evaluation
Machine translation evaluation (including neural and LLM-based MT)
System comparison and ranking
Error analysis and regression testing

MQM Taxonomy Reference

Error Categories

Accuracy/ Addition

Accuracy/ Omission

Accuracy/ Mistranslation

Accuracy/ Source error

Accuracy/ Untranslated

Fluency/ Punctuation

Fluency/ Spelling

Fluency/ Grammar

Fluency/ Register

Fluency/ Inconsistency

Fluency/ Character encoding

Terminology

Style

Locale convention

Audience appropriateness

Design and markup

Other

Severities

Minor
Major
Critical

Reporting & Analytics

Project Scoring

We convert MQM annotations into a single Quality Score (%) by normalizing error penalties by translation length using XLM-R SentencePiece tokens. This ensures fair and consistent scoring across all languages, including CJK and languages without whitespace.

How the score is calculated

Total Penalty = Σ (Error count × Error weight)

Quality Score (%) = (1 − Total Penalty ÷ Total token count) × 100

Default Weights:

• Critical: 25

• Major: 5

• Minor: 1

By default, a project passes if the Quality Score is 99.0% or higher. Both the Pass/Fail threshold and Error Weights are fully adjustable.

Insights

Gain insights into annotator performance and translation quality distributions.

Error Distribution Charts
Session Timing Estimates
PDF Export

Download Sample Report (PDF)

Example from Case Study: EuroLLM-22B (EN-IT)

Data Format Specifications

CSV / TSV Layout

Simple tabular data. One error per row. For errors, use <v>tags</v> in the target column.

Example.tsv

source	target	src_lang	tgt_lang	category	severity	comment
Cat	Gato	en	es			
Dog	<v>Dug</v>	en	es	Fluency	Minor	Typo

Required: source, target (alias: translation)
Optional:
- segment_id
- system_id
- doc_id
- src_lang
- tgt_lang
- context
- annotator_id
- correction
- category
- severity
- comment
- timestamp: an error or "No error" labelling event time

JSONL Structure

Standard line-delimited JSON. Ideal for nested annotations and metadata.

Example.jsonl

{
  "source": "Dog",
  "target": "Dug", 
  "src_lang": "en",
  "tgt_lang": "es",
  "correction": "Perro",
  "timestamp": 1234567890,
  "annotations": [
    {
      "start": 0, "end": 3,
      "category": "Fluency",
      "severity": "Minor",
      "comment": "Typo",
      "timestamp": 1234567890
    }
  ]
}

Required: source, target
Optional:
- segment_id
- system_id
- doc_id
- src_lang
- tgt_lang
- context
- annotator_id
- correction
- timestamp: "No error" labelling event time
- annotations (start, end, category, severity, comment, timestamp: error labelling event time)

Marking No Errors

To explicitly mark a segment as valid (containing no errors), provide an annotation with the category no-error. This ensures the segment is counted as "checked and correct" rather than just "skipped" or "pending".

Using the Context Field

You can provide additional context for linguists using the context field. This is ideal for passing glossary terms, reference links, style guide rules, or any other metadata that helps annotators make better decisions. This field is displayed prominently in the annotation interface.

API Integration

Automate your localization quality workflow with our simple REST API. OpenAPI Spec

Automated Import

Programmatically create projects and upload content. Supports TSV, CSV, JSONL, and raw JSON.

# Example: Text Import

curl -X POST "https://alconost.mt/mqm-tool/api/import" \
-H "Authorization: Bearer my_token" \
-F "file=@data.csv"

# Example: JSON Payload

curl -X POST "https://alconost.mt/mqm-tool/api/import" \
  -H "Authorization: Bearer my_token" \
  -H "Content-Type: application/json" \
  -d '{'"name": "My Project", "segments": [{'"source": "src", "target": "mt", "system_id": "v1", "doc_id": "d1", ...}'] }'

Automated Export

Retrieve annotated data and metrics in standard formats (JSONL, TSV, CSV).

# Example: Get Results

curl "https://alconost.mt/mqm-tool/api/projects/{id}/export?format=jsonl" \
-H "Authorization: Bearer my_token"

Read the Guide OpenAPI Spec

Expert MQM Annotation

Human-in-the-Loop Quality Assessment

Unsure about translation quality from AI or vendors?

Have a professional linguist complete a blind MQM error annotation and get an objective quality verdict.

Professional Linguists (blind résumé included)
Multiple Linguists & IAA calculation
Detailed Reports with Pass/Fail verdict
See examples: PDF • TSV • JSON
Optional Corrections & Error Explanations

Annotation performed by linguists with 20+ years track record in translation quality, in 120 languages.

MQM Golden Datasets

For LLM Benchmarking & Fine-tuning

Building the next SOTA translation model or metric?

Access high-quality, human-verified MQM gold standards for robust evaluation and RLHF.

Gold Standard Data for Translation Evaluation
Benchmarks for Quality Metric Development
Training data for Model Fine-tuning (RLHF)
Custom Domain Datasets available on request

🤗

Demo Dataset on Hugging Face

alconost/mqm-translation-gold

Ethically Sourced Data. We create custom datasets from scratch or use open licenses. We never resell client data.

Case Study 1: Inter-Annotator Agreement

EN→IT MQM Annotation Analysis

WMT 2025 2 Annotators Social Domain

This inter-annotator agreement study analyzes independent, double-blind annotations of EuroLLM-22B and Qwen3-235B outputs by two professional linguists using the MQM framework. Using a high-context document (ID: 114294867111841563) from the WMT 2025 General Machine Translation Shared Task, it focuses on English → Italian translations in the Social Media domain (10 segments, approx. 1,630 words). The labeled datasets used in this study are part of the Alconost MQM Translation Gold Dataset.

0.317

Kendall's τ

13.5%

Jaccard Index

176

Total Errors

71.4%

Severity Agreement

Download Resources

Model Source	Annotator	Annotation Data / MQM Report
EuroLLM-22B	A-5BFF0F0F	.tsv .jsonl .pdf
EuroLLM-22B	A-7A8BCDCD	.tsv .jsonl .pdf
Qwen3-235B	A-5BFF0F0F	.tsv .jsonl .pdf
Qwen3-235B	A-7A8BCDCD	.tsv .jsonl .pdf

View Full Case Study Download Paper Dataset on HuggingFace

Case Study 2: TranslateGemma Quality Evaluation

EN→16 Languages · 46 Linguists · MQM Annotation

16 Languages 46 Evaluators Academic Domain

45 Alconost linguists evaluated Google's TranslateGemma-12B — a compact open-weight translation model — across 16 target languages (12 officially supported, 4 unsupported). Source material: 7 segments from an academic ASR paper, translated and annotated using the MQM framework with 3 independent evaluators per language. The API-driven workflow managed 48 annotation projects from creation to export in 6 days. Moroccan Arabic ranked 2nd overall despite being unsupported, outperforming 10 officially supported languages — suggesting language proximity matters more than explicit model support. Automatic metrics (MetricX-24 XXL) correlated at r=0.88 with human MQM scores, strong enough to triage quality before expensive human review. All annotation data is part of the Alconost MQM Translation Gold Dataset.

1,169

Total Errors Found

34h

Total Effort

$971

Total Cost

0/16

Pass at 99%

Sample MQM Reports

Language	MQM	Annotation Data / MQM Report
German	48	.tsv .jsonl .pdf
Italian	77	.tsv .jsonl .pdf
Japanese	344	.tsv .jsonl .pdf
Korean	409	.tsv .jsonl .pdf
Hmong	1129	.tsv .jsonl .pdf

View Full Case Study Download Paper Dataset on HuggingFace

Need Privacy & Control?

We offer Private and Source Code licenses for enterprises with strict security requirements. Deploy on your own infrastructure, integrate with your internal tools, or embed in your platform.

On-premise White-label & Customization Source Code

About Alconost

Alconost is a global localization company, providing multilingual localization, translation, and language quality services to international technology companies for over 20 years.

The company’s core business is language services at scale, including translation, localization QA, and multilingual content operations across 120+ languages. Alconost works with a range of global digital platforms and SaaS companies, including long-term engagements with large technology clients.

Over time, and in response to client needs, Alconost has expanded its capabilities into AI-related services, including multilingual data labeling, machine translation and LLM evaluation, and human-in-the-loop quality assurance. These services are developed as an extension of Alconost’s localization and linguistic quality expertise, not as a standalone AI-only offering.

This tool uses the Multidimensional Quality Metrics (MQM) framework, licensed under CC BY 4.0 by The MQM Council.