Free MQM Annotation Tool by Alconost

MQM-based translation quality annotation for human and machine translation evaluation.

Upload translation outputs, annotate errors according to MQM guidelines, and export structured annotations for system-level and segment-level analysis.

A service by Alconost, a localization provider with over 20 years of experience.

Aligned with WMT evaluation practices and inspired by open MQM annotation tools such as Anthea and Marot.

What is MQM?

MQM (Multidimensional Quality Metrics) is an error-based framework for translation quality evaluation widely used in WMT Shared Tasks and industry benchmarks.

Instead of holistic scores, MQM evaluates translations by annotating explicit error categories and severities (e.g. Accuracy, Terminology, Fluency), enabling fine-grained, reproducible, and comparable evaluation.

MQM is used for:

  • Human translation evaluation
  • Machine translation evaluation (including neural and LLM-based MT)
  • System comparison and ranking
  • Error analysis and regression testing

MQM Taxonomy Reference

Error Categories

Accuracy/ Addition
Accuracy/ Omission
Accuracy/ Mistranslation
Accuracy/ Source error
Accuracy/ Untranslated
Fluency/ Punctuation
Fluency/ Spelling
Fluency/ Grammar
Fluency/ Register
Fluency/ Inconsistency
Fluency/ Character encoding
Terminology
Style
Locale convention
Audience appropriateness
Design and markup
Other

Severities

  • Minor
  • Major
  • Critical
Read more on the official MQM website

Reporting & Analytics

Project Scoring

We convert MQM annotations into a single Quality Score (%) by normalizing error penalties by translation length using XLM-R SentencePiece tokens. This ensures fair and consistent scoring across all languages, including CJK and languages without whitespace.

How the score is calculated
Total Penalty = Σ (Error count × Error weight)
Quality Score (%) = (1 − Total Penalty ÷ Total token count) × 100
Default Weights:
• Critical: 25
• Major: 5
• Minor: 1

By default, a project passes if the Quality Score is 99.0% or higher. Both the Pass/Fail threshold and Error Weights are fully adjustable.

Insights

Gain insights into annotator performance and translation quality distributions.

Download Sample Report (PDF)

Example from Case Study: EuroLLM-22B (EN-IT)

Data Format Specifications

CSV / TSV Layout

Simple tabular data. One error per row. For errors, use <v>tags</v> in the target column.

Example.tsv
source	target	src_lang	tgt_lang	category	severity	comment
Cat	Gato	en	es			
Dog	<v>Dug</v>	en	es	Fluency	Minor	Typo
  • Required: source, target (alias: translation)
  • Optional:
    • segment_id
    • system_id
    • doc_id
    • src_lang
    • tgt_lang
    • context
    • annotator_id
    • correction
    • category
    • severity
    • comment
    • timestamp: an error or "No error" labelling event time

JSONL Structure

Standard line-delimited JSON. Ideal for nested annotations and metadata.

Example.jsonl
{
  "source": "Dog",
  "target": "Dug", 
  "src_lang": "en",
  "tgt_lang": "es",
  "correction": "Perro",
  "timestamp": 1234567890,
  "annotations": [
    {
      "start": 0, "end": 3,
      "category": "Fluency",
      "severity": "Minor",
      "comment": "Typo",
      "timestamp": 1234567890
    }
  ]
}
  • Required: source, target
  • Optional:
    • segment_id
    • system_id
    • doc_id
    • src_lang
    • tgt_lang
    • context
    • annotator_id
    • correction
    • timestamp: "No error" labelling event time
    • annotations (start, end, category, severity, comment, timestamp: error labelling event time)

Marking No Errors

To explicitly mark a segment as valid (containing no errors), provide an annotation with the category no-error. This ensures the segment is counted as "checked and correct" rather than just "skipped" or "pending".

Using the Context Field

You can provide additional context for linguists using the context field. This is ideal for passing glossary terms, reference links, style guide rules, or any other metadata that helps annotators make better decisions. This field is displayed prominently in the annotation interface.

API Integration

Automate your localization quality workflow with our simple REST API. OpenAPI Spec

Automated Import

Programmatically create projects and upload content. Supports TSV, CSV, JSONL, and raw JSON.

# Example: Text Import
curl -X POST "https://alconost.mt/mqm-tool/api/import" \
  -H "Authorization: Bearer my_token" \
  -F "file=@data.csv"
# Example: JSON Payload
curl -X POST "https://alconost.mt/mqm-tool/api/import" \
  -H "Authorization: Bearer my_token" \
  -H "Content-Type: application/json" \
  -d '{'"name": "My Project", "segments": [{'"source": "src", "target": "mt", "system_id": "v1", "doc_id": "d1", ...}'] }'

Automated Export

Retrieve annotated data and metrics in standard formats (JSONL, TSV, CSV).

# Example: Get Results
curl "https://alconost.mt/mqm-tool/api/projects/{id}/export?format=jsonl" \
  -H "Authorization: Bearer my_token"

Expert MQM Annotation

Human-in-the-Loop Quality Assessment

Unsure about translation quality from AI or vendors?

Have a professional linguist complete a blind MQM error annotation and get an objective quality verdict.

  • Professional Linguists (blind résumé included)
  • Multiple Linguists & IAA calculation
  • Detailed Reports with Pass/Fail verdict
    See examples: PDF TSV JSON
  • Optional Corrections & Error Explanations

Annotation performed by Alconost linguists with 20+ years track record in translation quality, in 120 languages.

MQM Golden Datasets

For LLM Benchmarking & Fine-tuning

Building the next SOTA translation model or metric?

Access high-quality, human-verified MQM gold standards for robust evaluation and RLHF.

  • Gold Standard Data for Translation Evaluation
  • Benchmarks for Quality Metric Development
  • Training data for Model Fine-tuning (RLHF)
  • Custom Domain Datasets available on request
🤗
Demo Dataset on Hugging Face
alconost/mqm-translation-gold

Ethically Sourced Data. We create custom datasets from scratch or use open licenses. We never resell client data.

Case Study 1: Inter-Annotator Agreement

EN→IT MQM Annotation Analysis

WMT 2025 2 Annotators Social Domain

This inter-annotator agreement study analyzes independent, double-blind annotations of EuroLLM-22B and Qwen3-235B outputs by two professional linguists using the MQM framework. Using a high-context document (ID: 114294867111841563) from the WMT 2025 General Machine Translation Shared Task, it focuses on English → Italian translations in the Social Media domain (10 segments, approx. 1,630 words). The labeled datasets used in this study are part of the Alconost MQM Translation Gold Dataset.

0.317
Kendall's τ
13.5%
Jaccard Index
176
Total Errors
71.4%
Severity Agreement

Download Resources

Model Source Annotation Data / MQM Report
EuroLLM-22B
EuroLLM-22B
Qwen3-235B
Qwen3-235B

Case Study 2: TranslateGemma Quality Evaluation

EN→16 Languages · 46 Linguists · MQM Annotation

16 Languages 46 Evaluators Academic Domain

45 Alconost linguists evaluated Google's TranslateGemma-12B — a compact open-weight translation model — across 16 target languages (12 officially supported, 4 unsupported). Source material: 7 segments from an academic ASR paper, translated and annotated using the MQM framework with 3 independent evaluators per language. The API-driven workflow managed 48 annotation projects from creation to export in 6 days. Moroccan Arabic ranked 2nd overall despite being unsupported, outperforming 10 officially supported languages — suggesting language proximity matters more than explicit model support. Automatic metrics (MetricX-24 XXL) correlated at r=0.88 with human MQM scores, strong enough to triage quality before expensive human review. All annotation data is part of the Alconost MQM Translation Gold Dataset.

1,169
Total Errors Found
34h
Total Effort
$971
Total Cost
0/16
Pass at 99%

Sample MQM Reports

Language MQM Annotation Data / MQM Report
German 48
Italian 77
Japanese 344
Korean 409
Hmong 1129

Need Privacy & Control?

We offer Private and Source Code licenses for enterprises with strict security requirements. Deploy on your own infrastructure, integrate with your internal tools, or embed in your platform.

On-premise White-label & Customization Source Code

About Alconost

Alconost is a global localization company, providing multilingual localization, translation, and language quality services to international technology companies for over 20 years.

The company’s core business is language services at scale, including translation, localization QA, and multilingual content operations across 120+ languages. Alconost works with a range of global digital platforms and SaaS companies, including long-term engagements with large technology clients.

Over time, and in response to client needs, Alconost has expanded its capabilities into AI-related services, including multilingual data labeling, machine translation and LLM evaluation, and human-in-the-loop quality assurance. These services are developed as an extension of Alconost’s localization and linguistic quality expertise, not as a standalone AI-only offering.

This tool uses the Multidimensional Quality Metrics (MQM) framework, licensed under CC BY 4.0 by The MQM Council.