OKA/Empirical Evaluation of AI-Assisted Translation
Infodata
[edit]- Author: 7804j
- Publisher: Open Knowledge Association (OKA)
- Sponsor: Wikimedia CH (grant)
- Publication date: January 2026
- Data collection timeframe: November 2025 - January 2026
Executive Summary
[edit]This report presents findings from a systematic research project evaluating the use of Large Language Models (LLMs) to facilitate high-quality Wikipedia article translations. Supported by Wikimedia CH, this study serves as a formalized complement to the OKA's existing work, which has facilitated the publication of over 12,000 articles via LLM-assisted workflows.
While previous efforts focused on broad content production, this project introduced rigorous data collection to analyze error root causes, severity levels, and the precise human workload required to move a draft from AI output to publishable Mainspace quality. The study demonstrates that while LLMs provide substantial benefits over traditional tools—particularly in retaining and formatting Wikicode—human oversight remains the critical safeguard for encyclopedic accuracy and stylistic neutrality.
Key figures and findings:
- 119 articles were published across 10 language pairs, requiring 1,068 hours of human labor (incl. data collection)
- 27% of AI-generated content required editing before publication
- 74% of corrections addressed AI-introduced errors; 26% fixed pre-existing source issues
- All models performed well, with various strengths. Claude and ChatGPT ranked highest for prose quality; Grok excelled at structural "tamability"; Gemini overly "summarizes" content; DeepSeek has issues with formatting.
- Critical error rate of ~6% makes blind publication unsuitable without human review
- Median correction density: 21.1 improvements per 1,000 words across published entriesDisclaimer: The quantitative data presented in this report, including error counts and time-tracking statistics, should be considered as approximations. Due to variations in translator workflows, linguistic complexities across 10 language pairs, and individual interpretations of categorization guidelines, confidence intervals for these metrics are relatively broad. However, these variables do not impact the directional consistency of the findings, which identify clear patterns in the structural and qualitative performance of the evaluated AI models
Research Design
[edit]The research was designed to test the efficacy of LLMs in a realistic, grant-funded editing environment through two complementary studies.
Participants and Incentives
[edit]The project involved a cohort of 10–15 full-time editors. Participants received the standard OKA stipend to cover their cost of livings, plus a 5% participation bonus. This financial support allowed editors to dedicate significant time not just to translation, but to the meticulous logging of data required for this study. Each editor played two roles: translator and peer reviewer.
Study A: Production Workflow Analysis
[edit]The primary study tracked the real-world publication of 119 articles across 10 translation directions, including English, Spanish, Portuguese, French, Russian, Italian, and Polish.
- Workflow: Editors generated full article drafts using standardized prompts and published them to the Draft namespace. They then proceeded to "wikify," correct, and verify the text until it met standards for Mainspace publication. Detailed worfklow can be found here.
- Data Collection: Participants logged time-on-task for each workflow phase and categorized every edit made to drafts according to predefined error taxonomies.
- Counting Methodology: To avoid data inflation, repeated systemic errors (e.g., a model consistently failing to translate a specific term) were counted as a single error unit per article. Omissions of text were strictly categorized as Severity 3 (Critical). Editors were not blinded to model identity, as familiarity with model behavior enabled iterative prompting and better understanding of tool-specific limitations.
The raw data collected can be found here.
Study B: Comparative Model Analysis
[edit]A secondary study focused on deep-dive analysis of 3 medium-length articles by senior editors to compare model performance head-to-head. Specific paragraphs were translated by six models simultaneously (Grok, Gemini, Claude, DeepSeek, DeepL, ChatGPT).
Editors ranked paragraph outputs (1 to 6) based on prose naturalness and readability. Separately, they also assessed overall formatting quality (Wikicode, templates, references). Participants from study A were also asked to share written reflections on their personal user experience with the various tools.
Results: Production Workflow (Study A)
[edit]The project logged a total of 1,068 hours of human labor across 119 articles, processing approximately 403,485 words.
Workload Distribution
[edit]| Phase | Total Hours | % of Effort |
|---|---|---|
| Translation (Initial Prompting) | 46.58h | 4.4% |
| Formatting (Human Fixes) | 318.66h | 29.8% |
| Text Revision (Translator) | 324.85h | 30.4% |
| Text Revision (Reviewer) | 115.90h | 10.9% |
| Analysis & Logging | 262.08h | 24.5% |
The total human revision time (Translator + Reviewer) accounts for roughly 41% of the workload.
This distribution highlights a fundamental shift in the editorial role: AI acts as a drafting engine, while human editors function primarily as code verifiers, fact-checkers, and style regulators.
Edit Magnitude (Levenshtein Distance)
[edit]For the 103 articles where character tracking was finalized, the total Levenshtein distance was 538,936 against a total character count of 1,997,629. This indicates that approximately 27% of the original AI-generated content was edited or modified by humans before publication. This substantial revision rate reflects the gap between raw AI output and the requirements for publication on Wikipedia.
Error origin
[edit]The project recorded 8,530 improvements across published articles. A key finding is the distinction between errors caused by AI and errors present in source texts:
- LLM-Introduced Errors (73.75%): 6,291 corrections addressed issues introduced by the AI, including hallucinations, broken code, missing sentences, and mistranslations.
- Source Improvements (26.25%): Approximately 2,239 improvements fixed issues present in the original source article (e.g., dead links, missing templates, factual inaccuracies)
This highlights that translation is an active maintenance process that improves the overall quality of the global knowledge graph. When examining severity tiers 2 and 3 specifically, only 13% of issues were classified as originating from the source, suggesting that AI-introduced issues are more likely to be of higher severity.
Error severity
[edit]Issues were assigned a severity rating:
- Minor (84.34%): Small issue that does not affect meaning, only polish or readability (e.g., Punctuation, style, rewording for flow, missing reference).
- Major (10.07%): Affects clarity or accuracy, but meaning is still mostly understandable (e.g., Wrong word choice, unidiomatic expression, ambiguous syntax).
- Critical (5.59%): The issue changes or distorts the meaning, introduces factual error, or misrepresents information (e.g., Wrong number/date/name, mistranslation, reference in the wrong place, missing reference or paragraphs). Sentence or paragraph omissions represent a significant share of severity 3 issues, but can easily be spotted by translators.
The severity distribution suggests that LLM output is generally safe as a starting draft but not safe to publish without review. Most corrections are surface-level and fixable through systematic cleanup. However, the presence of meaning-changing issues (approximately 6–7% of all corrections) requires expert attention and mandatory human review.
Error Taxonomy
[edit]The issues were split as follows:
- Linking Issues (28.76%): This was the single most frequent error type. AI models struggle to map the "knowledge graph" of Wikipedia, frequently creating redlinks for topics that exist under different names in the target language. This percentage would be significantly higher if issues weren't deduplicated (e.g., some large articles had over 50 excessive red links, which this study counted as a single issue).
- Prose & Style (17.54%): Edits focused on removing "translationese" and ensuring NPOV.
- Wiki Markup (12.23%): General syntax failures, such as broken infoboxes or tables, or invented parameters. Excludes markup issues on references.
- References & Attribution (10.54%): While this accounts for ~10% of errors, it is technically complex. The AI rarely "hallucinates" the content of a reference, but it frequently breaks the syntax (e.g., dropping parameters, or using the wrong tags), or misses references altogether. This percentage would be significantly higher if issues weren't deduplicated: when the model gets something wrong, it often applies to all references in the page.
- Structure & Formatting - Categories (6.07%)
- Grammar & Syntax (4.55%)
- Factual accuracy (2.24%)
- Other Structure & Formatting issues (4.85%)
- Other issues (2.54%): was most commonly used for "omissions", when the LLM forgot to translate a part of the text
Fortunately, the most common issues are also the easiest to spot: problems with wiki syntax and links usually lead to visible errors, and reference issues are relatively easy to identify. Despite the frequent struggle of LLMs with wiki syntax, all translators report that their formatting work was significantly simplified since they started using LLMs.
In 6.15% of issues (or 2.3% of source issues and 7.5% of LLM issues), editors reported that the meaning was altered (factual errors or mistranslations).
Results: Comparative Model Analysis (Study B)
[edit]Based on paragraph-by-paragraph analysis of 3 articles, editors ranked models by text quality.
| Model | Average Rank (1=Best) |
|---|---|
| ChatGPT | 2.46 |
| DeepL | 2.66 |
| Claude | 2.69 |
| DeepSeek | 2.78 |
| Gemini | 3.17 |
| Grok | 4.09 |
While quantitative rankings provide one dimension of evaluation, qualitative feedback from the broader set of OKA editors revealed important nuances about model behavior and suitability for different use cases.
Claude: The Balanced Stylist
[edit]Claude strikes a superior balance between stylistic prose and structural adherence. It was praised for natural tone and requiring less linguistic revision than other models.
In my language pair... its quality is on par with Grok's, if not better. It makes fewer grammatical errors, though some constructions still sound unnatural.
— OKA editor
- Strengths: Natural prose, good NPOV adherence, balanced performance across languages
- Weaknesses: Occasional unnatural constructions, moderate structural errors
- Current OKA Status: Primary recommendation for OKA editors
ChatGPT: Strong All-Rounder
[edit]ChatGPT ranked highest in prose quality while maintaining acceptable structural performance.
- Strengths: Excellent prose naturalness, good readability
- Weaknesses: Limited specific feedback collected; requires further comparative analysis
- Current OKA Status: Primary recommendation for OKA editors, but requires further investigation
DeepSeek: The Promising Contender
[edit]DeepSeek performed surprisingly well, particularly in Portuguese translation, showing strong potential as a balanced alternative.
- Strengths: Good structural adherence, strong performance in certain language pairs
- Weaknesses: Struggled with some Wikipedia-specific templates (e.g., adding the non-existent "urlmorto" parameter); requires further analysis
- Current OKA Status: Not yet recommended to OKA editors, but shows potential
Grok: The "Tamable" Workhorse
[edit]While Grok scored lowest on raw text flow, qualitative feedback revealed a strong preference for it in specific structural contexts. Editors noted that Grok is the most responsive to structural commands, making it useful for template-heavy articles despite requiring more prose polishing.
Grok returns information with fewer errors and responds better to commands. Furthermore, when Grok makes a mistake and I ask it to correct the error, Grok does not usually make the same mistakes again.
— OKA editor
- Strengths: Highly responsive to iterative prompting, excellent for template-heavy articles, good for experienced editors who can guide the model
- Weaknesses: Requires more prose polishing, lower initial text quality
- Current OKA Status: Secondary recommendation, particularly to experienced OKA editors who require more controls over the model
Gemini: High Risk of Omission
[edit]Multiple editors reported a critical flaw where Gemini summarizes or omits content, making it dangerous for faithful translation.
I really like the text quality, but it tends to summarize the text too much and even remove references, especially if you put several paragraphs to translate at once.
— OKA editor
I still haven't been able to find Gemini's pattern of errors, and it seems to me that it suppresses more information than Grok.
— OKA editor
- Strengths: Good prose quality when content is preserved
- Weaknesses: Frequent summarization/omission of content (critical failure for encyclopedic translation), unpredictable error patterns
- Current OKA Status: Not recommended for primary use.
DeepL: The Traditional Baseline
[edit]DeepL provides high-quality text but lacks Wikicode handling capabilities.
- Strengths: Excellent prose quality, reliable translations (though only marginally better than general purpose LLMs)
- Weaknesses: Inability to handle Wikicode markup requires significantly more human time for "wikification" compared to LLMs
- Current OKA Status: Recommended for "spot translations" of a particular sentence or paragraph
Key Findings
[edit]The findings of this study supports maintaining and strengthening of OKA's editorial standards and suggest the following for the broader Wikimedia community.
Critical Findings
[edit]- Human Review is Mandatory: With a critical (Severity 3) error rate of approximately 5%, blind publication is not an option. Independent second-human review remains essential for quality assurance.
- Translation as Maintenance: The 26.25% of improvements addressing source article issues demonstrates that AI-assisted translation can serve a dual function: not only creating content in new languages but also improving the quality of the global knowledge graph.
- Substantial Editing Required: The 27% edit rate (by Levenshtein distance) confirms that AI output requires significant human refinement to meet Wikipedia standards.
- Article Length as Workload Predictor: Article length strongly predicts translator workload (Pearson r=0.0.88), though wikicode density, formatting complexity, and domain specificity can amplify effort beyond what word count alone would predict.
Workflow Optimization Opportunities
[edit]Since nearly 50% of errors relate to Links and References, tooling that automatically verifies link targets or validates citation syntax could substantially reduce human workload. Potential tool development areas include:
- Automated link target verification across language editions
- Citation template syntax validators
- Pre-publication quality checks for common structural errors
Model Selection Strategy
[edit]Based on empirical findings:
- Claude and ChatGPT are the primary recommendation for general drafting due to superior prose quality and balanced performance (though the latter requires further investigation).
- Grok remains valuable for experienced editors handling complex, template-heavy articles due to its "tamability" and responsiveness to iterative prompting
- DeepSeek shows promise as a balanced contender and warrants further testing, particularly for Portuguese translations
- Gemini should be avoided for encyclopedic translation until content omission issues are resolved
- DeepL provides excellent prose but requires significantly more wikification effort
Limitations
[edit]This study acknowledges several limitations that should inform interpretation of results:
- Peer review: this study wasn't peer-reviewed by individuals outside of OKA
- Data Diversity: Not all language pairs were represented equally in the dataset. 16 articles did not receive final peer review, resulting in partial data for those entries. Selection of input articles was not random, as they were selected by translators themselves. Three different models were used for data collection in Study A, introducing noise, however overall findings remain applicable across all three models.
- Subjectivity: Despite clear guidelines, categorization of errors remains inherently subjective and dependent on individual editor judgment.
- Translator Bias: Participants were experienced OKA members. Their efficiency and error-spotting rates may not represent the average volunteer editor's performance. Even within OKA, there are substantial variations from one editor to another in terms of speed and ability to spot and correct issues.
- Non-Blind Testing: Editors were aware of which models they were using, which may have influenced their assessment and prompting strategies.
- Statistical significance: Several research questions would require a larger scale of data for definitive conclusions (e.g., for robust comparison of various models across all dimensions).
- Subset of languages only: the study only covered a small subset of languages, focused on the ones with a large number of speakers and latin alphabet. This was done because previous OKA tests have shown that LLM performance is poorer for languages with fewer speakers, or with non-latin alphabets (e.g., Arabic, Chinese)
Future Research Directions
[edit]Future iterations could address these limitations through:
- More balanced representation across language pairs
- Clearer categorization guidelines with extensive examples
- Dedicated category for "Omissions" to improve data consistency, and clearer guidelines for how to deduplicate errors and assess severity (e.g., with examples)
- Larger sample of editors with varying experience levels
- Partial blinding in comparative studies where feasible
- Expanded sample size for model comparison to achieve statistical significance
Appendices
[edit]Error Category Definitions
[edit]- Meaning & Accuracy: changes that affect correctness or interpretation:
- Factual accuracy: Correcting factual errors, mistranslations that distort meaning.
- Example: “Portugal was invaded in 1940” → corrected to “1807.”
- Terminology / technical precision: Fixing wrong or imprecise domain-specific terms or culturally appropriate terms.
- Example: “Ministerio del Interior” → “Ministry of Home Affairs.”
- Factual accuracy: Correcting factual errors, mistranslations that distort meaning.
- Language Quality: improvements to grammar, clarity, or style:
- Grammar & syntax: Word order, agreement, tense, punctuation.
- Example: “the policies is effective” → “the policies are effective.”
- Prose & style / readability: Flow, conciseness, idiomatic phrasing, unnatural tone.
- Example: “It was in the year of 1990 that…” → “In 1990, …”
- Grammar & syntax: Word order, agreement, tense, punctuation.
- Structure & Formatting (S&F). Sub-categories:
- Use for organizational or Wikipedia-formatting fixes.
- Linking issues
- Poorly sourced content (e.g., remove content without references / with hard to find references)
- Reordering complete sentences or paragraphs for coherence
- Redundant text
- Headings (e.g., adjustment, harmonizing, new sections – such as "See also")
- Category (e.g., correction, addition)
- Wiki markup (e.g., templates, italics, indentations)
- References & Attribution: for citation-related edits. Examples:
- Finding references for unsourced material
- Fixing broken links
- Moving citation to correct sentence
- Removing unreliable or misplaced sources