Research:Bridging the Gap Between Wikipedians and Scientists with Terminology-Aware Translation: A Case Study in Turkish
Introduction
[edit]This project addresses the gap between the escalating volume of English-to-Turkish Wikipedia translations and the insufficient number of contributors, particularly in technical domains. Our focus is two folds:
- bridging academic/expert and Wikipedia communities,
- creating datasets, and developing NLP models for terminology identification and linking, and terminology-aware translation.
The project has five main contributions:
- (Dataset) Terminology-Rich Machine Translation Corpus. We combine cleaned English–Turkish paragraph pairs from Wikipedia’s Content Translation dumps with domain-filtered abstracts from the Turkish National Thesis Center, producing 3 300 sentence pairs evenly split across Mathematics, Physics, and Computer Science. Rigorous filtering—from 468,254 raw paragraphs to 105,506 after basic cleaning, and finally to 439 manually inspected candidates—ensures high lexical density and domain fidelity. The resulting corpus contains 10,157 expert-validated term links.
- (Dataset) Terminology Annotated Machine Translation Dataset. We design a 30-page guideline, an online quiz, and real-time quality gates in Label Studio to identify, link and correct the terminological translation in the corpus. 43 trained annotators achieve substantial agreement (Fleiss κ ≈ 0.71 for English and 0.67 for Turkish term detection) while earning above-market wages—demonstrating that fair, large-scale annotation is feasible even for specialized tasks. The final corpus along with the annotations is publicly available here.
- (Models) Baseline Evaluation and Insights. Four prompting strategies over six large language models reveal that (i) giving models access to the terimler.org database and (ii) supplying parallel EN-TR context both substantially improve performance, yet even the strongest baseline (o3-mini) underperforms humans by a wide margin, suggesting a large room for improvement.
- (Community) Wikipedia Technical Translation Guideline. We develop a guideline for Turkish Wikipedians on how to provide a higher-quality translation for technical documents that contain vast amount of technical terms. We perform a user study where we survey the Wikipedians for the guideline's SUS (System Usability Score), and find high SUS scores. The guideline can be accessed through here.
- (Community) Expert-Wikipedian Email Group. After thorough discussions between the academics and Turkish Wikipedians, we establish an email group to bridge these two communities. The guideline contains detailed instructions on how to reach out to experts in case of difficulties while translating technical terms. There are currently 35 members of the mail group and 19 active, public discussions. The group can be accessed through here (however one should join the group before trying to access the group through this link, otherwise it will show a "Content unavailable" error)
All resources are released and will be maintained under this repo.
Related Work
[edit]The Terminology Problem in Machine Translation
[edit]Neural machine-translation (NMT) systems achieve impressive average BLEU scores, yet they still fail conspicuously on specialized terms, brand names and formulaic expressions. Early efforts attempted to constrain the decoder so that specific target words must appear in the output—for example with multi-stack search over finite-state acceptors [1]. Industrial deployments such as SAP’s localization pipeline later validated that explicit constraints can halve post-editing effort in real-world workflows, but at the price of more brittle beam search and lower fluency when many terms are injected at once [2].
From Constrained Decoding to Prompting and Reinforcement Learning
[edit]Since 2023, research has broadened beyond hard constraints toward translate-then-refine paradigms. In the WMT-23 Terminology Shared Task, top systems first produced an unconstrained draft and then asked a large language model (LLM) to repair term errors, yielding the best terminology F-scores without sacrificing adequacy [3]. Follow-up analyses showed that dictionary access or reference peek still boosts scores, but gains plateau when evaluation metrics do not explicitly reward term faithfulness, prompting the community to call for more term-centric metrics [4].
More recently, Kim et al. proposed TAT-R1, which couples word-alignment-based rewards with reinforcement learning; the model significantly improves COMET on terminology stress-tests without degrading general quality [5]. Parallel work explores retrieval-augmented generation (RAG) pipelines that pass glossary snippets to the LLM at inference time [6], as well as Chain-of-Dictionary prompting that feeds multi-lingual dictionary chains to stimulate the model’s latent terminology knowledge [7]. Collectively, these studies suggest that future systems will blend soft guidance (prompts, retrieval) with post-editing rather than rely on rigid constraints alone.
Shared Tasks and Benchmarks
[edit]Shared tasks have been instrumental in crystallising the research frontier. Besides the WMT-23 track, WMT plans a broadened Terminology Task for 2025 covering more domains and language pairs, explicitly evaluating how systems exploit external dictionaries [8]. Outside WMT, researchers have released robustness suites that stress different constraint lengths and densities [9], and domain-specific evaluations in medical MT have shown that even state-of-the-art transformers still mistranslate up to 18 % of critical terms when training data are scarce [10].
Corpora for English-Turkish Technical Translation
[edit]English–Turkish remains low-resource compared with Indo-European pairs. Large web-crawled sets such as MaCoCu-tr-en 2.0 supply over 1.6 M aligned sentence pairs, but they are heterogeneous and rarely annotated for technical terms [11]. Earlier domain corpora focus on news or general language; none link term pairs or grade agreement among annotators. To our knowledge, there is no publicly available EN-TR corpus that (i) deliberately maximizes term density across STEM fields and (ii) provides gold / silver / bronze confidence tiers. The dataset introduced in our study therefore fills a demonstrable gap.
Annotation Methodologies
[edit]High-quality terminology benchmarks demand rigorous annotation. Prior projects often relied on a handful of experts and achieved only moderate inter-annotator agreement (κ≈0.55) when tagging terms in Chinese–English MT [12]. Our pipeline builds on these lessons by incorporating a 30-page guideline, a qualification quiz and live quality gates, echoing best practices from large multilingual terminology datasets such as GIST [13]. The result is substantial agreement for both English (κ=0.71) and Turkish (κ=0.67) term detection, surpassing many earlier efforts while paying annotators above-market wages.
Large Language Models and Low-Resource Terminology
[edit]Context-aware prompting studies report that commercial LLMs already match or exceed fine-tuned MT baselines on average BLEU, yet continue to lag behind human performance on term-specific metrics, especially for low-resource directions [14] [15]. Dictionary-centric fine-tuning and synthetic data generation have proven effective for other language pairs [16], but have not been explored for Turkish. By releasing 10,157 expert-validated term links and a reproducible evaluation suite, our work enables systematic investigation of glossary-guided prompting and other methods for terminology translation in a genuinely low-resource setting.
Existing solutions either (a) evaluate terminology translation on high-resource language pairs, (b) rely on heterogeneous low quality data, or (c) lack transparent annotation protocols. Our contribution—the first terminology-rich EN-TR corpus with graded term confidence and strong agreement—creates a much-needed test-bed. Moreover, the baseline experiments reported here complement the latest trends in LLM, demonstrating that even the strongest o3-mini model still trails human upper bounds, thus motivating future research on glossary-aware adaptation strategies for Turkish scientific discourse.
Terminology-Rich Machine Translation Corpus
[edit]The Wikipedia Content Translation tool simplifies the translation process by automating repetitive tasks such as copying text, creating links, and categorizing articles. It leverages resources like bilingual dictionaries and machine translation, enabling translators to focus on crafting fluent, high-quality composition. Translated paragraph pairs are published weekly in Wikimedia dumps.
Data Cleaning: The dataset undergoes multiple cleaning steps for high-quality sentence extraction. Empty, duplicate, and identical content are removed first. Paragraphs with insufficient or inconsistent lengths are then excluded, followed by the elimination of content containing symbols like ↑, &, or displaystyle, which often signify references or embedded equations in Wikipedia pages. These steps reduce the dataset from 468,254 to 105,506 paragraphs.
Domain Selection: For high-quality annotations, the dataset is restricted to paragraphs from three STEM domains: Mathematics, Physics, and Computer Science. Fields such as Chemistry and Biology are excluded due to limited domain expertise, ensuring accuracy in the selected domains. Classification into these domains relies on a vocabulary of 2,520 terms sourced from relevant Wikipedia pages. Physics terms are drawn from the Glossary of physics, Mathematics terms from the Glossary of areas of mathematics and Glossary of calculus, and Computer Science terms from the Glossary of computer science, Glossary of artificial intelligence, Machine learning, Deep learning, and Natural language processing pages. To construct a terminology-rich corpus, we first retain only paragraphs where the number of unique terms exceeds three, reducing the dataset to 1,562 paragraphs.
Next, we align English-Turkish sentences within the paragraphs and filter out those where the number of source sentences does not match the number of target sentences, resulting in 507 paragraphs using Stanza. We then use the GPT-4o model to determine the domains of the paragraphs, excluding those related to other scientific fields such as Chemistry and Biology. This step further reduces the dataset to 439 paragraphs. Finally, we manually review all 439 paragraphs and eliminate instances where the paragraphs are poorly translated. After this final step, we obtain 303 paragraphs containing a total of 1,185 sentences.
In addition to Wikipedia content, we also make use of the thesis abstracts which are readily available. The Turkish National Thesis Center [17], managed by the Turkish Council of Higher Education (YÖK), is the official repository for graduate theses from Turkish universities. This platform contains over 700,000 theses, providing a rich source of academic texts. To ensure dataset credibility, we use theses from the top six Turkish universities, ranked by the Times Higher Education and QS World University Rankings: Koç University, Middle East Technical University, Istanbul Technical University, Bilkent University, Boğaziçi University, and Sabancı University. These institutions represent the highest academic standards in Turkey, adding to the dataset’s reliability. We select abstracts exclusively from theses in the Mathematics, Physics, and Computer Science departments. From the six universities mentioned, we compile 287 abstracts comprising 2,115 sentences. Since the theses are submitted to the Turkish National Thesis Center in PDF format, OCR-related typos occasionally occur in the abstracts. To address this, we use the GPT-4o model to correct these typos. Finally, we combine sentences from both dataset, resulting in a total of 3,300 sentences, evenly distributed across the three domains: Mathematics, Physics, and Computer Science, with each domain contributing 1,100 sentences.
Terminology Annotated Machine Translation Dataset
[edit]Next, we annotate the corpus with term boundaries, links to terminology database and correct translation---i.e., translations that are loyal to the expert curated terminology database with correct morphological inflections.
To create a parallel corpus of labelled and linked datasets, we first identify a suitable annotation tool. After extensive research, we select Label Studio's Academic Program due to its free access, online functionality, and user-friendly interface. Annotators work directly online without requiring downloads, ensuring an efficient workflow. A sample user interface is given below:

.
Annotation Task
[edit]Since the annotators are tasked with multiple sub-tasks depended on each other, we first design an annotation pipeline. The pipeline begins by identifying English terms in the source sentence, which are labeled as terms. Corresponding Turkish terms are then identified in the target sentence and labeled as terms as well. For each English-Turkish term pair, a relation is established to link the terms. These relations are validated using the terimler.org terminology database, where correctly translated terms are marked as CORRECT_TRANSLATION, and incorrect ones are flagged with updated metadata. If an English term does not have a match in the database, its Turkish counterpart is manually evaluated and labeled as either correct or incorrect. These steps are summarized in Annotation Pipeline shown on the right.

We prepare a comprehensive annotation guideline consisting of 30 pages and 50 screenshots. This detailed approach ensures clarity and minimizes any potential confusion for the annotators. The guideline covers the following key aspects: the goals of annotation, the usage of terimler.org (a terminology database), details about the Label Studio interface, and step-by-step instructions for the annotation pipeline. It also includes a section on handling special cases during annotation. The guideline is provided here.
Annotators
[edit]We use various channels to identify 229 potential annotators, including university faculty mailing lists, WhatsApp, Telegram, and Discord groups associated with universities, LinkedIn, Twitter, and science platforms such as fizikhaber.com and Türk Fizik Takvimi.
To ensure the quality of annotations, we put a special emphasis on training of annotators and eliminating low-performing annotators. To do so, we conduct an Annotation Webinar to introduce the project and explain key topics, including the project's overview, the role of annotators, scoring system, annotation guidelines, the project timeline, and a Q&A session to address participants' questions.
After the webinar, we identify the list of interested participants and perform a quiz to eliminate the participants who are likely to score low. Initially, 160 participants confirmed their interest in taking the quiz to proceed further. The quiz consists of 30 sentence pairs, organized into 10 instances of three sentences each. On average, it takes 4 hours (±1 hour) to complete, including the time required to read the Annotation Guideline for the first time. After becoming familiar with the guidelines, participants typically take about 4 minutes per sentence to complete the quiz. Sentences are grouped into three-sentence instances rather than isolated individually to preserve context. Each paragraph, whether sourced from Wikipedia or abstracts, is divided into three-sentence chunks, with sentences kept in sequence to maintain continuity. If a paragraph contains a number of sentences not divisible by three, the remaining sentence(s) are excluded. Out of the 160 participants who received the quiz, 84 completed all 30 sentences. The mean English Term Detection F1 score is 0.69. Participants with an F1 score below 0.7 are eliminated, excluding 35 participants. Among the 43 annotators, 29 are undergraduates and 14 are graduates, ensuring a balanced mix of educational backgrounds. The majority (38) are in the 18–25 age group, with 4 aged 26–40 and one aged 41–65. In terms of academic specialization, 18 annotators are from Computer Science, 8 from Mathematics, and 6 from Physics, forming the core STEM expertise. Additional representation includes Electrical and Electronics Engineering (5), Mechanical Engineering (2), and 4 from other STEM-related fields, further diversifying the annotator pool.
Finally, for continuous training of the annotators, an automatically generated Quiz Evaluation Report is sent to participants who successfully complete the quiz to help them avoid repeating mistakes during labeling as shown below.

Annotation Results
[edit]At least two out of three annotators agreed on 8,470 terms across 3300 parallel sentences. For the rest of the terms (2,411 terms) with low agreement, we perform a manual review and remove 724 wrongly identified terms. In total, the final dataset contains 10,157 annotated terms.
We calculate Fleiss Kappa to assess inter-annotator agreement for English and Turkish term detection. For English term detection, the distribution of Fleiss Kappa scores across paragraphs shows a mean agreement of 0.715, indicating substantial agreement among annotators. Similarly, for Turkish term detection, the mean Fleiss Kappa score is 0.674, reflecting moderate to substantial agreement. Next, we calculate the average task scores for each step of the annotation. The English term detection achieves an F1 score of 0.84, while Turkish term detection scores 0.82, respectively. Turkish translation labeling achieves an exact match score of 0.85, correction improves to 0.68, and term linking maintains strong performance with a score of 0.86. These improvements highlight the enhanced performance of annotators post-quiz, demonstrating the effectiveness of the Quiz Evaluation Report and quality control mechanisms applied during the annotation process.
Improving the Terminology Database
[edit]Since no database is ever complete or perfect, we ask users to comment on missing or incorrect entries on terimler.org for various cases, including incorrect meanings, potential errors, spelling mistakes, and missing synonyms. Participants provided 2,100 comments in total. However, since all sentences are annotated three times, there is significant overlap in similar comments for the same sentence. Additionally, there is overlap across sentences for commonly repeated terms. For example, in terimler.org, the term "random" is translated as "rasgele," whereas, according to the Turkish Language Association (TDK), it should be "rastgele." Since the word "random" appears many times across the 3,300 sentences, there is substantial repetition in comments for such terms. After cleaning the 2,100 comments to remove duplicates, we reduce them to 294 unique entries for terimler.org, including 214 entries suggesting synonyms (SYNONYM), 37 identifying typos (TYPO), 22 highlighting potential errors in meaning (UNCERTAIN), and 21 pointing out definite errors requiring correction (WRONG).
Annotation Cost
[edit]The payment per sentence is 20 TL, with the average time to annotate one sentence approximately 4 minutes. Each of the 3,300 unique sentences is annotated by three annotators. A total of 9,900 sentences are annotated, resulting in a total annotation cost of 198,000 TL. Annotators earn an average hourly rate of 300 TL, significantly higher than Turkey’s minimum net hourly wage of 75.56 TL. The number of sentences annotated per annotator varies, with a mean of 230.23 sentences.. Additionally, bonus payments totaling 15,000 TL are awarded to 18 annotators for providing high-quality comments to identify errors in terimler.org.
Methods
[edit]Due to recent substantial performance of large language models on terminology-aware translation, we evaluate 6 state-of-the-art, modern LLMs, namely as: o3-mini, deepseek-r1, gpt-4o, gpt4o-mini, gemini-1.5 pro, and gemini 1.5 flash.
We consider both post-editing and translation from scratch scenarios. In post-editing ones, we assume that a high-quality translation is already provided by a modern translation tool, however, the term translations are not consistent with the database. In translation from scratch scenarios, we ask the models to perform translation by staying loyal the database we provide. In more details:
- Scenario 1 - With terimler.org access: In this method, we aim to design a pipeline that closely mimics how human annotators work. We provide English–Turkish parallel sentences along with access to the terimler.org database. Similar to annotation pipeline, we design an agentic pipeline where the LLMs perform the three tasks consequently: term detection, term linking and translation correction.
- Scenario 2 - Without terimler.org access: In this method, we provide English–Turkish parallel sentences but do not allow access to terimler.org. Therefore, the Term Linking phase is omitted. The Term Detection phase remains identical to that in Method 1. However, the Translation Correction phase is slightly modified.
- Scenario 3 - English only sentences with terimler.org access: To measure end-to-end performance of the systems, it follows the same procedure as scenario 1, but only English sentences are provided. Therefore, the models must first generate the Turkish translations before proceeding through the remaining stages.
- Scenario 4 - English only sentences without terimler.org access: Similar to scenario 2, but only English sentences are provided. Consequently, the models must first generate the Turkish translations before performing term detection and translation correction.
The results for each scenarios are given below. The best-performing model across all metrics and scenarios is o3-mini. Notably, the human upper-bound scores remain significantly higher than those of all models. Experiments also demonstrate that access to terimler.org significantly boosts model performance. Additionally, we can conclude that providing the EN-TR pair as a whole is more effective than generating the TR term from scratch, suggesting that using models for post-edit is a more viable solution for the long term.




Bridging the Two Communities
[edit]Our main goal is to develop strategies for integrating domain experts with Wikipedians; additionally, aiming to recruit domain experts as contributors and train existing/new Wikipedians to translate technical content more accurately. Therefore to foster sustained contributions and improve the overall quality of technical Turkish Wikipedia articles. The main events and outcomes are summarized below:
- [Done] Reach out to domain (physics, mathematics and informatics) experts and Wikipedians that create, translate or interested in creating and translating STEM related Wikipedia articles and gain insight on their editing/translating habits, profiles and willingness to contribute to a joint community.
- [Done] Seminar I: An online seminar introducing the project and interview results, with a panel to establish effective feedback channels between Wikipedians and researchers.
- [Done] Panel I: An online discussion panel with experts and Wikipedians on how to collaborate in order to improve the scientific and technical content of Turkish Wikipedia and to evaluate findings from earlier project phases.
- [Done] Demo I: An online 1-hour long online collaborative translation event with Wikipedians and experts to evaluate the effectiveness of collaborative work.
- [Done] Demo II: An 8-day long offline translation event with Wikipedians using the guideline and communicating with experts with the help of the guideline through the joint mail group to evaluate the usability of the guideline and the effectiveness of the joint mail group.
- [Done] Seminar II: A follow-up online seminar discussing project progress and user study outcomes, focusing on designing effective training for translating technical terms.
- [Done] Design and evaluate training materials for editors to more accurately translate the technical terms.
Community Channels
[edit]The community channels for expert outreach consisted of: societies, associations, foundations, e-mail groups of experts, announcement groups for experts, academic e-mails, discord groups and popular science social organizations. The channels were found from internet searches and the advices from experts communicated. The community channels for Wikipedian outreach consisted of: Turkish Wikipedia Village Pump, Wikimedia Community User Group Turkey (WMTR) telegram group, Instagram account and the Turkish Wikipedia translation group. The channels were found from internet searches and the advices from experts communicated -especially Dr. Bülent Sankur, the lead of terimler.org; and Başak Tosun and Zafer Batık, Wikimedia Community User Group Turkey leaders and founding members. Each community channel, along with number and the type (profession, involvement with Wikipedia etc.) of the people they span and for which event they were used, was recorded for community tracking and future reference.
Community Engagement
[edit]To further keep the community engaged, we create a LinkedIn and a Youtube channel. The LinkedIn channel is used to announce seminars and updates. We follow a regular posting schedule to increase public and community engagement. The seminars and the meetings are recorded with consent of the participants and uploaded to the youtube channel. To inform the Wikipedia community, a member of the project's community staff participated regularly to the biweekly VikiSalı (WikiTuesday) meetings to update and engage with the WMTR community.
Initial reach out (July-August 2024)
[edit]We conducted a survey targeting two groups: Wikipedians and subject-matter experts. The aim was to understand both groups' experiences, translation habits, language proficiency, and willingness to collaborate on translations of technical Wikipedia articles. Two 10-minute surveys; one dedicated for Wikipedians and one dedicated for the experts, hosted on Google Forms, were shared between July 24, 2024, and August 5, 2024. We accessed a list of English-to-Turkish translating Wikipedians from Turkish Wikipedia (77 Wikipedians) and sent direct messages through the Wikipedia user chat. Additionally, the survey link was shared on the Turkish Wikipedia Village Pump ("Köy Çeşmesi") (905 Wikipedians) and the Turkish Wikipedia Telegram channel (319 Wikipedians), also it was published in the Wikimedia Turkey (@wikimediaturkey) Instagram page which has 1,603 followers. Through these efforts, we reached an estimated 1,300+ Wikipedians out of 2,581 active Turkish Wikipedia users. To reach experts, we contacted email groups, consisting of experts of various fields totaling approximately 8100 people. We posted announcements on Türk Fizik Takvimi (Turkish Physics Calendar), which has 5800 people subscribed to its e-mail group and Turkmath, which are platforms through which the relevant announcements get published to reach academicians and researchers on the subjects of Physics and Mathematics. We reached Türk Fizik Derneği (Turkish Physics Association) and Akademik Bilişim Vakfı (Academical Informatics Foundation) and asked them to share our announcement with their groups and with their network. We published the announcement on Fizikhaber.com which is a popular science news platform with over 12,000 subscribers, which likely have a significant number of experts among them. We also asked our networks to share the announcement in their research and study groups, which we estimate to have reached over 200 more experts. In total we reached an estimated total of 9,000+ experts (excluding Fizikhaber.com). Finally, 29 Wikipedians and 182 experts filled their respective forms. For more detailed results please access the results document here.
Some of our key findings from the initial reach out phase are: Most Wikipedians had high levels of English proficiency and have already contributed significantly to Turkish Wikipedia. They predominantly used automated translation tools, but many reported poor performance in translating technical content fully, often having to rewrite much of the text manually. 26 out of 29 Wikipedians expressed willingness to participate in a platform connecting them with experts to consult and improve the quality of technical articles they write or translate.
The majority of experts (139 out of 182) have expertise in engineering, physics, and computer science disciplines which are the scope of this project. And the majority of the 139 experts had over 10 years of experience in their fields. While many experts occasionally consulted Wikipedia, they acknowledged that Turkish Wikipedia lacks high-quality technical content. Also 32 experts worked to improve Turkish Wikipedia resources and 51 experts worked on improving general Turkish technical resources (the Turkish Wikipedia or outside resources). At the end, 123 experts were deemed likely to contribute to the project, showing strong support for collaborating with Wikipedians to improve technical translations.
The results suggested a significant level of interest from experts and less interest from Wikipedians in forming a collaborative community that could substantially enhance the quality of technical translations on Turkish Wikipedia.
Seminar I (October 2024)
[edit]We held Seminar I on October 19, 2024, to introduce the project and present annotation and survey results, as well as to kick off engagement between the Wikipedians and the experts. The seminar was announced through: the e-mail list of 123 experts and 26 Wikipedians that showed further interest in the project (they were also asked to share the event with whom they think might be interested), the e-mails of experts and students that attended annotation, Village Pump, LinkedIn page, NLP and translation student and research groups. Also 15 engineering and science faculty offices across Turkey were contacted for announcement. In total, 34 people registered, and attended the seminar. Channels through which participants learned about the seminar are as follows:

At the end of the seminar, a discussion session between experts and Wikipedians resulted in a consensus on the following: Creating a shared terminology pool, Building a stronger bridge between Wikipedians and domain experts, Enhancing communication channels to foster greater interaction between the two communities.
Panel I (December 2024)
[edit]On December 2, 2024, a panel was held to discuss how experts and academics can collaborate with volunteers to improve the scientific and technical content of Turkish Wikipedia and to evaluate findings from earlier project phases. The announcement for the panel was done through the initial e-mail list of 123 experts and 26 Wikipedians also the people that showed further interest in the project from the Seminar I. The participation for this event was kept more private to shape the communication channel. The event was moderated by Prof. Dr. Bülent Sankur (Boğaziçi University), and the panelists were: the owner of The OPS Journal Hasan Ongan, Assistant Professor Süha Tuna, an anonymous Wikipedian and founding member of WMTR Başak Tosun. The session was attended by 21 participants. Key topics covered were:
- Emphasis on the importance of scientific Turkish,
- Information on Turkish Wikipedia and volunteers
- Information on Wikipedia translation processes, tools and Turkish Wikipedia translation group,
- Discussion of the need for a shared communication channel, and agreement on the value of expert contributions to Wikipedia. Some experts also showed interest in contributing to Wikipedia and asked Wikipedians for guidance in the future. It was decided to establish a shared email group for asynchronous communication going forward.
As a result of the discussions, it was decided to establish a shared mailing list to facilitate ongoing communication between Wikipedians and experts. Email group was chosen as the preferred communication channel, as it respects the anonymity of Wikipedians and aligns with the communication habits of experts, many of whom regularly check their email throughout the day.
Demo I (February 2025)
[edit]On February 27, 2025, an online collaborative translation event was organized to bring together experts and the Wikipedia community with the goal of producing more accurate and efficient translations. The event aimed to enable real-time collaboration on technical translations and observe its impact. We announced the event in VikiSalı meeting and invited a total of 25 individuals consisting of attendees of Seminar I, Panel I and additional individuals who expressed interest in participating through personal e-mails and Telegram. 12 experts and 4 Wikipedians confirmed participation. The area of expertise of the 12 experts were identified through their publications and they were divided into 4 groups with topics: Thermodynamics, Machine Learning and Statistics, Electromagnetism and Water Resources. The 4 Wikipedians were each allocated to one these groups. Then English technical Wikipedia articles without the Turkish counterparts were appointed to these groups to translate. 11 people joined the event---8 experts and 3 Wikipedians---, along with a backup Wikipedian so the 4 groups with a Wikipedian in each one could still be formed. Following an introduction and orientation, a 40-minute joint translation session was held, during which Wikipedians consulted experts in real time on technical terms and expressions. This marked the first direct translation collaboration between the two groups. After the event, a dedicated evaluation survey was sent to participants to assess the impact of collaborative translation on quality and to better understand how mutual assistance influenced both the translation process and content. 5 experts and 4 wikipedians filled out the survey.
All survey participants reported that working together improved translation quality, with most describing the effect as significant. Notably, all respondents expressed the highest level of willingness to engage in similar collaborative efforts in the future. Wikipedian answers highlighted that expert input was especially valuable for translating uncertain or unfamiliar terminology, while expert answers indicated appreciation of the editorial and contextual support provided by Wikipedians. This mutual assistance not only enhanced translation quality but also eased the overall translation process for both groups. Additionally, several participants suggested establishing a dedicated communication channel to facilitate future collaboration, signaling a strong interest in sustained, structured interaction from both parties.
Translation Guideline for Wikipedians and the Joint E-Mail Group
[edit]Drawing from the conclusions of previous events and survey responses, a Google Group was established to facilitate sustained communication and collaboration between experts and Wikipedians. The group included the most active participants from Panel I and Demo I. In parallel with the formation of the group, a comprehensive guideline was developed to support both communities in producing accurate and high-quality translations of scientific and technical content and sustain effective communication.
The guideline is designed for Wikipedia contributors and domain experts who wish to create technical content on Turkish Wikipedia or provide support in translating technical and academic terms and texts. Its main objectives are to ensure accuracy and consistency in translation, standardize terminology usage, and facilitate support from both parties in cases of ambiguity. It aims to enhance translation quality by offering practical solutions to the common challenges faced by contributors, providing them with necessary resources/databases, while also guiding experts who wish to contribute to Turkish Wikipedia or support other editors' translation processes.
Key components of the guideline
[edit]General Principles: In addition to standard Wikipedia editing rules, the guideline emphasizes the importance of using verified terminology, consulting academic sources, and seeking expert input when necessary. Translation Protocols: Detailed steps are provided for verifying terminology using trusted sources such as terimler.org, field-specific glossaries (e.g., for physics, mathematics, informatics), and academic databases (e.g., DergiPark, Ulusal Tez Merkezi). When standard sources fall short, translators are encouraged to use machine translation tools carefully and comparatively, and to consult experts through the mailing list. Expert Support Request Templates: Contributors are instructed on how to request expert help by email for three types of issues: (1) unclear terminology, (2) ambiguous sentences or paragraphs, and (3) entire texts that require holistic clarification. Each case is supported with structured email templates to ensure clear and effective communication. Response and Communication Protocols: Experts are expected to respond to emails within 48 hours. Public replies are encouraged to support transparency and collective learning, while one-on-one replies are acceptable for privacy-sensitive cases. Guidance for Experts: Experts are encouraged to join the mail group and offer feedback to Wikipedia contributors on terminology, policy questions, or formatting issues. They are also provided with guidelines on how to collaborate with Wikipedians as well as resources on best practices and policies of Wikipedia if they want to contribute as Wikipedia editors. Frequently Asked Questions: The guideline addresses common challenges, such as handling unfamiliar or ambiguous terminology, navigating stylistic conflicts between plain Turkish and technical precision, and managing communication breakdowns between contributors and experts.
Demo II (May 2025)
[edit]To evaluate the effectiveness and usability of the developed translation guideline and expert support mechanism, we organized a second translation event, Demo II, between May 20–28, 2025. Unlike Demo I, this session was conducted asynchronously, with support provided via email over an extended period to allow participants more flexibility and get used to communicating via the mail group. To reach a wider and more diverse audience, the event was promoted through multiple channels, including a Telegram group, the VikiSalı community meeting, the Village Pump discussion forum, and private messages to members of the English-to-Turkish translation group on Wikipedia. Through these efforts, the event announcement reached approximately 1,000 individuals and 10 Wikipedians (7 experienced, 3 new) were recruited for the translation event. As a prerequisite, interested participants were asked to complete a simple six-question quiz based on the guideline to test whether they read the guideline or not. A minimum score of 5 out of 6 was required to participate. All but two respondents scored full marks, with the remaining two scoring 5.
All 10 Wikipedia volunteers was assigned a highly technical English Wikipedia article containing ambiguous or under-defined terms in Turkish. They were asked to translate the articles using the guideline and support tools provided. After the end of the translation period each were given a survey, measuring the usability and effectiveness of the guideline using The System Usability Scale (SUS) metric and were invited to provide their comments and feedback for the guideline. Additionally, domain experts were invited to review the final translations for accuracy, clarity, and terminology use, allowing us to assess the end-to-end effectiveness of the support mechanism and guideline in real translation scenarios.
Conclusion
The Demo II translation event demonstrated that the structured guideline and expert collaboration improved the quality of technical translations on Turkish Wikipedia and enhanced the translation experiences of the Wikipedians while translating challenging, technical English articles articles.
Collaboration between Wikipedians and experts was active and timely, with 31 discussion threads initiated during the translation period and most receiving responses within a day. This highlighted the practicality and responsiveness of the email-based support system.
The guideline was ranked 83.75 on the System Usability Scale (SUS) metric. The high SUS score (Grade A, the highest possible grade) indicated that the guideline is clear and user-friendly. Also feedback from both new and experienced Wikipedians highlighted that the guideline and collaborative translation significantly eased their translation process and improved their output quality.
Expert evaluations confirmed the overall success of the translations, with near-excellent scores in accuracy (9/10), fluency (8.9/10), and terminological correctness (8.8/10). Additionally, the comparable clarity ratings of the original and translated texts suggest that the Turkish translations retained the comprehensibility of the source material near perfectly, even in highly technical contexts, even when evaluating experts are more accustomed to digesting English technical content.
In conclusion, Demo II validated that a combination of a well-designed translation guideline and expert-supported collaboration can enhance both the translation experience of volunteer translators and the quality of their outputs. These findings provide a strong foundation for scaling the approach and re-implementing it across other Wikipedia language communities and technical domains. For more detailed results please access the results document here.
Seminar II (May 2025)
[edit]We concluded the project on May 28, 2025, where we presented the project’s outcomes, share the results of the user studies and to introduce the guideline and the mail group to expert and Wikipedia communities. The presentation topics were mainly the deliverables of the project: dataset, annotation, models, community and support network (guideline & email group) and the future work.
The announcement was done through the Village Pump, Wikipedia TR Telegram group and WikiTuesday meetings, also the Wikipedian and expert mail groups.
The presentation covered all major deliverables of the project: the construction of a EN-TR terminology dataset, the annotation workflow and quality controls, the baseline evaluation of large language models; as well as expert and Wikipedian surveys, announcement mechanisms and the development of community mechanisms such as the expert–Wikipedian email group and the translation guideline. The results and findings of the user studies (Demo I and Demo II) and previous meetings (Seminar I and Panel I) were also shared.
Dissemination
[edit]- Wiki Workshop 2024: We presented our plan at Wiki Workshop 2024 on June 20, 2024. The extended abstract is available here, and the video of the presentation can be viewed here.
- Annotation Webinar: We hosted an Annotation Webinar on August 17, 2024, to explain the project in detail to potential annotators. The presentation file is available on Wikimedia Commons here, and a photo taken during the webinar can be found here.
- Wikimedia CEE Meeting 2024: We shared our progress, including annotation and survey results, at the Wikimedia CEE Meeting 2024 on September 21, 2024. More details about the presentation are available here, and a photo taken during the presentation is on Wikimedia Commons here.
- Seminar I: We held Seminar I on October 19, 2024, to introduce the project and present annotation and survey results. The presentation file can be accessed on Wikimedia Commons here, and a photo taken during the seminar is available here.
- Panel I: We organized a panel bringing together 25 domain experts (such as scientists and engineers) and Wikipedians on December 2, 2024. The session focused on evaluating participants’ areas of expertise, their experience with translation, and their involvement with Wikipedia. Specific challenges in term-focused translation processes were discussed, highlighting issues faced by both sides. The conversation also explored ways to improve the efficiency of these processes and considered approaches to building a sustainable channel for ongoing communication between expert academics and the Wikipedia community. A photo taken during the panel is available here; recording of the panel can be accessed here (in Turkish).
- Demo I: On February 25, 2025, we hosted a small-scale wiki-marathon focused on fostering collaborative translation between Wikipedians and domain experts. Participants were grouped according to their areas of specialization, with each group consisting of 3–4 experts and one experienced Wikipedian. This setup was intentional, allowing Wikipedians to compare the translation process with and without expert input. Each group received a selection of scientific English Wikipedia articles—without existing Turkish translations—relevant to their expertise, and they were tasked with translating as much content as possible within one hour. Following the session, a survey was distributed to participants. The results indicated that the collaborative translation experience was overwhelmingly positive for both sides, with the majority expressing interest in continuing such partnerships. A photo taken during the demo is available here.
- Demo II: Between May 20–28, 2025, we organized an asynchronous collaborative translation event to test the usability and impact of the translation guideline and expert support system. 10 Wikipedians translated complex technical articles using the guideline, with expert assistance provided via a mailing list. The collaboration resulted in 31 discussion threads, with most expert responses delivered within a day. Expert reviews of the translations showed high scores in accuracy, fluency, and terminological correctness. The guideline received a System Usability Scale score of 83.75 (Grade A), confirming its effectiveness and clarity.
- Seminar II: On May 28, 2025, we presented the final outcomes of the project, including datasets, models, community structures (guideline and mailing list), and results from user studies (Demo I and II). The seminar was promoted via the Village Pump, WikiTuesday meetings, and expert/Wikipedian mail groups. Presentation topics included dataset creation, annotation workflow, LLM evaluation, expert-Wikipedian collaboration, and future plans for scaling and sustaining the initiative.
Conclusion and Future Work
[edit]We introduced the first terminologically annotated English–Turkish corpus that spans Mathematics, Physics, and Computer Science, pairs 3,300 sentences, and links more than 10,000 technical terms with expert oversight. A carefully engineered annotation workflow—combining automatic filters, tiered quizzes, and live quality controls—produced high inter-annotator agreement while remaining cost-effective and ethically compensated.
Baseline experiments confirm the importance of both external terminological resources and bilingual context: models granted terimler.org access and full EN-TR input consistently outperform variants that lack one of these ingredients. Nevertheless, the performance gap between even the best model and human annotators underscores that terminology translation remains far from solved.
Our dataset and evaluation suite lay a reproducible foundation for future work. Promising directions include fine-tuning or instruction-tuning language models on the released corpus, integrating retrieval-augmented generation with richer Turkish glossaries, and extending the annotation protocol to additional domains (e.g., Biology, Chemistry) or other low-resource languages. We hope this effort accelerates progress toward machine translation systems that respect the precision and nuance demanded by scientific discourse.
From the community perspective, we identified and reached out to thousands of experts and hundreds of Wikipedians. Through seminars, social media, weekly gatherings and emails, we established an active email group containing 35 members from both communities. Through live translation events with participation of both sides, we together designed a translation guideline and showed its usefulness through SUS scores and expert evaluation of translated articles.
References
[edit]- ↑ Eva Hasler, Adrià De Gispert, Gonzalo Iglesias, Bill Byrne (2018). Neural Machine Translation Decoding with Terminology Constraints. arXiv preprint arXiv:1805.03750.
- ↑ Miriam Exel, Bianka Buschbeck, Lauritz Brandt, Simona Doneva (2020). Terminology-Constrained Neural Machine Translation at SAP. EAMT.
- ↑ Nikolay Bogoychev, Pinzhen Chen (2023) Terminology-Aware Translation with Constrained Decoding and Large Language Model Prompting. WMT.
- ↑ Kirill Semenov, Vilém Zouhar, Tom Kocmi, Dongdong Zhang, Wangchunshu Zhou, Yuchen Eleanor Jiang. (2023) Findings of the WMT 2023 Shared Task on Machine Translation with Terminologies. WMT
- ↑ Zheng Li, Mao Zheng, Mingyang Song, Wenjie Yang (2025). TAT-R1: Terminology-Aware Translation with Reinforcement Learning and Word Alignment. arXiv:2505.21172
- ↑ Jiarui Liu, Iman Ouzzani, Wenkai Li, Lechen Zhang, Tianyue Ou, Houda Bouamor, Zhijing Jin, Mona Diab. (2024) Towards Global AI Inclusivity: A Large-Scale Multilingual Terminology Dataset (GIST). arXiv:2412.18367
- ↑ Hongyuan Lu, Haoran Yang, Haoyang Huang, Dongdong Zhang, Wai Lam, Furu Wei. (2024) Chain-of-Dictionary Prompting Elicits Translation in Large Language Models. EMNLP 2024
- ↑ https://www2.statmt.org/wmt25/terminology.html
- ↑ Huaao Zhang, Qiang Wang, Bo Qin, Zelin Shi, Haibo Wang, Ming Chen. (2023) Understanding and Improving the Robustness of Terminology Constraints in Neural Machine Translation. ACL 2023
- ↑ https://www.researchgate.net/publication/358240882_Terminological_Quality_Evaluation_in_Turkish_to_English_Corpus-Based_Machine_Translation_in_Medical_Domain
- ↑ https://www.clarin.si/repository/xmlui/handle/11356/1816
- ↑ Kirill Semenov, Vilém Zouhar, Tom Kocmi, Dongdong Zhang, Wangchunshu Zhou, Yuchen Eleanor Jiang. (2023) Findings of the WMT 2023 Shared Task on Machine Translation with Terminologies. WMT
- ↑ Jiarui Liu, Iman Ouzzani, Wenkai Li, Lechen Zhang, Tianyue Ou, Houda Bouamor, Zhijing Jin, Mona Diab. (2024) Towards Global AI Inclusivity: A Large-Scale Multilingual Terminology Dataset (GIST). arXiv:2412.18367
- ↑ Ramakrishna Appicharla, Baban Gain, Santanu Pal, Asif Ekbal. (2025) Beyond the Sentence: A Survey on Context-Aware Machine Translation with Large Language Models. arXiv:2506.07583
- ↑ Yan Huang, Wei Liu. (2024) Evaluating the Translation Performance of Large Language Models Based on Euas-20. arXiv:2408.03119
- ↑ Yongjing Yin, Jiali Zeng, Yafu Li, Fandong Meng, Yue Zhang. (2024) LexMatcher: Dictionary-centric Data Curation for LLM-based Machine Translation. EMNLP Findings
- ↑ https://tez.yok.gov.tr/UlusalTezMerkezi