Research:Bridging the Gap Between Wikipedians and Scientists with Terminology-Aware Translation: A Case Study in Turkish

Contact

Gözde Gül Şahin

[[en:GGLab, Koç University|GGLab, Koç University]]

Duration: 2024-06 – 2025-06

Supported by Wikimedia Research Fund

Grant application

Research:Projects

This page documents a research project in progress.
Information may be incomplete and change as the project progresses.
Please contact the project lead before formally citing or reusing results from this page.

This project addresses the gap between the escalating volume of English-to-Turkish Wikipedia translations and the insufficient number of contributors, particularly in technical domains. Leveraging domain expertise from collaborative terminology dictionary effort, we plan to implement a pipeline system to enhance translation quality. Our focus is on bridging academic and Wikipedia communities, creating datasets, and developing NLP models for terminology identification and linking, and terminology-aware translation. The aim is to foster sustained contributions from both communities and improve the overall quality of Turkish Wikipedia articles.

Methods

Data Curation

We will rely on two resources: contenttranslation and abstracts of theses from the Council of Higher Education (CoHE) website. Contenttranslation is dumped bi-weekly and contains around 400,000 entries; whereas CoHE contains all abstracts of theses submitted to any Turkish university from 2006 until today—containing more than one million sentences. Since the translations are mostly provided as long paragraphs in both resources, we will first align the sentences with highly-accurate existing tools such as SentAlign, and VecAlign. Then we will perform simple preprocessing, e.g., filtering URLs, special characters, short/long sentences, and content with mostly numbers. Next, we aim to generate 3,000 parallel sentences in English-Turkish containing the following: i)English text annotated with the technical terms, ii) links to correct terminology entries in the database, and iii) edited translations using the correct terminology with Turkish terms.

Term Identification

To develop a flexible system, we consider this as a separate task that can later be integrated with any terminology database. We first plan to split the curated dataset into development and test; and explore the capacity of existing state-of-the-art methods on evaluation split. If the performance is below desired threshold, we will use the development split to further finetune a small pretrained model for the span detection task. Since the terminology spans will also be detected for the Turkish text, we will explore building both multilingual and monolingual identification models. We will calculate the F1 score considering only the exact matches on the lemmatized phrases (e.g., cats-cat).

Term Linking

Using the development split of the curated dataset, we will approach the problem as a retrieval task utilizing efficient tools like FAISS to index the dictionary and the contextualized term. In order to detect the missing terms to inform the domain-experts, our model will also handle the cases where the term is not linkable, simply by querying the DB beforehand. We will use R@n (percentage of the ground-truth term being in the top n) to evaluate the performance of the models.

Terminology-Aware Translation

We will explore two common approaches: i) post-processing the output, i.e., rewriting the content with lexical constraints (as given in Stage I) and ii) incorporating constraints into the training of the machine translation model. It is common to perform two step evaluation: i) general translation accuracy ii) terminology consistency. Although human evaluation is also necessary for NLG tasks, commonly used metrics for MT are BLEU, chrF, BERTScore and COMET. Terminology consistency evaluation is yet an active field of research, however the most common metric is the exact-match term accuracy ^[1]. We plan to use all possible metrics and conduct a small user study to rank the generation outputs from different models.

Building Communication Channel

We will work together with two distinct communities that do not normally engage: academics, researchers and students that voluntarily contribute to terminology database and Wikimedia Community User Group Turkey. We first plan to connect with the Wikimedia group with a particular focus on editors that generate STEM content. We will do a semi-structured interview with them to gain insights on their profiles and editing habits. Similarly we will interview the academic community for their desire to contribute. In this initial phase, we will reach as many people as possible through maintainers’ immediate networks, emailing lists, our collaborators from both communities, and student/university organizations that we have access to. Expected outcome is to identify potential new Wikipedia contributors with domain-expertise and existing contributors who are open to feedback/or have domain-expertise.

Seminar I. We will organize an online seminar within the first three months to introduce the project and present the interview results. During the seminar, we will hold a panel discussion on best strategies to bridge two communities (e.g., how to give feedback/how to use feedback from one another) to make informed decisions for our research. Expected outcome is an established communication channel (online e.g., discord,slack or offline e.g., feedback button, email etc…).

User study I. Using the previously established communication channel, we will recruit around 10 active participants from different groups (both old and newly recruited contributors) to perform a small scale user study. They will be shown a source text with highlighted and hyperlinked terms using the developed models. We will ask them to translate the content in a contrastive setting (one showing the links, other without), and evaluate the output against the gold standard(s).

User study II. We will do another user study with the same participants to gain insights on the generation models. We will show them the output of our rewriting and weakly supervised models along with modern tools (e.g., Google Translate, ChatGPT) and ask them to rank those outputs. Then we’ll evaluate the rank of our proposed models against others. Expected outcome for user studies is a quantitative measure for potential usability of the developed models.

Seminar II. We will organize another online seminar to discuss the project progress and user study results. We will hold a discussion panel on how to design training materials for editors to more accurately translate the technical terms (using developed models).

Timeline

June 1st — July 30 Annotation and public release of the dataset. Interview Wikipedians and domain-experts
August 15th — Seminar I
September 30 — Terminology identification and linking baseline models
October 30 — Analysis of various identification and linking models complete. User Study I is completed. Preliminary results posted here.
January 30 — Terminology-aware translation and post-editing baseline models
March 30 — Analysis of various translation & post-editing models complete. User Study II is completed. Preliminary results posted here.
May 30 — Seminar II (Closure). Publish and disseminate in *CL conferences/journals

Resources

TBD

References

↑ Anastasopoulos, A., Besacier, L., Cross, J., Gallé, M., Koehn, P., & Nikoulina, V. (2021). On the evaluation of machine translation for terminology consistency. arXiv preprint arXiv:2106.11891.

[1] Anastasopoulos, A., Besacier, L., Cross, J., Gallé, M., Koehn, P., & Nikoulina, V. (2021). On the evaluation of machine translation for terminology consistency. arXiv preprint arXiv:2106.11891.

[1]