Jump to content

Research:Bridging the Gap Between Wikipedians and Scientists with Terminology-Aware Translation: A Case Study in Turkish

From Meta, a Wikimedia project coordination wiki
Contact
[[en:GGLab, Koç University|GGLab, Koç University]]
Duration:  2024-06 – 2025-06
Grant ID: G-RS-2402-15231

This page documents a research project in progress.
Information may be incomplete and change as the project progresses.
Please contact the project lead before formally citing or reusing results from this page.


This project addresses the gap between the escalating volume of English-to-Turkish Wikipedia translations and the insufficient number of contributors, particularly in technical domains. Leveraging domain expertise from collaborative terminology dictionary effort, we plan to implement a pipeline system to enhance translation quality. Our focus is on bridging academic and Wikipedia communities, creating datasets, and developing NLP models for terminology identification and linking, and terminology-aware translation. The aim is to foster sustained contributions from both communities and improve the overall quality of Turkish Wikipedia articles.

Methods

[edit]

Data Curation

[edit]

We will rely on two resources: contenttranslation and abstracts of theses from the Council of Higher Education (CoHE) website. Contenttranslation is dumped bi-weekly and contains around 400,000 entries; whereas CoHE contains all abstracts of theses submitted to any Turkish university from 2006 until today—containing more than one million sentences. Since the translations are mostly provided as long paragraphs in both resources, we will first align the sentences with highly-accurate existing tools such as SentAlign, and VecAlign. Then we will perform simple preprocessing, e.g., filtering URLs, special characters, short/long sentences, and content with mostly numbers. Next, we aim to generate 3,000 parallel sentences in English-Turkish containing the following: i)English text annotated with the technical terms, ii) links to correct terminology entries in the database, and iii) edited translations using the correct terminology with Turkish terms.

Term Identification

[edit]

To develop a flexible system, we consider this as a separate task that can later be integrated with any terminology database. We first plan to split the curated dataset into development and test; and explore the capacity of existing state-of-the-art methods on evaluation split. If the performance is below desired threshold, we will use the development split to further finetune a small pretrained model for the span detection task. Since the terminology spans will also be detected for the Turkish text, we will explore building both multilingual and monolingual identification models. We will calculate the F1 score considering only the exact matches on the lemmatized phrases (e.g., cats-cat).

Term Linking

[edit]

Using the development split of the curated dataset, we will approach the problem as a retrieval task utilizing efficient tools like FAISS to index the dictionary and the contextualized term. In order to detect the missing terms to inform the domain-experts, our model will also handle the cases where the term is not linkable, simply by querying the DB beforehand. We will use R@n (percentage of the ground-truth term being in the top n) to evaluate the performance of the models.

Terminology-Aware Translation

[edit]

We will explore two common approaches: i) post-processing the output, i.e., rewriting the content with lexical constraints (as given in Stage I) and ii) incorporating constraints into the training of the machine translation model. It is common to perform two step evaluation: i) general translation accuracy ii) terminology consistency. Although human evaluation is also necessary for NLG tasks, commonly used metrics for MT are BLEU, chrF, BERTScore and COMET. Terminology consistency evaluation is yet an active field of research, however the most common metric is the exact-match term accuracy [1]. We plan to use all possible metrics and conduct a small user study to rank the generation outputs from different models.

Building Communication Channel

[edit]

We will work together with two distinct communities that do not normally engage: academics, researchers and students that voluntarily contribute to terminology database and Wikimedia Community User Group Turkey. We first plan to connect with the Wikimedia group with a particular focus on editors that generate STEM content. We will do a semi-structured interview with them to gain insights on their profiles and editing habits. Similarly we will interview the academic community for their desire to contribute. In this initial phase, we will reach as many people as possible through maintainers’ immediate networks, emailing lists, our collaborators from both communities, and student/university organizations that we have access to. Expected outcome is to identify potential new Wikipedia contributors with domain-expertise and existing contributors who are open to feedback/or have domain-expertise.

Seminar I. We will organize an online seminar within the first three months to introduce the project and present the interview results. During the seminar, we will hold a panel discussion on best strategies to bridge two communities (e.g., how to give feedback/how to use feedback from one another) to make informed decisions for our research. Expected outcome is an established communication channel (online e.g., discord,slack or offline e.g., feedback button, email etc…).

User study I. Using the previously established communication channel, we will recruit around 10 active participants from different groups (both old and newly recruited contributors) to perform a small scale user study. They will be shown a source text with highlighted and hyperlinked terms using the developed models. We will ask them to translate the content in a contrastive setting (one showing the links, other without), and evaluate the output against the gold standard(s).

User study II. We will do another user study with the same participants to gain insights on the generation models. We will show them the output of our rewriting and weakly supervised models along with modern tools (e.g., Google Translate, ChatGPT) and ask them to rank those outputs. Then we’ll evaluate the rank of our proposed models against others. Expected outcome for user studies is a quantitative measure for potential usability of the developed models.

Seminar II. We will organize another online seminar to discuss the project progress and user study results. We will hold a discussion panel on how to design training materials for editors to more accurately translate the technical terms (using developed models).

Timeline

[edit]
  • June 1 — Project Kickoff
  • June 1 – July 30 — Preparation for Annotation (Corpus Preprocessing, Evaluation Design, Annotation Guidelines)
  • July 31 — Annotation Announcement
  • August 17 – August 20 — Selection and Training of Annotators
  • August 25 – September 1 — Annotation
  • August 5 – September 5 — Wikipedians and Domain Experts Survey
  • September 1 – October 31 - Annotated High-Quality Terminology-Aware Translation Dataset
  • October 19: Seminar I
  • November 1 – December 30 —Term Identification and Term Linking Models
  • January 1 – March 30 —Translation Models
  • April 1 — April 30 — User Evaluation of Proposed Models
  • May 1 — May 30 — Writing and Dissemination
  • May 30 — Seminar II


Dissemination

[edit]
  • Wiki Workshop 2024: We presented our plan at Wiki Workshop 2024 on June 20, 2024. The extended abstract is available here, and the video of the presentation can be viewed here.
  • Annotation Webinar: We hosted an Annotation Webinar on August 17, 2024, to explain the project in detail to potential annotators. The presentation file is available on Wikimedia Commons here, and a photo taken during the webinar can be found here.
  • Wikimedia CEE Meeting 2024: We shared our progress, including annotation and survey results, at the Wikimedia CEE Meeting 2024 on September 21, 2024. More details about the presentation are available here, and a photo taken during the presentation is on Wikimedia Commons here.
  • Seminar I: We held Seminar I on October 19, 2024, to introduce the project and present annotation and survey results. The presentation file can be accessed on Wikimedia Commons here, and a photo taken during the seminar is available here.

References

[edit]
  1. Anastasopoulos, A., Besacier, L., Cross, J., Gallé, M., Koehn, P., & Nikoulina, V. (2021). On the evaluation of machine translation for terminology consistency. arXiv preprint arXiv:2106.11891.