Research:Wiki Content Translation Tool Translation Project
This project was undertaken for the 2019 Summer Outreachy internship program. Initial Outreachy submission: https://paws-public.wmflabs.org/paws-public/User:Doriszhou1224/Qualitative%20T218003.ipynb It is a research project that seeked to investigate the translation habits and patterns of Wikipedia editors, particularly concerning their usage of the CX translation tool. This project was exploratory in nature. I am very grateful for the mentorship from Isaac Johnson and Jonathon Morgan.
Wikipedia has articles available in over 280 languages. Each language contains a unique corpus of articles, of varying quality. In order to make information more accessible, the Content Translation tool, abbreviated CX, was created to make the translation process easier. This project focused on observing the editing patterns and habits of users after they create an article for translation using the CX tool.
A portion of this project included figuring out what questions were meaningful to investigate. As such, the project was split into several phases, with each phase informing the next phase. The main approach that was taken can be divided into three main steps.
- Literature review
- Understanding the data (Qualitative Analysis)
- Attempt in organizing the data
Phase 1: Literature Review
The goal of this stage was to simply gain a more well-rounded understanding of prior research conducted about the Wikipedia community and machine learning techniques on language analysis. The notes and the articles that I read can be found in this PAWs notebook: https://paws-public.wmflabs.org/paws-public/User:Doriszhou1224/Notes.ipynb Of particular interest was learning about the creation of the CX tool itself and the richness of Wikipedia articles that exist across different languages and their relationships with each other and English Wikipedia.
Phase 2: Qualitative Analysis
The next step was to shift through the available data, e.g., going through dump files and looking at the edit history of translated articles. The goal of this step was to learn if there may exist possible preferences that editors have when they translate articles. The main questions that were asked: are certain topics chosen more frequently? Are there editing patterns, for example did editors prefer to add the main content in the first few edits or did they prefer to add information slowly in later edits? Because I am familiar with Mandarin and French, I concentrated on articles that were translated post 2016 using the CX translation tool from English to Chinese and from English to French. More importantly, the goal of this stage was to be able to articulate more concise questions about the translation habits of editors and thus be able to discover a clearer direction for this research project. This PAWs notebook has more details: https://paws-public.wmflabs.org/paws/user/Doriszhou1224/notebooks/DataNotebook1.ipynb This document contains notes from selected articles from the above notebook: https://docs.google.com/spreadsheets/d/1Z96LSDSv1A5u0IELzg04Dcoz9ik137VmklX-CbwAzoA/edit?usp=sharing
Phase 3: Organizing the Data
From the former stage, I have now a bigger picture understanding of the editing landscape for translated articles. The main questions formulated were: are there patterns for certain sections to be translated more frequently than others? What are the differences between early edits and later edits? Were articles with content related to the culture of the target language translated more frequently or are of a higher quality? To answer the first question, we wrote a section title comparison function that seeked to take in a large list (such as > 1000) of English-French article title pairs and output the section titles that were translated into French. To answer the latter two questions, I chose by hand collections of translated articles from English to Chinese and English to French and recorded notes and observations.
- Contrary to my initial Outreachy submission, after looking at significantly more French translated articles, I found it was not necessarily true that an article more related to French culture will have a better translation effort. What gets translated and how well depends largely on the editors' interests. As such, articles with content that editors deem to fall under a Wiki Project are more frequently translated.
- However it should be noted that it seems it is definitely true that content related to the culture is more likely to be translated into that language, only that it may not be of higher quality
- Although it remains true that biographies is the most common category of translated articles for French, upon further inspection, I realized not all biographical articles existed as standalone articles, but were rather part of a "batch" of other articles about people who have common traits
- The same editor who creates the translation often does the bulk of the article writing and the form of the article is usually set in the first few edits. Later editors will add/fix links, categories and fix grammar/typo errors
- The quality and completeness of the article depends in large part on the one or two "main" editors, defined at people who have made more than three edits on the article and have done content changes (so more than fixing links or adding categories).
- A common phenomenon I observed was that it was very common for both French and Chinese translated articles to come in "batches". Namely, at least two articles with similar content created very closely in time to each other and share common editors.
- It is more common than not for translated articles to share the same format as their English source article. As for why, further analysis would need to be done. One conjecture is that it may be due to the CX Tool, because the tool makes it very easy to keep the same structure of the source article. If there are differences in organization, these changes happen in later edits by different editors, but the same editor may also make structural changes.
A remark: it has been very difficult to come up with a coherent and consistent definition of a "successful" translation. It seems at a minimum that we want the translated article to be complete and have all the information the source article contains. However, does this necessarily qualify it as a truly sucessful translation? For instance, for current events that change rapidly, such as government elections, a useful article should update the new events accordingly. However, this would be difficult to detect or to cross-validate with the source article, especially ifw the source itself is not updated fast enough. For the purposes of this project, I have decided to use completeness as the main criteria for deciding if a translated article is of good quality.
For the section titles comparison function, we made use of a number of methods to match the English title with the corresponding French section title. We measured the "rank" of the section title--1 for main section, 2 for subsection, etc. We also measured the "line number" of the section title--if all the section titles for a given article were put in a list, this measure would be the index of the title in the list. Our methods were not perfect and do not always find the translated French section title, even when it exists. Thus, this function most likely underestimates the matchings between the English and French section titles. However, running this function on 500 article pairs shows that the overall section titles and their position in the source and translated artices are quite similar. There are no section titles that experience consistent change. Translated French articles tend to follow the given English article outline.
Other: Experimentation with Semantics
For interest, I attempted to change the code of this project: https://github.com/diyiy/Wiki_Semantic_Intention, "Identifying Edit Intentions from Revisions in Wikipedia" to also be able to predict the edit intention for French article edits. This was not successful as the code always returns "1" for "fact-update" for all edits. However, for anyone with a possible interest in this, I will provide the folder with the changed code. https://drive.google.com/drive/folders/1zxwe7FbsY0_bEmQhEd2gaTjmk_kHfOUU?usp=sharing
- Additional work could be done to more rigoursly find patterns that are common for translated articles. A better method/algorithm would be helpful to match the English titles to French titles or to overall come up with a more coherent framework for defining a successful translation.
- Exploration on the articles that were found to be organized differently or being of a higher quality than the source article. I could not find enough such articles to extrapolate any reasons, for they came from a variety of content backgrounds, e.g. autobiography and geography. I considered such articles to be outliers, but they may be more common.
- It would be interesting to see what articles are translated into other languages without the CX tool.
- Laxstrom, Niklas; P. Giner; S. Thottingal (Jun 2015). "Content Translation: Computer-Assisted Translation Tool for Wikipedia Articles". ArXiv.1506.01914.