Research:Can Machine Translation Improve Knowledge Equity? A Large-scale Study of Wikipedias across more than 300 language editions

Contact

Kai Zhu

Bocconi University

Duration: 2022-August – 2023-December

This page documents a completed research project.

Introduction[edit]

This research examines the transformative impact of advanced machine translation (MT) technologies on multilingual knowledge dissemination, focusing on the integration of Google Translate into Wikipedia's Content Translation tool in 2019. The study explores how this integration influences content creation across various language editions of Wikipedia, seeking to understand its role in addressing the digital language divide. By analyzing data from hundreds of language editions, the research aims to provide insights into how machine translation affects the quantity and quality of content on one of the world's largest knowledge platforms.

Literature Review[edit]

Extensive prior research has documented knowledge gaps on Wikipedia along dimensions like gender, geography, and topics (see Halavais & Lackaff, 2008 ^[1]; Graham et al., 2014 ^[2]; Wagner et al., 2015 ^[3], 2016 ^[4]). Organized editing campaigns can improve specific articles but often do not increase overall visibility (see Zhu et al., 2020 ^[5]; Langrock & González-Bailón, 2021 ^[6]). Relatively few studies have analyzed linguistic knowledge gaps (see Miquel-Ribé & Laniado, 2020 ^[7]). Bridging knowledge gaps across languages is a strategic priority for Wikipedia (see Redi et al., 2019 ^[8]).

Methods and Data[edit]

The study utilizes a natural experiment design, leveraging the 2019 integration of Google Translate into Wikipedia's Content Translation tool as a key intervention. Data were collected from various language editions of Wikipedia, encompassing millions of articles. The analysis focused on measuring changes in article creation rates, translation volumes, and article deletion ratios pre- and post-integration. Advanced statistical techniques, including difference-in-differences (DID) analysis, were employed to isolate the effect of Google Translate's integration from other factors influencing Wikipedia content creation and quality. This methodological approach ensures a robust and reliable assessment of the integration's impact.

Results[edit]

The study reveals a significant impact of integrating Google Translate into Wikipedia's Content Translation system. A notable increase in translation volume was observed, with an average of 71 additional articles per language each month, primarily due to heightened activity and efficiency among editors and an expanded editor base. The analysis shows a growing reliance on English as the primary source language and an increase in the diversity of target languages. However, the translation growth benefits vary across languages, with high and medium-resource languages showing more significant gains compared to lower-resource ones.

Additionally, the impact of machine translation varies across knowledge domains. Culture and Geography sections experienced the most substantial increases. There's evidence of enhanced representation in biography and geography articles, indicating progress in narrowing gender and geographic gaps. However, persistent gaps due to pre-existing content imbalances highlight the dual role of machine translation in promoting diversity and its limitations.

The study underscores the potential benefits and challenges of integrating AI technologies like machine translation in decentralized knowledge ecosystems. It emphasizes the gains in efficiency and accessibility, illustrating technology's role in breaking down systemic barriers and enhancing representation. Yet, it also draws attention to ongoing challenges in content translation, rooted in existing social and economic inequalities, underscoring the need for addressing both technical and social aspects to maximize the potential of such technologies.

More Information[edit]

If you are interested in reading the full manuscript of the study, you can find it here at SSRN: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4708614^[9]

↑ Halavais, A., & Lackaff, D. (2008). An analysis of topical coverage of Wikipedia. Journal of computer-mediated communication, 13(2), 429-440.
↑ Graham, M., Hogan, B., Straumann, R. K., & Medhat, A. (2014). Uneven geographies of user-generated information: Patterns of increasing informational poverty. Annals of the Association of American Geographers, 104(4), 746-764.
↑ Wagner, C., Garcia, D., Jadidi, M., & Strohmaier, M. (2015). It’s a man's Wikipedia? Assessing gender inequality in an online encyclopedia. Ninth International AAAI Conference on Web and Social Media. https://www.aaai.org/ocs/index.php/ICWSM/ICWSM15/paper/viewPaper/10585
↑ Wagner, C., Graells-Garrido, E., Garcia, D., & Menczer, F. (2016). Women through the glass ceiling: gender asymmetries in Wikipedia. EPJ Data Science, 5(1), 1–24. https://doi.org/10.1140/epjds/s13688-016-0066-4
↑ Zhu, K., Walker, D., & Muchnik, L. (2020). Content Growth and Attention Contagion in Information Networks: Addressing Information Poverty on Wikipedia. Information Systems Research, 31(2), 491–509. https://doi.org/10.1287/isre.2019.0899
↑ Langrock, I., & González-Bailón, S. (2020). The Gender Divide in Wikipedia: A Computational Approach to Assessing the Impact of Two Feminist Interventions. https://doi.org/10.2139/ssrn.3739176
↑ Miquel-Ribé, M., & Laniado, D. (2020, August). The Wikipedia diversity observatory: A project to identify and bridge content gaps in Wikipedia. In Proceedings of the 16th International Symposium on Open Collaboration (pp. 1-4).
↑ Redi, M., Gerlach, M., Johnson, I., Morgan, J., & Zia, L. (2020). A taxonomy of knowledge gaps for wikimedia projects (second draft). arXiv preprint arXiv:2008.12314.
↑ Zhu, K., & Walker, D. (2024). The Promise and Pitfalls of AI Technology in Bridging Digital Language Divide: Insights from Machine Translation on Wikipedia. Available at SSRN. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4708614

[1] Halavais, A., & Lackaff, D. (2008). An analysis of topical coverage of Wikipedia. Journal of computer-mediated communication, 13(2), 429-440.

[2] Graham, M., Hogan, B., Straumann, R. K., & Medhat, A. (2014). Uneven geographies of user-generated information: Patterns of increasing informational poverty. Annals of the Association of American Geographers, 104(4), 746-764.

[3] Wagner, C., Garcia, D., Jadidi, M., & Strohmaier, M. (2015). It’s a man's Wikipedia? Assessing gender inequality in an online encyclopedia. Ninth International AAAI Conference on Web and Social Media. https://www.aaai.org/ocs/index.php/ICWSM/ICWSM15/paper/viewPaper/10585

[4] Wagner, C., Graells-Garrido, E., Garcia, D., & Menczer, F. (2016). Women through the glass ceiling: gender asymmetries in Wikipedia. EPJ Data Science, 5(1), 1–24. https://doi.org/10.1140/epjds/s13688-016-0066-4

[5] Zhu, K., Walker, D., & Muchnik, L. (2020). Content Growth and Attention Contagion in Information Networks: Addressing Information Poverty on Wikipedia. Information Systems Research, 31(2), 491–509. https://doi.org/10.1287/isre.2019.0899

[6] Langrock, I., & González-Bailón, S. (2020). The Gender Divide in Wikipedia: A Computational Approach to Assessing the Impact of Two Feminist Interventions. https://doi.org/10.2139/ssrn.3739176

[7] Miquel-Ribé, M., & Laniado, D. (2020, August). The Wikipedia diversity observatory: A project to identify and bridge content gaps in Wikipedia. In Proceedings of the 16th International Symposium on Open Collaboration (pp. 1-4).

[8] Redi, M., Gerlach, M., Johnson, I., Morgan, J., & Zia, L. (2020). A taxonomy of knowledge gaps for wikimedia projects (second draft). arXiv preprint arXiv:2008.12314.

[9] Zhu, K., & Walker, D. (2024). The Promise and Pitfalls of AI Technology in Bridging Digital Language Divide: Insights from Machine Translation on Wikipedia. Available at SSRN. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4708614

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]