Research:Lost Without Translation: Estimation and Implications of Invisible Languages on Wikipedia

Created

14:27, 30 October 2024 (UTC)

Contact

Saurabh Khanna

University of Amsterdam

Collaborators

Olga Eisele

University of Amsterdam

Paula Helm

University of Amsterdam

Katrin Schulz

University of Amsterdam

John Willinsky

Public Knowledge Project

Ariel Koren

Respond Crisis Translation

Duration: 2024-10 – 2026-12

knowledge gaps, invisible languages, multilingualism

Research:Projects

This page documents a proposed research project.
Information may be incomplete and may change before the project starts.

Lost Without Translation as a research project aims to address the critical issue of linguistic inequality in the digital age. Despite the vast amount of information available online, systemic biases favor dominant languages, rendering content in underrepresented languages invisible. This phenomenon limits access to information, exacerbates socio-economic disparities, and threatens cultural diversity.

By collaborating with the Wikimedia Foundation, we seek to develop metrics for assessing language gaps on Wikimedia platforms, create tools to enhance the visibility of underrepresented languages, and promote inclusive digital practices. The output of this project will provide tangible benefits to the Wikimedia community in the form of data, software, and web services, including an open-source interactive dashboard that visualizes the state of languages across Wikimedia projects. Our research will deepen the understanding of how linguistic inequalities manifest on Wikimedia platforms and inform strategies to promote linguistic diversity and equity.''

Methods

Our research methodology combines theoretical development, empirical analysis, and practical interventions:

Data Collection and Analysis: We will collect multilingual datasets from Wikimedia projects, focusing on language editions, content volume, and user engagement metrics. Advanced machine learning techniques, including multimodal embeddings, will be employed to process and analyze textual, auditory, and visual data. This will help us identify patterns of linguistic underrepresentation and content invisibility.

Metric Development: In collaboration with the Wikimedia Foundation, we will develop metrics to assess the state of languages on Wikimedia platforms, understand the dynamics of the Wikimedia Incubator, and measure knowledge gaps. These metrics will provide insights into how different languages are represented and accessed on Wikimedia projects.

Interactive Dashboard Creation: We will develop an open-source, publicly accessible dashboard that dynamically evaluates and visualizes invisible languages across Wikimedia projects. The dashboard will display real-time data on language usage, content availability, and user engagement, serving as a tool for researchers, policymakers, and the Wikimedia community.

Community Engagement: While our project does not involve recruiting Wikimedia editors for surveys or interviews, we will engage with the Wikimedia community through discussion pages and forums to gather feedback on our tools and findings. This collaborative approach ensures that our work aligns with community needs and values.

Note: As we are not conducting surveys or interviews involving Wikimedia editors, there is no need for consent forms or recruitment methods.

Timeline

Months 1-6: Literature review and theoretical framework development; initial data collection from Wikimedia projects.

Months 7-12: Development of metrics for assessing language gaps; preliminary data analysis.

Months 13-18: Creation and testing of the interactive dashboard; iterative improvements based on community feedback.

Months 19-24: Final data analysis; preparation of reports and scholarly publications; dissemination of findings to the Wikimedia community and broader audiences.

Policy, Ethics and Human Subjects Research

We are committed to conducting our research ethically and in compliance with Wikimedia policies:

Non-Disruption of Wikipedians' Work: We will not interfere with the activities of Wikipedia editors or disrupt the normal functioning of Wikimedia projects. Our data collection will utilize publicly available information without engaging directly with individual contributors.

Data Privacy and Anonymity: All data used will be aggregated and anonymized, ensuring that no personal or sensitive information is disclosed. We will adhere to Wikimedia's privacy policies and guidelines on data use.

Ethical Approval: As our study involves analysis of publicly available data and does not involve human subjects research requiring consent, institutional ethical approval is not necessary. However, we uphold the highest ethical standards in our research practices.

Compliance with Wikimedia Policies: We will follow all relevant Wikimedia policies, including those related to research, data use, and community engagement.

Results

To be completed upon project completion.

We anticipate that our project will uncover significant insights into the extent and impact of linguistic invisibility on Wikimedia platforms. The findings will inform strategies to enhance the visibility of underrepresented languages, contributing to more equitable access to knowledge. Preliminary data and results will be shared with the Wikimedia community as they become available, fostering collaboration and feedback.

Resources

Lost in Translation Homepage

The (In)visible Lab Homepage

GitHub repository with open-source code

References