Project Grants This project is funded by a Project Grant

Report accepted
This midpoint report for a Project Grant approved in FY 2018-19 has been reviewed and accepted by the Wikimedia Foundation.
  To read the approved grant submission describing the plan for this project, please visit Grants:Project/WCDO/Culture Gap Monthly Monitoring.
  • You may still review or add to the discussion about this report on its talk page.
  • You are welcome to email projectgrants(_AT_) at any time if you have questions or concerns about this report.


This project reaches the midpoint mark with a more mature approach to the problem of cultural diversity in Wikipedia. It aims at both collecting articles of local content related to every language edition (named CCC) but also those which have not been created yet (Missing CCC).

The first phase is dedicated to create more visualizations and tools, the second phase is dedicated to disseminate results across communities, Academia and the general reader and the third phase is dedicated to research Wikipedia project gaps in order to propose more languages.

During this half project different tasks have been accomplished in each of the phases. In the first phase, we have new types of article lists and created the code in order to analyze the editors. In the second phase, we published two papers in an Academic conference and in a book dedicated to Wikipedia, as well as participating in to Wikimedia conferences. In the third phase, we obtained data from all the 7000 languages in the world in order to study which are missing and would be a priority to try to engage in Wikipedia. Likewise, we updated Wikidata).

Methods and activities[edit]

This project is organized into 3 phases. Even though it makes sense to think of the three phases as sequential, they can be tackled independently. Given that we had to face some technical difficulties or challenges and at the same time there appeared some new opportunities for dissemination, the development order of the phases has not been altered. We are pleased to report that Phase 1 and 2 have been completed in half of the tasks and Phase 3 is entirely finished. The code which has been developed for both Phase 1 and 3 is available on github.

We overview some of the most relevant methods and activities carried away during this half of the project:

Phase 1: Development (Visualizations and Tools)

  • We had to re-code almost half of the code to use the dumps instead of the database replicas to obtain all the data to create our datasets, and now the solution is much more reliable. During the months of May and June we expected the problems with the replicas (which worked fine last year) would disappear. Since they did not, we needed a new approach. It is does not solve the issue for revision data (number of edits and editors).
  • We created new Top CCC articles on several topics (folk, monuments, earth, music creations and organizations, sports and teams, food, paintings, glam, books, clothing and fashion, and industry). This was not planned for the project, but we believed it was very convenient as it helps many Wikimedia initiatives to think about cultural diversity.
  • We created the interface in order to retrieve Missing CCC articles (those about the local content of a language that do not exist in that language but in bigger ones).

Phase 2: Dissemination across communities, Academia and general reader

  • We presented the WCDO project at the Seminario DigiDoc abril 2019 (Universitat Pompeu Fabra, Barcelona, Catalonia) as “Wikipedia Cultural Diversity Observatory: un caso de aplicación práctica del análisis de datos para mejorar la diversidad cultural en la Wikipedia” (slides in Commons).
  • We published the CCC Dataset publicly and for the research community and presented it at the conference ICWSM, Munich June 11-13th (Program). Reference: Miquel-Ribé, M., & Laniado, D. (2019). Wikipedia Cultural Diversity Dataset: A Complete Cartography for 300 Language Editions. Proceedings of the 13th International AAAI Conference on Web and Social Media (pdf). ICWSM. ACM.
  • Marc has been contributing to the Diversity Working Group with recommendations directed to expand the horizons of the observatory and address other problematics in the diversity area.
  • We studied different possibilities in order to evangelize the Wikimedia movement with cultural diversity and wrote a document about a “Cultural Diversity Maturity Model” for communities. This became a poster.

We measured the language gap in geolocated articles to evaluate the impact of Wikimania 2018 on the creation of geolocated articles in Africa. It was presented in Wikimania Diversity Track but this will be published as an online visualization.

Phase 3: Research: Wikipedia project gaps / Language strategy planning

  • We located several databases (e.g. ethnologue, wals) including all the world languages and studied their overlap in the territories where they are spoken in order to detect languages with a marginalization status.
  • We created a language territories database (languages_territories.db) extending the file Wikipedia_language_territories_mapping_quality.csv and other files. This is based on the more than 6 thousand languages spoken in the world and computed their overlapping in the same territories.
  • We identified the languages that should be a priority as new Wikipedia projects. These are a) languages of territories whose languages have not a Wikipedia language edition, b) those with more speakers and c) languages with a high social status. These was explained in the presentations related to languages (Celtic Knot and Wikimania).

Midpoint outcomes[edit]

Some of the outcomes have already been mentioned as activities. We are particularly satisfied with the publication of

  • Miquel Ribé, M. (2019). The Sum of Human Knowledge? Not in One Wikipedia Language Edition. Wikipedia @ 20. Retrieved from
  • Created a new version of the Language-Territories Mapping and Languages Pairs Database (including all the 7000 languages in the world).
  • We used Wikidata quick statements in order to complement existing languages and territories (countries and subregions) in Wikidata with the following missing information (properties and values). 566 instance of language, 48 instance of macrolanguage, 496 dead languages, 70 ISO 639-3 code, 4 glottolog codes, 134 WALS code, 2626 language coordinates (WALS database), 226 native label (it cannot be more as it is using 639-2 as a constraint), 8668 language-countries pairs, 14770 language-subregion pairs, 108 country native names, 258 country official names, 726 language used countries population status, 886 countries official language, 229 territories native labels, 156 territories official names, 44715 territories language used languages with language status, 1372 territories official language. Currently Wikidata is much more complete in terms of languages, their characteristics and especially their relations with territories.
  • Presentations in Wikimedia conferences Celtic Knot and Wikimania and Academic venues (Seminario Digidoc, ICWSM), among others.


Our finances are on track and are available at:


What are the challenges[edit]

  • It is an important challenge to face the technical development of such a big project, especially to access, retrieve and process almost all the available data from the 300 Wikipedia language editions. This requires facing different sorts of technical bottlenecks (RAM, Disk, etc.), as data were retrieved from different sources (Wikidata dump, Wikipedia dumps, MySQL wikireplicas, etc). We need data to create topical selections of articles and data to rank articles according to relevance. The processes to do this used to take 13 days last November. During these months we tried to make it shorter and more agile by re-coding and re-running the processes (in parallel) without success because the conditions changed (the wikireplicas do not perform as well as before). Therefore, we had to try to work around the problem with static dumps, which are much more reliable - without losing efficiency in the process. In most cases, it worked but we still miss data from revision tables.
  • These kinds of problems do not allow to progress in other lines (visualizations, writing, etc.) and are very frustrating because they risk the development of many other parts of the project but there is no other way to tackle them. Hopefully, during the second half of the project, we will be able to go back on track and create the remaining visualizations. Because of this, automatization and new data, every month is a serious challenge. This was possibly underestimated during the project planning (the number of specific goals was very ambitious, as it was pointed out by many community members). We will try to achieve all these goals while being aware of the external technical difficulties and the necessity for being flexible in case other opportunities for development or dissemination (publications, conferences, etc.) with a potential impact appear and deserve dedicating time to them.
  • It is always a challenge to understand data. For this reason, phase 3 of the project has been tackled before other phases. In this case, we wanted to understand all the available data for the project to have a more accurate idea of the limits of the project - most of the other data was known because of long-term involvement in the Wikimedia movement. Phase 3 data was the data of all the 7000 languages and the data of overlappings between languages in territories (coexistence), which is essential to find gaps in Wikipedias of minoritized languages about their local context that are covered by bigger language editions. One particular approach which became very useful was the use of Tableau (data visualization software). This allows you to create fast visualizations to examine the relationship between different variables, and once learned how to use, you can create valuable dashboards, some of them used in presentations. The importance of understanding data is key to know what is there, and what can be proposed as a solution.
  • It has been also a challenge to keep in touch with communities and people with valuable information. However, it is essential to prioritize actions. In this case, events such as Wikimedia Celtic Knot or Wikimania have been useful to work at discourse level (simplify, simplify and simplify the way to explain cultural diversity in Wikipedia and its possible solutions). Conversations with editors are very useful to understand their mindset much better (their daily concerns and needs) and be able to adapt the results or even create new lists of articles (as it was the case for GLAM, among others). Instead, keeping in touch with researchers is much easier as some of us keep a long-term collaboration - we usually have a call or in-person meeting on a weekly basis to share the technical and research problems (bottlenecks, content selection criteria, machine learning methods, among others.) and write with the aim at publishing the results or any new discourse that can be helpful to raise awareness on cultural diversity in Wikipedia.
  • Another challenge has been to get involved in the Wikimedia 2030 Strategy Process to have a contribution in diversity. Even though Marc was not an initial member of the Diversity Working Group, he was invited right after the scoping document to write recommendations based on community feedback and his research. This was initially an opportunity to disseminate the project results and ideas. Later, it became useful to think of different ways in which data (based on diversity in content, community and also on barriers) could illuminate the path to more diversity in the movement. The strategy process has been useful to find ways to impact on diversity but also an inspiration to think about other lines of work for this project.
  • Finally, it is important to say that at the current stage of the project in which there is already an important creation of "diversity datasets", visualizations and research/talks given to communities, it is hard to continue deploying all functions simultaneously. It is a challenge for being able to switch roles when required and accomplish the different goals. Even though it is stimulating, there is always the perception that some of the tasks could be carried away by some other specialized roles once they are defined at a sufficient level of precision. Currently, we have already defined and understood quite well the limits of a quantitative approach to help with cultural diversity in Wikipedia. Therefore, for further stages in the project development - after this grant - it would be necessary to look at a new more collaborative approach to continue with it. For the current moment, it is important to stick to the top priority outcome, to be flexible to incorporate new things and to continue disseminating to the communities and academia.

What is working well / Next steps[edit]

The current outcomes are an example of what is working well. Outlined below are some of the very next steps/outcomes the project intends to accomplish in its second half.

  • Attend the Wikiarabia conference (October) and provide an in-depth analysis for their language edition in terms of cultural diversity (both representation of their context and coverage of others).
  • Updating the ccc databases and datasets with a shorter iteration based on only computing the metrics/selections for the new articles. The current iteration takes 15 days and this is not sustainable.
  • Creating visualizations using Dash for the data stats that have been created and are in the stats.db database.
  • Promote the project to all the existing interlanguage collaboration initiatives. In particular we are looking forward to contribute to the project GLOW with articles of Missing CCC (Missing Local Content) as we agreed on different conversations on Wikimania and after.

In general, it is always important to keep in mind:

  • Document all the different methods employed, scripts and overall project details.
  • Prepare next improvements for the Cultural Diversity Observatory based on different applications or new types of data.
  • Find other ways to make the outcomes have a higher impact – disseminate across communities and the movement, e.g. in the strategy process.

Grantee reflection[edit]

This project aims at raising awareness on language gaps based on cultural context content. We have conceptualized the problem of cultural diversity as both a problem of representation (content that is not yet in Wikipedia) and of sharing (content that needs to be translated across languages). Finding all the possible ways to do it is a very exciting quest – and for a researcher, the natural tendency is to stick to data and work in this area in more depth. However, we believe that raising awareness and at the same time giving tools is the best way to improve Wikipedia. Bridging these two steps is our goal.

Some of the recommendations written for Strategy 2030 (WG Diversity) are tightly related to this project. And the opposite, the future of this project could be expanded to help many more types of diversity than cultural (e.g. gender, sexual orientation, etc.). Likewise, it would be important to raise awareness among editors so they make a more flexible interpretation of notability based on contextual data. For example, we believe that showing data on the gaps, and the average available sources for a type of content can be helpful to make more informed decisions.

We believe the potential impact of this project is very big. The different challenges mentioned will probably remain as the big stoppers. It is important to be very aware of them and find viable solutions for the short-term, but at the same time to start thinking about future collaborations to overcome them. We do not want to miss the chance to be grateful to all the people who helped the project and offered their time to discuss or any kind of contribution - starting with collaborators mentioned in the page, but also WMF members. Looking forward to meeting you again and continue these conversations.