Grants:Project/WCDO/Culture Gap Monthly Monitoring/Timeline

From Meta, a Wikimedia project coordination wiki

Timeline for WCDO[edit]

Timeline Date
Publish the Midpoint Report 15 August 2019
Publish the Final Report 30 January 2020

Monthly updates[edit]

March 2019[edit]

  • We debugged the process of collecting Cultural Context Content (CCC).
  • We participated in "Edit-a-thon" DHASA (Digital Humanities Association of Southern African) organized by DNdubane_(WMF) at the University of Pretoria with a short online presentation - Wikipedia Cultural Diversity Observatory project (March 27th).
  • We started creating the lists of Top CCC articles on several topics (folk, monuments, earth, music creations and organizations, sports and teams, food, paintings, glam, books, clothing and fashion, and industry).
  • We adapted the project meta site ( for the new phase.
  • We located several databases (e.g. ethnologue, wals) including all the world languages and studied their overlap in the territories where they are spoken in order to detect languages with a marginalization status.
  • We prepared the organizational documents, Excels, and code in order to tackle the new research and development phase for the project.

April 2019[edit]

  • We finished creating the lists of Top CCC articles on several topics (folk, monuments, earth, music creations and organizations, sports and teams, food, paintings, glam, books, clothing and fashion, and industry).
  • We created a language territories database (languages_territories.db) extending the file Wikipedia_language_territories_mapping_quality.csv and other files. This is based on the more than 6 thousand languages spoken in the world and computed their overlapping in the same territories.
  • We started writing a paper about editor participation on Cultural Context Content in order to explain how important it is to represent the context for the well-functioning of a Wikipedia.
  • We studied different possibilities in order to evangelize the Wikimedia movement with cultural diversity and wrote a document about a “Cultural Diversity Maturity Model” for communities.
  • We presented the WCDO project at the Seminario DigiDoc abril 2019 (Universitat Pompeu Fabra, Barcelona, Catalonia) as “Wikipedia Cultural Diversity Observatory: un caso de aplicación práctica del análisis de datos para mejorar la diversidad cultural en la Wikipedia” (slides in Commons).

May 2019[edit]

June 2019[edit]

  • We published the CCC Dataset publicly and for the research community and presented it at the conference ICWSM, Munich June 11-13th (Program). Reference: Miquel-Ribé, M., & Laniado, D. (2019). Wikipedia Cultural Diversity Dataset: A Complete Cartography for 300 Language Editions. Proceedings of the 13th International AAAI Conference on Web and Social Media (pdf). ICWSM. ACM.
  • We did an analysis of the world languages according to their geographical extent, their social status and number of speakers in order to determine both the coexistence in a territory and the situations of language marginalization.
  • We were stuck with a bottleneck with MySQL databases replicas and had to code again the functions in multiple ways in order to make it work.
  • Marc has been contributing to the Diversity Working Group with recommendations directed to expand the horizons of the observatory and address other problematics in the diversity area.
  • We received feedback and made extensive changes and edits to the chapter for “Wikipedia@20” and participated in the reviewing process of other chapters.

July 2019[edit]

  • We attended the Wikimedia conference Celtic Knot and presented “Languages Matter to Cultural Diversity: Finding Missing Languages and Bridging the Gaps in Minority Languages” (slides).
  • We designed the database and main code for the monthly analysis (
  • We contacted the Wikitech and Analytics teams to consult the bottleneck and started re-rewriting the code of the whole WCDO framework in order to use the SQL dumps (those concerning the replicas tables).
  • We ended the computing of the dataset “Missing CCC”: this dataset contains for every language the articles that should exist because they are in their local context and instead they exist in a language of higher status (e.g. articles on Uganda that do not exist in Luganda Wikipedia but exist in English Wikipedia).
  • Marc has been contributing to the Diversity Working Group with the writing of the recommendations, weekly calls and has met in Rome with some members of the WG.

August 2019[edit]

September 2019[edit]

  • We have generated a new version of the CCC dataset - although, it took three times more than before, and some features are still unavailable (those revision table based: number of editors, number of edits, etc.).
  • We have explored the possibility of uploading the Top CCC lists to Wikidata by creating new properties with Alex Stinton and Satdeep.
  • We made some specific analyses for the Arabic and Egyptian Arabic Wikipedias in order to prepare the presentation for the WikiArabia conference.
  • We started coding some new data visualizations (topical analyses) that are not available yet.

October 2019[edit]

November 2019[edit]

  • We created some dashboards both testing the stats generated and the different types of graphs available in Plotly (not uploaded yet).
  • We re-coded some of the functions in order to use dumps and avoid the replicas.
  • We started writing/outlining an article in order to disseminate results in a specialized journal.

December 2019[edit]

  • We created some dashboards both testing the stats generated and the different types of graphs available in Plotly (not uploaded yet).
  • We created a plan for the next potential phases of the Observatory.
  • We created a summary of the project for the Knowledge Equity Advent calendar initiative from Wikimedia Deutschland.

Is your final report due but you need more time?

Extension request[edit]

New end date[edit]



As we explained in the midreport (section What are the challenges), we have had to re-code an important part of the original due to a different functioning (lower performance) of the MySQL replicas. This lead us to look for different approaches using both the replicas and the XML dumps. However, the processing of the dumps for the 300 languages has not been successful as the processing time is too high, and the replicas functioning has not improved and any of the approaches using them was fast enough in order to retrieve and process the data (on the scale of weeks, which made the results totally invalid).

These issues are of vital importance for the project. First, because they do not allow to progress on the development of other fundamental parts of the project (visualizations and analysis) that are dependant on having the data. Second, because without these data processes automatized, it is not possible to have these visualizations showing up-to-date results. In order to fix this, we have been contacting several WMF departments in order to request access to certain services that would fix the problem or would give a much better solution to it. Even though it was not possible to obtain access, we can try a new approach that involves a new dataset that is being released in few weeks. In order to do this and finish the project with the desired functionalities, we will need to extend it three more months. This will set the final deadline for the report to the end of March. Please, do not hesitate to ask me for further detail on how I would spend these extra months or any aspect related to the project.

@Marcmiquel: Hi Marc, thanks for this extension request, and for detailing the obstacles you encountered in obtaining access to specific services. This extension request is approved, with a final report due one month later on 30 April 2020. I JethroBT (WMF) (talk) 15:58, 18 December 2019 (UTC)

Extension request[edit]

New end date[edit]



Extend the project from Cultural Diversity Observatory to Diversity Observatory

The project has been completed and the tasks designed for 2019 have given place to different dashboards, datasets, publications, and talks in the project site. During different conversations in Wikimedia events and in the frame of the Wikimedia Strategy 2030 conversations, some folks showed interest in having extended dashboards, tools, and data for some missing types of diversity: ethnic groups, LGTB+ groups, religious groups, among others.

Following the same approach, we propose to expand the project so that the Cultural Observatory becomes the Diversity Observatory (here a conceptualization for all types of diversity). The tasks and goals proposed for this extension are aimed at expanding the datasets, dashboards and improving the entire framework and website (see here).

@Marcmiquel: This extension of your timeline and additional tasks are formally approved! Your new end date is 10 November 2020, and a final report will be due on 10 December 2020. If you do need more time, especially as the review process for this extension took much longer than expected, please feel free to request additional time. I JethroBT (WMF) (talk) 20:46, 14 July 2020 (UTC)