Wikipedia Cultural Diversity Observatory

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search


Cultural Context Content (CCC) Gaps (Coverage and Spread) Top CCC Articles Lists Minoritized Languages Get Involved
Wikipedia Cultural Diversity Observatory

A project to provide data with strategic value and resources to organize and fight for more cultural diversity within Wikipedia

Browse other cultures and places. Find the gaps you care about!

The Wikipedia Cultural Diversity Observatory (WCDO) is a space to study Wikipedias intercultural coverage and fight knowledge inequality. To do so, it aims at raising awareness on Wikipedia’s current state of cultural diversity by providing datasets, visualizations and statistics, as well as pointing out solutions and tools.


This project’s vision is to align the movement to achieve cultural diversity in the different projects content.

This project's mission is to create a joint space for researchers and activists to study and fight against the cultural knowledge gaps and promote knowledge equity. Hence, we provide strategic valuable data and resources to organize and take action.

This project is especially motivated by the Africa knowledge gap (see this interview).


These are the three main outcome goals we are working on to increase the cultural diversity within the Wikimedia projects:

Main outcome goals:

  1. Every Wikipedia language edition ensures a minimal representation of their own territories’ cultural context (from geography to biographies, traditions, language and others).
  2. Every Wikipedia language edition ensures a minimal coverage of every other language cultural context content.
  3. Every Wikipedian has information about marginalized languages without a Wikipedia so he can help out their speakers to create one and start representing their cultural context.

In order to reach these goals, we detail some other more specific goals in community engagement and research and development activities of the project.

Community engagement goals:

  • Every Wikipedia language community is aware and knows about the knowledge inequalities in the entire Wikipedia project.
  • Every Wikipedia language community is aware of the importance of representing her own culture so the rest of language editions users can import and learn from it.
  • Every Wikipedia event and community organized contest considers dedicating sections and activities aimed at mitigating the cultural knowledge gaps and derived inequalities.

Research and development goals:

  • Every Wikipedian has access to some data visualization tools in order to browse the gaps and create new valuable articles.
  • Every Wikipedian has access to some statistical analysis on the extent of the gaps and understands the priorities in order to bridge or cover them.
  • Every Wikipedian has access to some data on the world's languages without a Wikipedia in order to disseminate the importance and try to engage in creating one.


“The sum of human knowledge” is not in a single language but in the existing cultural diversity from every territory and language in the world. Wikipedia aims at gathering it.

In order for each Wikipedia language edition to have content representing something close to the world existing cultural diversity, we have to work on very different aspects and align all the Wikimedia movement stakeholders to facilitate the creation of content that ensures articles that show cultural diversity.

We see this as a two-step process or two sequential processes: representation and sharing.

For each language, the process of representation implies creating content that relates to the geographical and culture context from the editors. Instead, the process of sharing, implies understanding where the gaps are both in the own language and in the others, in order to exchange each others' cultural context content and increase all languages' cultural diversity.

In order to facilitate cultural context representation, we propose:

  1. Create, collect, process, and present the sorts of metrics which describe creation and usage statistics of cultural content on Wikimedia projects.
  2. Understand the situation of all the world's marginalized languages that could become Wikipedia language editions, and consequently, the potential content about their cultural context they would bring to the entire Wikipedia project.

In order to facilitate each language to share (import and export) cultural context content, we propose:

  1. Ideate and develop tools that prioritize and allows finding the most valuable content (popular and relevant) that might be essential to be created across projects.
  2. Provide training to organizations and individuals in these tools so that they can help mitigate the knowledge gaps and increase the cultural diversity in Wikimedia projects.

Current Outcomes

As an observatory, the outcomes of this project bridge the gap between research and activism more than focusing on the content creation itself. This portal itself provides results updated on a monthly basis, however most of the visualizations are located or better depicted at an external website ( created with Plotly hosted in Toolforge.

Even though some results are repeated in both sites, those at the external website are preferable as they allow better user interaction with the data. For example, the tables from List of Wikipedias by Cultural Context Content allow filtering feature not available in List of Wikipedias by Cultural Context Content.

This project is continually developing research questions, concepts, visualizations and tools. DISCLAIMER: Currently it is in the beta 1, so if you find any bug, we would be so pleased to receive a report to the e-mail

WCDO's main concepts are Cultural Context Content, Culture Gap and Top CCC articles lists:

Cultural Context Content (CCC)

Cultural Context Content (CCC) (methodology) is the group of articles in a Wikipedia language edition that relates to the editors' geographical and cultural context (places, traditions, language, politics, agriculture, biographies, events, etcetera.) (Figure 1). You can see this Youtube video explaning its creation and use.

Figure 1. By Cultural Context Content articles (CCC) I refer to the articles on a different range of topics, all related to the editors’ context, occurring in each Wikipedia language edition

In order to create any CCC it is necessary to establish a language territories mapping, in other words, to pin out the territories where the language is spoken as native or with official legal status.

Cultural Context Content is collected as a group of datasets, which are released on a monthly basis. These datasets are used to compute and depict several statistics on the state of knowledge equality and cross-cultural coverage.

For example, it is possible to consult the extent of CCC in each Wikipedia language edition (List of Wikipedias by Cultural Context Content) or even the amount of articles from a particular territory in one language edition CCC (List of Language Territories by Cultural Context Content).

Culture Gap

The culture gap occurs when a Wikipedia language edition is not covering articles that belong to another language edition CCC. Around a 50% of the articles non-existing across language editions (language gap) is due to the culture gap.

In order to compute the culture gap and other statistics, WCDO proposes calculating the intersections between differents sets of articles (e.g. common articles between all articles from English language edition and articles from Japanese CCC). The use of intersections allows to see the absolute number of articlese and its extent (the relative importance) in each of the two sets.

In these two tables it is possible to see the culture gap in two different ways. First, the spread of a language CCC on the rest of Wikipedia language editions, and, second, the coverage of all the languages CCC.

Top CCC articles lists

Wikipedia language editions should not be a replica of each other and the gap may never be completely closed. However, a minimal coverage of all other languages should be a goal on the agenda of each Wikipedia edition to create more multicultural (and complete) encyclopaedias.

Figure 2. Top CCC articles lists are different selection of articles from CCC (such as gender, geolocation, etc.) ranked according to a particular feature (number of pageviews, number of editors contributing to it, etc.).This is a useful way to find some relevant articules to bridge the gap.

Top CCC articles lists can help in providing content for this minimal cultural coverage. Inspired by the Vital articles lists, the Top CCC articles present the most rellevant articles in terms of different metrics (e.g. number of editors or pageviews) and specific content types (e.g. geolocated articles or women) from a language cultural context or country's cultural context.

The Top CCC articles current generaetd lists are: list of CCC articles with most number of editors (Editors), list of CCC articles with featured article distinction (Featured), most bytes and references (weights: 0.8, 0.1 and 0.1 respectively), list of CCC articles with geolocation with most links coming from CCC, list of CCC articles with keywords on title with most bytes (Bytes), list of CCC articles categorized in Wikidata as women with most edits (Women), list of CCC articles categorized in Wikidata as men with most edits (Men), list of CCC articles created during the first three years and with most edits (First 3Y.), list of CCC articles created during the last year and with most edits (Last Y.), list of CCC articles with most pageviews during the last month (Pageviews), list of CCC articles with most edits in talk pages (Discussions).

In this page, you can consult the list from a particular country or language CCC generated on a monthly basis from the latest CCC dataset. You need to specify the list parameter (editors, featured, geolocated, keywords, women, men, created_first_three_years, created_last_year, pageviews and discussions), the language target parameter (as lang_target and the language wikicode), the language origin (as lang_origin and the language wikicode), and, optionally to limit the scope of the selection, the country origin parameter as part of the CCC (as country_origin and the country ISO3166 code). In case no country is selected, the default is 'all'.

One possible URL with Top CCC list by number of editors, language origin Spanish, language target Italian and no country would be:

A similar list but limited to a specific country and to women, would be:

The generated table includes several metrics, and shows the availability in top right column with the current title (in case it exists) or one possible title generated by translator or by a Wikidata label.

Another way to browse the lists is by examining how well a language edition covers the other language editions Top CCC articles lists (centered around countries, as Countries Top CCC article lists), or how well spread are one particular language editions Top CCC lists on the rest of language editions.

In this case, it is necessary to specify the language covering or spreading the lists with the lang parameter. This is an example using Catalan Wikipedia:

  • Languages Top CCC articles spread from Catalan Wikipedia.

  • Languages Top CCC articles coverage by Catalan Wikipedia.

  • Countries Top CCC articles coverage by Catalan Wikipedia.

Future outcomes

Figure 3. CCC Datasets are a necessary map and a starting point to fight for cultural diversity in each Wikipedia (video explaining them).

The previous outcomes of the project allowed us to lay the project foundations and to create the WCDO website for data visualization, the Cultural Context Content (CCC) Datasets, and to disseminate it across communities and academia (journal paper).

As a quick reminder, Cultural Context Content is the group of articles in a Wikipedia language edition that relate to the editors' geographical and cultural context (places, traditions, language, politics, agriculture, biographies, events, etcetera.). The CCC datasets are a cartography and are fundamental in order to show the gaps and suggest further solutions such as lists of articles (Figure 3).

In this new phase of the project, in order to improve the problem of Wikipedia’s lack of cultural diversity we propose two particular types of solutions:

  • We want to use the CCC datasets to monitor the gaps on a monthly basis (showing the creation of articles for specific kinds of content to show whether and where editors are really bridging the gap) along with many other lists, solutions and improvements after all the feedback gathered in past Wikimedia events and from local communities. Likewise, we want to create a multilingual editors dashboard where to find potential collaborators. The editor must be able to query lists or visualizations where to see editors from other language editions or his and their cultural context interests.
Figure 4. These are the final functionalities we propose for the observatory. They all support the idea of helping editors "browse other cultures and places, and find the gaps they care about".

Hence, in this phase the project mainly shifts from data analytics and machine learning to community engagement and data visualization.

  • We also want to provide strategic data to detect potential new Wikipedia language editions. Following our previous research and the language-territories mapping, we want to create an initial database of all the languages in a status of marginalization (similar to the language-territories mapping), in order to see their potential (in number of speakers and literacy) to become Wikipedia language editions, and to select content related to their cultural contexts that exists in other Wikipedia language editions (of other languages coexisting in these same territories) that should be created in their own native language in order to decolonize this knowledge.

This way, the solutions proposed are based on developing key resources for working on Wikipedia’s cultural diversity to the advantage of any language community or global initiative. In Figure 4 there is an overview of the different functionalities the observatory will have by the end of this phase.

Tools and research papers

We also want to provide a short ovierview on the different other tools and research papers created that are useful to understand and detect cultural differences between language editions and possibly bridge the gaps.

Activities / Get Involved

The project has been presented at different venues as a concept and in its beta phases. It does need dissemination in order to reach all the possible Wikimedia events and activities where it could provide some value.

This page here is the central hub for the research and technical documentation, and at the same time, it directs to the visualizations.

If you want to collaborate, get involved. In case you want to code some extra visualizations, you can find the project's code here: github page.