Wikipedia Cultural Diversity Observatory

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search
This page contains changes which are not marked for translation.


Cultural Context Content (CCC) Gaps (Coverage and Spread) Top CCC Articles Lists Languages Get Involved
Wikipedia Cultural Diversity Observatory

A project to provide data with strategic value and resources to organize and fight for more cultural diversity within Wikipedia

Browse other cultures and places. Find the gaps you care about!

The Wikipedia Cultural Diversity Observatory (WCDO) is a space to study Wikipedia's diversity coverage, discuss the strategic needs and propose solutions to improve it.

To do so, it aims at raising awareness on Wikipedia’s current state of diversity by providing datasets, visualizations and statistics, as well as pointing out solutions and tools.


This project’s vision is to align the movement to achieve cultural diversity in the different projects content.

This project's mission is to create a joint space for researchers and activists to study and fight against the cultural knowledge gaps and promote knowledge equity. Hence, we provide strategic valuable data and resources to organize and take action.


These are the three main outcome goals we are working on to increase the cultural diversity within the Wikimedia projects:

Main outcome goals:

  1. Every Wikipedia language edition ensures a minimal representation of their own territories’ cultural context (from geography to biographies, traditions, language and others).
  2. Every Wikipedia language edition ensures a minimal coverage of every other language cultural context content.
  3. Every Wikipedian has information about marginalized languages without a Wikipedia so he can help out their speakers to create one and start representing their cultural context.

In order to reach these goals, we detail some other more specific goals in community engagement and research and development activities of the project.

Community engagement goals:

  • Every Wikipedia language community is aware and knows about the knowledge inequalities in the entire Wikipedia project.
  • Every Wikipedia language community is aware of the importance of representing her own culture so the rest of language editions users can import and learn from it.
  • Every Wikipedia event and community organized contest considers dedicating sections and activities aimed at mitigating the cultural knowledge gaps and derived inequalities.

Research and development goals:

  • Every Wikipedian has access to some data visualization tools in order to browse the gaps and create new valuable articles.
  • Every Wikipedian has access to some statistical analysis on the extent of the gaps and understands the priorities in order to bridge or cover them.
  • Every Wikipedian has access to some data on the world's languages without a Wikipedia in order to disseminate the importance and try to engage in creating one.

Framing the problem

“The sum of human knowledge” is not in a single language but in the existing cultural diversity from every territory and language in the world. Wikipedia aims at gathering it.

In order for each Wikipedia language edition to have content representing something close to the world existing cultural diversity, we have to work on very different aspects and align all the Wikimedia movement stakeholders to facilitate the creation of content that ensures articles that show cultural diversity.

We see this as a two-step process or two sequential processes: representation and sharing.

For each language, the process of representation implies creating content that relates to the geographical and culture context from the editors. Instead, the process of sharing, implies understanding where the gaps are both in the own language and in the others, in order to exchange each others' cultural context content and increase all languages' cultural diversity.

In order to facilitate cultural context representation, we propose:

  1. Create, collect, process, and present different sorts of metrics and tools to describe creation and usage of cultural content on Wikimedia projects.
  2. Understand the situation of all the world's languages that could become Wikipedia language editions, and consequently, the potential content about their cultural context they would bring to the entire Wikipedia project.

In order to facilitate each language the sharing (import and export) of all languages cultural context content, we propose:

  1. Ideate and develop tools that prioritize and allows finding the most valuable content (popular and relevant) that might be essential to be created across projects.
  2. Provide training to organizations and individuals in these tools so that they can help mitigate the knowledge gaps and increase the cultural diversity in Wikimedia projects.

Not all languages are in the same position in order to achieve a good coverage of the worlds' cultural diversity. Usually languages represent their cultural context first, and build the capacity and maturity later in order to create articles about every other language's cultural context. It is possible to compare and discuss the maturity level of a language edition in terms of content cultural diversity according to several aspects we discuss in this preliminar model.

Cultural diversity tools

As an observatory, the outcomes of this project bridge the gap between research and activism more than focusing on the content creation itself. This portal itself provides results. Most of the visualizations are located or better depicted at an external website ( created with Plotly hosted in Toolforge.

Even though some results are repeated in both sites, those at the external website are preferable as they allow better user interaction with the data. For example, the tables from List of Wikipedias by Cultural Context Content allow filtering feature not available in List of Wikipedias by Cultural Context Content.

This project is continually developing research questions, concepts, visualizations and tools.

WCDO's main concepts are Cultural Context Content, Culture Gap, Top CCC articles and Missing CCC articles:

Cultural Context Content (CCC) aka Local Content

Figure 1. By Cultural Context Content articles (CCC) I refer to the articles on a different range of topics, all related to the editors’ context, occurring in each Wikipedia language edition
Figure 2. CCC Datasets are a necessary map and a starting point to fight for cultural diversity in each Wikipedia (video explaining them).

Cultural Context Content (CCC) (methodology) is the group of articles in a Wikipedia language edition that relates to the editors' geographical and cultural context (places, traditions, language, politics, agriculture, biographies, events, etcetera.) (Figure 1). You can see this Youtube video explaning its creation and use.

In order to create any CCC it is necessary to establish a language territories mapping, in other words, to pin out the territories where the language is spoken as native or with official legal status.

Cultural Context Content is collected as a group of datasets (Figure 2), which are released on a monthly basis. These datasets are used to compute and depict several statistics on the state of knowledge equality and cross-cultural coverage.

For example, it is possible to consult the extent of CCC in each Wikipedia language edition (List of Wikipedias by Cultural Context Content) or even the amount of articles from a particular territory in one language edition CCC (List of Language Territories by Cultural Context Content).

Culture Gap

The culture gap occurs when a Wikipedia language edition is not covering articles that belong to another language edition CCC. Around a 50% of the articles non-existing across language editions (language gap) is due to the culture gap.

In order to compute the culture gap and other statistics, WCDO proposes calculating the intersections between differents sets of articles (e.g. common articles between all articles from English language edition and articles from Japanese CCC). The use of intersections allows to see the absolute number of articlese and its extent (the relative importance) in each of the two sets.

In these two tables it is possible to see the culture gap in two different ways. First, the spread of a language CCC on the rest of Wikipedia language editions, and, second, the coverage of all the languages CCC.

Top CCC articles lists

Wikipedia language editions should not be a replica of each other and the gap may never be completely closed. However, a minimal coverage of all other languages should be a goal on the agenda of each Wikipedia edition to create more multicultural (and complete) encyclopaedias.

Figure 2. Top CCC articles lists are different selection of articles from CCC (such as gender, geolocation, etc.) ranked according to a particular feature (number of pageviews, number of editors contributing to it, etc.).This is a useful way to find some relevant articules to bridge the gap.

Top CCC articles lists can help in providing content for this minimal cultural coverage. Inspired by the Vital articles lists, the Top CCC articles present the most rellevant articles in terms of different metrics (e.g. number of editors or pageviews) and specific content types (e.g. geolocated articles or women) from a language cultural context or country's cultural context.

The Top CCC articles current generaetd lists are: list of CCC articles with most number of editors (Editors), list of CCC articles with featured article distinction (Featured), most bytes and references (weights: 0.8, 0.1 and 0.1 respectively), list of CCC articles with geolocation with most links coming from CCC, list of CCC articles with keywords on title with most bytes (Bytes), list of CCC articles categorized in Wikidata as women with most edits (Women), list of CCC articles categorized in Wikidata as men with most edits (Men), list of CCC articles created during the first three years and with most edits (First 3Y.), list of CCC articles created during the last year and with most edits (Last Y.), list of CCC articles with most pageviews during the last month (Pageviews), list of CCC articles with most edits in talk pages (Discussions).

In this page, you can consult the list from a particular country or language CCC generated on a monthly basis from the latest CCC dataset. You need to specify the list parameter (editors, featured, geolocated, keywords, women, men, created_first_three_years, created_last_year, pageviews and discussions), the language target parameter (as lang_target and the language wikicode), the language origin (as lang_origin and the language wikicode), and, optionally to limit the scope of the selection, the country origin parameter as part of the CCC (as country_origin and the country ISO3166 code). In case no country is selected, the default is 'all'.

One possible URL with Top CCC list by number of editors, language origin Spanish, language target Italian and no country would be:

A similar list but limited to a specific country and to women, would be:

The generated table includes several metrics, and shows the availability in top right column with the current title (in case it exists) or one possible title generated by translator or by a Wikidata label.

Another way to browse the lists is by examining how well a language edition covers the other language editions Top CCC articles lists (centered around countries, as Countries Top CCC article lists), or how well spread are one particular language editions Top CCC lists on the rest of language editions.

In this case, it is necessary to specify the language covering or spreading the lists with the lang parameter. This is an example using Catalan Wikipedia:

  • Languages Top CCC articles spread from Catalan Wikipedia.

  • Languages Top CCC articles coverage by Catalan Wikipedia.

  • Countries Top CCC articles coverage by Catalan Wikipedia.

Missing CCC articles

Normally Wikipedia language editions tend to cover their own cultural context (from territories to all the cultural expressions) much better than others. However, in around 150 languages their cultural context content is below the 10% of the content, which is a sign that it is likely underrepresented. In this case, it very possible that larger Wikipedia language editions have articles that are missing in their CCC. Sometimes these languages are English, French Russian and Spanish, which are the languages that usually coexist with other languages with a Wikipedia (only 48 Wikipedia language editions are of languages that do not coexist with other languages in one territory).

In order to improve the representation of local content in these underdeveloped Wikipedias, we proposed the creation of a tool named "Missing CCC articles". This allows us to query articles that should exist in one language CCC but they have not been created yet, and instead, exist in other languages. Additionally, we can also query articles from a language CCC that are longer in another language edition.

It is possible to query any list by changing the URL parameters or by using the following menus. You first need to select the target language (where you would like to improve local content representation). Additionally, if you want to aim at specific part of a language context, you can select the target country and target region - they are optional and allow you to filter for a specific area. For instance, for Target language French, whose language context encompasses several countries, Target country and Target region could be France and Québec.

One possible URL with a query for Luganda CCC about Uganda and Geolocated content that is found in any other language edition would be:

Disclaimer: This tool is still at Alpha phase and may contain some bugs. Your feedback can be useful.

New tools (work in progress)

Figure 4. These are the final functionalities we propose for the observatory. They all support the idea of helping editors "browse other cultures and places, and find the gaps they care about".

Current we want to use the CCC datasets to monitor the gaps on a continual basis (showing the creation of articles for specific kinds of content to show whether and where editors are really bridging the gap) along with many other lists, solutions and improvements after all the feedback gathered in past Wikimedia events and from local communities (Figure 4). Likewise, we want to create a multilingual editors dashboard where to find potential collaborators. The editor must be able to query lists or visualizations to see editors from other language editions according to their cultural context interests.

Other diversity tools and research papers

We also want to provide a short overview on the different other tools and research papers created outside this project that are useful to understand and detect cultural differences between language editions and possibly bridge the gaps or work on other diversity problems like the content gender gap.

Dissemination timeline

These are the latest actions we did in order to raise awareness on the cultural diversity problem in Wikipedia. It is the dissemination of research results, concepts and tools:

Strategic discussions

This project also aims at raising debates on the different types of diversity. Some of the Wikimedia 2030 Strategy process discussions in the Diversity Working group are directed at improving diversity on content and in the current communities.

These are some of the recommendations that are related to the project:






You can always contact us and engage in discussions. We believe the Wikimedia movement needs more discussions on diversity in order to encourage the necessary changes to become more inclusive and improve the content coverage.

Activities / Get involved

The Observatory does need dissemination in order to reach all the possible Wikimedia events and activities where it could provide some value. If you want to collaborate, get involved. Leave your username and send us an e-mail at

The project welcomes all kind of contributors that want to participate in discussions, do some research on diversity or simply fight the gaps. There are at least seven different types of activities or profiles from whom the project would benefit.

  • Data retriever frames the problem of diversity and extracts the necessary data to study and throw light on it.
  • Researcher/data analyst studies the data on a problem of diversity, extracts conclusions and communicates them through visualizations.
  • Communicator explains the conclusions in order to raise awareness in specific communities or movement-wide.
  • Strategist proposes some top priority goals, mechanisms or principles based on research and the state of the communities.
  • Creator proposes or creates a tool based on the data or any conclusion in order to improve any diversity-related problem.
  • Developer/designer works on developing and refining the tools in order to make them as usable as possible.
  • Program manager organizes programs with the communities including activities and using the tools in order to solve the problems.

The observatory is primarily involved in the first 5 activities. However, any of the seven can make a contribution and benefit from the work done. Likewise, some of the work originated in other spaces but also on diversity can be disseminated or trigger new tools. Getting involved can be useful in order to find a meeting point or a place to start working on diversity.

In case you want to code some extra visualizations, you can find the project's code here: github page.