Wikipedia Cultural Diversity Observatory/Sets intersections and increments

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search
Proportion of CCC, CCC GL and CCC KW in 40 Wikipedia language editions (2016)[1].

The most important goal of WCDO is to raise awareness on the existing Wikipedia cultural diversity and the knowledge inequalities based on cultural context content. Therefore, it relies heavily on an analytical perspective in order to provide statistics.

Almost every analytical question on cultural diversity can be answered as a matter of proportion ("How many articles are created to bridge this gap...?", "How big is the gap between...", "What is the extent of..."). Therefore, in order to be able to give answers to a wide-range of questions, the project has built a flexible abstraction. Once defined, it is possible to compute the statistics (see


The abstraction is based on sets, intersections and increments.


In order to study the groups of articles, its articles characteristics and the relationships among the articles themselves and with other groups, they are considered as sets. For instance, the proportion of the set of a prior version of CCC accounted for a 23.57% in average in 40 Wikipedia language editions, while CCC GL accounted for 5.04% and CCC KW accounted for 1.14%.

Defining sets is useful in order to later being able to track changes over time.

Sets can be defined by the kind of content, its characteristics or as the aggregation of other sets. For instance, some which are being currently used are:

  • CCC Content type:

Language edition CCC, language edition CCC geolocated articles, language edition CCC articles with keyword on title, all languages editions CCC articles, etcetera.

  • Location:

All geolocated articles in a language edition, all geolocated articles Wikidata items, geolocated articles for a specific country, geolocated articles for a specific continent, etcetera.

  • Genre:

All people articles in a language edition, all people Wikidata qitems, all male articles in a language edition, all female articles in a language edition, etcetera.

  • Article characteristics:

Language edition articles without interlanguage links, all wikipedia language editions articles without interlanguage links, etcetera.

  • Period of time:

Articles created during the past month, pageviews aggregated in a group of articles during the last month, etcetera.

  • General:

All Wikipedia languages articles, all Wikidata qitems with namespace zero, etcetera.

Intersections and Increments

With these sets it is possible to compute the intersection in absolute number of articles two sets share and in percentage for each set and, later, the increment of these values according to previous periods of time (monthly, quarterly, semester, yearly, five yearly). This is especially interesting in order to assess the impact of different projects on the different knowledge gaps.


In this file there is an extensive list of different sets and intersections along with the specific question each one answers. This is used in order to keep track of the current statistics computed.


  1. Miquel-Ribé, M., & Laniado, D. (2016). Cultural identities in wikipedias. In Proceedings of the 7th 2016 International Conference on Social Media & Society (p. 24). ACM.