Wikipedia Diversity Observatory/Sets intersections and increments

From Meta, a Wikimedia project coordination wiki

The Wikipedia Diversity Observatory relies heavily on an analytical perspective in order to provide stats to raise awareness on the existing content diversity and gaps. In this page we explain the way coverage of topics is computed through the use of sets of articles and intersections between them.

Stats Database

In the stats database file (stats.db) you can find all the data used to depict all the different visualizations.

Since the intersections encompass all sorts of categories (e.g. cultures, geography entities, gender, etc.) and over time (on a monthly basis both as increments and accumulated), the file can be considered a history log of content diversity and gaps in Wikipedia language editions.

Stats Concept

Almost every analytical question on cultural diversity can be answered as a matter of proportion ("How many articles are created to bridge this gap...?", "How big is the gap between...", "What is the extent of..."). Therefore, in order to be able to give answers to a wide-range of questions, the project has built a flexible abstraction. Once defined, it is possible to compute the statistics (see

The creation of stats is based on sets, intersections and increments.


Proportion of CCC, CCC GL and CCC KW in 40 Wikipedia language editions (2016)[1].

In order to study the groups of articles, its articles characteristics and the relationships among the articles themselves and with other groups, they are considered sets. For instance, the proportion of the set of a prior version of CCC accounted for a 23.57% in average in 40 Wikipedia language editions, while CCC GL accounted for 5.04% and CCC KW accounted for 1.14%.

Defining sets is useful in order to later being able to track changes over time.

Sets can be defined by the kind of content, its characteristics or as the aggregation of other sets. For instance, some which are being currently used are:

  • CCC Content type:

Language edition CCC, language edition CCC geolocated articles, language edition CCC articles with keyword on title, all languages editions CCC articles, etcetera.

  • Location:

All geolocated articles in a language edition, all geolocated articles Wikidata items, geolocated articles for a specific country, geolocated articles for a specific continent, etcetera.

  • Genre:

All people articles in a language edition, all people Wikidata qitems, all male articles in a language edition, all female articles in a language edition, etcetera.

  • Article characteristics:

Language edition articles without interlanguage links, all wikipedia language editions articles without interlanguage links, etcetera.

  • Period of time:

Articles created during the past month, pageviews aggregated in a group of articles during the last month, etcetera.

  • General:

All Wikipedia languages articles, all Wikidata qitems with namespace zero, etcetera.

Intersections and Increments

With these sets it is possible to compute the intersection in the absolute number of articles two sets share and in percentage for each set and, later, the increment of these values according to previous periods of time (monthly, quarterly, semester, yearly, five-yearly). This is especially interesting in order to assess the impact of different projects on different knowledge gaps.

Excel of Intersections/Questions

In this Excel file (sets_intersections.xlsx) there is an extensive list of different sets and intersections along with the specific question each one answers. This is used in order to keep track of the current statistics computed.


  1. Miquel-Ribé, M., & Laniado, D. (2016). Cultural identities in wikipedias. In Proceedings of the 7th 2016 International Conference on Social Media & Society (p. 24). ACM.