Wikipedia Diversity Observatory/Languages

From Meta, a Wikimedia project coordination wiki

In order to improve cultural diversity in Wikipedia content, we need to both encourage each Wikipedia language edition to represent their cultural context, and also, all language editions to share articles between themselves. This way we improve the coverage of gaps across languages.

In regards of representation, there is another type of gap: some concepts or points of view from a context are not represented because their native language do not have a Wikipedia. Therefore, looking for new potential Wikipedias and extending the list of editions is a way to improve the overall project’s cultural diversity.

Considering that there are 7000 languages in the world according to Ethnologue (SIL) we need to find strategies to prioritize new Wikipedias. Here we present two criteria to increase Wikipedia's cultural diversity by encouraging new language editions.


  • Languages with characteristics that make them more *likely* to have a sustainable Wikipedia

There are many factors that explain the development of a language. Two very important ones are the number of speakers and the EGIDs scale which determines the language status - from full development and use to situation of endangerment. The EGIDS consists of 13 levels with each higher number on the scale representing a greater level of disruption to the intergenerational transmission of the language. Therefore, a language with a high status (status code 1) and a large number of speakers could be a potential Wikipedia.

In this Figure we can observe there are several India and Bangladesh languages between 10 and 15 million speakers and status 3, which is Wider Communication (language is used in work and mass media without official status). South Nbedele, a language of South Africa, has a status 1, which national recognition (education, work, mass media, and government) but it is only spoken by 2 million speakers. It is difficult to predict or to consider which language is more likely to become a Wikipedia - also considering that there are other contextual and socioeconomical factors. However, taking into account all the languages in the world without a Wikipedia, we see that in general the number of speakers is rather small and there are not many languages with a high status still left. This explains the current difficulty in engaging new languages into creating a Wikipedia.

  • Languages from territories whose indigenous languages have not a Wikipedia yet

The three hundred languages with a Wikipedia language edition are spread over the world in all the five continents. However, this does not imply that every country or region is covered by a Wikipedia language edition of an indigenous language. It is very often the case that a country or a region is only covered by a language of high status (e.g. English or Russian), and the local language does not have a Wikipedia. This results in a gap of knowledge that results in less cultural diversity in Wikipedia.

In other cases, both languages have a language edition and contribute to represent the same context with different points of view. In the end, the coexistence of languages in the same territory is frequent (from the 300 Wikipedia languages, 252 languages coexist with another language) and it is also the case for any other language that remains without a Wikipedia language edition.

In the following map we can see the countries with areas (provinces or regions) in which none of its indigenous languages has a Wikipedia language edition. In fact, the map depicts very well former colonized countries in Africa, America and Australia.

We can also zoom in these countries in order to see the cultural richness that is not being represented in Wikipedia. In the following figure we see a tree-map with the five countries with a largest number of indigenous languages. Countries are depicted in different colors, and their regions as inners squares whose size is according to the number of indigenous languages without a Wikipedia. In particular we see that Cameroon is the largest in number of indigenous languages and its region North contains 209. California is the largest state in the United States with 61, and the Amazonas the largest region in Brazil with 55.

Encouraging languages from these countries and regions to start a Wikipedia seems a strategic action in order to increase the project's cultural diversity. In fact, this map presents the territories in which there are more opportunities to represent and sum new knowledge. However, one most account again that if this has not happened at this point it means that factors like the social status of the language, the standardization or the speakers social situation may be important barriers preventing it. This is where language planning and development requires important contextual information.


If you want to examine the data to find potential Wikipedias or valuable gaps, you can either query Wikidata or download the SQlite3 file (diversity_groups.db) provided by the project Cultural Diversity Observatory with all the world languages. This database contains the language-territory mapping we use in order to create the Cultural Context Content datasets.