Wikidata Languages Landscape

This Wikidata Languages Landscape dashboard is developed by WMDE in response to the Wikidata Languages Landscape Phabricator task and in the scope of preparations for the WikidataCon 2019 (the main topic of the conference: languages and Wikidata).

The goal of this dashboard is to provide insights into the ways languages are organized and used in Wikidata and across the Wikimedia projects that reuse Wikidata.

While it makes use of the Wikidata Concept Monitor (WDCM) reuse statistics, this dashboard is strictly not a part of the WDCM analytical system which produces and analyzes the essential Wikidata reuse statistics. Familiarity with the WDCM system will help an interested user to understand the working of the Wikidata Languages Landscape, but no prior knowledge of WDCM is really required.

Clusters of Reused Wikidata languages as obtained from clustering across their Jaccard distances.

Introduction

The WD Languages Landscape dashboard relies on different data sources to provide a comprehensive picture of how different languages are used in Wikidata and - via the entities that they refer to - how they are mapped across the universe of Wikimedia projects:

the copy of the Wikidata JSON dump in hdfs,
various datasets obtained directly from the WDQS via SPARQL queries, and
datasets on Wikidata entity reuse statistics obtained from the Wikidata Concepts Monitor.

In addition, The List of All Wikimedia Language Codes - as maintained at Wikidata and periodically updated by a bot - is used to verify the language codes obtained from these data sources. The goal of this dashboard is to provide insights into the various aspects of language use in Wikidata. We study the Wikidata language items, and then the entities that any of the Wikidata languages is referring to, to infer the total reuse of a language across the Wikimedia projects. We derive similarity metrics from language x language matrix of overlap in their labels across almost 60 millions of Wikidata items. We study the way Wikidata languages are represented in its ontology and compare this representation to the similarity across languages inferred from empirical data on their overlap in signification across the Wikidata entities. We also look at the UNESCO and Ethnologue language status categories and compare the status of language against various indicators of its use and reuse in Wikidata and Wikimedia projects.

Several means of data visualization are employed in this dashboard, relying on {plotly}, {ggplot2}, {igraph}, and {visNetwork} in R to visualize the complex relationship discovered in our study of the Wikidata languages. Apache Spark with Pyspark is used for ETL purposes. All computation and modeling is done in R; the dashboard itself is developed in {shiny} and deployed on an open-source RStudio Shiny Server instance in CloudVPS.

N.B. All presented results are relative to the latest version of the Wikidata JSON Dump processed and copied to the WMD Data Lake (see: Phabricator).

Features

The dashboard is organized into several tabs that can be accessed from its left-hand navigation panel.

Ontology Tab. All nodes in this graph represent either a particular Wikidata language item or a Wikidata class that encompasses different languages in the ontology. The relations between languages and language classes in Wikidata are organized through different properties: P31 (instance of), P279 (subclass of), and P361 (part of). The relational structure of languages in the ontology is not always systematic (e.g. sometimes a language is both a P279 (subclass of) and a P361 (part of) of a language class or another language). The Language/Class tab in this Dashboard can help you inspect these relationships closer and decide if a change in the ontology needs to be introduced.

The Fruchterman-Reingold algorithm in {igraph} is used to visualize the network. Note. This Dashboard focuses on the language items that have any items in Wikidata and whose items are reused in our Wikies. All properties used across this Dashboard are obtained by searching the Wikidata starting from a set of language items thus defined. This implies that the depiction of the language ontology presented here is not necessarily complete.

Language/Class Tab. This tab introduces a visual browser for Wikidata languages and related language classes. Upon selecting a particular entity (language or language class) from the drop-down menu on the left, the Dashboard will generate a graph of its immediate relational context taking into account any of the P31 instance of/P279 subclass of/P361 part of properties. While most of the organization of languages in Wikidata makes use of P31 instance of and P279 subclass of properties, for some languages also P361 part of is used, and not always in a consistent way. By inspecting the Wikidata languages of your interest in this visual browser you can decide if their properties are consistenly structure and maybe introduce a change in the Wikidata ontology later on if you find it necessary or desireable. Hovering over a particular node will reveal the details: (a) the respective Wikidata item, (b) the number of labels for that language in Wikidata, (c) the percent of the items that have a label in that language and are also reused across the Wikies, the WDCM (Wikidata Concepts Monitor) reuse statistic, (d) the number of sitelinks for the respective language's item, and (d) the UNESCO/Ethnologue language status for the respective language (if the respective data are present in Wikidata).

Label Sharing Tab. Each bubble in the graphs represents a Wikidata language. We first look at each of ~60M Wikidata items to see what languages label them. Than we look at the languages, pairwise, and determine the similarity of the way they are used in Wikidata by assessing the items that both languages in a pair refer to (the language overlap) the items which one of them refers to but the other does not (the mismatch). From these data we compute a similarity index between any two Wikidata languages.

Two visualizations of the similarity data are provided in this tab. The Static/Clusters graph represents each Wikidata language by its Wikimedia code, and each language points towards the language to which it is most similar in the above described sense. A clustering algoritm in {igraph} is used to group the languages according to their overall similarity, and the cluster boundaries are overlayed across the graph.

The second visualization, Interactive/Clusters presents exactly the same data in a different way: the languages are represented by bubbles whose size corresponds to the number of labels in Wikidata for the respective language. Hovering over any language in the graph will reveal more detailed info. Again, each language is connected to the one to which it is most similar in terms of its usage across the Wikidata entities. While the previous two Dashboard tabs (Ontology and Language/Classes) focused on the representation of the way the Wikidata language items are connected in Wikidata itself, this tab represents the empirical similarity relations between languages, based on the way they are used to refer to any Wikidata entities. By comparing the structure of languages in the ontology with the usage similarity patterns here we can study if the languages with similar properties also tend to refer to the same sets of entities or they are used in a way which is irrespective of their properties.

Language Status Tab. We use the UNESCO language endangerment categories and the Ethnologue language status as (and when) reported in Wikidata to study several indicators of how does a particular language stand in Wikidata and across the WMF's projects in general.

The horizontal axes in the interactive charts present in this tab always represent the language status category. Each bubble represents a Wikidata language, but different variables are mapped onto the size of the bubble and the charts' vertical axis in each panel. We focus on the following language usage related indicators here: (a) the number of sitelinks for the respective Wikidata language, (b) the number of labels that it has (i.e. the the number of entities to which it refers in Wikidata), and (c) the WDCM reuse statistic for all the items referred to from a particular language (For the definition of the WDCM reuse statistic see the Description tab).

By following the usage indicators (sitelinks, number of labels, and reuse across the WMF projects) for Wikidata languages of less favorable status we can recognize what languages we need to focus on in order to represent them in Wikidata and across the Wikimedia universe and thus help their preservation. When combined with structural and empirical similarity data provided on the previous tabs in this Dashboard these data can help formulate strategies to improve the digital representation of underrepresented languages in general.

Language Usage Tab. The charts in this Dashboard tab represent various indicators of language usage in Wikidata and across the Wikimedia projects and are meant for a more general assessment of the Wikidata languages in comparison to the detailed analytics provided in the previous tabs.