Wikidata Identifier Landscape

From Meta, a Wikimedia project coordination wiki

The Wikidata Identifiers Landscape Dashboard keeps track of the Wikidata external identifier usage and the overlap of their usage across the Wikidata items.

Dashboard Tabs. To get an insight into the dataset browse the Similarity Map tab, which presents a global overview of the overlap in the usage of Wikidata identifiers, and the Tables section where the respective data can be found. The Overlap Network tab visualizes all Wikidata external identifiers in a network of nearest neighbors. On the Identifier Classes tab we present insights into the relationships between the Wikidata external identifiers belonging to the same class of Wikidata identifiers. The Particular Identifier tab provides insights into the data for any Wikidata external identifier of choice.

WD External Identifiers Overlap Network.
WD External Identifiers Overlap Network. Each bubble in the network represents an WD external identifier. Nearest neighbors in terms of the number of shared items (i.e. overlap) are connected: each identifier points a link towards its own nearest neighbor. The size of the bubble corresponds to the total number of items across which the respective identifier overlaps with other identifiers. In the dashboard you can hover over the bubble to obtain the details (identifier ID and the measure of total overlap) on the respective identifier and use the toolbox (in the top-right corner of the network) to zoom, pan, or download. The Fruchterman-Reingold algorithm in {igraph} is used to visualize the identifier network.

Introduction[edit]

This dashboard is developed by WMDE in response to the Analyze and visualize the identifier landscape of Wikidata Phabricator task.

The Wikidata Identifier Landscape dashboard relies on the dataset obtained by performing ETL from Apache Spark (w. Pyspark) against the copy of the Wikidata dump in HDFS (WMF Data Lake) and post-processed in R. All machine learning procedures are performed in R (the t-SNE dimensionality reduction is handled by {Rtsne}.

The goal of this dashboard is to provide insight into the structure of the overlap in usage of various Wikidata external identifiers. Several means of data visualization are employed in that cause, relying on and {plotly} to visualize complex semantic maps and networks.

N.B. All the results presented on this dashboard are relative to the latest version of the Wikidata Dump processed and copied to the WMD Data Lake.

Visualizations[edit]

Visualizations. Visualizing the overlap structure of the Wikidata external identifiers is a challenging task. In order to provide an as thorough as possible insight into the similarity in the usage of various identifiers, we employ several different approaches to data visualization. The landing, Similarity Map tab, presents a two-dimensional map in which each Wikidata external identifier (that is used at all) is represented by a bubble. The higher the overlap across the Wikidata items which are described by a particular pair of identifiers, the closer the bubbles that represent them stand in the map. The size of each bubble corresponds to the number of items that the respective identifier describes. Since any Wikidata external identifier can fall in more than one Wikidata class of identifiers, we have avoided setting a fixed color scheme to mark the identifiers belonging to the same class in the map. Our approach was to let user select a particular class of identifiers of interest and then color the respective bubbles in the map to ease recognition.

WD Identifier Landscape: Semantic Map
Wikidata Identifier Landscape: Semantic Map

A more straightforward (and probably more popular) approach to visualize datasets as the one at hand is to employ graphs. On the Overlap Network each identifier is again represented by a bubble and points towards its nearest neighbor: the identifier with which it shares the highest number of items that they both describe. The size of the bubble in this visualization does not represent the number of items described by a particular identifier, but the extent of its total overlap with other identifiers, making the hubs in the network easier to spot.

WD Identifier Landscape Network
Wikidata Identifier Landscape Network

The Tables enables the user to browse for specific identifiers and inspect the exact extent of their overlap with all other Wikidata external identifiers. Another table is produced under the same tab: one providing for the counts of items described by all considered identifiers.

As already explained, Wikidata external identifiers belong to one or more Wikidata identifier classes. The Identifier Classes tab provides a browser of these classes, generates a local neighborhood similarity structure (based on an overlap in identifier usage) for all the identifiers found in the selected class, and lists all of the class identifiers. Similarly, the Particular Identifier tab provides information about a particular Wikidata external identifier: its local neighborhood structure, and a set of exemplar uses of that identifier (represented by a selection of items that it describes, the Wikidata classes to which these items belong, and the values of the selected identifier associated to them).