Research:Wikidata gap analysis
- 1 Background
- 2 Research
- 2.1 RQ1: Missing labels/missing articles
- 2.2 RQ2: Missing descriptions/missing articles
- 2.3 RQ3: Missing images
- 2.4 RQ4: Common infoboxes
- 2.5 RQ5: Primary fields
- 2.6 RQ6: Data availability
- 2.7 RQ7: Trustworthiness
- 2.8 RQ8: Time-based datatypes
- 2.9 RQ9: Coordinate usage
- 2.10 RQ10: Hierarchical property usage
- 3 Conclusion
- 4 Notes
The Wikimedia Foundation's engineering and product development teams are interested in ideas that allow them to integrate their services with those of Wikidata, the free knowledge base. The idea is to open up opportunities to surface information to readers, generate and verify information, and create opportunities to recruit community members. Of these opportunities, the priority is reader information.
To get a better idea of what the limitations and opportunities in this space are, the WMF has commissioned this report to (in decreasing order of importance):
- Understand any systemic bias or limitations the content on Wikidata may have;
- Understand any technical restrictions Wikidata may have that impact particular ideas for integration;
- Understand any community restrictions or limitations Wikidata may have that impact particular ideas for integration.
Wikidata capabilities, limitations and plans
Wikidata's base of information is a goldmine, but one of the biggest issues surfaced was the lack of a proper API. At the moment, Wikidata simply has the MediaWiki API; retrieving information for items is done by requesting the revision text for that item (which is a JSON blob) and then processing it client-side. This is somewhat inefficient, as is the lack of granularity in what parameters you can select.
More importantly, the lack of any clear importance ranking for Wikidata parameters makes it hard to build useable article summaries. We can pull out, say, the description and title in a fairly standard way, but if we look at the Wikidata item on Barack Obama, the lack of importance ranking means that his birth date is ranked just as highly as the fact that he's President of the United States and a Nobel Prize winner. From a reader point of view, the last two bits are clearly more vital in any summary, but there's no way for a machine to tell. Other issues surfaced include the lack of an ability to include data with associated units (we can include a "weight" of 200, but not know whether that's pounds or ounces or grams or tola), and access to arbitrary items (it is currently impossible to incorporate data from an item by item ID, unless the page you're incorporating it on has that ID).
Wikidata's future development plans include tackling the arbitrary item incorporation, and the association of units with numeric values. Further down the road is building a fully-fledged API, but this is currently being reworked.
One limitation surfaced by Lydia during our meeting with her was the social limitations of Wikidata as a platform for expansion. On the Wikidata side, these largely stem from the small size of the community: there are concerns about relying on the data due to the limited number of people available to check it, and concerns about encouraging newcomers to contribute to Wikidata given that lack of oversight. From other projects, the issue is one of trust: sister projects tend not to trust Wikidata, its content and its capabilities enough, yet, to willingly automate and pass over things like infobox generation and content.
A lot of Wikidata's efforts over the next few months are focused on solving for the trust problem, and their future development plans reflect this focus. This is going to be a big factor in how easy or non-controversial integration between Wikidata and other projects are, for us, even if they are reader-facing integrations.
Wikidata contains slightly fewer than 14 million distinct items, with associated statements (properties, and values associated with those properties). The bias in this content reflects the bias on our other projects, since data is drawn from there; relatively little research has been done specifically on bias in Wikidata, and that research which has been done is normally framed as looking at Wikipedia and using Wikidata as a convenient way of accessing the metadata. The bias of this content is paramount, however, to the use cases we are investigating; as such, it will be the main subject of this research.
A lot of ideas have been surfaced by both WMF staffers and Wikidata staffers and community members; if you have more ideas, drop them on the talkpage. Highlights are:
- Displaying Wikidata-sourced placeholder articles where projects lack a local article on a subject: this has been extensively discussed with the community (see here and here) and people genuinely seem to approve, which is a promising start. The idea is that if a user navigates to an article (say 'Foo') on a project that lacks it, Wikidata will step in, identify that they have a data item for Foo, and construct a dummy article. One question here would be whether this is better or worse, for editors and readers, than a redlink; there is some promising research, but it's limited in how detailed it is and what methodology was used.
- Generating short, more easily-digested summaries of content; this could be done either in search results, to provide a more self-contained piece than the current search results page does (since that merely displays the first N characters, which may or may not be self-contained), or in articles themselves (as seen in the screenshot to the right).
- Providing multilingual search; allowing users to search in different languages, or across different projects. This would take advantage of wikidata's "sitelinks", along with the localised item titles and descriptions. Heavily interlinked with "generating short, more easily-digested summaries of content".
- Powering local infoboxes, eliminating the need for manual updates and synchronization of content in multiple languages, and making it easier to selectively iterate on which data is displayed in an infobox (without having to modify potentially thousands of pages that invoke a template).
- Interactive timelines specifically enriching articles about historical events. Examples have already been built, including Histropedia and Magnus Manske's Tempo Spatial Display of Information, which also includes maps.
- Interactive maps, as above, whether to support timelines, or to enable exploration of an item as it relates to other items near it ("show churches nearby"), structured data could enable new types of discovery.
- Interactive charts could replace manually generated and updated static images that show relevant data such as population and similar time series data, visualize scientific datasets, and more.
- Hierarchical and sequential navigation refers to functionality which enables fast exploration of related topics beyond the existing capabilities of the category topics. For example, Wikidata could make it easier to navigate through all the books written by an author, or replace manually maintained navigation boxes with software-provided user interfaces.
The use cases have particular content and research questions in common, listed here, that this document intends to answer:
|Relevant content||Use cases||Questions|
|'time' datatypes and properties||
|properties to indicate a location||
|Properties that define order, hierarchy or membership (follows, followed by, replaces, succeeded by, has part, part of, cast member, relative, notable works)||
RQ1: Missing labels/missing articles
For research question 1, we want to answer:
- How many Wikidata items have an article in a language project, but no label in that language?
- How many Wikidata items have a label in a language, but no equivalent article?
Labels are necessary prerequisites for a lot of ideas for Wikidata integration, including multilanguage search, short summaries and article placeholders. To look at coverage, we took a randomly-selected sample of one million Wikidata items, and for each item, identified the disjoint between labels and sites: the labels without sites of the same name, and the sites without labels of the same name. Common sites without specific languages were removed.
Of the 1 million items identified, only 44,273 labels were missing for languages where we have a linked article; since we have 2,239,967 links, that comes to just over 1%. Inversely, there are 1,411,842 additional labels - labels in a language without a corresponding site link.
This is pretty promising, in theory, for multilingual search and article placeholders: it indicates that we have a lot of language coverage for labels in languages where we lack an equivalent article, and could use that to automatically generate placeholders. It indicates that we have very few missing labels, and so search results in a multilingual context are likely to be fairly complete.
However, the 1.4 million additional labels are not evenly distributed; indeed, most languages have very few, with a small number having very many (Fig.1). Those languages with excessive coverage are English, German and other common European languages where we already have wide article coverage and saturated reader bases. In other words, while we have many unassociated labels, most of them are not necessarily useful for attracting or increasing engagement with a "new" audience.
RQ2: Missing descriptions/missing articles
For research question 2, we want to answer:
- How many Wikidata items have an article in a language project, but no description in that language?
- How many Wikidata items have a description in a language, but no equivalent article?
For this we will use the same dataset that we did with RQ1, comparing descriptions and sitelinks.
The 1 million items in the sample contain 5,962,032 descriptions - an amazingly large number! Not only that, but almost all of them - 89% - don't have associated sitelinks. More importantly, a lot more languages (Fig.2) are represented for descriptions than labels (Fig.1). Anecdotal feedback from Wikidata users suggests the prevalence of localised descriptions is due to semi-automated editing, and so the level of detail is likely to be limited, but it's still a really good start for things like article placeholders and search results.
Unfortunately the inverse is also true; there are a large number of sitelinks that lack descriptions - almost 73%. These are widely distributed between different language projects (Fig.3), and have implications for (e.g.) multilingual search.
RQ3: Missing images
Images are necessary for UI-friendly descriptions and highly helpful with article placeholders. In Wikidata, "image" is not a component or datatype - instead, it's a class of data held by particular properties. Using a hand-gathered list of these properties, we looped through the 1 million retrieved items. If an item had a claim with a name matching one of these properties, it was tagged as having an associated image. If it didn't, it was tagged as missing an image.
Based on this, only 4.6% of Wikidata items have an associated image. Moreover, the experience of identifying where one would even look for an image is suboptimal. The lack of any concrete exposure of properties at the moment, and how variable they are (the community can create new ones or delete old ones at a whim) means that it is exceedingly difficult to identify what an image even is in the Wikidata item. Where multiple images are present, the lack of any kind of an importance indicator makes it difficult to identify which would be most beneficial to the reader in a placeholder or in search results.
RQ4: Common infoboxes
See the conclusion.
RQ5: Primary fields
See the conclusion.
RQ6: Data availability
See the conclusion.
See the conclusion.
RQ8: Time-based datatypes
Wikidata has datatypes built around time and temporal units - these are necessary if we want to roll out things such as interactive timelines. This date-time datatype has the capability to:
- Store date-times in a variety of calendar formats;
- Store date-times with varying levels of precision;
- Store date-times of different levels of detail (from just the year, down to seconds)
These are all good and allow us to easily establish, from a set of date-times associated with events, what order they come in. Approximately 12.3% of Wikidata items have one or more associated date-times, so for collections of articles this is certainly plausible.
However, the fact that an element of a Wikidata item has a date-time type is not actually exposed by the API. Instead, the individual properties - which any user can create, remove or modify - do this. This means that to identify whether an item even contains properties to parse as date-times, the client would have to either maintain a list locally of what properties are "date-times" - assuming nobody has modified a property or added new properties - or programmatically retrieve each and every property associated with the item to get the data type from that item.
RQ9: Coordinate usage
Maps or geographic features dependent on Wikidata require, well, geographic data. Wikidata has a - geographic locations datatype and a planned geographic shape datatype. Approximately 11.9% of Wikidata entries contain properties with geographic coordinates.
Again, however, the fact that a property has this type is not actually exposed through the API, making it difficult to rely on automatically. The lack of priority or importance also makes it difficult to identify what property should be used, in a case where multiple geographic properties are present.
RQ10: Hierarchical property usage
Wikidata does contain some hierarchical properties; these are not the same as an importance ranking, but they could serve as a substitute in cases where we want to automatically determine what information to include, or generate things like interactive hierarchical trees around a particular topic. Taking a list of these properties, along with the 1 million items already retrieved, we find that 1.8% of pages have one or more of these hierarchical properties associated with them.
Again, the nature of these properties is not called out in the API response that items with those properties provide, making it hard to accurately, automatically identify whether a page has hierarchical elements or what those elements are (and in what order they should be placed).
We ended the research project without investigating infobox usage due to the number of systemic problems we found while delving into the other research questions. Infobox usage could be investigated once those problems have been addressed.
Wikidata, its content and its software have fantastic potential with regard to the creation of new experiences for readers and contributors. The incredible growth of Wikidata is a testament to the interest it has generated in its ecosystem, and to the strength and commitment of its community. However, the content and software have not reached the level of maturity required for widespread and deep integration with Wikimedia Foundation products. Our recommendation is that the Wikimedia Foundation not look into additional opportunities to integrate products with Wikidata at this time. Surgical integration of very specific types of content in very specific environments may be considered on a case-by-case basis, only after investigating that the appropriate content and tools are present, usable and reliable.
Instead, the WMF should focus on increasing the resourcing to Wikidata and setting out a clear idea of what we'd need the system to do in order to be integrated with. This is due to widespread, systemic architectural problems and content deficiencies in Wikidata as it currently stands, that act as blockers to wider adoption and use. These are:
- The idea of article placeholders or short summaries is undermined by the answers to research questions one and two. For either idea to be viable, we require content to be available in a localised way, on a consistent basis. Wikidata has almost all labels localised for sites where we have articles on those subjects, but almost all of the "extraneous" localised labels are for high-volume sites (enwiki, dewiki). In other words, we would only be able to provide placeholders or additional context on projects where coverage is already excellent and extensive. And, while we have localised labels, almost all of the sitelinks on Wikidata lack a localised description, acting as a blocker to short summaries. This is further compounded by the lack of any clear identifier for importance or primacy in Wikidata items, making it difficult to identify what the most pertinent information to surface about a subject is when creating placeholders.
- The idea of geographic, temporal or hierarchical models is undermined by the fact that the Wikidata API does not expose whether something is a coordinate, a date-time or an element of a hierarchy directly: instead, we would have to query the property pages themselves, for each item being considered. Not only that, but data coverage is itself scant in some areas; fewer than 2% of Wikidata items have any kind of hierarchical element on them at all.
Proposed Wikidata improvements
- Property datatypes should be noted in the information returned from the API when querying an item that contains that property;
- A focus should be put on description localisation, which is currently lagging substantially behind the unstructured data on our projects and the structured labels on Wikidata;
- Importance levels for properties, either on a per-property or per-item basis, should be introduced, offering a way to programmatically identify high-value information to surface to users, or to resolve conflicts when multiple images or geographic elements are present.
- The plan was to have one API for simple queries, and one for complex queries, where you'd log a request and it would compute it in the background and send it to you some time later. This plan was thrown out of sync by the discovery that it was possible to make complex queries fast enough for near-instantaneous responses. This is a pretty good problem to have.
- As of writing, the countID had exceeded 19m, but with deletions the actual number of items hovered at around 13.8m
- Manske, Magnus (2015) Sex and Artists
- Solomon, J., and Wash, R. Bootstrapping wikis: developing critical mass in a fledgling community by seeding content. In Proc. CSCW (2012), 261–264.
- For example, "simple" or "commons"
- Once you know the Spanish for "A football player", you can tag all entities that are football players that way