Research:Building a Wikidata Content Gap Index
This page documents a research project in progress.
Information may be incomplete and change as the project progresses.
Please contact the project lead before formally citing or reusing results from this page.
In 2017, the Wikimedia foundation announced a strategic goal to support “knowledge and communities that have been left out by structures of power and privilege”. To date, much of the work in this area has focused on Wikipedia. In this project, we aim to develop a formal framework of content gaps in Wikidata. We will identify and evaluate existing metrics from current research while developing novel metrics and methods to further the state of the art. We will then evaluate these metrics to identify which are the most effective for accurately measuring gaps and which best support the development of a content gap index.
Context and Motivation
[edit]Previous research has shown that Wikidata is generally of high -- and improving -- quality, but that its effectiveness is hampered by missing data among other factors. We note that such gaps have also been recognised by the Wikimedia foundation in the knowledge gap white paper[1], which outlines a taxonomy of knowledge gaps and calls for research which identifies and bridges such knowledge gaps.
To recognise and address these gaps, the Wikimedia Foundation has developed an existing index of content metrics on Wikipedia[2], mapping types of content gaps to quantitative measurements for formal assessment on Wikipedia. However, we note that for many content gap classes, the framework lacks metrics for formally measuring gaps. Additionally, the metrics defined are largely related to Wikipedia articles and do not necessarily generalise or apply to Wikidata. For example, when it comes to completeness of the graph, the complexity of the Wikidata ontology model makes it difficult to identify and measure the gaps. It is easier to look at specific case studies based on entity type, e.g., people/places. Similarly, when it comes to individual entities, measuring content as amount of characters or bytes does not work well with structured data. On the other hand, simply measuring the number of statements is too simplistic as some statements are more descriptive than others.
We thus want to develop a content gap index for Wikidata which aligns with the knowledge gap taxonomy and is similarly operationalised to the Wikipedia one, but takes into account the specificities of Wikidata as a structured knowledge base. We are conscious that prior research has already addressed some of these challenges. Relevant research on Wikidata has separately looked at its gender diversity representations[3] in terms of gender identity distributions; compared Wikidata to other knowledge graphs on a range of graph quality metrics including completeness and accuracy[4]; analysed race and citizenship bias in the knowledge base[5]; studied Wikidata content gaps as the imbalance between Wikidata contributions and Wikipedia pageviews[6], representing information needs. Our goal, then, is to identify the existing metrics that have been identified and developed for classifying Wikidata content gaps, but also to conceptualise and develop novel metrics and methods for identifying these content gaps. We will further compare the accuracy and effectiveness of these metrics to identify which best compliment the development of a content gap index.
Building on Wikipedia Content Gaps
[edit]There has been a wide body of work looking at content gaps within Wikipedia and a full review of all existing metrics would be outside of the scope of this research. However, we aim to identify some of the most common metrics and explore whether these would apply to Wikidata as an introductory point. We believe further analysis would be an important area for future work.
We recognise that there has been prior work within the Wikimedia Foundation (and beyond) to identify and classify types of content gap within Wikipedia -- for example Jonathan Morgan’s work[7], and the knowledge gap index[8]. We will use this existing framework and identified literature as the basis for our own analysis, which we will update to take into account subsequently published literature.
Methods
[edit]- Analyse existing metrics used to identify content gaps in Wikipedia and adapt to Wikidata.
- Analyse metrics used to identify content gaps in Wikidata through a systematic literature review.
- Create a unified framework of content gaps
Research Questions
[edit]- What are the existing approaches to identify and measure content gaps in Wikipedia?
- To what extent do these Wikipedia content gap metrics apply to Wikidata?
- What are the existing metrics used to measure content gaps in Wikidata?
- How do different metrics compare against each other and are certain metrics more reliable and suitable for developing an index than others?
Goals and Outputs
[edit]Our ultimate goal is to develop a knowledge index of metrics for identifying and measuring content gaps within Wikidata. This knowledge index will be shared with the wider Wikidata and Wikimedia communities as well as in the form of a scientific paper which will collect, compare and evaluate different metrics and methods for quantifying content gaps. As well as this, we will produce a paper consolidating work conducted in this space to date by gathering and categorising metrics used to evaluate Wikipedia content gaps, as well as existing gaps identified within Wikidata. We aim to conduct surveys and interviews with the Wikidata community to aid in this evaluation process.
Timeline
[edit]Please provide in this section a short timeline with the main milestones and deliverables (if any) for this project.
Policy, Ethics and Human Subjects Research
[edit]The initial phase of our work focuses on the literature review. Once this is complete, we plan to reach out to the Wikidata community to gather their views on the metrics. We will first seek institutional ethics approval prior and will update this page with further details of the process to be used and the approvals received when available.
Results
[edit]To follow when available.
Based on a systematic literature review of prior work in this space, we have developed a framework below to categorise existing studies and analyses of content gaps within WikiData, as well as to conceptualise novel or extended methods to measure gaps. Each dimension has a set of values associated with it and dimensions can be either categorical (the dimension can have one or more categories as a value), binary (the dimension can have either one value or the other) or spectrum-based (the dimension can have a value that occurs anywhere between the two values).
The aim of this framework is threefold:
- To help categorise existing studies within this space based on what they have previously explored. This aims to help us understand what is meant when there is talk of e.g., a gender gap within WikiData. How does that gap arise and how has it been observed?
- To help conceptualise new methods for measuring and assessing gaps. For example, if a gender gap has been observed within individual entities, does it also apply at the level of the whole graph? We believe this is essential for identifying potential solutions to content gaps.
- To serve as a basis for identifying gaps within our understanding of content gaps.
Dimension | Values | Facet Nature | Description |
---|---|---|---|
Level | Property, Entity, Subclass, Class | Categorical | Indicates the abstraction level(s) at which the gap was measured or observed. |
Intrinsicity | Intrinsic ↔ Extrinsic | Binary | Identifies whether the gap is intrinsic to WikiData or results from/reflects an external source. |
Determination | Assertion ↔ Inference | Binary | Reflects how the gap was identified —measured directly or inferred from properties. |
Scale | Individual → Whole | Spectrum | Represents the scale at which the gap occurs -- is it a gap within the individual entity or something observed at the scale of the whole graph? |
Features | Description ↔ Structure | Binary | Identifies whether gap occurs within textual features and descriptions or within the properties or structural elements of the graph. |
Temporality | Snapshot ↔ Periodic | Binary | Identifies whether gap was a one-off (e.g., recency bias) or recurring or periodically observed phenomenon. |
Directness | Direct ↔ Proxy | Binary | Identifies whether gap was measured directly or using proxy measures or tools. |
Source | Behavioural ↔ Architectural | Binary | Identifies whether the link derives from user and editor behaviour or from architectural constraints. |
Granularity | Single → Whole | Spectrum | Identifies whether the gap was measured from a single entity or at the level of the whole graph. Distinct from scale in covering measurement, while scale covers occurrence of gaps. |
To further this aim, we have also developed a typology based on the same literature review concerning the existing types of gap that have been observed and measured in WikiData. We caution that this list is unlikely to be exhaustive as there are many gaps which have been theorised or observed more qualitatively for which formal measurements have not taken place. There are also likely to be many gaps that are recognised or understood but for which have not been published in academic literature. Nevertheless, this typology serves as a starting point for understanding the types of gap that are understood to occur within WikiData.
Category | Subcategory | Description |
---|---|---|
Demographic | Gender | Gaps associated with the gender of individuals. |
Race | Gaps associated with the race of individuals. | |
Citizenship | Gaps associated with the nation of an individual's birth or the country in which they are resident. | |
Geographic | Culture | Gaps associated with cultural elements or heritage |
Language | Gaps associated with a specific language(s). | |
Socio-economic | Gaps associated with an individual or a topic's socioeconomic background. | |
Recency | Gaps associated with the recency of an observed phenomenon. | |
Occupation | Gaps associated with an individual's/individuals' job role or working responsibilities. |
References
[edit]- ↑ Leila Zia, Isaac Johnson, Bahodir Mansurov, Jonathan Morgan, Miriam Redi, Diego Saez-Trumper, and Dario Taraborelli. 2019. Knowledge Gaps – Wikimedia Research 2030. https://doi.org/10.6084/m9.figshare.7698245
- ↑ https://meta.wikimedia.org/wiki/Research:Knowledge_Gaps_Index/Measurement/Content
- ↑ https://wigedi.com/
- ↑ Färber, M., Bartscherer, F., Menne, C., & Rettinger, A. (2018). Linked data quality of dbpedia, freebase, opencyc, wikidata, and yago. Semantic Web, 9(1), 77-129.
- ↑ Shaik, Zaina, Filip Ilievski, and Fred Morstatter. "Analyzing race and citizenship bias in Wikidata." 2021 IEEE 18th international conference on mobile Ad Hoc and smart systems (MASS). IEEE, 2021.
- ↑ Abián, David, Albert Meroño-Peñuela, and Elena Simperl. "An analysis of content gaps versus user needs in the wikidata knowledge graph." International Semantic Web Conference. Cham: Springer International Publishing, 2022.
- ↑ https://meta.wikimedia.org/wiki/Research:Content_gaps_on_Wikipedia
- ↑ https://meta.wikimedia.org/wiki/Research:Knowledge_Gaps_Index/Measurement