Research:Content gaps on Wikipedia
Wikipedia is incomplete by design. The opportunity to share new information with the world is a major motivating factor among both new and established Wikipedia contributors. However, when important information about a topic is absent, incomplete, biased, or otherwise inaccessible to readers, these content gaps can undermine Wikipedia’s ability to serve the needs of its global audience.
Although a great deal of research has been done to identify different types of gaps, and the characteristics of those gaps, there has not yet been an attempt to synthesize this body of work into actionable guidance for identifying, prioritizing, and measuring content gaps. The goal of this project is to characterize previously-identified content gaps, and arrange them hierarchically in a taxonomy in order to facilitate future work on prioritizing which content gaps to focus on, measuring content gaps consistently, and evaluating the impact of interventions meant to close content gaps.
- Summarize findings from a body of relevant academic and industry research focused on content gaps related to the selection, extent, and framing of hypertextual Wikipedia content (e.g. text, links and citations, structured meta data, but not multimedia)
- Identify the empirical methods used in these various studies, and their advantages and limitations with respect to their general applicability for large-scale analysis of content gaps across different languages of Wikipedia and for different forms of hypertext-based information
- Identify the potential causes of content gaps described in these various studies, and the supporting evidence for each
- Develop a taxonomy of content gaps
- Provide recommendations for topic-, language-, and format-agnostic metrics and measurement techniques that can support the evaluation of both technological and programmatic interventions to close content gaps.
- What are the selection, extent, and framing gaps that have been identified in previous literature?
- Which of the proposed causes for these gaps are best supported by currently available evidence?
- What are the characteristics of previous programmatic and technological interventions that have shown some success at addressing these content gaps?
- What metrics have been used to quantify extent or change over time in content gaps, and which of these metrics show most promise for general applicability—beyond a specific topic, language, or type of content?
This project will begin with a literature review of previous work related to content gaps. This literature review will follow the three-part classification of knowledge gap type outlined in the associated Wikimedia Research 2030 white paper: selection, extent, and framing gaps. The literature review will also:
- Identify methods and metrics used to identify these kinds of gaps in previous research, and compare and contrast the benefits and limitations of these methods
- Identify potential causes of content gaps, and evaluate the evidence provided for these proposed causes in previous research
By organizing previous research according to thematic categories related to gap type, methods/metrics, and proposed causes, we will be able to provide the first draft of a taxonomy of content gaps on Wikipedia.
This literature review will focus on content gaps in information represented in text (e.g. Wikipedia articles, or other textual entities) or hypertext (e.g. links between articles, categories, and other text-based metadata). Multimedia gaps, and gaps specific to WikiData and other Wikimedia projects (e.g. Wiktionary) are beyond the scope of this literature review, and may be addressed in a separate study.
- Research:Explaining the Wikipedia reader gender gap
- Knowledge Gaps — Wikimedia Research 2030
- Research:Closing the Gender Content Gap in Wikipedia
- Research:Gender gap in Wikipedia's content
- Research:Automatic new article topics suggestion