Jump to content

Research:Untangling Wikipedia's Sources: Mapping References To Reveal Global Knowledge

From Meta, a Wikimedia project coordination wiki
Created
20250315210000
Collaborators
Prof. Dr. Manuel Burghardt (University of Leipzig)
Duration:  2024-01 – 2026-12

This page documents a research project in progress.
Information may be incomplete and change as the project progresses.
Please contact the project lead before formally citing or reusing results from this page.


🚧 This is an active, evolving research project. Everything may change. 🚧

This page documents work in progress. Questions, methods, and findings are all subject to refinement—or even complete transformation—as we learn more.

✨ A case in point: This project began with a focus on parsing plain text citations, treating unstructured references as a problem to solve. But as we explored the data, the sheer prevalence and opacity of URLs became impossible to ignore. We realized that URLs themselves—not just the text around them—were the real treasure trove. So we pivoted. What started as a quest to structure text became a mission to enrich links.

The lesson? Everything can change. And that's a feature, not a bug.

💡 We welcome your insights, critiques, and collaboration. If you see a better path, a new question, or a flaw in our logic, please join the conversation.

Wikipedia contains millions of URLs pointing to the world's knowledge—news articles, government data, academic papers, archived web pages, and more. Yet despite their abundance, we have no consistent, scalable pipeline to transform these URLs into actionable metadata that can help us understand Wikipedia's sourcing practices, knowledge gaps, and relationship with the broader web.

This project develops methodologies to unlock the latent information in Wikipedia's URLs, treating them not as endpoints but as gateways to understanding the encyclopedia's intellectual lineage, geographic reach, and epistemic diversity.

Background

[edit]

Previous citation research has focused heavily on scholarly identifiers (DOIs, ISBNs, PMIDs) and structured templates. This focus, while valuable, has created a blind spot: URLs—the most democratic and widely used citation format—remain analytically opaque. A URL from elpais.com tells us little by itself. But when enriched with metadata about the source's country, media type, political bias, or ownership, it becomes a powerful signal.

Consider what URLs can reveal:

  • A elpais.com URL → Spanish newspaper, factual reporting, Wikipedia page for more context (en:El País)
  • A insee.fr URL → French government statistics, official data

The challenge is that no standardized pipeline exists to:

  1. Extract URLs from all Wikipedia references (not just those in templates)
  2. Resolve and normalize them across languages and formats
  3. Enrich them with structured metadata from diverse sources
  4. Analyze them at scale to reveal meaningful patterns

Literature Review

[edit]

The study of Wikipedia citations has evolved significantly, but a persistent methodological gap remains: the underutilization of URLs as a source of insight.

Identifier-Based Approaches

[edit]

A substantial body of research has focused on citations that include structured identifiers. Lin and Fenner (2014) analyzed Wikipedia references across PLOS publications, demonstrating the potential of using DOIs to track scholarly impact. This approach was extended by Maggio et al. (2017), who characterized Wikipedia as a "gateway to biomedical research" by analyzing citations with identifiers, revealing the relative distribution and use of scholarly sources.

Singh, West, and Colavizza (2021) made a significant contribution by creating a comprehensive dataset of citations with identifiers extracted from English Wikipedia. Their work, published in Quantitative Science Studies, provided the research community with a valuable resource for studying scholarly citations at scale. Similarly, Kousha and Thelwall (2017) investigated whether Wikipedia citations serve as important evidence of the impact of scholarly articles and books, finding meaningful correlations between Wikipedia mentions and traditional impact metrics.

The Wikimedia Foundation has also contributed to this line of inquiry. Redi et al. (2018) measured the proportion of open-access sources across Wikipedia languages and topics, using identifiers to assess the accessibility of cited scholarly content. Their findings revealed significant variation across languages and disciplines in the availability of freely accessible research.

I'm missing some references, will add later

Template and API-Based Methods

[edit]

Other researchers have leveraged Wikipedia's citation templates. Zagovora et al. (2020, 2022) developed a sophisticated approach to tracking reference evolution, creating a dataset of individual edit histories for all references in English Wikipedia. Their work, validated through crowdsourcing, traces the lifecycle of citations from creation through modification and deletion. While methodologically innovative, this approach focuses on references within <ref> tags regardless of structure, but does not address the challenge of enriching unstructured citations with external metadata.

Pooladian and Borrego (2017) highlighted methodological issues in measuring Wikipedia citations, using Library and Information Science as a case study. Their work underscores the importance of carefully considering what different methods capture—and what they miss.

The URL Gap

[edit]

Despite these advances, none of the existing approaches systematically leverages URLs as a primary data source. Bilder (2014), in his role as Wikimedia Ambassador for Crossref, advocated for increased use of scholarly identifiers across Wikimedia projects, noting the "patchy adoption" that limits identifier-based analyses. This observation points to a deeper issue: by focusing on what is easiest to harvest (identifiers, templates), the research community has systematically understudied what is most prevalent (URLs).

URLs appear in all types of references—from structured template citations to unstructured "text" references. They point to news articles, government data, personal blogs, archived content, and millions of other sources that shape Wikipedia's knowledge ecosystem. Yet no consistent pipeline exists to transform these URLs into the kind of structured metadata that would enable large-scale analysis of sourcing patterns, geographic representation, or media diversity.

This project directly addresses this gap, developing methodologies to make URLs legible as data.

Research Questions

[edit]

This project asks: What can URLs teach us about Wikipedia that we cannot learn from structured citations alone?

RQ1
The URL Landscape — What is the full universe of URLs cited across Wikipedia languages? How many are unique, how many are dead, how many are archived?
RQ2
The "Text" URL Goldmine — What can we learn from the millions of URLs hiding in unstructured "text" references? Are they systematically different from URLs in template citations?
RQ3
Archival Dependency — How dependent is Wikipedia on web archives (web.archive.org, archive.today, etc.)? What does this tell us about the fragility of its sources?
RQ4
Source Enrichment at Scale — When we enrich URLs with metadata (country, bias, media type, ownership), what patterns emerge about Wikipedia's sourcing across languages and topics?
RQ5
Knowledge Equity Through URLs — Do URLs reveal different sourcing patterns than DOIs? Do they capture knowledge from regions, languages, and publication types that identifiers miss?

Methodology

[edit]

Rather than treating URLs as a problem to clean, this project builds pipelines to turn URLs into data.

1. URL Extraction Pipeline

[edit]

Using Wikimedia Enterprise snapshots, we extract every URL from every reference across multiple Wikipedia languages. This includes:

  • URLs in structured templates (cite web, cite news)
  • URLs hiding in unstructured "text" references
  • URLs embedded in archive wrappers (web.archive.org/*/http://original.com)

Initial analysis of Spanish, French, and German Wikipedias (June 2025 snapshot) reveals the scale:

  • DE: 8.1M "text" references → 4.8M non-Wikipedia URLs
  • FR: 6.5M "text" references → 3.9M non-Wikipedia URLs
  • ES: 4.8M "text" references → 2.9M non-Wikipedia URLs

2. URL Normalization & Resolution

[edit]

A critical challenge: the same source appears in infinite variations.

http://elpais.com/123
https://www.elpais.com/123
http://elpais.com/123?utm_source=wikipedia
https://web.archive.org/web/2020/elpais.com/123

Our pipeline normalizes URLs to base domains, resolves archive wrappers to original sources, and handles redirects—creating a clean mapping between Wikipedia references and the actual sources being cited.

3. URL Enrichment Pipeline

[edit]

This is the core innovation: attaching rich metadata to URLs at scale.

Enrichment Sources and What They Reveal
Metadata Layer Source What It Reveals
Domain-level Media Bias Fact Check (MBFC) Political bias, factual reporting, media type, country
Institutional GDELT, Wikidata Ownership, funding, transparency
Geographic IP geolocation, WHOIS Where sources are physically located
Scholarly OpenAlex, Crossref For academic URLs, full citation metadata
Archival Memento API, Wayback Machine Whether URL is archived, how often, when it died

Initial MBFC enrichment of top domains demonstrated the power of this approach: suddenly, we could see not just that nytimes.com is frequently cited, but that it's a U.S.-based, left-center, high-factual-reliability newspaper—and compare it to spiegel.de or lemonde.fr across languages.

4. Cross-Language Comparative Analysis

[edit]

By building this pipeline once and running it across languages, we can ask genuinely comparative questions:

  • Does French Wikipedia rely more on government domains than German?
  • Do Spanish Wikipedia's URLs point to more U.S. sources than Latin American ones?
  • Which language editions cite more archived content?

Key Preliminary Findings

[edit]

The "Text" URL Goldmine

[edit]

The generic "text" reference type—often dismissed as noise—is actually a URL treasure trove:

  • 59-60% of "text" references contain at least one non-Wikipedia URL
  • These URLs reveal sourcing patterns invisible to template-based analysis:
    • German: Academic (Google Books, DOI) and news (spiegel.de)
    • French: Government data (insee.fr) and cultural heritage (gallica.bnf.fr)
    • Spanish: Geographic data (geonames.usgs.gov) suggesting translation from English sources

This suggests that focusing only on template citations systematically undercounts certain types of sources—especially government data, archival content, and non-English materials.

The Archival Layer

[edit]

URLs reveal Wikipedia's deep dependency on web archives:

  • web.archive.org appears in top domains across all languages
  • But archives are not monolithic: archive.today, wikiwix.com, timetravel.mementoweb.org, and language-specific tools (redirecter.toolforge.org in DE) all serve different roles
  • The prevalence of archive URLs raises questions: Are we studying Wikipedia's sources or Wikipedia's archived sources?

What URLs Teach Us That DOIs Don't

[edit]

When we enrich URLs with MBFC metadata, patterns emerge that DOI-based analysis could never capture:

  • U.S. dominance: In French Wikipedia's top 100 domains, U.S.-based domains (21 domains) account for 6.6M references—far more than French domains (9 domains, 0.65M)
  • Bias distribution: Most matched references come from "Left-Center" (27 domains, 6.7M refs) or "Right-Center" (17 domains, 0.66M refs) sources
  • Media type: "Newspapers" dominate (32 domains), but "Organization/Foundation" (6 domains) account for 5.8M references—showing the outsized influence of a few institutional sources

These patterns raise normative questions: Should Wikipedia's citations reflect the global knowledge landscape, and if so, what does "should" mean when URLs reveal such concentration?

Why This Matters: From URLs to Understanding

[edit]

Building a URL-to-metadata pipeline transforms what we can ask about Wikipedia:

Old Question (Identifier-Based) New Question (URL-Enriched)
How many scholarly articles does Wikipedia cite? What types of sources (news, government, blogs, archives) does Wikipedia rely on, and how does this vary by language?
Which journals are most cited? Which countries, media outlets, and institutional sources shape Wikipedia's worldview?
What is the open access percentage? What is the political bias distribution of Wikipedia's news sources?
How many citations are to dead links? Which sources are most fragile? Which communities are most dependent on archives?

The last question is particularly urgent. If a language edition relies heavily on URLs from a single country or a handful of news outlets, and those URLs rot, an entire community's knowledge infrastructure becomes unstable. URL analysis lets us see these dependencies before they become crises.

Next Steps

[edit]

Pipeline Development

[edit]
  1. Scale URL enrichment beyond top domains to all URLs in the dataset
  2. Integrate multiple enrichment sources (MBFC, GDELT, OpenAlex, WHOIS) into a unified pipeline
  3. Build archive analysis tools to trace the relationship between original URLs and their archived versions

Research Directions

[edit]
  1. Comparative URL ecology: How do URL sourcing patterns differ across 10+ Wikipedia languages?
  2. URL decay and preservation: Which types of sources (geographic, political, institutional) are most likely to go dead? Which communities are most affected?
  3. Bias through URLs: When Wikipedia cites a URL, whose voice is it amplifying? Can we trace the political economy of Wikipedia's sources through URL metadata?
  4. URLs as equity indicators: Do URLs from the Global South, minority languages, or independent media survive as well as those from the Global North?

Community Engagement

[edit]
  1. Share findings with Wikipedia communities about their URL sourcing patterns
  2. Develop tools to help editors assess the diversity and fragility of their sources
  3. Advocate for better URL practices—consistent archiving, template usage, and metadata inclusion

References

[edit]
  • Bilder, G. (2014, August 7). Citation needed. Front Matter. https://doi.org/10.64000/4jf4e-19h85
  • Kousha, K., & Thelwall, M. (2017). Are Wikipedia citations important evidence of the impact of scholarly articles and books? Journal of the Association for Information Science and Technology, 68(3), 762–779. https://doi.org/10.1002/asi.23694
  • Lin, J., & Fenner, M. (2014). An analysis of Wikipedia references across PLOS publications. Altmetrics14: Expanding Impacts and Metrics an ACM Web Science Conference 2014 Workshop, 23–26. http://pfigshare-u-files.s3.amazonaws.com/1546358/Altmetrics14_PLOS_v2.pdf
  • Maggio, L. A., Willinsky, J. M., Steinberg, R. M., Mietchen, D., Wass, J. L., & Dong, T. (2017). Wikipedia as a gateway to biomedical research: The relative distribution and use of citations in the English Wikipedia. PLoS ONE, 12(12), e0190046. https://doi.org/10.1371/journal.pone.0190046
  • Pooladian, A., & Borrego, Á. (2017). Methodological issues in measuring citations in Wikipedia: A case study in Library and Information Science. Scientometrics, 113(1), 455–464.
  • Redi, M., Taraborelli, D., & Orlowitz, J. (2018, August 20). How many Wikipedia references are available to read? We measured the proportion of open access sources across languages and topics. Wikimedia Foundation. https://wikimediafoundation.org/news/2018/08/20/how-many-wikipedia-references-are-available-to-read/
  • Singh, H., West, R., & Colavizza, G. (2021). Wikipedia citations: A comprehensive data set of citations with identifiers extracted from English Wikipedia. Quantitative Science Studies, 2(1), 1–19. https://doi.org/10.1162/qss_a_00105
  • Zagovora, O., Ulloa, R., Weller, K., & Flöck, F. (2020). Individual Edit Histories of All References in the English Wikipedia [Dataset]. Zenodo. https://doi.org/10.5281/zenodo.3964990
  • Zagovora, O., Ulloa, R., Weller, K., & Flöck, F. (2022). "I updated the <ref> tag": The evolution of references in the English Wikipedia and the implications for altmetrics. Quantitative Science Studies, 3(1), 147–173. https://doi.org/10.1162/qss_a_00171

See also

[edit]