Jump to content

Research:Exploring Wikimedia Communities, Trace Data, Social Systems and Causality

From Meta, a Wikimedia project coordination wiki
Contact
no affiliation
Duration:  2025-10 – ??

This page documents a research project in progress.
Information may be incomplete and change as the project progresses.
Please contact the project lead before formally citing or reusing results from this page.


Summary

[edit]

This exploratory project investigates Wikimedia communities, readership and content using causal and social systems approaches. It connects micro-level activity, observed in trace data, to meso- and macro-level social phenomena. The project’s primary output will be clusterings of wiki communities, wiki pages, edits and social groupings. Clusters will be interpretable, that is, they will be describable in terms of their connections to social phenomena.

The project will proceed through focused stages, with each stage exploring specific phenomena and associated data. A final integration stage will synthesize the outputs of the preceding stages to produce cross-domain models. Partial results from the first two stages are summarized below.

The project aims to advance understanding of grassroots knowledge generation and the consumption of that knowledge, and to develop theory and methods for linking trace data to social phenomena.

Introduction

[edit]

This project applies causal and social systems approaches to study Wikimedia communities, readership and content. It uses these approaches as the basis for a framework for connecting micro-level explanations about how people act and think (for example, about viewing or editing a Wikipedia article) to meso- and macro-level social phenomena (such as networks of editors, cultural difference, regional identities or demographics).

The project has four goals:

  • To study grassroots knowledge generation and the consumption of that knowledge from a social perspective. We expect that research focusing on social aspects of these processes will reveal features overlooked by other approaches.
  • To support applied research in the Wikimedia ecosystem by providing insights into social contexts and dynamics. Such research is critical for developing new site features and understanding community priorities.
  • To refine and formalize approaches for connecting trace data to social phenomena. Previous investigations of these links have relied on a diversity of approaches; this project aims to contribute to a theoretically grounded, integrated framework and toolset for studying such relationships.
  • To advance theories of society, culture and knowledge by building on and developing the author's previous proposals [1] through empirical application.

We will analyze trace data (digital traces of human activity), especially pageviews, edits and wiki content. Note that we view wiki content not just as trace data, but also as symbolic communication among editors and readers. Some analyses will also integrate auxiliary data, such as demographic statistics.

A primary output will be clusterings of wiki communities, wiki pages, edits and people. These groupings will include a time dimension showing cluster evolution, and will be interpretable in social terms—that is, the properties that distinguish the clusters will connect to social phenomena.

Additionally, the project will develop open-source pipelines for data preparation, feature engineering, model generation and visualization.

The project is exploratory and will be carried out in stages. It has three broad, core research questions:

RQ1: What patterns exist in Wikimedia trace data?

RQ2: What hypotheses about social phenomena could explain these patterns?

RQ3: What models can we build to investigate these hypotheses and shed light on these phenomena?

The research questions do not define a specific population as the object of study because Wikimedia communities and readership span global humanity. (See the Approach section for more on this.)

Stages

[edit]

Given the project’s scope, it will be executed in multiple focused stages.

An initial preparatory stage will select demographic data inputs and explore ways of combining those inputs with established Wikimedia metrics.

Next, a series of intermediate stages will explore limited sets of trace data (for example, pageviews, edit histories or talk page contents) and social phenomena (such as regional differences in reader focus or community dynamics), building on related previous research whenever possible. Domain-specific research questions that link back to the core research questions will provide focus for these stages.

The project will conclude with a final integration stage that will synthesize results from the preceding stages. The clusterings mentioned above will be generated in this stage. Data from preceding stages will serve as engineered features, and the domain-specific insights obtained will guide cross-domain modeling.

We have intentionally not specified which statistical and modeling techniques we will use. These will be selected as appropriate for each stage and refined as preliminary results emerge.

Stages may not be carried out sequentially. Partial results for stages 0 and 1 are summarized below.

Stage 0: Demographics and derived metrics

[edit]

This preparatory stage involves selecting and preparing relevant demographic data, and exploring derived metrics that integrate such data with key Wikimedia metrics (e.g., pageviews, editor activity). This stage provides foundational inputs for subsequent stages. Research questions for this stage are:

Stage 0 RQ1: What demographic or socioeconomic indicators are relevant to social phenomena related to Wikimedia communities, readership and content?

Stage 0 RQ2: How can those statistics be integrated with established Wikimedia metrics to define meaningful derived metrics?

Work on this stage is ongoing. Partial results include the following:

For RQ1, population size and internet access rates have been identified as the key inputs. This data allows us to account for the wide variation in population sizes across countries represented in the data. Additionally, Stage 1 uses these inputs as weights for aggregating country-level data.

For RQ2, we propose readership penetration as a derived metric, defined as the ratio of mean daily pageviews to internet-connected population. This provides a rough measure of Wikipedia readership, adjusting for variation in internet access across countries.

Below is a map of this metric by country for October 2025.

Wikipedia readership penetration by country, October 2025, based on data for pageviews, population and rates of internet access, obtained from the Wikimedia Foundation, the United Nations and the World Bank, respectively.[2][3][4] For privacy reasons, the Wikimedia Foundation excludes data on several countries from public datasets.

Upcoming work on this stage will extend the scope of demographic inputs to include education, age structure, language speakers and language status.

Stage 1: Geographic and topic pageview associations

[edit]

This stage of the project explores patterns in the articles viewed by Wikipedia readers, across countries and languages. The research questions specific to this stage are:

Stage 1 RQ1: What patterns exist in Wikipedia pageviews, related to the topics and geographic locations associated with the articles viewed, across readers’ own geographic locations and Wikipedia language editions?

(Here, “topics and locations associated with the articles” means, “topics and locations that the articles are about or closely related to.” For example, an article about a scientist born in country A would be associated with both science and country A.)

Stage 1 RQ2: What hypotheses about social phenomena could explain these patterns?

Stage 1 RQ3: What models can we build to investigate these hypotheses?

Additional stage-specific research questions may be added as work on this stage progresses.

This stage analyzes data from the Wikimedia Foundation’s Differential Privacy Pageviews dataset [5] in conjunction with the output of machine learning models that predict articles' geographic and topic associations. This data was supplemented with population counts and internet access rates (see Stage 0).

Due to the size of the pageviews dataset, it was not possible to fetch the models’ predictions for all articles viewed over a reasonable timespan, so pageviews were sampled, and models were queried for the associations of articles appearing in the samples.

Based on per-country samples of pageviews from October 2025, we see the following trends, across world regions, for the combined pageviews of all Wikipedia language editions:

(a) Readers consistently access articles associated with the region they are located in at a higher rate than readers in other regions.

(b) Conversely, there is relatively less variation in the proportions of broad topic areas (e.g., media, science, geography) associated with pageviews, across regions.

Plausible explanations for (a) include reader preference for content related to a local identity, and search engine algorithms that favor local content.

(b) could imply that the contexts and motivations for accessing Wikipedia, along with the frequencies of their occurrence, are globally relatively invariant. (b) also connects to previous research.[6]

Here are overview visualizations for (a), (b) and data quality:

Visualizaion of estimated weighted proportions of regional associations of pageviews, by device region, all Wikipedia projects, October 2025. Based on samples of pageviews from the public Differential Pageviews dataset.
Estimated weighted proportions of regional associations of pageviews, by device region, all Wikipedias, October 2025.
Visualization of divergence scores for proportions of regional associations of pageviews, by device region, all Wikipedias, October 2025.
Divergence scores for proportions of regional associations of pageviews, by device region, all Wikipedias, October 2025.
Visualization of estimated weighted proportions of topic area associations of pageviews, by device region, all Wikipedias, October 2025.
Estimated weighted proportions of topic area associations of pageviews, by device region, all Wikipedias, October 2025.
Visualization of divergence scores for proportions of topic area associations of pageviews, by device region, all Wikipedias, October 2025.
Divergence scores for proportions of topic area associations of pageviews, by device region, all Wikipedias, October 2025.
Visualization of data availability and sample status for samples of Differential Pageviews data samples, all Wikipedias, October 2025.
Data availability and sample status for samples of Differential Pageviews data samples, all Wikipedias, October 2025.

Additional details, including methodological notes and an overview of the data processing pipeline, are available in this short paper. A single image with many visualizations is here.

Further work on this stage will include modeling noise and missing data due to differential privacy processing, refining the sampling approach, and calculating e-values to assess the robustness of findings. We will also conduct a granular topic analysis of the top articles contributing to geographical self-focus and develop models for clustering countries based on reader focus. Finally, we plan to calculate content availability baselines and investigate patterns across Wikipedia language editions.

Approach

[edit]

(In progress)

Prior work

[edit]

(In progress)

Timeline

[edit]

No timeline is currently specified for this project.

Policy, ethics and human subjects research

[edit]

Only publicly available data will be used for this project. No interviews or surveys are planned at this time. This project has not undergone a policy, ethics and human subjects review.

Resources

[edit]

Further details about Stage 1 can be found in this short paper.

Code for stages 0 and 1 is here.

Notes

[edit]
  1. See Green, Andrew R. (2019). Derrumbando la Barrera Cualitativa-Cuantitativa: Perspectivas sobre el Pensamiento, las Expresiones Formales, el Lenguaje y la Investigación Social. Ph.D. dissertation. Maldonado Soto, Ricardo (advisor). Mexico: Instituto Nacional de Antropología e Historia. Broadly, the proposals draw on Cognitive Science, Cognitive Linguistics and Systems Science to connect approaches that are typically seen as incommensurable.
  2. Pageview data downloaded using the Wikimedia Foundation's REST API.
  3. United Nations, Department of Economic and Social Affairs, Population Division (2025). World Population Prospects 2024. https://population.un.org/wpp/downloads?folder=Standard%20Projections
  4. World Bank Group (2026). Individuals using the internet (% of population). https://data.worldbank.org/indicator/IT.NET.USER.ZS.
  5. Wikimedia Foundation. "Pageviews Differential Privacy — Current". 
  6. Lemmerich, Florian; Sáez-Trumper, Diego; West, Robert; Zia, Leila (2019). "Why the World Reads Wikipedia: Beyond English Speakers". WSDM '19: Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining. doi:10.1145/3289600.3291021.