Research:Exploring Wikimedia Communities, Trace Data, Social Systems and Causality
This page documents a research project in progress.
Information may be incomplete and change as the project progresses.
Please contact the project lead before formally citing or reusing results from this page.
|
This page is currently a draft. More information pertaining to this may be available on the talk page. Translation admins: Normally, drafts should not be marked for translation. |
Summary
[edit]This exploratory project investigates Wikimedia communities, readership and content using causal and social systems approaches. It connects micro-level activity, observed in trace data, to meso- and macro-level social phenomena. The project’s primary output will be clusterings of wiki communities and related entities. Clusters will be interpretable, that is, they will be describable in terms of their connections to social phenomena.
The project will proceed through focused stages, with each stage exploring specific phenomena and associated data. A final integration stage will synthesize the outputs of the preceding stages to produce cross-domain models. Preliminary results from the first stage are summarized below.
The project aims to advance understanding of grassroots knowledge generation and the consumption of that knowledge, and to develop theory and methods for linking trace data to social phenomena.
Introduction
[edit]This project applies causal and social systems approaches to study Wikimedia communities, readership and content. It uses these approaches as the basis for a framework for connecting micro-level explanations about how people act and think (for example, about viewing or editing a Wikipedia article) to meso- and macro-level social phenomena (such as networks of editors, cultural difference, regional identities or demographics).
The project has four goals:
- To study grassroots knowledge generation and the consumption of that knowledge from a social perspective. We expect that research focusing on social aspects of these processes will reveal features overlooked by other approaches.
- To support applied research in the Wikimedia ecosystem by providing insights into social contexts and dynamics. Such research is critical for developing new site features and understanding community priorities.
- To refine and formalize approaches for connecting trace data to social phenomena. Previous investigations of these links have relied on a diversity of approaches; this project aims to contribute to a theoretically grounded, integrated framework and toolset for studying such relationships.
- To advance theories of society, culture and knowledge by building on and developing the author's previous proposals [1] through empirical application.
We will analyze trace data (digital traces of human activity), especially pageviews, edits and wiki content. Note that we view wiki content not just as trace data, but also as symbolic communication among editors and readers. Some analyses will also integrate auxiliary data, such as demographic statistics.
A primary output will be clusterings of wiki communities, wiki pages, edits and people. These groupings will include a time dimension showing cluster evolution, and will be interpretable in social terms—that is, the properties that distinguish the clusters will connect to social phenomena.
Additionally, the project will develop open-source pipelines for data preparation, feature engineering, model generation and visualization.
The project is exploratory and will be carried out in stages. It has three broad, core research questions:
RQ1: What patterns exist in Wikimedia trace data?
RQ2: What hypotheses about social phenomena could explain these patterns?
RQ3: What models can we build to investigate these hypotheses and shed light on these phenomena?
The research questions do not define a specific population as the object of study because Wikimedia communities and readership span global humanity. (See the Approach section for more on this.)
Stages
[edit]Given the project’s scope, it will be executed in multiple focused stages.
Each stage will explore a limited set of trace data (for example, pageviews, edit histories or talk page contents) and social phenomena (such as regional differences in reader focus or community dynamics) and will build on related previous research. Domain-specific research questions that link back to the core research questions will provide focus for each stage.
The project will conclude with a final integration stage that will synthesize results from the preceding stages. The clusterings mentioned above will be generated in this stage. Data from preceding stages will serve as engineered features, and the domain-specific insights obtained will guide cross-domain modeling.
We have intentionally not specified which statistical and modeling techniques we will use. These will be selected as appropriate for each stage and refined as preliminary results emerge.
A first stage, summarized below, is currently underway.
Stage 1: Geographic and topic pageview associations
[edit]The first stage of this project explores patterns in the articles viewed by Wikipedia readers, across countries and languages. The research questions specific to this stage are:
Stage 1 RQ1: What patterns exist in Wikipedia pageviews, related to the topics and geographic locations associated with the articles viewed, across readers’ own geographic locations and Wikipedia language editions?
Stage 1 RQ2: What hypotheses about social phenomena could explain these patterns?
(In Stage 1 RQ1, “associated with the articles” means, “that the articles are about, or that are closely related to the topic of the articles”. For example, an article about a scientist born in country A would be associated with both science and country A.)
Additional stage-specific research questions will be added as work on this stage progresses.
This stage analyzes data from the Wikimedia Foundation’s Differential Privacy Pageviews dataset [2] in conjunction with the output of machine learning models that predict articles' geographic and topic associations. This data was supplemented with population counts and internet access rates.
Due to the size of the pageviews dataset, it was not possible to fetch the models’ predictions for all articles viewed over a reasonable timespan, so pageviews were sampled, and models were queried for the associations of articles appearing in the samples.
Based on per-country samples of pageviews from October 2025, we see the following trends, across world regions, for the combined pageviews of all Wikipedia language editions:
(a) Readers consistently access articles associated with the region they are located in at a higher rate than readers in other regions.
(b) Conversely, there is relatively less variation in the proportions of broad topic areas (e.g., media, science, geography) associated with pageviews, across regions.
Possible causes of (a) include reader preference for content related to a local identity, and search engine algorithms that favor local content.
(b) could imply that the contexts and motivations for accessing Wikipedia, as well as the frequencies of their occurrence, are globally relatively invariant. (b) also connects to previous research [3].
Here are overview visualizations for (a), (b) and data quality:





Additional details, including more visualizations, methodological notes, and an overview of the data processing pipeline, are available in this blog post. A single image with many the visualizations is here.
Further work on this stage will include clustering analyses, calculating content availability baselines, modeling the alteration and removal of data due for differential privacy, improving sampling methods for countries with fewer pageviews, and investigating patterns at country-level granularity and by language edition.
Approach
[edit](In progress; some initial notes are here.)
Prior work
[edit](In progress)
Timeline
[edit]No timeline is currently specified for this project.
Policy, ethics and human subjects research
[edit]Only publicly available data will be used for this project. No interviews or surveys are planned at this time. This project has not undergone a policy, ethics and human subjects review.
Resources
[edit]Further details about Stage 1 are in this blog post.
Code for Stage 1 is here.
References
[edit]- ↑ Green, Andrew R. (2019). Derrumbando la Barrera Cualitativa-Cuantitativa: Perspectivas sobre el Pensamiento, las Expresiones Formales, el Lenguaje y la Investigación Social. Ph.D. dissertation. Maldonado Soto, Ricardo (advisor). Mexico: Instituto Nacional de Antropología e Historia.
- ↑ Wikimedia Foundation. "Pageviews Differential Privacy — Current".
- ↑ Lemmerich, Florian; Sáez-Trumper, Diego; West, Robert; Zia, Leila (2019). "Why the World Reads Wikipedia: Beyond English Speakers". WSDM '19: Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining. doi:10.1145/3289600.3291021.