Research:Assessing electoral campaigns' impact on political information demand using Wikipedia page view data

From Meta, a Wikimedia project coordination wiki
Sascha Göbel

This page documents a proposed research project.
Information may be incomplete and may change before the project starts.

Project summary and state of research[edit]

Wikipedia covers vast amounts of political information, including topics such as political systems, highly specific policy issues, election results and analyses, and incredibly many biographies of politicians (e.g., biographies for all members of all terms of the United States Congress). Previous social science research focused on the creation and negotiation of political content on Wikipedia[1][2]. In contrast, recent efforts emphasize Wikipedia's role as a popular and widely used outlet for the consumption of political information[3]. Citizens use the platform as a trustworthy and neutral information source[4] about political issues and elites in order to inform their political decisions and preferences. But what instances exactly cause people to request political information on Wikipedia, besides the mere appearance of elections[3], and why is the provided information hence beneficial or even necessary for society?

This project aims to extend our knowledge regarding the causes of the at times massive demand for political information on Wikipedia. In doing so, it connects ongoing Wikimedia research that seeks to understand the use of information provided on Wikipedia[5][6] or Wikidata[7] with empirical political science research into campaign effects on political attitudes and behavior[8][9]. In particular, the project studies the impact of electoral campaigns on political information demand at local levels of geographical aggregation. To this end, the date and location of events at campaign trails (intervention) will be tracked via mining of Twitter accounts[10][11] for select candidates at past US gubernatorial elections. Archived Wikipedia page view data serves as a measure of public attention and information demand[12]. Accordingly, raw server access logs of relevant Wikipedia articles (e.g., on candidate biographies, candidates' political parties, or salient campaign issues) will be used to measure campaign-related information demand (outcome) in both areas exposed and comparable areas not exposed to campaign events. Comparable areas will be identified using region-specific auxiliary information.

State-of-the-art methods in causal inference will be used to contrast information demand between regions in different exposure states and estimate the effect of electoral campaigns on demand for political information. The investigated process is observational in nature and does not lend itself to experimental randomization. However, the geographical and temporal scope of the data allow to treat campaign events as an intervention and estimate their causal impact on a target population within geographical bounds without random assignment in a quasi-experimental[13] fashion. The use of raw server access logs of Wikipedia articles in combination with other Web data hence enables us to (1) understand demand for political information on Wikipedia, (2) better study the effects of electoral campaigns, and (3) produce causal insights that ordinary survey or other more aggregated observational data cannot offer.

Research questions and contributions[edit]

The project described above addresses the following research questions:

  1. Do electoral campaigns affect citizens' demand for political information? What is the direction and magnitude of this effect, i.e., how large is the average increase/decrease of information demand caused by campaign events?
  2. When do people request political information on Wikipedia? What is hence the potential societal value of political information on the platform?

The project is thus set out to yield insights of interest for:

the Wikipedia community:
  • We know that political information is popularly being requested on Wikipedia[3][12]. However, we do not yet know what sparks this interest. This knowledge is invaluable to judge which purpose political information on Wikipedia serves to its users. Increasing demand of politicial information on Wikipedia in the aftermath of electoral campaign events corroborates citizens' use of the platform to assist opinion-formation and decision-making. Given that political information is sometimes heavily contested on Wikipedia[14] and even used for political advertising[3], such knowledge is all the more important. It could help the Wikipedia community to better assess the social consequences of (politically motivated) editorial bias on the platform.
  • Knowing when people are especially likely to request political information on Wikipedia helps the community to tailor the provision of political content and its supervision better to the needs of users.
researchers in academia:
  • the rich political science literature on electoral campaigns has predominantly focused on assessing effects in terms of electoral outcomes[8][9]. Yet, campaign trails do not merely serve to win votes but also to enlighten uninformed voters and arouse interest. Focusing only on electoral outcomes neglects the latter purpose of electoral campaigns. This project is the first to look at changes in the demand for political information as a consequence of electoral campaigns. It investigates whether campaigns spark peoples' interest in the candidate, the election, political issues, or politics in general. In doing so, the project adds a new dimension to the study of electoral campaigns.
  • assessing the impact of electoral campaigns on vote shares using observational data is notoriously difficult. The problem consists primarily in large time intervals between sparse measures of the outcome (e.g., election results). This makes it hard to measure campaign effects since campaigns by opposing candidates offset each other over time[15]. Furthermore, the lifespan of campaign effects on voter preferences is fairly short[16]. At the same time laboratory experiments suffer from low external validity[17]. This study is the first to assess the impact of campaign trails using observational data that provides frequent measures of the outcome (demand for political information) in real-time intervals. Raw server access logs of Wikipedia articles allow to measure the demand for political information individually and as it occurs. As such, they enable an immediate assessment of the behavioral consequences of electoral campaigns. In addition, the geolocation of information demand via Internet Protocol addresses reported in access logs makes it possible to aggregate the demand for political information at any geographical and temporal level. This makes it feasible to exploit quasi-experimental conditions for robust causal inference.
  • Wikipedia's potential to study political behavior and to understand general social processes is still largely untapped in the social sciences[18][19]. The bulk of social science research relying on Web data focuses on social media such as Twitter or Facebook. This project acknowledges and highlights Wikipedia's value to tackle central questions in political science research using computational social science approaches[20].

Data and research design[edit]

At least three types of data will be combined to answer the above research questions:

  1. Wikipedia data (outcome): Political information demand is measured for a specified period via individual page view data of a select set of Wikipedia articles. The data will be extracted from raw server access logs. Access times and Internet Protocol addresses for geolocation of access origins are obtained using regular expressions as well as select R packages and public APIs.
  2. Twitter data (intervention): Events on a campaign trail of select candidates and gubernatorial elections are identified via mining of Twitter accounts of the candidates. The time and location of events are obtained using regular expressions as well as select R packages and public APIs.
  3. Auxiliary information: To assess the comparability of regions in different exposure states and match them accordingly, additional information, such as regional sociodemographic data, range of news media, etc., will be collected. For this, various official and Web data sources are used.

The researcher will be responsible for the data collection. The support of the Wikimedia Foundation is, however, kindly requested for the provision of raw server access logs.

The overarching approach used to assess the impact of electoral campaigns on political information demand and to demonstrate causality between the intervention and the oucome is spatial ecological inference[21]. Here, units (Wikipedia users requesting political information) are assigned to different exposure states based on whether they are located within a specified region (e.g., community, county) surrounding the intervention (i.e., the campaign event) or not. The outcome (demand for political information) is then measured for the different exposure states at the levels of the respective geographic regions. Due to the fine geographical and temporal resolution of the data, several methods qualify to deal with the issue of confounding. For instance, matching[22], or semi-parametric difference-in-differences[23] are suitable candidates. However, given the temporal scope of the data, the synthetic control method[24][25] seems most promising and fruitful in the present case. This method systematically combines regions not exposed to the intervention into optimal control cases by weighing them according to their comparability to the region exposed to the intervention. In addition and as opposed to the other two methods, it accounts for time-varying confounders. A careful selection of cases (i.e., elections, candidates, and regions) will be conducted once the project commences. Similarly, strategies to deal with further threats to internal validity, such as violations of the stable unit treatment value assumption due to spatial spillovers of the intervention, will be developed at early stages of the project. The projects involve a lot of programming and statistical computation. These will be conducted using the open-source programming languages R and Python.


All progress wil be reported on this page on a regular basis. The results produced by this project will further be presented at leading international conferences in Political Science (e.g., the APSA (American Political Science Association) and EPSA (European Political Science Association) Annual Conferences) and Computational Social Science (e.g., International Conference on Computational Social Science) as well as at Wikipedia related conferences (e.g., Wikimania) and internal research colloquia. Research papers resulting from this project will be submitted to open access journals for publication. Code and non-personal data produced throughout this project will be made publicly available on a project-specific repository at


This research proposal is part of a larger PhD project funded by the Excellence Initiative of the German federal and state governments through a full scholarship at the Graduate School of Decision Sciences (GSC 1019) at the University of Konstanz.

Time schedule[edit]

2017-10 → Project kicks off
2017-10 — 2017-11 → case selection, theoretical work/framing, full development of research design
2017-12 — 2018-02 → data collection and processing
2018-03 — 2018-05 → implementation of research design
2018-06 — 2018-07 → analyses, visualization of results
2018-08 — 2018-09 → writing of research paper
2018-10 — Project ends → communication of results, revision, submission of research paper to open-access journal
This is a very preliminary time schedule and may well be adjusted as the project progresses. It is accordingly also possible that the duration of the project will be extended by a few months.

Wikimedia Policies, Ethics, and Human Subjects Protection[edit]

This project and the researchers involved in it will fully and consistently comply with the Wikimedia Foundation's open access and privacy policy. As concerns non-public data requested in this proposal, the researchers that are involved in this project and are granted access to non-public data by the Wikimedia Foundation based on a formal collaboration will enter into a non-disclosure agreement as well as a memorandum of understanding. Non-public data provided by the Wikimedia Foundation will not be redistributed or made available, directly or indirectly, and will be used in full accordance with the Wikimedia Foundation's instructions. Research findings will refer to aggregate forms of the data and will be presented visually or otherwise in a manner such that re-identification of users is precluded.


  1. Kalla, Joshua L.; Aronow, Peter M. (2015). "Editorial bias in crowd-sourced political information". PLoS ONE 10 (9). doi:10.1371/journal.pone.0136327. 
  2. Neff, Jessica J.; Laniado, David; Kappler, Karolin E.; Volkovich, Yana; Aragón, Pablo; Kaltenbrunner, Andreas (2013). "Jointly they edit: Examining the impact of community identification on political interaction in Wikipedia". PLoS ONE 8 (4). doi:10.1371/journal.pone.0060584. 
  3. a b c d Göbel, Sascha; Munzert, Simon (2017). "Political advertising on the Wikipedia marketplace of information". Social Science Computer Review. Forthcoming. doi:10.1177/0894439317703579. 
  4. Pande, Mani (2011). "Wikipedia Readership Survey 2011". Wikimedia Research. Retrieved 13 July 2017. 
  5. Zia, Leila; Leskovec, Jure; West, Robert; Wulczyn, Ellery; Taraborelli, Dario; Morgan, Jonathan; Singer, Philipp; Lemmerich, Florian; Strohmaier, Markus (2015). "Characterizing Wikipedia Reader Behaviour". Wikimedia Research. Retrieved 13 July 2017. 
  6. Paolotti, Daniela; Tizzoni, Michele; Panisson, André; Cattuto, Ciro (2016). "Quantifying the global attention to public health threats through Wikipedia pageview data". Wikimedia Research. Retrieved 13 July 2017. 
  7. Krötzsch, Markus; Masopust, Tomáš; Voigt, Hannes; Krause, Alexander; Bielefeldt, Adrian; Gonsior, Julius (2016). "Understanding Wikidata Queries". Wikimedia Research. Retrieved 13 July 2017. 
  8. a b Brady, Henry E.; Johnston, Richard, eds. (2006). Capturing campaign effects. Ann Arbor: University of Michigan Press. 
  9. a b Hillygus, Sunshine D. (2010), "Campaign Effects on Vote Choice", in Leighly, Jan; Edwards III, George C., Oxford Handbook on Elections and Political Behavior, Oxford: Oxford University Press, pp. 326–345. 
  10. Jennifer, Goldbeck; Grimes, Justin M.; Rogers, Anthony (2010). "Twitter use by the U.S. Congress". Journal of the Association for Information Science and Technology 61 (8): 1612–1621. doi:10.1002/asi.21344. 
  11. Graham, Todd; Broersma, Marcel; Hazelhoff, Karin; van 't Haar, Guido (2013). "Between broadcasting political messages and interacting with voters. The use of Twitter during the 2010 UK general election campaign". Information, Communication & Society 16 (5): 692–716. doi:10.1080/1369118X.2013.785581. 
  12. a b Munzert, Simon (2015). "Using Wikipedia article traffic volume to measure public issue attention" (PDF). Unpublished manuscript. Retrieved 13 July 2017. 
  13. Keele, Luke; Titiunik, Rocío (2016). "Natural experiments based on geography". Political Science Research and Methods 4 (1): 65–95. doi:10.1017/psrm.2015.4. 
  14. Yasseri, Taha; Sumi, Robert; Rung, András; Kornai, András; Kertész, János (2012). "Dynamics of Conflicts in Wikipedia". PLoS ONE 7 (6). doi:10.1371/journal.pone.0038869. 
  15. Gelman, Andrew; King, Gary (1993). "Why are American presidential election campaign polls so variable when votes are so predictable?". British Journal of Political Science 23 (4): 409–451. 
  16. Mitchell, Dona-Gene (2012). "It's about time: The lifespan of information effects in a multiweek campaign". American Journal of Political Science 56 (2): 298–311. doi:10.1111/j.1540-5907.2011.00549.x. 
  17. Shanto, Iyengar; Simon, Adam F. (2000). "New perspectives and evidence on political communication and campaign effects". Annual Review of Political Science 18: 31–47. doi:10.1146/annurev.psych.51.1.149. 
  18. Mesgari, Mostafa; Okoli, Chitu; Mehdi, Mohamad; Nielsen, Finn Å.; Lanamäki, Arto (2015). "'The sum of all human knowledge': A systematic review of scholarly research on the content of Wikipedia". Journal of the Association for Information Science and Technology 66 (2): 219–245. doi:10.1002/asi.23172. 
  19. Schroeder, Ralph; Taylor, Linnet (2015). "Big data and Wikipedia research: Social science knowledge across disciplinary divides". Information, Communication & Society 18 (9): 1039–1056. doi:10.1080/1369118X.2015.1008538. 
  20. Cioffi-Revilla, Claudio (2010). "Computational social science". Wiley Interdisciplinary Reviews: Computational Statistics 2 (3): 259–271. doi:10.1002/wics.95. 
  21. Wakefield, Jonathan (2004). "A critique of statistical aspects of ecological studies in spatial epidemiology". Environmental and Ecological Statistics 11 (1): 31–54. doi:10.1023/B:EEST.0000011363.12720.38. 
  22. Stuart, Elizabeth A. (2010). "Matching methods for causal inference: A review and a look forward". Statistical Science 25 (1): 1–21. doi:10.1214/09-STS313. 
  23. Abadie, Alberto (2005). "Semiparametric difference-in-differences estimators". Review of Economic Studies 72 (1): 1–19. doi:10.1111/0034-6527.00321. 
  24. Abadie, Alberto; Diamond, Alexis; Hainmueller, Jens (2010). "Synthetic control methods for comparative case studies: Estimating the effect of California’s tobacco control program". Journal of the American Statistical Association 105 (490): 493–505. doi:10.1198/jasa.2009.ap08746. 
  25. Xu, Yiqing (2017). "Generalized synthetic control method. Causal inference with interactive fixed effects models". Political Analysis 25 (1): 57–76. doi:10.1017/pan.2016.2. 


Sascha Göbel
University of Konstanz
Graduate School of Decision Sciences & Center for Data and Methods
78457 Konstanz, Germany

Simon Munzert
Humboldt University of Berlin
Department of Social Sciences
10099 Berlin, Germany