Research:Quantifying the global attention to public health threats through Wikipedia pageview data

From Meta, a Wikimedia project coordination wiki
Jump to: navigation, search
11:59, 29 May 2016 (CET)
Daniela Paolotti
Michele Tizzoni
André Panisson
Ciro Cattuto
Duration:  2016-07 — 2017-07
Open data project  Open data
no url provided
Open access project  Open access
no url provided

This page documents a research project in progress.
Information may be incomplete and change as the project progresses.

We plan to leverage the high-quality and high-granularity page view data provided by Wikimedia to obtain a high-resolution and time-resolved map of the patterns of global attention to an epidemic outbreak, and link such patterns to health communication activities as well as to the epidemiology of the disease. This will allow to derive a data-driven quantitative description of the interplay between the diffusion of information, the public concern raised by an outbreak and the dynamics of the infection spread.


In the last decade, significant advances in digital approaches to the domains of epidemiology and population health have set the new paradigm of “Digital Epidemiology” [1]. Recent and important examples of this new research field are based on the analysis of data from social media such as Twitter to measure vaccine sentiment [2] [3], data from Google searches [4] and participatory Web platforms for influenza surveillance [5] [6] and notably data from Wikipedia to forecast disease activity [7] [8]. When considering the threat posed by emerging infectious diseases, such as the emergence of novel coronaviruses causing SARS, MERS, novel influenza subtypes such as A/H1N1 in 2009 and H7N9 in 2013, arboviruses like Zika and Chikungunya, digital epidemiology has shown value in supporting public health decision makers to rapidly identify outbreak sources [9], monitor the disease incidence [3] and deploy a quick response [10].

Modeling collective attention to a public health threat[edit]

While most of the literature has focused on the use of novel data streams to monitor or forecast disease burden, very little work has been done on quantifying the public discussion, degree of public concern and - more in general - the dynamics of attention induced by a current health threat such as an outbreak. This aspect, however, is highly relevant both for disease forecast and for public health policy: In fact, it is clear that human behaviour and behavioral responses to an outbreak play an important role in the transmission of infectious diseases [11]. Public reactions to an outbreak may range from relatively mild, as during the 2009 A/H1N1 pandemic, to severe, as it was the case for the 2015 MERS-Cov outbreak in South Korea, or even scale up to violent unrest, as in the 2014 West Africa Ebola outbreak. Individual behavior is also key in the current Zika outbreak, as the virus affects pregnant women causing birth defects and it also can be transmitted by a man to his sex partners. The current Zika outbreak, in particular, poses peculiar communication challenges to the public due to its association with microcephaly in newborns, its transmission modalities, and its current prevalence in areas that are going to witness intense international travel due to the upcoming Rio Olympics 2016. The proposed investigation aims at a better quantitative understanding of public attention and awareness during a disease outbreak, with the potential to inform realistic mathematical epidemic models [12]. Moreover, understanding the influence of the attention/behaviour on the spread of diseases and quantifying such behavioral component can be key to improving public health communication strategies and dissemination of outbreak related information.

By now, it is evident that there is a huge potential to use Web-related data sources to measure and quantify the complex interplay between the spread of information, individual behaviour and the epidemiology of an infectious disease. However, such potential has remained untapped so far due to the lack of adequate data sources and of an empirical framework to link a measure of concern to the disease epidemiology. Recent work in the field has mainly relied on Twitter data (see for instance [3]) or other proxies for individual behavior [13]. Recent efforts at ISI Foundation have focused on mapping out collective attention patterns by means of two Web-based platforms, EbolaTracking ( and ZikaTracking ( These systems were developed as an aid in monitoring the global Twitter conversation about Ebola and Zika, respectively, and provide an awareness tool able to follow multiple events in a geo-referenced context and in real time. Yet, several gaps still need to be bridged before we can reach a comprehensive understanding of the complex interplay of information dynamics, public awareness, and disease epidemiology and dynamics.

In this context, Wikipedia represents an extremely valuable data source to understand how information spreads during an epidemic outbreak and to directly quantify the level of public attention and concern induced by the epidemic. In several countries, Wikipedia represents a widely accessed source of information for health related topics (, and Wikipedia page views can be reasonably considered a valid and accurate proxy to measure the level of public concern and attention to a health topic.

Study summary[edit]

In this study, we plan to leverage the high-quality and high-granularity page view data provided by Wikimedia to obtain a high-resolution and time-resolved map of the patterns of global attention to an epidemic outbreak, and link such patterns to health communication activities as well as to the epidemiology of the disease. This will allow for the first time to derive a data-driven quantitative description of the interplay between the diffusion of information, the public concern raised by an outbreak and the dynamics of the infection spread. Eventually, the results will allow us to define, test and validate behavior-contagion models able to close the feedback loop between behavioral changes triggered in the population by an individual's perception and response to the disease/contagion/information spread and the actual disease spread itself. This might aid in devising appropriate and effective strategies of communication the epidemic situation to the general public and an informed education/dissemination of the effects of intervention strategies.

Specifically, we will focus on two recent epidemic outbreaks where human behavior and public attention have played a significant role:

Objectives and research questions[edit]

The project will achieve the following objectives:

  • Quantify global patterns of attention to specific public health threats (mainly Zika and MERS-Cov, but also Ebola, Yellow fever, and others) through the analysis of fine-grained, geo-tagged and time-stamped pageview data of disease related Wikipedia pages.
  • Compare the global patterns of attention with epidemiological data: how is attention driven by the incidence, prevalence, or other epidemiological features of the disease (for instance its basic reproductive number)? What about local attention patterns?
  • Identify geographic correlations in the patterns of attention. Does information and concern spread between countries that are at risk? What are the roles of geographic distance from the source of the outbreak, shared languages, cultural similarities, etc. ?
  • Compare the patterns of attention to those measured from social media, and more specifically from Twitter. What type of relevant signals are betted computed from Wikipedia pageview data in comparison with, e.g., social media streams such as Twitter?
  • Identify the drivers of attention. What is the role of traditional media sources and of public health agencies in the global conversation? Are they able to bring the global attention to relevant sources of information? What is the interplay between information conveyed by mass media and social media?
  • Derive quantitative input from the measured/observed patterns of attention to be integrated into epidemic models that take into account the behavioral component (e.g., the models in [12]).

Methods and work plan[edit]

As mentioned above, we will initially focus on two current public health threats: the Zika virus and the MERS Coronavirus. The former has been declared a Public Health Emergency of International Concern by the WHO and it is presently detected in Africa, the Americas, Asia and the Pacific and is starting to be widespread in several of the United States as well ( ). The latter has emerged in the Middle East in 2012 and has caused its largest outbreak in South Korea in June 2015 ( ).

The goal of this study is to uncover and characterize the dynamics of attention/concern related to a public health threat in a geo-referenced and time-resolved fashion. Our analysis will focus on the geolocalized hourly pageview data made available by the Wikimedia Foundation ( ). Based on the “Current Schema” described in the specifications for the available data format, we will select the pages containing relevant keywords in the title (in order to collect all the relevant pages regardless the language) such as: “Zika”, “MERS”, “Middle East respiratory syndrome”, and others. We will also consider, for each source thus selected, the list of pages linking to it as well as their translations in other languages. The results of our query will be a time series of the daily number of page views of Wikipedia pages related with Zika and MERS, across languages and spatially resolved down to the finest available granularity (geographical entities Country and City as specified in the “Current Schema”).

Our work will be organized along the following research directions:

  • The dynamics of collective attention to a public health threat as measured by pageview data. Time-resolved, geo-referenced pageview data - for communities where such data is available - are a precious proxy for information-seeking behavior and collective attention to specific topics or issues. We will characterize and model the response of communities to attention drivers that span institutional communication by public health agencies, mass media coverage, spread of information and disinformation driven by social media. We will use unsupervised machine learning and anomaly detection techniques to detect change points and attention shifts, and relate them to events for which spatio-temporal metadata is available (e.g., media coverage in local TV newscasts). To this end, we will leverage large-scale data sources of news/broadcast/print activity such as the GDELT Project. We will use time series analysis to map out influence relations between regions and cities [14], and characterize the determinants of such influence (e.g., shared languages, distribution of the same mass media content, etc.).
  • Comparison of Wikipedia pageview data with disease-related Twitter activity. We will compare the spatially-resolved time series of page views with the time series of tweets collected via the Twitter streaming API and containing the keyword “Zika” or “MERS”. We will integrate the Zika PV data into a private version of the ZikaTracking platform ( to visually explore the two data streams. We will subsequently investigate the level of spatial and temporal correlation between time series, and examine similarities and differences between the spread of information on Twitter, the sentiment of Twitter messages, and attention level recorded by Wikipedia page views. Ultimately, we will aim at quantifying and modeling the community perception of health risk, its spatio-temporal evolution, and its relation to actual health risk due to the considered public health threat.
  • Comparison of pageview time series with time series of disease incidence. This will be done by considering epidemiological data at both national and local level. The volume of page views will then be matched to the number of cases observed at national level. Temporal correlations will be examined to understand how and when the detection of the first disease case triggers a relevant shift of public attention. We will also examine how temporal variations in the incidence of the disease are correlated with page views and whether major events (such as the detection of a new case, or official alerts issued by public health agencies) affect the pattern of page views.
  • Analysis of spatial correlations between pageview data and the spatial spread of infection. This will be done by comparing the geo-referenced time series of page views with the spatially resolved disease incidence/prevalence provided by the WHO and other national surveillance systems. We will examine whether geographic distance plays a role in the spread of concern/attention, and if this role is similar to the spatial correlations that drive the spread of the infection.
  • Definition of a phenomenological parameter capturing population awareness and/or concern, to be integrated into spatial epidemic models. Our final research activity will be the parameterization of behavior-disease models based on the quantitative results of the activities described above. We will first try to parametrize compartmental epidemic models (e.g., [12]) and then extend our approach to the integration of awareness into a large-scale spatial epidemic model based on human mobility: GLEAM ( Results of this parameterization will be validated by comparing modeling results and retrospective forecasts with epidemiological data.

Policy, Ethics and Human Subjects Research[edit]

The proposed work will not track individual activity or profiles of users and editors. It will only aim at modeling collective attention patterns aggregated over time and space. Raw pageview data will not be redistributed or made available, directly or indirectly, through data visualization devices.


  1. Salathé, Marcel; Bengtsson, Linus; Bodnar, Todd J.; Brewer, Devon D.; Brownstein, John S.; Buckee, Caroline; Campbell, Ellsworth M.; Cattuto, Ciro; Khandelwal, Shashank; Mabry, Patricia L.; Vespignani, Alessandro (2012). "Digital Epidemiology". PLoS Computational Biology 8 (7): e1002616. ISSN 1553-7358. doi:10.1371/journal.pcbi.1002616. 
  2. Salathé, Marcel; Khandelwal, Shashank (2011). "Assessing Vaccination Sentiments with Online Social Media: Implications for Infectious Disease Dynamics and Control". PLoS Computational Biology 7 (10): e1002199. ISSN 1553-7358. doi:10.1371/journal.pcbi.1002199. 
  3. a b c Signorini, Alessio; Segre, Alberto Maria; Polgreen, Philip M. (2011). "The Use of Twitter to Track Levels of Disease Activity and Public Concern in the U.S. during the Influenza A H1N1 Pandemic". PLoS ONE 6 (5): e19467. ISSN 1932-6203. doi:10.1371/journal.pone.0019467. 
  4. Ginsberg, Jeremy; Mohebbi, Matthew H.; Patel, Rajan S.; Brammer, Lynnette; Smolinski, Mark S.; Brilliant, Larry (2008). "Detecting influenza epidemics using search engine query data". Nature 457 (7232): 1012–1014. ISSN 0028-0836. doi:10.1038/nature07634. 
  5. Paolotti, D.; Carnahan, A.; Colizza, V.; Eames, K.; Edmunds, J.; Gomes, G.; Koppeschaar, C.; Rehn, M.; Smallenburg, R.; Turbelin, C.; Van Noort, S.; Vespignani, A. (2014). "Web-based participatory surveillance of infectious diseases: the Influenzanet participatory surveillance experience". Clinical Microbiology and Infection 20 (1): 17–21. ISSN 1198-743X. doi:10.1111/1469-0691.12477. 
  6. Smolinski, Mark S.; Crawley, Adam W.; Baltrusaitis, Kristin; Chunara, Rumi; Olsen, Jennifer M.; Wójcik, Oktawia; Santillana, Mauricio; Nguyen, Andre; Brownstein, John S. (2015). "Flu Near You: Crowdsourced Symptom Reporting Spanning 2 Influenza Seasons". American Journal of Public Health 105 (10): 2124–2130. ISSN 0090-0036. doi:10.2105/AJPH.2015.302696. 
  7. McIver, David J.; Brownstein, John S. (2014). "Wikipedia Usage Estimates Prevalence of Influenza-Like Illness in the United States in Near Real-Time". PLoS Computational Biology 10 (4): e1003581. ISSN 1553-7358. doi:10.1371/journal.pcbi.1003581. 
  8. Generous, Nicholas; Fairchild, Geoffrey; Deshpande, Alina; Del Valle, Sara Y.; Priedhorsky, Reid (2014). "Global Disease Monitoring and Forecasting with Wikipedia". PLoS Computational Biology 10 (11): e1003892. ISSN 1553-7358. doi:10.1371/journal.pcbi.1003892. 
  9. Salathé, Marcel; Freifeld, Clark C.; Mekaru, Sumiko R.; Tomasulo, Anna F.; Brownstein, John S. (2013). "Influenza A (H7N9) and the Importance of Digital Epidemiology". New England Journal of Medicine 369 (5): 401–404. ISSN 0028-4793. doi:10.1056/NEJMp1307752. 
  10. Althouse, Benjamin M; Scarpino, Samuel V; Meyers, Lauren Ancel; Ayers, John W; Bargsten, Marisa; Baumbach, Joan; Brownstein, John S; Castro, Lauren; Clapham, Hannah; Cummings, Derek AT; Del Valle, Sara; Eubank, Stephen; Fairchild, Geoffrey; Finelli, Lyn; Generous, Nicholas; George, Dylan; Harper, David R; Hébert-Dufresne, Laurent; Johansson, Michael A; Konty, Kevin; Lipsitch, Marc; Milinovich, Gabriel; Miller, Joseph D; Nsoesie, Elaine O; Olson, Donald R; Paul, Michael; Polgreen, Philip M; Priedhorsky, Reid; Read, Jonathan M; Rodríguez-Barraquer, Isabel; Smith, Derek J; Stefansen, Christian; Swerdlow, David L; Thompson, Deborah; Vespignani, Alessandro; Wesolowski, Amy (2015). "Enhancing disease surveillance with novel data streams: challenges and opportunities". EPJ Data Science 4 (1). ISSN 2193-1127. doi:10.1140/epjds/s13688-015-0054-0. 
  11. Funk, S.; Salathe, M.; Jansen, V. A. A. (2010). "Modelling the influence of human behaviour on the spread of infectious diseases: a review". Journal of The Royal Society Interface 7 (50): 1247–1256. ISSN 1742-5689. doi:10.1098/rsif.2010.0142. 
  12. a b c Perra, Nicola; Balcan, Duygu; Gonçalves, Bruno; Vespignani, Alessandro (2011). "Towards a Characterization of Behavior-Disease Models". PLoS ONE 6 (8): e23084. ISSN 1932-6203. doi:10.1371/journal.pone.0023084. 
  13. Springborn, Michael; Chowell, Gerardo; MacLachlan, Matthew; Fenichel, Eli P (2015). "Accounting for behavioral responses during a flu epidemic using home television viewing". BMC Infectious Diseases 15 (1): 21. ISSN 1471-2334. doi:10.1186/s12879-014-0691-0. 
  14. Borge-Holthoefer, J.; Perra, N.; Goncalves, B.; Gonzalez-Bailon, S.; Arenas, A.; Moreno, Y.; Vespignani, A. (2016). "The dynamics of information-driven coordination phenomena: A transfer entropy analysis". Science Advances 2 (4): e1501158–e1501158. ISSN 2375-2548. doi:10.1126/sciadv.1501158.