Research:Characterizing Readers Navigation
This project aims to understand the behavior of Wikipedia readers. Previous work   used qualitative approaches focusing on the motivations behind the visiting an article and by modeling the intention in a taxonomy. In this project, we take a data-driven way to discover patterns to understand how users satisfy their information need.
We plan to study how readers explore the graph of Wikipedia and to study the navigation trajectories in different spaces: across topics, across languages, and in the content of the page (hover popups).
Previous research has succesfully addressed the question Why we read Wikipedia?   in order to understand, e.g. the motivations behind reading Wikipedia. Here, we investigate whether and to what degree this is reflected in how we use wikipedia? By providing a characterization of readership behavior and patterns, this project will be contributing towards the goal of addressing knowledge gaps in readership. Following the idea of the pipeline of online participation  , this will also provide crucial insights into possible causes into knowledge gaps in contributorship and content.
Empirical characterization of reading strategies
What are different strategies used in the exploration of Wikimedia content?
- For example, to which degree do readers navigate according to a depth-first search (i.e. 'going down the rabbit hole), or a breadth-first search (i.e. multi-tab browsing)
How does the content affect the navigation in Wikimedia projects?
- How do strategies differ across topics?
- Can we define metrics for the usage of articles which are more informative than simply counting the number of pageviews? For example,  suggested metrics for articles based on clickstream data to identify articles that function as sinks/sources or bottlenecks/distributors.
Modeling of reading sessions
How can we identify content relevant to specific topics based on readers' interests using machine-learning models?
- Possible use cases include the support of list-building for Wikimedia projects. For example, for the WikiProject Covid-19 we could identify sets of articles which were read in conjunction to articles on the 2019-20_coronavirus_pandemic
- Previous research on navigation vector generate embeddings from reading sessions capturing the semantic similarity of articles using word2vec. Qualitative evaluation suggested that the approach was not necessarily perceived as more useful in recommending related articles than text-based approaches (such as the RelatedArticles-extension based on the morelike-search). Here, we assess the potential of more sophisticated models based on Deep neural networks, such as LSTM, which explicitly take into account the sequential nature of the process.
We base the analysis on the server logs available in the Hive table "web-request".
The format of the data and the different navigation patterns pose a challenge in studying how people explore content on Wikipedia. Using the server log, it is not possible to access information about the session of the users: we aggregate the set of pages read by the same user by creating an identifier composed by request IP and user-agent (data hashed).
The first challenge is the definition of "session." In this context, we consider two types of aggregation strategies:
Reading session This approach takes into account the navigation events sorted by timestamp as the readers generated them. It represents the natural way to model the trajectory of a reader across different topics. While it constitutes a pragmatic and seemingly straight-forward approach to generate sessions it suffers from the following drawbacks:
- There is a need to define a heuristic to split different reading sessions (i.e., one hour without activities).
- it projects the reading session into a one-dimensional sequence of pageviews ignoring more intricate aspects of reader navigation such as multi-tab browsing.
Navigation session In this approach, we represent reader navigation as a tree (instead of a sequence) by taking into account the referer of each pageview. In this way we can capture different reading strategies, such as multi-tab browsing. Nevertheless, we still have to consider the following drawbacks:
- unclear representation of behaviors where the referer field is not defined (i.e., a sequence of pages loaded from Search engine results).
- unclear representation of how the content was consumed by the reader (user jumping between different tabs)
Empirical characterization of reading strategies
In a first exploratory study, we examined three aspects of reading sessions: - revisiting basic assumptions in the generation of reading sessions, i.e. the common choice of cutting a session if two pageviews are separated by more than 1 hour - the length (or sometimes called session depth) of reading sessions depending on the topic of the session (more specifically, the topic of the first pageview) - the rate of diffusion in the topical space
Modeling of reading sessions with sequential models
Short Summary: We model reading sessions using sequential models such as LSTMs. Since these models explicitly account for the sequential nature of the data (a reading session is a sequence of pageviews) they will be better able to capture the relation between articles based on reader interest. In fact, when evaluating the models in their ability to predict the next article in a reading session, LSTMs lead to substantial improvements compared to different baselines, most notably embeddings generated using word2vec or a model using text-based similarity. For example,
- recall@1 (i.e. top-ranked prediction is the in fact the next article in session) increases by factor of up to 2 (from 0.139 to 0.280 in some languages) compared to baselines.
- performance of embeddings from word2vec yield worse or at most very little improvement over simple text-based heuristics. this matches previous qualitative studies in Research:Evaluating_RelatedArticles_recommendations
- scalability of LSTM to large wikis is a problem
Comparing webrequest-logs and clickstream-data
Systematic studies of reader navigation in Wikipedia are limited because of a lack of publicly available data due to the commitment to protect readers’ privacy by not storing or sharing potentially sensitive data. In the past, we have granted access to webrequest-logs when we have found strict alignment of interests between the direction of the Research team or Wikimedia Foundation and researchers in academia or industry, and we have turned down many requests with a heavy heart. Therefore, we wanted to address the following question: how well navigation of readers can be approximated by using publicly available resources alone, most notably the Wikipedia?
We answered this question by quantifying the difference between real and synthetic navigation sequences generated from the clickstream data, through 6 different experiments across 8 Wikipedia language versions. Our main finding is that differences are statistically significant but the effect sizes are small, often well within 10%.
We thus provide quantitative evidence for the utility of the Wikipedia clickstream data as a public resource by showing that it can closely capture reader navigation on Wikipedia, and constitute a sufficient approximation for most practical downstream applications relying on data from readers. More generally, this work provides an example for how clickstream-like data can empower broader research on navigation in other online platforms while protecting users’ privacy.
More details can be found at Research:Characterizing Readers Navigation/Comparing-webrequest-logs-and-clickstream-data
In order to understand how readers explore content on Wikipedia when learning about a given subject, we systematically characterized the structure of the navigation pathways: i) how readers reach an article, ii) how they transition between articles, and iii) how they combine these patterns into more complex navigation sequences.
Our main findings are that:
- Most readers reach an article from an external search engine such as Google (read more about a recently released public dataset on search referrers in a TECH:Blogpost)
- External search engines also play a major role for navigation between two Wikipedia articles. This happens for a substantial fraction of consecutively viewed pages (often within a few minutes separation). Often readers do not use the available hyperlinks in the page
- Reading sessions are generally very short; however, in absolute terms there are millions of readers who visit 10 or more articles with strong variations with respect to the device (desktop/mobile) or the topic of interest.
Wikipedia is a rich knowledge base that fulfills multiple dynamic information needs that depend on the reader's context. One of the aspects driving what type of information people look up online is the time of the day. For example, during an average evening, we may be more inclined to look for information about a TV show than a math theorem. Naturally, given the substantial amount of time we spend browsing the Web, this tendency extends to online attention, including the content we consume on Wikipedia. We characterize these diurnal patterns to describe their presence and how topics, access methods, and country affect their shape.
We found that each article has a specific consumption fingerprint throughout the day, with a major distinction of content that tends to be consumed during the day versus the evening or night. Articles exhibit similar consumption patterns based on their topics, and the access method and country are associated with different reading habits during the day.
More detailed results: Research:Characterizing Readers Navigation/Temporal Rhythms - Work In Progress
As part of this project, we published the following papers:
- Wikipedia Reader Navigation: When Synthetic Data Is Enough  (pdf from arXiv)
- A Large-Scale Characterization of How Readers Browse Wikipedia  (pdf from arXiv)
- Going Down the Rabbit Hole: Characterizing the Long Tail of Wikipedia Reading Sessions  (pdf from arXiv)
- Curious Rhythms: Temporal Regularities of Wikipedia Consumption  (pdf from arXiv)
- Singer, Philipp; Lemmerich, Florian; West, Robert; Zia, Leila; Wulczyn, Ellery; Strohmaier, Markus; Leskovec, Jure (2017). "Why We Read Wikipedia". pp. 1591–1600. doi:10.1145/3038912.3052716.
- Lemmerich, Florian; Sáez-Trumper, Diego; West, Robert; Zia, Leila (2019). "Why the World Reads Wikipedia". pp. 618–626. doi:10.1145/3289600.3291021.
- Shaw, Aaron; Hargittai, Eszter (2018). "The Pipeline of Online Participation Inequalities: The Case of Wikipedia Editing". Journal of Communication 68 (1): 143–168. ISSN 0021-9916. doi:10.1093/joc/jqx003.
- Gildersleve, Patrick; Yasseri, Taha (2018). "Inspiration, Captivation, and Misdirection: Emergent Properties in Networks of Online Navigation". pp. 271–282. ISSN 2213-8684. doi:10.1007/978-3-319-73198-8_23.
- Arora, Akhil; Gerlach, Martin; Piccardi, Tiziano; García-Durán, Alberto; West, Robert (2022-02-15). "Wikipedia Reader Navigation: When Synthetic Data Is Enough". Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining. WSDM '22 (New York, NY, USA: Association for Computing Machinery): 16–26. ISBN 978-1-4503-9132-0. doi:10.1145/3488560.3498496.
- Piccardi, Tiziano; Gerlach, Martin; Arora, Akhil; West, Robert (2023-01-13). "A Large-Scale Characterization of How Readers Browse Wikipedia". ACM Transactions on the Web. ISSN 1559-1131. doi:10.1145/3580318.
- Piccardi, Tiziano; Gerlach, Martin; West, Robert (2022-08-16). "Going Down the Rabbit Hole: Characterizing the Long Tail of Wikipedia Reading Sessions". Companion Proceedings of the Web Conference 2022. WWW '22 (New York, NY, USA: Association for Computing Machinery): 1324–1330. ISBN 978-1-4503-9130-6. doi:10.1145/3487553.3524930.
- Piccardi, T., Gerlach, M., & West, R. (2023). Curious Rhythms: Temporal Regularities of Wikipedia Consumption. In arXiv [cs.CY]. arXiv. http://arxiv.org/abs/2305.09497