Research:Characterizing Readers Navigation

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search
Alberto García Durán
Robert West
Duration:  2020-June – ??

This page documents a research project in progress.
Information may be incomplete and change as the project progresses.
Please contact the project lead before formally citing or reusing results from this page.

This project aims to understand the behavior of Wikipedia readers. Previous work [1] [2] used qualitative approaches focusing on the motivations behind the visiting an article and by modeling the intention in a taxonomy. In this project, we take a data-driven way to discover patterns to understand how users satisfy their information need.

We plan to study how readers explore the graph of Wikipedia and to study the navigation trajectories in different spaces: across topics, across languages, and in the content of the page (hover popups).

Research questions[edit]

Previous research has succesfully addressed the question Why we read Wikipedia? [1] [2] in order to understand, e.g. the motivations behind reading Wikipedia. Here, we investigate whether and to what degree this is reflected in how we use wikipedia? By providing a characterization of readership behavior and patterns, this project will be contributing towards the goal of addressing knowledge gaps in readership. Following the idea of the pipeline of online participation [3] , this will also provide crucial insights into possible causes into knowledge gaps in contributorship and content.

Empirical characterization of reading strategies[edit]

What are different strategies used in the exploration of Wikimedia content?

How does the content affect the navigation in Wikimedia projects?

  • How do strategies differ across topics?
  • Can we define metrics for the usage of articles which are more informative than simply counting the number of pageviews? For example, [4] suggested metrics for articles based on clickstream data to identify articles that function as sinks/sources or bottlenecks/distributors.

Modeling of reading sessions[edit]

How can we identify content relevant to specific topics based on readers' interests using machine-learning models?

  • Possible use cases include the support of list-building for Wikimedia projects. For example, for the WikiProject Covid-19 we could identify sets of articles which were read in conjunction to articles on the 2019-20_coronavirus_pandemic
  • Previous research on navigation vector generate embeddings from reading sessions capturing the semantic similarity of articles using word2vec. Qualitative evaluation suggested that the approach was not necessarily perceived as more useful in recommending related articles than text-based approaches (such as the RelatedArticles-extension based on the morelike-search). Here, we assess the potential of more sophisticated models based on Deep neural networks, such as LSTM, which explicitly take into account the sequential nature of the process.


We base the analysis on the server logs available in the Hive table "web-request".

The format of the data and the different navigation patterns pose a challenge in studying how people explore content on Wikipedia. Using the server log, it is not possible to access information about the session of the users: we aggregate the set of pages read by the same user by creating an identifier composed by request IP and user-agent (data hashed).

The first challenge is the definition of "session." In this context, we consider two types of aggregation strategies:

Reading session This approach takes into account the navigation events sorted by timestamp as the readers generated them. It represents the natural way to model the trajectory of a reader across different topics. While it constitutes a pragmatic and seemingly straight-forward approach to generate sessions it suffers from the following drawbacks:

  • There is a need to define a heuristic to split different reading sessions (i.e., one hour without activities).
  • it projects the reading session into a one-dimensional sequence of pageviews ignoring more intricate aspects of reader navigation such as multi-tab browsing.

Navigation session In this approach, we represent reader navigation as a tree (instead of a sequence) by taking into account the referer of each pageview. In this way we can capture different reading strategies, such as multi-tab browsing. Nevertheless, we still have to consider the following drawbacks:

  • unclear representation of behaviors where the referer field is not defined (i.e., a sequence of pages loaded from Search engine results).
  • unclear representation of how the content was consumed by the reader (user jumping between different tabs)


Empirical characterization of reading strategies[edit]

In a first exploratory study, we examined three aspects of reading sessions: - revisiting basic assumptions in the generation of reading sessions, i.e. the common choice of cutting a session if two pageviews are separated by more than 1 hour - the length (or sometimes called session depth) of reading sessions depending on the topic of the session (more specifically, the topic of the first pageview) - the rate of diffusion in the topical space

Modeling of reading sessions[edit]

Sequential models[edit]

Detailed results: Research:Characterizing_Readers_Navigation/Modeling_Reading_Sessions:_First_Round_of_Analysis.

Short Summary: We model reading sessions using sequential models such as LSTMs. Since these models explicitly account for the sequential nature of the data (a reading session is a sequence of pageviews) they will be better able to capture the relation between articles based on reader interest. In fact, when evaluating the models in their ability to predict the next article in a reading session, LSTMs lead to substantial improvements compared to different baselines, most notably embeddings generated using word2vec or a model using text-based similarity. For example,

  • recall@1 (i.e. top-ranked prediction is the in fact the next article in session) increases by factor of up to 2 (from 0.139 to 0.280 in some languages) compared to baselines.
  • performance of embeddings from word2vec yield worse or at most very little improvement over simple text-based heuristics. this matches previous qualitative studies in Research:Evaluating_RelatedArticles_recommendations
  • scalability of LSTM to large wikis is a problem


  1. a b Singer, Philipp; Lemmerich, Florian; West, Robert; Zia, Leila; Wulczyn, Ellery; Strohmaier, Markus; Leskovec, Jure (2017). "Why We Read Wikipedia". pp. 1591–1600. doi:10.1145/3038912.3052716. 
  2. a b Lemmerich, Florian; Sáez-Trumper, Diego; West, Robert; Zia, Leila (2019). "Why the World Reads Wikipedia". pp. 618–626. doi:10.1145/3289600.3291021. 
  3. Shaw, Aaron; Hargittai, Eszter (2018). "The Pipeline of Online Participation Inequalities: The Case of Wikipedia Editing". Journal of Communication 68 (1): 143–168. ISSN 0021-9916. doi:10.1093/joc/jqx003. 
  4. Gildersleve, Patrick; Yasseri, Taha (2018). "Inspiration, Captivation, and Misdirection: Emergent Properties in Networks of Online Navigation". pp. 271–282. ISSN 2213-8684. doi:10.1007/978-3-319-73198-8_23. 

Subpages of this page[edit]

Pages with the prefix 'Characterizing Readers Navigation' in the 'Research' and 'Research talk' namespaces: