Research:Sequential Models for Analyzing Navigation/Reading Sessions

From Meta, a Wikimedia project coordination wiki
Created
Collaborators
Alberto García Durán
Robert West
Duration:  2020-May – ??

This page documents a research project in progress.
Information may be incomplete and change as the project progresses.
Please contact the project lead before formally citing or reusing results from this page.


About[edit]

This project aims to employ sequential modeling techniques, viz. RNNs/Transformers, to capture the sequential dependencies among page-views existent in the reading/navigation sessions extracted from the webrequest server logs. Once trained, such models possess the capability to address a plethora of downstream tasks to be used as user recommendations, some of which are stated as follows:

  1. Predicting the next article in a session
  2. Predicting a sequence of articles in a session
  3. Predicting the target of a navigation session
  4. Predicting the user-intent

In addition to using the reader/navigation sessions extracted from the server logs, we also aim to use the underlying graph structure (captured by the hyperlink network of Wikipedia articles) and the textual content of each article for training multi-modal sequential models.

Data[edit]

We plan to use reading/navigation sessions extracted from Wikimedia's server logs, where all HTTP requests to Wikimedia projects are logged.

Method[edit]

We aggregate the set of pages read by the same user by creating an identifier composed by request IP and user-agent (data hashed). Once users are uniquely identified, it is important to define the term "session." In this context, we consider two types of aggregation strategies:

  • Navigation session: It takes into account the referrer field of the server logs to represent user navigations as trees. Negative aspects of this approach include:
  1. unclear representation of behaviors where the referrer field is not defined (i.e., a sequence of pages loaded from Search engine results).
  2. unclear representation of how the content was consumed by the reader (user jumping between different tabs)
  • Reading session: This approach takes into account the navigation events sorted by timestamp as the readers generated them. It represents the natural way to model the trajectory of the reader across different topics. Negative aspects include:
  1. There is a need to define a heuristic to split different reading sessions (i.e., one hour without activities).

Mixed approaches are also possible. The analysis will be limited to anonymous users.

Once the reading/navigation sessions have been extracted from the server logs, we then train the following language models on the sequence of articles extracted in the said sessions.

  • Word2Vec
  • LSTM
  • BERT (ongoing)

Baseline[edit]

We use the "morelike" feature in mw:Extension:CirrusSearch, which extracts keywords from articles via tfidf and searches other articles with those keywords, as a baseline.

Result[edit]

Based on our initial experiments on SimpleWiki, we have obtained the following results:

Method MRR
MoreLike 0.176
Word2Vec 0.17
LSTM 0.30