Research:Characterizing Readers Navigation/Modeling Reading Sessions: First Round of Analysis

From Meta, a Wikimedia project coordination wiki

In the first round of analysis we explore the potential of modeling reading sessions with sequential models such as LSTM.

Main findings:

  • LSTM models yield huge improvements in next-article prediction for datasets we were able to run the model; for example recall@1 more than doubles from 0.139 to 0.280 compared to morelike in hiwiki
  • previously used Research:Wikipedia_Navigation_Vectors based on word2vec perform worse or equal than the text-based morelike used in the RelatedArticles-feature. this is consistent with previous qualitative evaluation on human raters ([[1]])

Problems and future directions:

  • Scalability to larger wikis: In order to train the LSTM, we use the GPUs on stat1005 and stat1008. With a single GPU we can only train smaller datasets (say up to 1M sessions). As a result, at the moment the LSTM is not suitable for large wikis in which we need a larger number of sessions to get sufficient coverage. This motivates several approaches
    • introduce approximations, e.g. in the softmax-layer, in order to
    • targeted sampling: we know that pageviews are very unevenly distributed (i.e. some pages occur orders of magnitude more often than others), thus increasing the amount of data might not actually add a lot of new information/signal. in contrast, one could preferentially sample sessions containing articles with lower coverage.
    • add additional information from the underlying generative process, such as the information about available links or layout information.


Motivation[edit]

Previous research on Research:Wikipedia_Navigation_Vectors modeled reading sessions to generate embeddings of articles which capture their semantic similarity based on reader interests. One possible application was to use these embeddings to generate recommendations for which articles to read next. In a qualitative evaluation such recommendations were compared with text-based recommendations from the RelatedArticles-extension based on the morelike-search) -- finding that the latter was judged more useful for human raters showing that RelatedArticles constitutes a hard-to-beat baseline.

In order to learn the embeddings of articles from reading sessions, the navigation vectors used Word2vec, a common approach in natural language processing to capture semantic similarity of words (i.e. articles) in large collections of documents (i.e. reading sessions). However, one of the main limitations of this model is that it does not explicitly take into account the sequential information of the reading sesssion (only indirectly via a context window). We hypothesize that the sequential process plays an important role in understanding and modeling reading sessions.

Therefore, we aim to model reading sessions using explicitly sequential models. One of the most well-known approaches from natural language processing are so-called LSTMs



Methods[edit]

Data[edit]

We extract reading sessions for 14 different wikipedias (the same as in Why the world reads Wikipedia) from 1 week of webrequest logs. Specifically, we follow the basic approach described in Research:Wikipedia_Navigation_Vectors#Data_Preparation, i.e.

  • keeping only requests which are pageviews in main namespace
  • remove pageviews from identified bots (we filter sessions with more than 100 pageviews per day as a proxy for other automated traffic)
  • keep only sessions from desktop and mobile-web
  • remove sessions which contain an edit-attempt
  • remove sessions which contain the main-page of the respective wikipedia
  • cut reading sessions if time between consecutive pageviews is longer than 1 hour (see Halfaker et al. 2015)
  • keep only sessions with 2 to 30 pageviews

We split data into train-, dev-, testset (80%,10%,10%) randomly in terms of reading sessions.


Basic statistics of the datasets.
wiki N-sessions N-pages N-pageviews N-pageviews-per-page
arwiki 3864756 536653 18283789 34.07
bnwiki 235596 55210 1080966 19.58
dewiki 18827285 1690625 90471373 53.51
enwiki 156936661 4863258 775580093 159.48
eswiki 25811921 1278193 114638465 89.69
hewiki 1358828 199486 6358130 31.87
hiwiki 901426 113000 3939958 34.87
huwiki 1651606 224810 7559146 33.62
jawiki 21862400 1136453 102855008 90.51
nlwiki 2924407 542331 13172173 24.29
rowiki 748245 144575 3316610 22.94
ruwiki 18478481 1275477 87049576 68.25
ukwiki 1992400 349542 9238183 26.43
zhwiki 7893189 867361 38637735 44.55

Models and baselines[edit]

We compare in total 3 models.

  • LSTM, a sequential model to generate embeddings from reading sessions
  • word2vec, model used in navigation-vectors to generate embeddings from reading sessions
  • morelike, text-based model to find similar articles used for recommendations in RelatedArticles-feature.

Models are trained on the train-set. Evaluation of performance is done on the test-set. Evaluation on dev-set is used for optimizing hyperparameters. Note that for morelike, we do not do any additional training and only evaluate on the test-set.


Evaluation[edit]

We evaluate the models in the task of next-article prediction. For a session in the test-set of length L, we pick a random position isource = 1,...,L-1 and aim to predict the next article at position itarget=isource+1 (and thus the article at isource and itarget are the source and target articles respectively). We assign a rank for each prediction by comparing the target-article with the list of articles predicted from the source-article, i.e. if the target article is the 5th most likely recommendation we assign rank=5. We then calculate

  • mean reciprocal rank (MRR): the average of the inverse rank (i.e. 1/MRR corresponds to the harmonic mean of the ranks)
  • recall@k: the fraction of test-cases for which the rank is <= k (i.e. recall@1 is the fraction of times the target-article was the most likely prediction based on the source article).

Results[edit]

We evaluate different models on several wikis using 4 different metrics (MRR, recall@1, recall@10, and recall@100).

Main results:

  • Note that only for 3 smaller wikis (hewiki, hiwiki, huwiki) we were able to finish training the LSTM (top cases)
    • in each of these cases the LSTM substantially outperforms the other two baselines
  • Word2vec sometimes performs worse (or only slightly better) than the text-based morelike which does not use any training (hewiki, hiwiki, bnwiki, eswiki, nlwiki, rowiki, ukwiki ). this could be to several reasons
    • not enough training data (mostly smaller wikis)
  • substantial variation across wikis (regardless of the model), e.g. hiwiki vs ukwiki.
    • it is very likely that this is due to different composition of mobile vs desktop traffic. hiwiki has an exceptionally large fraction of readers via mobile-web (in some months it makes 80-90% wikistats), while for other wikis it is much lower (e.g. ukwiki where number of desktop-readers is higher than mobile-readers wikistats)
    • (to be added in more detail below): in fact we can show that reader-sessions are more predictable in mobile than in desktop when separating by access-method, i.e. performance in next-article prediction is higher for mobile-sessions than for desktop-sessions. This is consistent with purely empirical observation that given the same number of sessions, desktop-sessions contain a higher diversity of pages than mobile-sessions (in terms of the number of different pages visited in those sessions). This cannot only be attributed to the RelatedArticles-feature showing 3 related articles for further reading at the bottom of articles in the mobile-version since we observe similar patterns for dewiki, in which the feature is not enabled (by default).
Next-article prediction for different models and different wikis
Wiki Model MRR Recall@1 Recall@10 Recall@100
hewiki Morelike 0.113305 0.06 0.217 0.406
Word2vec 0.1291 0.07892 0.23134 0.38979
LSTM 0.241439 0.158223 0.41163 0.577477
hiwiki Morelike 0.224332 0.139 0.377 0.488
Word2vec 0.188405 0.113174 0.345736 0.56412
LSTM 0.37979 0.278942 0.57975 0.757443
huwiki Morelike 0.114129 0.064 0.216 0.379
Word2vec 0.127071 0.07669 0.2294 0.40457
LSTM 0.24379 0.160018 0.415452 0.617054
arwiki Morelike 0.125015 0.075 0.221 0.386
Word2vec 0.136778 0.08385 0.24293 0.42
bnwiki Morelike 0.193514 0.117 0.346 0.493
Word2vec 0.132293 0.078647 0.244514 0.435805
dewiki Morelike 0.089064 0.05 0.173 0.318
Word2vec 0.131105 0.08184 0.23067 0.38692
enwiki Morelike 0.108866 0.054 0.212 0.39
Word2vec 0.151636 0.09491 0.26648 0.43196
eswiki Morelike 0.12912 0.077 0.233 0.402
Word2vec 0.131733 0.08121 0.2362 0.40848
jawiki Morelike 0.106265 0.058 0.206 0.369
Word2vec 0.151647 0.09849 0.25689 0.41933
nlwiki Morelike 0.111285 0.06 0.21 0.393
Word2vec 0.131795 0.08114 0.23303 0.38997
rowiki Morelike 0.137073 0.078 0.248 0.417
Word2vec 0.129519 0.078076 0.235162 0.418069
ruwiki Morelike 0.110369 0.065 0.211 0.365
Word2vec 0.134226 0.08197 0.24137 0.41118
ukwiki Morelike 0.092121 0.052 0.174 0.341
Word2vec 0.104915 0.06289 0.19127 0.36027
zhwiki Morelike 0.095617 0.056 0.182 0.337
Word2vec 0.154163 0.09714 0.27255 0.44906