Jump to content

Research:Characterizing Wikipedia Reader Behaviour/S3-English Large Scale

From Meta, a Wikimedia project coordination wiki

Notes on Planned Analysis[edit]

Response Distributions ( notebook )[edit]

  • response histograms for each question
  • Q3 co-occurrences for responses with 2 motivations
  • See how response distributions to one question change when conditioned on a specific response in another question
Temporal Distributions (notebook):[edit]

plot answer distribution for each question

  • by hour of day
  • by weekday vs weekend
  • broken down by device

Response Timestamps will be normalized using timezone information from the IP

How motivations for reading wikipedia change during the day or on a weekday vs a weekend will be especially interesting

Session Data Preparation and Quality QA[edit]

Investigate how to:

  • join responses and other page view requests
  • sessionize requests
  • build navigation trees (see Bob and Ashwin's work)
Sampling Bias[edit]

How do sessions from people who answered the survey compare to those who did not?

Consider session features:

  • session length
  • requests per session
  • # of trees
  • mean # of children per node
  • some measure of whether navigation trees traversed in DFS vs BFS manner

Investigate feature distributions for each group for some features

Logistic Regression: Predict if are you in the sample? Look at accuracy, significance of features.

  • hopefully poor accuracy and no features are significant
  • if not, will give us insight into along what features respondents differ

Classification Tasks

  • can you predict information depth (Q1 response) from session features
    • if so, analyze feature weights to get insight for how people read differently when looking up a fact vs desiring in-depth knowledge
  • can you predict level of familiarity (Q2 response ) from session features?
    • if so, analyze feature weights to get insight for how people read differently when familiar vs. unfamiliar
  • can you predict motivation (Q3 response) from topics in the article?
    • take average over topic vector for articles in session (mean topic)
    • take element-wise variance over topic vector for articles in session (topic variability)
    • investigate if session features help
  • 4. can you predict device from trace
    • kind of boring

Learning Article Representations from Reading Co-Occurrence (notebook)[edit]

Its possible to generate article embeddings based on text (LDA) or links to tother articles (SVD on link network matrix). We can also generate representations that are based on what people actually read together. We could apply the skip-gram or CBOW word2vec models to our collection of sessions, where articles::sessions as words::sentences. These embeddings could be used to improve the "read-more" reading recommendations feature, which currently just uses a "more-like" search. They might also yield an interesting approach to generating link recommendations: given an article: see which of its nearest neighbors are not linked to it. We can also use these vectors to measure how "far" people move around in a session.

Misc

  • under what motivations do people do BFS vs DFS (random/bored group will be particularly interesting)