Research:Characterizing Wikipedia Reader Behaviour/S3-English Large Scale

Notes on Planned Analysis[edit]

Response Distributions ( notebook )[edit]

response histograms for each question
Q3 co-occurrences for responses with 2 motivations
See how response distributions to one question change when conditioned on a specific response in another question

Temporal Distributions (notebook):[edit]

plot answer distribution for each question

by hour of day
by weekday vs weekend
broken down by device

Response Timestamps will be normalized using timezone information from the IP

How motivations for reading wikipedia change during the day or on a weekday vs a weekend will be especially interesting

Session Data Preparation and Quality QA[edit]

Investigate how to:

join responses and other page view requests
sessionize requests
build navigation trees (see Bob and Ashwin's work)

Sampling Bias[edit]

How do sessions from people who answered the survey compare to those who did not?

Consider session features:

session length
requests per session
# of trees
mean # of children per node
some measure of whether navigation trees traversed in DFS vs BFS manner

Investigate feature distributions for each group for some features

Logistic Regression: Predict if are you in the sample? Look at accuracy, significance of features.

hopefully poor accuracy and no features are significant
if not, will give us insight into along what features respondents differ

Classification Tasks

can you predict information depth (Q1 response) from session features
- if so, analyze feature weights to get insight for how people read differently when looking up a fact vs desiring in-depth knowledge
can you predict level of familiarity (Q2 response ) from session features?
- if so, analyze feature weights to get insight for how people read differently when familiar vs. unfamiliar
can you predict motivation (Q3 response) from topics in the article?
- take average over topic vector for articles in session (mean topic)
- take element-wise variance over topic vector for articles in session (topic variability)
- investigate if session features help

4. can you predict device from trace
- kind of boring

Learning Article Representations from Reading Co-Occurrence (notebook)[edit]

Its possible to generate article embeddings based on text (LDA) or links to tother articles (SVD on link network matrix). We can also generate representations that are based on what people actually read together. We could apply the skip-gram or CBOW word2vec models to our collection of sessions, where articles::sessions as words::sentences. These embeddings could be used to improve the "read-more" reading recommendations feature, which currently just uses a "more-like" search. They might also yield an interesting approach to generating link recommendations: given an article: see which of its nearest neighbors are not linked to it. We can also use these vectors to measure how "far" people move around in a session.

Misc

under what motivations do people do BFS vs DFS (random/bored group will be particularly interesting)