Research:Characterizing Wikipedia Reader Behaviour/S3-English Large Scale
Notes on Planned Analysis[edit]
Response Distributions ( notebook )[edit]
- response histograms for each question
- Q3 co-occurrences for responses with 2 motivations
- See how response distributions to one question change when conditioned on a specific response in another question
Temporal Distributions (notebook):[edit]
plot answer distribution for each question
- by hour of day
- by weekday vs weekend
- broken down by device
Response Timestamps will be normalized using timezone information from the IP
How motivations for reading wikipedia change during the day or on a weekday vs a weekend will be especially interesting
Session Data Preparation and Quality QA[edit]
Investigate how to:
- join responses and other page view requests
- sessionize requests
- build navigation trees (see Bob and Ashwin's work)
Sampling Bias[edit]
How do sessions from people who answered the survey compare to those who did not?
Consider session features:
- session length
- requests per session
- # of trees
- mean # of children per node
- some measure of whether navigation trees traversed in DFS vs BFS manner
Investigate feature distributions for each group for some features
Logistic Regression: Predict if are you in the sample? Look at accuracy, significance of features.
- hopefully poor accuracy and no features are significant
- if not, will give us insight into along what features respondents differ
Classification Tasks
- can you predict information depth (Q1 response) from session features
- if so, analyze feature weights to get insight for how people read differently when looking up a fact vs desiring in-depth knowledge
- can you predict level of familiarity (Q2 response ) from session features?
- if so, analyze feature weights to get insight for how people read differently when familiar vs. unfamiliar
- can you predict motivation (Q3 response) from topics in the article?
- take average over topic vector for articles in session (mean topic)
- take element-wise variance over topic vector for articles in session (topic variability)
- investigate if session features help
- 4. can you predict device from trace
- kind of boring
Learning Article Representations from Reading Co-Occurrence (notebook)[edit]
Its possible to generate article embeddings based on text (LDA) or links to tother articles (SVD on link network matrix). We can also generate representations that are based on what people actually read together. We could apply the skip-gram or CBOW word2vec models to our collection of sessions, where articles::sessions as words::sentences. These embeddings could be used to improve the "read-more" reading recommendations feature, which currently just uses a "more-like" search. They might also yield an interesting approach to generating link recommendations: given an article: see which of its nearest neighbors are not linked to it. We can also use these vectors to measure how "far" people move around in a session.
Misc
- under what motivations do people do BFS vs DFS (random/bored group will be particularly interesting)