Research talk:Characterizing Wikipedia Reader Behaviour/Data

Data for release[edit]

Raw survey responses with high-level article data connected to them, for all the languages. 1 zip file for every language, and 1 zip file with all in one place. Draw a sample of 50% from each language, until we make sure what we want to do with the data challenge. We remove session data (e.g., number of pages viewed, date and time) to protect user privacy. The survey responses include:
- a given reader's survey responses (i.e. their answers these multiple-choice questions regarding their motivation for visiting Wikipedia)
- the title of the article from where they took the survey
- metadata computed from public data about links associated with the article (indegree, outdegree, pagerank)
- LDA topics computed based on the text of the article (also using all public data)
[TODO] All pairs for survey response x feature from Table 2 in the paper ("Pairs of usage patterns and survey responses with the largest normalized mean effect across language editions"; 250 pairs): effect, standard deviation, ...
- e.g., the correlation between desktop users and answering that their motivation was work/school-related
All pairs for Figure 3 of the paper ("Visualization of relationships between selected usage patterns and survey answers"; 250 plots)
- i.e. graphs for the correlations above broken down by country-language pair for which their is sufficient data (500 responses)
All pairs for Figure 4 of the paper ("Correlation between the Human Development Index (HDI) and survey responses")
Pairs of (country, language): distribution of the answers and country specific features (e.g., GDP) that support the analysis in Table 3.
Pairs of (country, language): relationship between socio-economic factors and LDA topics for languages that span many countries (e.g., English, Spanish)
One bar chart for every question (Figure 1 in paper)
Sub-groups results (i.e. patterns in Table 2) if the language is one of the top n.
Topics model details for all languages. For each topic, this includes top words associated with the topic, top articles, and a set of random articles.
[TODO] Correlations between survey responses (no significance scores as they can be misinterpreted -- only report correlations and distributions).

--Isaac (WMF) (talk) 22:07, 16 January 2019 (UTC) (most of this content comes from User:LZia_(WMF))[reply]