Research talk:Characterizing Wikipedia Reader Behaviour/Robustness across languages/Work log/2017-11-16

From Meta, a Wikimedia project coordination wiki

Thursday, November 16, 2017[edit]

Well, this is not really a work-log. But maybe one day posts like this become closer to a work-log.

Florian did some analysis of sampled sessions from the week the survey was run in. Below are the plots we're seeing along with some descriptions and questions.

Data[edit]

As in previous research, we analyzed viewing data separately for all languages on an (approximate) user level . For that purpose, we first group all log entries in one Wikipedia editions according to its user. (the combination of IP address and user-agent serves as a pseudo-identifier). For each language, we then sampled 500,000 users (7,000,000 total for 14 languages) for our analysis, and selected randomly one web-log entry as the main entry.


For those user, we generated a variety of features. In particular, we decomposed a user’s entire browsing history into sessions, where we define a session as a contiguous sequence of pageviews with no break longer than one hour. We then computed the following features: - The number of sessions in week - The length of session of the main entry (in number of requested articles - The time of the session of the main entry (time from first to last request in the session) - The average time between to requests in the session - Mobile vs Desktop access (if that varies, it was determined by the main request) - The referer class of the main request (search engine, vs internal referrer, vs. no referrer, ...)

Code[edit]

Let's share link to code here.

Results and analysis[edit]

For categorical features, we show grouped bar charts, for the numerical features quantile plots. https://en.wikipedia.org/wiki/Quantile_function In this plot, each language editions is represented by one line. The x-values specifies a probability. The value on the y-axis then state the value at which the probability of the feature being less than or equal to the given probability for a random user.

In general, we see that distributions of the different language editions are overall quite similar. Nonetheless, we can observe certain peculiarities for some of the language editions. For example, we can observe that for the Chinese (zh) Wikipedia edition, we have on average longer browsing sessions, but lower average time spent on each page. By contrast, for the Spanish Wikipedia edition, we see the opposite: the average time spent per page is the highest across all compared language editions, but the sessions are shorter on average.

Regarding referer class, search engines are the most frequent referer in all languages. The prevalence of search engines is lowest for the Bengali and Hindi language editions, followed by Chinese. Instead, Internal navigation plays a larger than average role in these editions.

Mobile access dominates desktop access in all language editions, but this is specifically the case for Arabic, Bengali, and Hindi language editions. Let's add some description of insights/observation for each plots.

The distribution of sessions across languages and platforms
Session length as a function of quantiles
Average length between sessions as a function of quantiles
Number of sessions as a function of quantiles
Referer class for each language

Questions[edit]

What are we missing when looking at these plots? Are there other interesting things we should be looking at?