Research talk:Characterization of articles by mobile readership

Add topic
From Meta, a Wikimedia project coordination wiki
Latest comment: 8 years ago by Ironholds in topic Clustering

Welcome[edit]

Brainstorming and data exploration[edit]

Ironholds (talk · contribs) placed the following data on GitHub: "article", "views", "rank", "project", "access", "year", "month", "day". We've restricted "project" to be en.wikipedia for this phase. There are 89468 rows.

We have two dependent variables: integer "views" and ordinal "rank". We have several independent variables: the "project" + "article" identifier, qualified by "access" method and "year" + "month" + "day" time series data.

All rows are from year 2015 and month 11. Day frequencies range from 2971 to 2988. "access" frequencies for mobile-web is 29903, mobile-app is 29827, and desktop is 29738. Grouping by both access method and day, we have a range of frequencies from 983 to 1000. These are all close enough to avoid undesirable effects.

View counts are more interesting.

  • count: 89468
  • mean: 13946.970448
  • std: 253910.674431
  • min: 134
  • 25%: 345
  • 50%: 5485
  • 75%: 8071
  • max: 14084684

But it turns out that mobile-app views have an order of magnitude fewer views across all percentiles. We'll have to decide if we want to sum the two mobile views or keep them separate.

TODO:

  1. comparison of summary statistics for mobile and non-mobile pageviews
  2. are there any pages that should be removed? e.g., 'Main_Page', or '-', or anything else?
  3. eyeball the article titles and see if we can spot any interesting patterns
    1. I hope to grab Wikidata instance-of information for better data here
I'd argue for summing mobile-web and mobile-app, yep. I'd advocate for removing both of those examples, particularly because "-" is ambiguously a bug; it's an article, sure, but it's also the character for "no field value" in HDFS :P.
One thing I'd recommend is trying to cluster. So, I have a fear with this data which is that automated traffic could be driving some of the pages, and it could be automated traffic we're not identifying, that's in with the user data. The good news is that one of the really easy heuristics to use to identify this is traffic massively biased towards the desktop or mobile platform. The bad news is we're looking for that kind of bias here ;p. So I'd be interested to try clustering the delta between mobile and desktop in the hope that we see essentially 3 clusters; one, with a very high delta, which is probably automata, a second, with a moderate delta, which is hopefully the thing we're actually interested in, and a third that represents the content both groups are interested in. If we see that pattern we can actually test the underlying data to make sure the first group is automata by inspecting the rows server-side, so we can defend eliminating that group pretty solidly. Ironholds (talk) 14:51, 24 January 2016 (UTC)Reply

Clustering[edit]

Facets[edit]

The pageview counts themselves don't contain enough information to cluster pages. We can obtain extraneous information from Wikidata and probably elsewhere. Possible facets include:

  • instance-of and subclass-of: is there a difference between the sort of things people read on mobile versus desktop?
  • country, country of origin, and the other country properties: do readers use mobile devices to look at articles grouped by nationality differently?
  • date properties: are mobile devices used more to look up articles describing recent events and new items?

Methods[edit]

  • are we actually using clustering methods (K-means, etc) or using the word clustering to mean faceting?
    I'd probably say we should start with clustering methods to make sure we're actually seeing pageviews. We could mean faceting for the analysis itself. Ironholds (talk) 23:07, 14 February 2016 (UTC)Reply