- Elucidate the different kinds of mobile traffic and how they appear;
Talk to Yuri/the Zero team about where their data goes and what it looks like; Talk to the Apps team, ditto;
Talk to Christian/the Ops team about mw.session; Check IP rotation/breakdowns; Look for stats (internet-wide) on mobile IP rotation; Test hashing test against desktop requests, see if it produces the same hinkiness;
Look for competing session definitions;
- Look at local minimum/gaussian analysis as a way of checking option 2.
- Inter-time, not offset!
The Optimal Blog Post is 7 Minutes
Hi User:Okeyes (WMF) and User:Halfak (WMF), your analysis ("we find that requests tend to near-uniformly cease after 430 seconds of inactivity.(Fig. 6) This is in line with both the results from the ModuleStorage tests in RQ1, and Geiger & Halfaker's work on Wikipedia editors.") seems spot on:
- The Optimal Blog Post is 7 Minutes, medium.com, and a 7-minute read comes in around 1,600 words, according to thenextweb.
Nice analysis. So, tell me: How many Wikipedia articles never get read to the end? ;-) What percentage is just too long? Give me a list and i'll get us a bot :-)) (en:Special:LongPages, en:User:Dr_pda/prosesize / en:User:Shubinator/DYKcheck#Prose_length) --Atlasowa (talk) 14:20, 14 April 2014 (UTC)
- Good question! At the moment, we don't know the answer - we're waiting on some changes to the analytics pipeline that should give us more accuracy around pinning down that 430-second-ish time, and seeing what pages actually get read. But this is absolutely on my list of 'questions to answer' - how much time do people spend on articles, and is there any correlation between time spent and article length? Ironholds (talk) 16:26, 14 April 2014 (UTC)
"Grouping" result and sampling
The following conclusion about the low likelihood of "grouping" different clients is an important result that in the years since appears to have been relied upon in some research that used the IP+user agent combination as a proxy for identifying clients:
- we ran the identifier algorithm over a dataset containing a different, cookie-based unique identifier,
experimentID- the result of Aaron Halfaker and Ori Livneh's work on testing module storage performance. After retrieving the ModuleStorage dataset for 20 January, we see 38,367 distinct IPs, distributed over 36,918 distinct experimentIds, with 94,444 requests. Almost all of the IPs have only one corresponding experimentId, with the highest being 49 associated with a single IP in the 24-hour period selected (Fig. 2)
- [...] restricting the dataset to 'sessions' [and] looking at situations where hashing groups multiple 'Unique Clients' that are a single experimentId - we see a 0.969% effective error rate. The algorithm is, for all intents and purposes, accurate ...
But according to Research:Module_storage_performance#Sample, that ModuleStorage schema appears to have used a very small sampling ratio - it collected 1.49 million pageviews during a timespan of about 3.6 days, which at around 16 billion views/month corresponds to less than 0.1%. So I am wondering how well the "Almost all of the IPs have only one corresponding experimentId" result can be generalized to our overall traffic, especially on larger wikis.