User:Erik Zachte/Wikistats 1
Wikistats 1 is now (Jan 2020) gradually replaced by Wikistats 2, a complete overhaul done by the WMF Analytics Team. Foremost purpose of this article is to grasp where we stand in the transition to Wikistats 2. Whenever an opinion is included, it will be my personal view, and of course will contain my personal biases. Therefore this can't be considered official documentation.
I also intend to collect info about Wikistats 1 here for reference and easy access (even I have difficulty to find some data files I created).
For reference here is the list of open tasks for the team, two columns on Wikistats.
An often heard complaint about Wikistats 1 was that the raw data weren't available for external processing. Actually many were/are available, but were hard to find, and often undocumented. I'll provide some pointers here.
Page views per project, per wiki, per month
Jan 2020: Archives (and counts from these archives) on page views are still updated and refreshed daily (hurray!).
- Hourly page views are all packed into one yearly tar file
- 2007-12 till 2016-08. These files (called projectcounts) were produced by Webstatscollector. The files have been patched by me on at least two occasions, in order to correct for massive under-counting (up to 40% of messages were not counted for +/- 8 months). 
- 2015-07 etc Similar files (now called projectviews) come now from hadoop and use a new definition for page views (no traffic from bots, all traffic to the mobile site).
- Csv files on multiple aggregation levels have been packed into one zip file. This is a huge file (Jan 2020: size 122 MB).
In the zip file there is a separate folder for each project (wikipedia, wiktionary etc). Each folder contains counts per language since 2008 (some counts started later, e.g. to WMF mobile site). Counts have been broken down by day, week, month, day of week, overall total per language. A white list of genuine language codes is used to remove cruft.
Note: file projectviews_per_month_all_projects_html.csv is an exception, in that it contains html code snippets, which are reused in the last step of the batch process for the highest level overview report.
Defunct: File projectviews_per_hour_all.csv only contains counts till 2017 (and none even for Wikipedia).
Defunct: Files projectviews_per_month_popular_wikis_normalized_[yyyy][mm].csv were generated for all projects into one file, which was stored in the folder for Wikipedia data. It was only meant to be used in the original Monthly Report Card, and is no longer updated.
Page views per article, per hour
(see also section Reports below for Wikistats 2 alternative)
There are two sets of public data files about page views per article. One published per hour, one aggregated into monthly chunks, but still with hourly granularity (and extrapolations for missing hours). These files go back to
- hourly https://dumps.wikimedia.org/other/pagecounts-raw/
- monthly https://dumps.wikimedia.org/other/pagecounts-ez/merged/ These monthly aggregates are still updated, which is great :-) These are used in 3rd party reports as well.
I consider this the most important data stream from Wikistats 1. It is not about Wikimedia projects per se. It is about what the world at large did seek to learn, in our age. I see it as complementary to the Twitter Archive at the Library of Congress. If only we had such a treasure trove for data archaeologists from e.g. WW II, it would be used by many scholars. Its importance will grow in coming decades, as the data age and ripen. BTW it was a community project (shout-out to Mathias Schindler) that I took over, as it was better to keep it going on Wikimedia servers.
There is some redundancy for those data files, as hourly and monthly files are both publicly available (albeit on same server, which therefore forms a single point of failure). Who wants to download and archive 720 hourly dumps when there is an aggregate version with no granularity lost, and less than one percent of the cumulative size of those 720?!
So the data gathering and publication is OK, and quite robust. But as for long term preservation, I'm not so sure about that part, with a single copy on dumps.wikimedia.org. That's why I started to backup to hdfs with its much better redundancy and fail-over. That hdfs backup part of my script is broken now. Dan replied on Phabricator that he wants to take care of this. Thanks much, Dan :-)
These data files go back to 2008, when page view counts became available. Please be aware that the page view definition changed in 2015, when a.o. requests by bots were no longer in the data, and mobile traffic got fully counted.
Data collected from database dumps
These data have been collected up till Jan 2019. Data archive is here.
Popular reports which have not (or only partially) been migrated
Screenshots with a yellow background come from Wikistats 1 reports. Wikistats followed here the principle that meta pages were shown in yellow on Wikipedia, in early days. Those yellow pages on Wikipedia soon disappeared but it stayed here, as an easy way to tell Wikistats 1 reports apart from other reports. So it's not primarily my bad aesthetics which should be blamed ;-)
Page views per project, per wiki, per month
Wikistats 1 reported on page views for nearly all Wikimedia wikis in many variations. These reports are still generated daily, but do no longer reach an audience, as they are no longer copied to the public server (access rights).
Older versions of the reports (March 2019) are online: example, list of all reports.
Levels of reporting in these reports:
- Separate reports for mobile and non-mobile traffic, and also one for the overall total.
- Separate reports for raw counts and normalized counts
- A normalized version of the monthly counts was introduced to make it easier to compare months, as follows: Time and again people got confused when we had a 'drop in page views' in February. I used to say "Remember February usually has almost 10% days less then January" (usual reply "I got it, of course"). These normalized data were also used in the Monthly Report Card. These were less suited for external communication, but imho much better for internal discussion. Hence both versions of the reports coexisted peacefully.
- Counts per project, per wiki
- Monthly counts (with rounded numbers for brevity)
- Monthly secondary metrics
- month over month (MoM) growth for that wiki (the color of the cell also conveyed rate of decline/growth)
- percentage of overall views that went to this wiki
- ranking for this wiki in this month
Page views per geography
Wikistats reports used the same data-feed as Wikistats 2, so this is about presentation. Only one side of issue has been been migrated to Wikistats 2: (different breakdowns got equal support votes in the survey)
Page views per wiki per country
Wikistats 2 has a panel with map and table to breakdown page views to a certain project/wiki (where do those page views come from?). Note: it uses absolute counts rather than relative counts, so large countries always are over-exposed. See also xkcd on this.
Also Wikistats 1 has an overview page which also provides breakdown by continent, and also shows what percentage of people are reached.
Sitemap per wiki
This report received most support votes (43) for migration to Wikistats 2. And large parts have indeed been migrated to the per-project/wiki panels of Wikistats 2. Those panels look attractive, have flexibility (filter criteria), people can zoom in and zoom out (date range), annotations can be added, raw data are also easily available. All in all major progress.
- Overall trend data in the new panels are bad (see also last section on this page)
- I'd welcome copying the project/wiki names inside the chart, so that partial screen copies are self-explanatory. Now people have to add these themselves or include the full menu at the left in the screen copy.
- Not all useful metrics have been migrated, e.g. speakers per language, admins, active bots, %bot edits.
50 recently active wikipedians, excl. bots
In Wikistats 2 there is this list of top editors by overall activity, but that report is rather bare-bones I would say.
Any report which allows the user to compare languages at-a-glance
Wikistats 1 has many tables where all wikis in a project can be compared, month by month, on one specific aspect. These reports are oversized, up to 200 columns, up to 240 rows, too many cells overall. These surely can benefit from redesign, but are evidently appreciated for what they present.
This content is great also for outsiders, like journalists.
Mailing list stats
This report is no longer updated on the WMF site (again, merely a rsync issue to fix this).
End 2019 WMF staff asked me about an update, so I switched these back to my own server, where these were published originally.
But eventually these reports may become obsolete, when mailing lists are migrated/reorganized (under consideration, I hear).
Shoutout for Wikistats 2 query tool on page views per article
WMF hosts a Wikistats 2 query tool which produces charts about page views for one or several articles. This is based on the same W2 data as the files for offline usage mentioned in section Data. It is a widely applauded product (and IMO rightly so), made by the Analytics Teams and others. I'm mentioning it here as its complements the files described in section Data. Wikistats 1 never had such a tool.
Please be aware that the page view definition changed in 2015, when a.o. requests by bots were no longer in the data, and mobile traffic got fully counted. This tool only reports on counts from 2015 and later.
Two surveys were held in 2015/2016 to provide a summary of what Wikistats 1 entails, and to ask feedback on what our editor community cared about
- Reports on traffic (many on page views, a few on page edits) (2015)
- Reports based on data which had been collected from the database dumps (on edit counts, editor counts, article counts, bot and revert trends, etc, etc) (2016)
I use the term visualisation (aka viz.) mostly for data renderings that go beyond simple bar or line charts.
Main viz's were these:
Wikipedia Views Visualized
Monthly pageviews per country/region, or per wikipedia language. See visualization, documentation.
Status: last monthly update: September 2018. Data collection scripts are rather complex, and require quite a few meta data, also use a complex data structure.
A subset of these data (page views per wiki per country) is available in Wikistats 2
Wikipedias active editors per million speakers
Status: data are from August 2018. Refreshing this with input from Wikistats 2 should be doable, and meta data from Wikistats 1 could be reused.
Wikipedia edits on a random day
- Wikistats 1 has been written in perl mostly.
- Older bar charts were html only, generated with perl. Newer line charts were rendered with R.
Musings over debatable choices, which were made for Wikistats 2
I don't want to nitpick here. But a few choices for Wikistats 2 should have been made differently.
Absolute vs Relative
Compare these maps which deal with the same metric in Wikistats 1 and Wikistats 2. See also xkcd on this.
Trend metrics are wrong
The footer says "129.90% over this time range".
It also says "They include data from January 2008 to July 2016". The first bar is actually for December 2007. And the last one for August 2016. Minor issue one could think, nitpicking after all. Moreover I can see how these extreme months could be suspicious, they might be incomplete months (better not show them at all then). But the 129.90% I quoted is actually based on that first and last month, and nothing in between!
The last month is only 13% of the month before it, so that clearly is an incomplete month indeed. Yet it's the only month used for the overall growth in 9.5 years! Extreme months here are incomparable anyway as they suffer from seasonality. In my view a sensible way to compress trend data into one metric would be to divide overall total for last 12 full months Aug 2015-Jul 2016 (1,240,308,130) and for Jan-Dec 2008 (109,278,819) which leads to a ratio of 11.350 or 1135.0% or +1035.0% (not +129.9%).
This issue of building trend figures from extreme months applies to all panels in Wikistats 2. The issues of accepting an incomplete month as one of those extremes may not always apply.
I reported this issue Dec 2018 in https://phabricator.wikimedia.org/T212032. It was acknowledged within days and marked high priority: "For now we'll be removing the time period trend since as you said it doesn't add value in its current form." That's still a eufemism, it's totally nonsensical in its current form. I'd expect a task about at least removing nonsense to be followed-up in days, rather than years. Trust comes by foot and leaves on horseback. And again, at least part of the issue (basing trends on singular months) is endemic to Wikistats 2, all panels.
Editor counts are confusing
I would rate Editor Counts as one of most important metrics for Wikimedia. It was in the 5 topmost strategic goals long ago. Trends were discussed almost monthly in the Metrics Meetings. So we'd better report on active editors consistently.
I have two issues here:
Wikistats 2 has inconsistent editor counts
In some reports Wikistats 2 includes even editors with merely one edit (!). This is about semantics: I consistently repeated over many years 'metrics loose their meaning in fringe cases'. Say, a person does no writing at all, except for a weekly shopping list. Would we call that person a writer? Say, a person doesn't travel at all except for an occasional local bus-ride, which involves climbing two steps to enter the bus, and again two steps to leave the bus. Would we call that person a climber? These examples are ludicrous on purpose.
For one report a user can choose to split by activity level, and uncheck '1-4 edits', but it is checked on by default. Also choosing any other breakdown always includes all fringe cases.
Wikistats2 has a separate metric 'Active editors' where this more strict filter is applied. Having these two metrics coexist it's still confusing for the general public.
In Wikimania London I spoke about how we should 'strive to err on the side of modesty'. Even with more strict filters our accomplishments are flabbergasting.
Adding oranges to apples does not result in more apples
There was a recent addition to Wikistats 2. A long awaited breakdown of editors by geography. I looked forward to this. But I liked it less when I saw that editors and ip addresses were added up into one metric. It's like adding apples and oranges into one count, and still call it apples. Let me explain:
- Registered editors don't always sign in. Even Jimmy Wales 'confessed' doing so in a keynote speech years ago. I also 'plead guilty' myself. Note that quotes signal irony here, it's perfectly allright within our rules.
- Many people edit from schools, libraries, cybercafes (especially this last one is prevalent in the Global South), so here unregistered editors would be undercounted rather than the other way around.
- Some (many?) internet providers do not hand out permanent ip addresses, but cycle these per user per session.
- People who edit from a mobile phone swap their phone more often than their PC (if they can afford to have both). In rich countries that might be every 2 or 3 years. So this would contribute even more double counts (3%-4%?).
So all in all deriving unique editors from a mix of registered and unregistered or not logged in editors is shady at best. There has been discussion about using a cookie to take apart editors using the same ip address. This won't work the other way around, to merge edits from different ip addresses into one editor.
My take: it's all very shady. I'd recommend presenting the solid and shady numbers as separate metrics, and warn about the last one: 'may contain double counts'.
Is Wikistats 2 migration almost complete? Such was said on a recent discussion on Facebook (group Wikipedia weekly). You decide.
Am I negative about Wikistats 2, the part that has been realized? For many aspects, I ain't at all. For some, I am, but this is mostly in the domain of quality control. It's about vetting the outcomes, preferably by someone who really likes numbers, but who stays away from implementation issues. I'm not that person anymore.
- Percentage missing per squid server could be deduced from the gap in sequence numbers, which ideally should be 1000 per squid, based on a 1:1000 sample rate, but was much higher.
- This still is a matter of debate between me and the Analytics Team  (Jan 23, 2020)