Research:Page view

Content consumption metrics

Use cases
generalised filtering
Tags
Spiders

Apps

Zero

Mobile site

Desktop

Use cases

A page view is a request for the content of a web page. Page views on Wikimedia projects is our most important content consumption metric.

The current definition was drafted in 2014–15. For background information and use cases, see Background. Since summer 2015, this set of pages is being regularly updated with any modifications to the definitions, and acts as the document of record for the definitions.

The canonical source of page view data for Wikimedia projects is the private pageview hourly dataset stored in the WMF's Hadoop cluster, which goes back to May 2015. Public data derived from this table is available in the following locations:

raw dump files (documentation)
the Wikimedia Analytics API
the Pageviews Analysis web tool (documentation)

(see also "Data sources that use this definition" below)

Definition

For further information, see the detailed breakdown of the generalised filters.

A request from the web request logs is a pageview if it meets the following conditions:

the X-Analytics header does not contain preview=1;
the HTTP response code is 200 OK or 304 Not Modified;
the MIME type is either
- a version of text/html, or
- application/json for mobile-app requests only
either of the following:
- the X-Analytics header contains pageview=1, or
- the URL meets the following tests:
  - it contains a production site (Wikipedias, Wikisources, etc. plus e.g. Meta and Commons);
  - it contains a 'content' directory (mainly /wiki/, but also /zh-hant/ or another language variant directory);
  - it is not automatically-called "Special" page.^[1]

Note that, under this definition, API requests are only counted as pageviews if they come from a mobile app which is using them to fetch page content.

See also this code which implements most of the definition.

Tagging

For further information, see the detailed breakdown of the tagging process.

After page views are extracted from the request logs, new fields have to be added to provide metadata about the requests.

	tag	filter
1	Spider A request made by a web spider	`User agent` header is identified as a spider by ua-parser and additional custom regex based identification;
2	App A request made by a mobile application	`MIME type` header = application/json;
3	Zero A request via a Wikipedia Zero carrier	Request is not to `api.php`; and the `X-Analytics` header contains “zero”; and the request is not tagged as Spider;
4	Mobile site A request to the mobile web site	Request is not to `api.php`; and the URL contains `"m.w"`; and the request is not tagged as Zero or Spider;
5	Desktop A request to the desktop site	Request is not to `api.php`; and the request is not tagged as Zero, Spider or Mobile web;

Resulting format

Field	Example	Type	Description	Extraction method	Use case	Sensitivity
timestamp	"2016-11-26 10:00:00"	ISO 8601 timestamp (YYYY-MM-DD HH:00:00)	UTC timestamp of the request	Presumably a combination of a timestamp-handling UDF and Concat(?)	Everything. Pageviews that aren't over time aren't really useful ;p	Publicly consumable
country	"CA"	(string) two-letter ISO 3166-1 alpha-2 country codes	Represents which nation the IP address geolocates to.	A UDF that implements MaxMind geolocation - combined with something to detect valid XFFs and properly handle those, /and/ correctly identify IPs in the case that the request is coming from an SSL terminator or similar	Allows us to evaluate trends in pageviews data, for a particular project, page or any other combination of variables	Publicly consumable unless combined with `page`
referer	"Google"	string	Either (a) the hostname of the referer or (b) the organisational name of the referer, for particularly prominent referers	Either (a) Grab the referer, simplify it to hostname or (b) grab the referer, and see if it matches a string of regular expressions identifying particularly prominent domains.	Allows us to look at referral trends for particular articles or projects (or overall) and understand how Wikimedia sites' traffic is driven by other places on the internet	Publicly consumable if (b) and not grouped by `page` .^[2]^[3]
project	"frwiki"	(string) project identifier	The language and project that the requests were to, stored as language.project. Must handle the mobile and zero subdomains (i.e., not store en.zero instead of en.wikipedia).	This is how I do it but there's probably something more sensible and hivelike that can be written.	Understanding public consumption of Wikimedia content on a per-language-variant, per-project basis.	Public
page	"London"	string	The URL (or, ideally, page title) the requests came to	URI_path extraction. Ideally we'd have some way of accurately mapping these to Wikimedia page_titles/page_ids, though	Per-page counts for consumption by individual community members/the general public	Public unless combined with country, referer, MCC or device_class.
access_method	"mobile site"	(string) enum	Whether the request was to the "mobile" site or the "desktop" site, or an "app" request	See access method tagging.	Understanding the growth/prominence of the mobile web in relation to desktop traffic, how it varies by country and by project.	Public.
is_spider	0	boolean	Is the request from a known spider or piece of automata? TRUE or FALSE	See Spider identification	Providing spider-filtered counts of traffic, and understanding where spiders come from/what's doing them	Public.
agent_type		String	Categorise the agent making the webrequest as either user or bot	See Spider identification	Our aim is to count only user originated traffic as pageviews	Public.
pageviews	30000	Integer	An integer showing the number of pageviews	how many requests pass through the generalised filters and match the permutation of fields described here?	This is self-explanatory	Public.

Implementation

Pageview definition is implemented in Java. Please see: [1]

Change log

See also wikitech:Analytics/Data/Pageview hourly#Changes and known problems since 2015-06-16 for changes and fixes to the database table storing the pageview data

2015-02-27: exclude edit attempts from the pageviews filter (implemented in this Gerrit patch, documentation updated in this edit)
2015-03-02: include wikidata.org hits (implemented in this Gerrit patch, documentation updated in this edit)
2015-03-02: include mediawiki.org hits (implemented in this Gerrit patch, documentation updated in this edit)
2015-03-04 include search consistently (implemented in this Gerrit patch, documentation updated in this edit)
2015-05-07 Update hosts filter introducing bug (implemented in this Gerrit patch)
2015-08-13 Correct bug in host filter (implemented in this Gerrit patch)
2015-08-27 Correct bug introduced on 2015-05-07, update filtered hosts and documentation (implemented in this Gerrit patch, documentation updated in this edit)
2015-08-31 Exclude arbitration committee wikis from the pageviews filter (implemented in this Gerrit patch, documentation updated in this edit)
2015-09-17 Update ua-parser to an up-to-date version and enhance spider tagging with a more extensive regular expression. (implemented in this Gerrit patch and this Gerrit patch, documentation updated in this edit)
2015-09-17 Update definition so that if x_analytics header contains tag 'preview', request is not counted as pageview. (implemented in this Gerrit patch, documentation updated in this edit (cross-documentation with NRuiz) this edit)
2015-11-19 enhance spider tagging (implemented in this Gerrit patch, documentation updated in this edit)
2015-12-01 Update definition so that Special:BlankPage, Special:MobileMenu and Special:HideBanners are excluded (2015-11-02: Merged in code but deployed on 2015-12-01).
2016-03-09 Enhance spider tagging with a more extensive regular expression to match the bots that follow the conventions specified in WMF User-Agent Policy (implemented in this Gerrit patch, documentation updated in this edit).
2016-03-23 Add x_analytics header inclusive filtering with tag pageview=1, update ua-parser version to up-to-date user agent parsing definitions, correct a bug in search engine referer tagging (inclusive filtering change implemented in this patch and documented in this edit, upgrade to ua-parser implemented in this patch, and update to search engine referer classification code implemented in this patch).
2016-04-11 Correct handling x_analytics header tag pageview=1: only used for mobile app pageviews (not implemented in core MediaWiki), and used only in case preview is not 1, status_code and mime_type are expected values. Add Wikipedia/5.0.X to accepted user agents for an iOS app, to cover for iOS App release 5.0 user agent change (code updated in this patch for x_analytics header tagging, and this patch and this patch for mobile app user-agent update. Documentation updated in this edit for x_analytics header tagging and this edit for mobile app user-agent update).

2016-10-31 Added non knowledge wikis to pageview refinement. Our system keeps being a whitelist, so thus far we added outreach wiki and chapter wikis such us de.wikimedia.org. Changes to regex: https://gerrit.wikimedia.org/r/#/c/316845/ An example of wikis we are whitelisting: gerrit:#/c/316838/4/static data/pageview/whitelist/whitelist.tsv

2016-11-12 Corrected definition to account for recent changes to iOS user agent: https://gerrit.wikimedia.org/r/#/c/319374

2016-11-02 Added new self-reported bot to bot list: Phab:T150990

2017-02-09 Corrected code and definition to exclude edit previews (action=submit): phab:T156628. Added DSXS (self-identified bot) to bot regex: phab:T157528.

2017-03-21 Corrected code and definition NOT to filter test[2].wikipedia.org from pageviews: phab:T160484.

Future work

Better definition of mobile apps pageviews

Filtering to "core" pageviews

The new definition very definitely includes some hits to things that aren't 'pages', such as search results; they're an intentionally-taken user action, and so they count as a pageview as a search page provides user-content, but they aren't useful for (e.g.) per-article pageview statistics.

MediaWiki is now providing pageID and namespace with each request to an actual 'page', which makes it far easier to get per-page statistics—but the Wikimedia Apps are not. Future work to provide per-article data that is more reliable than the current, per-title data, is:

Get the Apps team to begin providing this information;
Implement code to extract this information and match it to a page title.

Other metrics

How much API traffic do we have?
What are the most-requested URL strings, i.e., those with the most requests that 404?

Data sources that use this definition

Pageview_hourly, Projectview hourly: Documentation of two (private) analytics database tables on Hive, containing core pageview data according to the above definition (from April/May 2015 on)
the Pageviews Analysis web tool(documentation)
Wikistats (per-project data from May 2015 on; updated to the new definition in December 2015)
Wikimedia REST API (per-page data from May 2015 on)
staging.pageviews05 (a private database table on the analytics-store server, also known as Cube v0.5, containing sampled pageview data according to an earlier implementation of the above "new" definition, from 2013 until April 2015)

Differences to earlier implementation of the "new" definition (2013-2015 data)

The page view definition is currently implemented in Java, and used to generate the above-mentioned Hive tables (which in turn form the basis of data provided by Wikistats and the API). Aside from the changes listed above, it also differs somewhat from an earlier implementation in R that forms the basis of the historical data (2013 – April 2015) in the staging.pageviews05 table). For people who need to do historical comparisons with times before April 2015, the following table summarizes some findings about these differences from phab:T108925.

	Hive	pageviews05	estimated maximum difference per day (Hive-pageviews05)
implemented in (language)	Java	R	-
sampled?	no	1:1000	...
Wikis included in this implementation but not the other one	wikimediafoundation.org, mediawiki (?), outreach	...	~ +5000-20000/day (outreach)
spider detection	less aggressive (as of May 2015, improved September 17, 2015 - now probably more aggressive than on pageviews05)	more aggressive	~ +- 2%
daily partitioning	0:00 UTC- 23:59 UTC	shifted by several hours
action=edit	excluded	included	~ - 3416000 / day (?)
action=submit (edit preview)	excluded (since 2017-02-09)	included	~ less than action=edit
search actions	included	excluded	~ +464000 / day
project name normalized to lowercase	yes	no	~ +614000 / day (e.g. "EN.WIKIPEDIA" not counted in pageviews05, but "EN.wikipedia" is)
...	...	...	...

FAQ

Are Special: pages considered pageviews?

The pageview definition tries to count pageviews of content delivered to users, not actions. As such, most "Special:" pages are not considered pageviews, with the notable exception of "Search" pages.

Are edits considered pageviews?

Edits are not considered pageviews, because a) it is assumed that the pageview from which the user clicked the edit button has already been logged, and b) similar to the case of Special: pages above, in these metric definitions we try to separate consumption metrics (e.g. content delivered to users) from production metrics (e.g., action=edit).

Notes

↑ On the other hand, special pages that users purposefully navigate to, like Special:RecentChanges or Special:Version, are included.
↑ https://wiki-search-referrals.wmcloud.org
↑ Research:Wikipedia_clickstream