Research:Page view

From Meta, a Wikimedia project coordination wiki
Jump to: navigation, search

A page view is a request for the content of a web page. Page views on Wikimedia projects is our most important content consumption metric.

The current definition was drafted in 2014-15. Some background information and use cases can be found here. Since summer 2015, this set of pages is being regularly updated with any modifications to the definitions, and acts as the document of record for the definitions.

The canonical source of page view data for Wikimedia projects is the private pageview hourly dataset stored in the WMF's Hadoop cluster, which goes back to May 2015. Public data derived from this table is available in the following locations:

(see also "Data sources that use this definition" below)

Definition[edit]

For further information, see the detailed breakdown of the generalised filters.

A request from the web request logs is a pageview if it meets the following conditions:

  • the X-Analytics header does not contain preview=1;
  • the HTTP response code is 200 OK or 304 Not Modified;
  • the MIME type is either
    • a version of text/html, or
    • application/json for mobile-app requests only
  • either of the following:
    • the X-Analytics header contains pageview=1, or
    • the URL meets the following tests:
      • it contains a production site (Wikipedias, Wikisources, etc. plus e.g. Meta and Commons);
      • it contains a 'content' directory (mainly /wiki/, but also /zh-hant/ or another language variant directory);
      • it is not automatically-called "Special" page.[1]

Note that, under this definition, API requests are only counted as pageviews if they come from a mobile app which is using them to fetch page content.

Tagging[edit]

For further information, see the detailed breakdown of the tagging process.

After page views are extracted from the request logs, new fields have to be added to provide metadata about the requests.

tag filter
1 Spider

A request made by a web spider

User agent header is identified as a spider by ua-parser and additional custom regex based identification;
2 App

A request made by a mobile application

MIME type header = application/json;
3 Zero

A request via a Wikipedia Zero carrier

Request is not to api.php;
and the X-Analytics header contains “zero”;
and the request is not tagged as Spider;
4 Mobile site

A request to the mobile web site

Request is not to api.php;
and the URL contains "m.w";
and the request is not tagged as Zero or Spider;
5 Desktop

A request to the desktop site

Request is not to api.php;
and the request is not tagged as Zero, Spider or Mobile web;

Resulting format[edit]

Field Example Type Description Extraction method Use case Sensitivity
timestamp "2016-11-26 10:00:00" ISO 8601 timestamp (YYYY-MM-DD HH:00:00) UTC timestamp of the request Presumably a combination of a timestamp-handling UDF and Concat(?) Everything. Pageviews that aren't over time aren't really useful ;p Publicly consumable
country "CA" (string) two-letter ISO 3166-1 alpha-2 country codes Represents which nation the IP address geolocates to. A UDF that implements MaxMind geolocation - combined with something to detect valid XFFs and properly handle those, /and/ correctly identify IPs in the case that the request is coming from an SSL terminator or similar Allows us to evaluate trends in pageviews data, for a particular project, page or any other combination of variables Publicly consumable unless combined with page
referer "Google" string Either (a) the hostname of the referer or (b) the organisational name of the referer, for particularly prominent referers Either (a) Grab the referer, simplify it to hostname or (b) grab the referer, and see if it matches a string of regular expressions identifying particularly prominent domains. Allows us to look at referral trends for particular articles or projects (or overall) and understand how Wikimedia sites' traffic is driven by other places on the internet Publicly consumable if (b) and not grouped by page .
project "frwiki" (string) project identifier The language and project that the requests were to, stored as language.project. Must handle the mobile and zero subdomains (i.e., not store en.zero instead of en.wikipedia). This is how I do it but there's probably something more sensible and hivelike that can be written. Understanding public consumption of Wikimedia content on a per-language-variant, per-project basis. Public
page "London" string The URL (or, ideally, page title) the requests came to URI_path extraction. Ideally we'd have some way of accurately mapping these to Wikimedia page_titles/page_ids, though Per-page counts for consumption by individual community members/the general public Public unless combined with country, referer, MCC or device_class.
access_method "mobile site" (string) enum Whether the request was to the "mobile" site or the "desktop" site, or an "app" request See access method tagging. Understanding the growth/prominence of the mobile web in relation to desktop traffic, how it varies by country and by project. Public.
is_spider 0 boolean Is the request from a known spider or piece of automata? TRUE or FALSE See Spider identification Providing spider-filtered counts of traffic, and understanding where spiders come from/what's doing them Public.
agent_type String Categorise the agent making the webrequest as either user or bot See Spider identification Our aim is to count only user originated traffic as pageviews Public.
pageviews 30000 Integer An integer showing the number of pageviews how many requests pass through the generalised filters and match the permutation of fields described here? This is self-explanatory Public.

Implementation[edit]

Pageview definition is implemented in Java. Please see: [1]

Change log[edit]

See also wikitech:Analytics/Data/Pageview hourly#Changes and known problems since 2015-06-16 for changes and fixes to the database table storing the pageview data
  • 2015-02-27: exclude edit attempts from the pageviews filter (implemented in this gerrit patch, documentation updated in this edit)
  • 2015-03-02: include wikidata.org hits (implemented in this gerrit patch, documentation updated in this edit)
  • 2015-03-02: include mediawiki.org hits (implemented in this gerrit patch, documentation updated in this edit)
  • 2015-03-04 include search consistently (implemented in this gerrit patch, documentation updated in this edit)
  • 2015-05-07 Update hosts filter introducing bug (implemented in this gerrit patch)
  • 2015-08-13 Correct bug in host filter (implemented in this gerrit patch)
  • 2015-08-27 Correct bug introduced on 2015-05-07, update filtered hosts and documentation (implemented in this gerrit patch, documentation updated in this edit)
  • 2015-08-31 Exclude arbitration committee wikis from the pageviews filter (implemented in this gerrit patch, documentation updated in this edit)
  • 2015-09-17 Update ua-parser to an up-to-date version and enhance spider tagging with a more extensive regular expression. (implemented in this gerrit patch and this gerrit patch, documentation updated in this edit)
  • 2015-09-17 Update definition so that if x_analytics header contains tag 'preview', request is not counted as pageview. (implemented in this gerrit patch , documentation updated in this edit (cross-documentation with NRuiz) this edit)
  • 2015-11-19 enhance spider tagging (implemented in this gerrit patch, documentation updated in this edit)
  • 2015-12-01 Update definition so that Special:BlankPage, Special:MobileMenu and Special:HideBanners are excluded (2015-11-02: Merged in code but deployed on 2015-12-01).
  • 2016-03-09 Enhance spider tagging with a more extensive regular expression to match the bots that follow the conventions specified in WMF User-Agent Policy (implemented in this gerrit patch, documentation updated in this edit).
  • 2016-03-23 Add x_analytics header inclusive filtering with tag pageview=1, update ua-parser version to up-to-date user agent parsing definitions, correct a bug in search engine referer tagging (inclusive filtering change implemented in this patch and documented in this edit, upgrade to ua-parser implemented in this patch, and update to search engine referer classification code implemented in this patch).
  • 2016-04-11 Correct handling x_analytics header tag pageview=1: only used for mobile app pageviews (not implemented in core mediawiki), and used only in case preview is not 1, status_code and mime_type are expected values. Add Wikipedia/5.0.X to accepted user agents for an iOs app, to cover for iOs App release 5.0 user agent change (code updated in this patch for x_analytics header tagging, and this patch and this patch for mobile app user-agent update. Documentation updated in this edit for x_analytics header tagging and this edit for mobile app user-agent update).
  • 2016-11-02 Added new self-reported bot to bot list: Phab:T150990
  • 2017-02-09 Corrected code and definition to exclude edit previews (action=submit): phab:T156628. Added DSXS (self-identified bot) to bot regex: phab:T157528.
  • 2017-03-21 Corrected code and definition NOT to filter test[2].wikipedia.org from pageviews: phab:T160484.

Future work[edit]

Better definition of mobile apps pageviews[edit]

Implementing tagging infrastructure[edit]

The tagging infrastructure is[when?] not implemented. Specifically, we need:

  1. Code to extract project variant and class from a uri_host;
  2. Code to simplify and strip referers;
  3. Code to identify spiders and other automata.

We already have code to identify the access method of the pageview.

Filtering to "core" pageviews[edit]

The new definition very definitely includes some hits to things that aren't 'pages', such as search results; they're an intentionally-taken user action, and so they count as a pageview as a search page provides user-content, but they aren't useful for (e.g.) per-article pageview statistics.

MediaWiki is now providing pageID and namespace with each request to an actual 'page', which makes it far easier to get per-page statistics - but the Wikimedia Apps are not. Future work to provide per-article data that is more reliable than the current, per-title data, is:

  1. Get the Apps team to begin providing this information;
  2. Implement code to extract this information and match it to a page title.

Other metrics[edit]

  • How much API traffic do we have?
  • What are the most-requested URL strings, i.e., those with the most requests that 404?

Data sources that use this definition[edit]

  • Pageview_hourly, Projectview hourly: Documentation of two (private) analytics database tables on Hive, containing core pageview data according to the above definition (from April/May 2015 on)
  • Wikistats (per-project data from May 2015 on; updated to the new definition in December 2015)
  • Wikimedia REST API (per-page data from May 2015 on)
  • staging.pageviews05 (a private database table available via stat1003, containing sampled pageview data according to an earlier implementation of the above definition, from 2013 until April 2015)

Differences to earlier implementation of the "new" definition (2013-2015 data)[edit]

The page view definition is currently implemented in Java, and used to generate the above mentioned Hive tables (which in turn form the basis of data provided by Wikistats and the API). Aside from the changes listed above, it also differs somewhat from an earlier implementation in R that forms the basis of the historical data (2013 - April 2015) in the staging.pageviews05 table (also known as Cube v0.5). For people who need to do historical comparisons with times before April 2015, the following table summarizes some findings about these differences from phab:T108925.

Hive pageviews05 estimated maximum difference per day (Hive-pageviews05)
implemented in (language) Java R -
sampled? no 1:1000 ...
Wikis included in this implementation
but not the other one
wikimediafoundation.org, mediawiki (?), outreach ... ~ +5000-20000/day (outreach)
spider detection less aggressive (as of May 2015, improved September 17, 2015 - now probably more aggressive than on pageviews05) more aggressive ~ +- 2%
daily partitioning 0:00 UTC- 23:59 UTC shifted by several hours
action=edit excluded included ~ - 3416000 / day (?)
action=submit (edit preview) excluded (since 2017-02-09) included ~ less than action=edit
search actions included excluded ~ +464000 / day
project name normalized to lowercase yes no ~ +614000 / day (e.g. "EN.WIKIPEDIA" not counted in pageviews05, but "EN.wikipedia" is)
... ... ... ...

FAQ[edit]

Are Special: pages considered pageviews?[edit]

The pageview definition tries to count pageviews of content delivered to users, not actions. As such most "Special:" pages are not considered pageviews, with the notable exception of "Search" pages.

Are edits considered pageviews?[edit]

Edits are not considered pageviews, because a) it is assumed that the pageview from which the user clicked the edit button has already been logged, and b) similar to the case of Special: pages above, in these metric definitions we try to separate consumption metrics (e.g. content delivered to users) from production metrics (e.g. action=edit).

Notes[edit]

  1. On the other hand, special pages that users purposefully navigate to, like Special:RecentChanges or Special:Version, are included.

See also[edit]