Research:Page view
A page view is a request for the content of a web page. Page views on Wikimedia projects is our most important content consumption metric.
The current definition was drafted in 2014–15. For background information and use cases, see Background. Since summer 2015, this set of pages is being regularly updated with any modifications to the definitions, and acts as the document of record for the definitions.
The canonical source of page view data for Wikimedia projects is the private pageview hourly dataset stored in the WMF's Hadoop cluster, which goes back to May 2015. Public data derived from this table is available in the following locations:
- raw dump files (documentation)
- the Wikimedia Analytics API
- the Pageviews Analysis web tool (documentation)
(see also "Data sources that use this definition" below)
Definition
[edit]A request from the web request logs is a pageview if it meets the following conditions:
- the X-Analytics header does not contain
preview=1
; - the HTTP response code is
200 OK
or304 Not Modified
; - the MIME type is either
- a version of
text/html
, or application/json
for mobile-app requests only
- a version of
- either of the following:
- the X-Analytics header contains
pageview=1
, or - the URL meets the following tests:
- it contains a production site (Wikipedias, Wikisources, etc. plus e.g. Meta and Commons);
- it contains a 'content' directory (mainly
/wiki/
, but also/zh-hant/
or another language variant directory); - it is not automatically-called "Special" page.[1]
- the X-Analytics header contains
Note that, under this definition, API requests are only counted as pageviews if they come from a mobile app which is using them to fetch page content.
See also this code which implements most of the definition.
Tagging
[edit]After page views are extracted from the request logs, new fields have to be added to provide metadata about the requests.
tag | filter | |
---|---|---|
1 | Spider
A request made by a web spider |
User agent header is identified as a spider by ua-parser and additional custom regex based identification;
|
2 | App
A request made by a mobile application |
MIME type header = application/json;
|
3 | Zero
A request via a Wikipedia Zero carrier |
Request is not to api.php ;
|
4 | Mobile site
A request to the mobile web site |
Request is not to api.php ;
|
5 | Desktop
A request to the desktop site |
Request is not to api.php ;
|
Resulting format
[edit]Field | Example | Type | Description | Extraction method | Use case | Sensitivity |
---|---|---|---|---|---|---|
timestamp | "2016-11-26 10:00:00" | ISO 8601 timestamp (YYYY-MM-DD HH:00:00) | UTC timestamp of the request | Presumably a combination of a timestamp-handling UDF and Concat(?) | Everything. Pageviews that aren't over time aren't really useful ;p | Publicly consumable |
country | "CA" | (string) two-letter ISO 3166-1 alpha-2 country codes | Represents which nation the IP address geolocates to. | A UDF that implements MaxMind geolocation - combined with something to detect valid XFFs and properly handle those, /and/ correctly identify IPs in the case that the request is coming from an SSL terminator or similar | Allows us to evaluate trends in pageviews data, for a particular project, page or any other combination of variables | Publicly consumable unless combined with page
|
referer | "Google" | string | Either (a) the hostname of the referer or (b) the organisational name of the referer, for particularly prominent referers | Either (a) Grab the referer, simplify it to hostname or (b) grab the referer, and see if it matches a string of regular expressions identifying particularly prominent domains. | Allows us to look at referral trends for particular articles or projects (or overall) and understand how Wikimedia sites' traffic is driven by other places on the internet | Publicly consumable if (b) and not grouped by page .[2][3]
|
project | "frwiki" | (string) project identifier | The language and project that the requests were to, stored as language.project. Must handle the mobile and zero subdomains (i.e., not store en.zero instead of en.wikipedia). | This is how I do it but there's probably something more sensible and hivelike that can be written. | Understanding public consumption of Wikimedia content on a per-language-variant, per-project basis. | Public |
page | "London" | string | The URL (or, ideally, page title) the requests came to | URI_path extraction. Ideally we'd have some way of accurately mapping these to Wikimedia page_titles/page_ids, though | Per-page counts for consumption by individual community members/the general public | Public unless combined with country, referer, MCC or device_class. |
access_method | "mobile site" | (string) enum | Whether the request was to the "mobile" site or the "desktop" site, or an "app" request | See access method tagging. | Understanding the growth/prominence of the mobile web in relation to desktop traffic, how it varies by country and by project. | Public. |
is_spider | 0 | boolean | Is the request from a known spider or piece of automata? TRUE or FALSE | See Spider identification | Providing spider-filtered counts of traffic, and understanding where spiders come from/what's doing them | Public. |
agent_type | String | Categorise the agent making the webrequest as either user or bot | See Spider identification | Our aim is to count only user originated traffic as pageviews | Public. | |
pageviews | 30000 | Integer | An integer showing the number of pageviews | how many requests pass through the generalised filters and match the permutation of fields described here? | This is self-explanatory | Public. |
Implementation
[edit]Pageview definition is implemented in Java. Please see: [1]
Change log
[edit]- See also wikitech:Analytics/Data/Pageview hourly#Changes and known problems since 2015-06-16 for changes and fixes to the database table storing the pageview data
- 2015-02-27: exclude edit attempts from the pageviews filter (implemented in this Gerrit patch, documentation updated in this edit)
- 2015-03-02: include wikidata.org hits (implemented in this Gerrit patch, documentation updated in this edit)
- 2015-03-02: include mediawiki.org hits (implemented in this Gerrit patch, documentation updated in this edit)
- 2015-03-04 include search consistently (implemented in this Gerrit patch, documentation updated in this edit)
- 2015-05-07 Update hosts filter introducing bug (implemented in this Gerrit patch)
- 2015-08-13 Correct bug in host filter (implemented in this Gerrit patch)
- 2015-08-27 Correct bug introduced on 2015-05-07, update filtered hosts and documentation (implemented in this Gerrit patch, documentation updated in this edit)
- 2015-08-31 Exclude arbitration committee wikis from the pageviews filter (implemented in this Gerrit patch, documentation updated in this edit)
- 2015-09-17 Update ua-parser to an up-to-date version and enhance spider tagging with a more extensive regular expression. (implemented in this Gerrit patch and this Gerrit patch, documentation updated in this edit)
- 2015-09-17 Update definition so that if x_analytics header contains tag 'preview', request is not counted as pageview. (implemented in this Gerrit patch, documentation updated in this edit (cross-documentation with NRuiz) this edit)
- 2015-11-19 enhance spider tagging (implemented in this Gerrit patch, documentation updated in this edit)
- 2015-12-01 Update definition so that Special:BlankPage, Special:MobileMenu and Special:HideBanners are excluded (2015-11-02: Merged in code but deployed on 2015-12-01).
- 2016-03-09 Enhance spider tagging with a more extensive regular expression to match the bots that follow the conventions specified in WMF User-Agent Policy (implemented in this Gerrit patch, documentation updated in this edit).
- 2016-03-23 Add x_analytics header inclusive filtering with tag
pageview=1
, update ua-parser version to up-to-date user agent parsing definitions, correct a bug in search engine referer tagging (inclusive filtering change implemented in this patch and documented in this edit, upgrade to ua-parser implemented in this patch, and update to search engine referer classification code implemented in this patch). - 2016-04-11 Correct handling x_analytics header tag
pageview=1
: only used for mobile app pageviews (not implemented in core MediaWiki), and used only in casepreview
is not 1,status_code
andmime_type
are expected values. AddWikipedia/5.0.X
to accepted user agents for an iOS app, to cover for iOS App release 5.0 user agent change (code updated in this patch for x_analytics header tagging, and this patch and this patch for mobile app user-agent update. Documentation updated in this edit for x_analytics header tagging and this edit for mobile app user-agent update).
- 2016-10-31 Added non knowledge wikis to pageview refinement. Our system keeps being a whitelist, so thus far we added outreach wiki and chapter wikis such us de.wikimedia.org. Changes to regex: https://gerrit.wikimedia.org/r/#/c/316845/ An example of wikis we are whitelisting: gerrit:#/c/316838/4/static data/pageview/whitelist/whitelist.tsv
- 2016-11-12 Corrected definition to account for recent changes to iOS user agent: https://gerrit.wikimedia.org/r/#/c/319374
- 2016-11-02 Added new self-reported bot to bot list: Phab:T150990
- 2017-02-09 Corrected code and definition to exclude edit previews (action=submit): phab:T156628. Added DSXS (self-identified bot) to bot regex: phab:T157528.
- 2017-03-21 Corrected code and definition NOT to filter test[2].wikipedia.org from pageviews: phab:T160484.
Future work
[edit]Better definition of mobile apps pageviews
[edit]Filtering to "core" pageviews
[edit]The new definition very definitely includes some hits to things that aren't 'pages', such as search results; they're an intentionally-taken user action, and so they count as a pageview as a search page provides user-content, but they aren't useful for (e.g.) per-article pageview statistics.
MediaWiki is now providing pageID and namespace with each request to an actual 'page', which makes it far easier to get per-page statistics—but the Wikimedia Apps are not. Future work to provide per-article data that is more reliable than the current, per-title data, is:
- Get the Apps team to begin providing this information;
- Implement code to extract this information and match it to a page title.
Other metrics
[edit]- How much API traffic do we have?
- What are the most-requested URL strings, i.e., those with the most requests that 404?
Data sources that use this definition
[edit]- Pageview_hourly, Projectview hourly: Documentation of two (private) analytics database tables on Hive, containing core pageview data according to the above definition (from April/May 2015 on)
- the Pageviews Analysis web tool(documentation)
- Wikistats (per-project data from May 2015 on; updated to the new definition in December 2015)
- Wikimedia REST API (per-page data from May 2015 on)
- staging.pageviews05 (a private database table on the analytics-store server, also known as Cube v0.5, containing sampled pageview data according to an earlier implementation of the above "new" definition, from 2013 until April 2015)
Differences to earlier implementation of the "new" definition (2013-2015 data)
[edit]The page view definition is currently implemented in Java, and used to generate the above-mentioned Hive tables (which in turn form the basis of data provided by Wikistats and the API). Aside from the changes listed above, it also differs somewhat from an earlier implementation in R that forms the basis of the historical data (2013 – April 2015) in the staging.pageviews05 table). For people who need to do historical comparisons with times before April 2015, the following table summarizes some findings about these differences from phab:T108925.
Hive | pageviews05 | estimated maximum difference per day (Hive-pageviews05) | |
---|---|---|---|
implemented in (language) | Java | R | - |
sampled? | no | 1:1000 | ... |
Wikis included in this implementation but not the other one |
wikimediafoundation.org, mediawiki (?), outreach | ... | ~ +5000-20000/day (outreach) |
spider detection | less aggressive (as of May 2015, improved September 17, 2015 - now probably more aggressive than on pageviews05) | more aggressive | ~ +- 2% |
daily partitioning | 0:00 UTC- 23:59 UTC | shifted by several hours | |
action=edit | excluded | included | ~ - 3416000 / day (?) |
action=submit (edit preview) | excluded (since 2017-02-09) | included | ~ less than action=edit |
search actions | included | excluded | ~ +464000 / day |
project name normalized to lowercase | yes | no | ~ +614000 / day (e.g. "EN.WIKIPEDIA" not counted in pageviews05, but "EN.wikipedia" is) |
... | ... | ... | ... |
FAQ
[edit]Are Special: pages considered pageviews?
[edit]The pageview definition tries to count pageviews of content delivered to users, not actions. As such, most "Special:" pages are not considered pageviews, with the notable exception of "Search" pages.
Are edits considered pageviews?
[edit]Edits are not considered pageviews, because a) it is assumed that the pageview from which the user clicked the edit button has already been logged, and b) similar to the case of Special: pages above, in these metric definitions we try to separate consumption metrics (e.g. content delivered to users) from production metrics (e.g., action=edit).
Notes
[edit]- ↑ On the other hand, special pages that users purposefully navigate to, like Special:RecentChanges or Special:Version, are included.
- ↑ https://wiki-search-referrals.wmcloud.org
- ↑ Research:Wikipedia_clickstream
See also
[edit]- wikitech:Analytics/Data/Redirects — how different types of redirects are handled in the context of pageviews
- Information about the legacy pageview definition that was used until 2015:
- Flow diagram outlining which requests did and didn't get counted as a pageview under the old definition
- wikitech:Analytics/Archive/Webstatscollector — more notes about the old definition and the set of services implementing it
- wikitech:Analytics/Pageviews — comparison with the new (2015-) definition
- Referer data is available for some Wikipedias in the Clickstream dataset, which is visualized at https://wikinav.toolforge.org/