Research:Page view/Tags

This page describes tags added to the data left after generalised filtering of the request logs, intended to turn the output (a set of pageviews) into a more consumer-friendly and privacy-compliant dataset that provides useful metadata to internal and external reusers.

Country

Referer

Project

Page

Access Method

Pageviews definition standardisation

Access method tagging

Status

draft

Filter class

Tagging filter

Pseudocode

If:

The MIME type is "application/json"

Tag as an app request.

If:

The URL matches \\.(zero|m)\\.w

Tag as a mobile web request.

Else:

Tag as a desktop request.

An important piece of metadata about the requests is the access method; was it made via the mobile web, the desktop site or the official app? The generalised filtering makes this fairly trivial to determine.

The Apps are available on Android and iOS devices, and go through the API (either mobile or desktop, depending) - for that reason, App requests have historically not been counted, because they don't match the /wiki/ path.

Since we're including API hits - or at least, those that match App requests - these issues are solved for. At the same time, though, we have to avoid overcounting App pageviews. The Apps send a large number of requests, many of not pageviews - search requests, for example. In addition, each request for an "actual" page is actually two requests - one to load the first section of the page, and one to load all subsequent sections. The solution to both problems is to look for sections=0 in the URL: that parameter only appears for "actual" page requests, and that value only appears for the first of the two requests for each page.

Apps requests all have "WikipediaApp" in their provided user agent, and use a variety of MIME types. Requests for pages, however, only use one: application/json. This requires further filtering, though, because App pageviews may appear as multiple requests: one for the first section, which includes sections=0 in the URL, and a second request for subsequent sections, with sections=1-N. The first request is the only one that is guaranteed to appear (since some pages do not have multiple sections), and so filtering to remove duplicate requests should remove those that aren't sections=0. This is all handled in the MIME type filtering; what remains is simply to tag the resulting requests, which should be all requests still in the dataset with MIME types that match application/json as app pageviews.

Mobile web traffic is also mobile, but consists of organic traffic to Wikimedia's mobile websites which is not generated through an official app. Requests follow the pattern language.m.project.org or language.zero.project.org, and after the application of the generalised filters, any text/html request to a URL with that domain format that has not been tagged already should be a Mobile Web pageview.

There are a lot of language and project combinations; what's the simplest way to accurately identify them all? We took a day of sampled logs - 622,149 requests, after the application of the generalised filters - and searched for .m.w in the URL. This produced 215,931 URLs, which we then substringed to look for those without mobile markers. Only 6 of these were incorrect - in all cases, because the log file was merging two unrelated URLs in the URL field. Accordingly, .m.w or .zero.w seems an appropriate string to search for. The full pseudocode can be seen in the infobox to the right.

Desktop traffic consists of any traffic not tagged as part of the app or mobile web traffic, and so identifying it is trivial; the check is simply whether it has already been tagged. If it hasn't, the appropriate tag is "desktop".

Spider

Pageviews definition standardisation

Spider filter

Status

draft

Filter class

Tagging filter

Pseudocode

If:

ua-parser identifies the user agent's device as "Spider";

Or:

the user agent matches "-"

Or:

the user agent matches


(?i)^(.*(bot|spider|WordPress|AppEngine|AppleDictionaryService|Python-urllib|python-requests|Google-HTTP-Java-Client|[Ff]acebook|[Yy]ahoo|RockPeaks).*|(goo wikipedia|MediaWikiCrawler-Google|wikiwix-bot|Java/|curl|PHP/|Faraday|HTTPC|Ruby|\\.NET|Python|Apache|Scrapy|PycURL|libwww|Zend|wget|nodemw|WinHttpRaw|Twisted|com\\.eusoft|Lagotto|Peggo|Recuweb|check_http|Magnus|MLD|Jakarta|find-link|J\\. River|projectplan9|ADmantX|httpunit|LWP|iNaturalist|WikiDemo|FSResearchIt|livedoor|Microsoft Monitoring|MediaWiki|User:|User_talk:|github|tools.wmflabs.org|Blackboard Safeassign|Damn Small XSS|\S+@\S+\.[a-zA-Z]{2,3}).*)$

Or:

ua-parser identifies the user agent's device as "Automata";

Tag as TRUE.

Else:

Tag as FALSE.

Identifying spiders is crucial for distinguishing organic, human traffic from automated and mechanized traffic. Grouping both without any distinction does not give us an accurate understanding of Wikipedia's usage around the world.

For general user agent parsing, the Analytics team at the Wikimedia Foundation settled on ua-parser, an open-source and widely contributed-to user agent parser that breaks browsers, devices, and versions out of user agent strings. Amongst other things, ua-parser also handles spider identification. If we're standardizing, it seems sensible to try to standardize for spiders as well as browser IDing. Accordingly, the definition of "spider" matches ua-parser's spider identification: a request is from a spider if ua-parser's outputted device is "Spider".

ua-parser does not, however, identify some wiki-specific crawlers and some generic crawling applications. We have to identify them ourselves. First, we tag as spider any unset user-agent -, and we use the following regular expression (see here for currently used version):

(?i)^(.*(bot|spider|WordPress|AppEngine|AppleDictionaryService|Python-urllib|python-requests|Google-HTTP-Java-Client|[Ff]acebook|[Yy]ahoo|RockPeaks|PhantomJS|http).*|(goo wikipedia|MediaWikiCrawler-Google|wikiwix-bot|Java/|curl|PHP/|Faraday|HTTPC|Ruby|\\.NET|Python|Apache|Scrapy|PycURL|libwww|Zend|wget|nodemw|WinHttpRaw|Twisted|com\\.eusoft|Lagotto|Peggo|Recuweb|check_http|Magnus|MLD|Jakarta|find-link|J\\. River|projectplan9|ADmantX|httpunit|LWP|iNaturalist|WikiDemo|FSResearchIt|livedoor|Microsoft Monitoring|MediaWiki|User:|User_talk:|github|tools.wmflabs.org|Blackboard Safeassign|Damn Small XSS|MeetingRoomApp|\S+@\S+\.[a-zA-Z]{2,3}).*)$

.

On top of that, we are also using a decision tree to increase our bot detection. See the whole process here for more details.