Research talk:Page view/Archive 1

From Meta, a Wikimedia project coordination wiki

HTTP 200

This requirement directly contradicts the stated aims as for questions to answer, in particular «should be focused on much-requested or much-appreciated content». Visits to red links can be valuable, is there any particular need to exclude them? I'd be convinced if e.g. there were significant visits from crawlers to 404 pages. --Nemo 17:36, 28 September 2014 (UTC)

Good point! I'm going to take a look at the dataset and see what we see. If there isn't any substantial set of issues we can look at including it, and adding it as an extra tag or something, if we want to demarcate completed views. If there is, we can treat it as a separate metric. Ironholds (talk) 14:35, 30 September 2014 (UTC)
Bit of ad-hoc testing; a day of sampled logs contained (after filtering by MIME type) 47,374 requests that 404d, of which 17,341 came from identified spiders. At least another 8,749 came from spiders/feed-readers/other automated things that weren't recognised. That comes to 26,090 out of 47,374, or 55%. There were 966,649 200s, after MIME filtering - Of those 119,944 are from identified spiders, and it looks another 99,194 or so are from other automata, or unidentified spiders. The percentages are different here but obviously it's difficult to tell if that's because of different behavioural patterns (most links are not red links). Ironholds (talk) 15:50, 30 September 2014 (UTC)
Some 20 times less than 200s, and only half of them "false", sounds like they're doing no harm. Benefits of having them somewhere are huge, cf. [1] and countless other discussions in last decade about failed searches and "there were no results" logs. --Nemo 22:44, 1 October 2014 (UTC)
Fair point. How's this sound, then; include them, filter them out or mark them for the high-level pageview numbers (since for those, what we want to know is completed content consumption. Or, at least, we must be able to extract that from the aggregated data), but for the per-page/per-URL numbers, keep 404s in? I'm more of a fan of simply adding that as a new tag, personally, because as you say, it's interesting data, and I can see us being interested in (for example) looking at how success or failure rates vary geographically, and whether the blocker to wider participation is a lack of starting content in locally-interesting subject areas. Ironholds (talk) 22:50, 1 October 2014 (UTC)
FWIW I did a report on most requested missing articles (not refreshed monthly). The demand was overwhelming (if 10 repeated requests from one person count as such). I would like to see 404's from humans separately from crawler induced duds. Erik Zachte (WMF) (talk) 16:20, 7 October 2014 (UTC)
Makes sense. What would people think of adding 404s as another 'tag', as it were? That way we would exclude crawler traffic from the 404 stream while retaining the ability to filter 404s out if we wanted to measure consumption rather than desired consumption. Ironholds (talk) 21:59, 9 October 2014 (UTC)

To identify spider traffic

This sounds extremely weak and in particular it doesn't protect from intentional manipulation of page view statistics (which would lie on user agent). Much more robust would be to (also?) exclude requests from IPs which made abnormal numbers of requests in a limited timespan; this is how Erik Zachte and others typically identified sources of errors in statistics about small numbers of pages. --Nemo 17:42, 28 September 2014 (UTC)

Yeah, and that's how I tend to, too; IP/UA tuples. The difficulty is..getting the precision of that filter correct: the ratio of peoples-to-IPs varies widely. We can work on it and discuss it, though. Ironholds (talk) 14:00, 30 September 2014 (UTC)

To identify Zero traffic

I don't think such an edge case for a service which might even be discontinued in the future fits in a definition of page view. --Nemo 17:42, 28 September 2014 (UTC)

@Nemo bis: we have an explicit request to identify and measure Zero traffic, the tags listed in the definition are meant to provide a first set of high-level breakdowns of total pageviews. --Dario (WMF) (talk) 05:01, 30 September 2014 (UTC)
I'm not blaming you, but the request is unreasonable. --Nemo 22:44, 1 October 2014 (UTC)
If the service is discontinued, the filter will be unnecessary and removed - simple as. Reasonable or unreasonable, it's also completely trivial. Ironholds (talk) 22:53, 1 October 2014 (UTC)

API requests

API requests other than those that are used by apps to render the contents of a page are intentionally excluded from the PV definition, which makes sense. I wonder if (and how) we should capture the volume of API requests as a KPI. See also Research_talk:Page_view/App_identification for a related question --Dario (WMF) (talk) 05:08, 30 September 2014 (UTC)

Point. I feel like that's a different research question but still a metric we should be tackling. Ironholds (talk) 14:20, 30 September 2014 (UTC)
Called out here. Ironholds (talk) 14:37, 30 September 2014 (UTC)

Timeframe

It was mentioned in private email that this definition has a tight timeframe. What is the timeframe? QChris (WMF) (talk) 13:32, 30 September 2014 (UTC)

As I understand it from Toby, the expectation is an implementable definition by the end of the second quarter of this year. So, 3 months to get this right. I'm not sure if there are Analytics Engineering expectations too, however. Ironholds (talk) 14:17, 30 September 2014 (UTC)

Vague use cases

Since the use-cases are still vague, could you please provide the specifics? Who needs which data (Fields, and aggregation levels) for which concrete purpose? QChris (WMF) (talk) 13:51, 30 September 2014 (UTC)

I can do my best; what sorta format would you like? (Table, list...) Ironholds (talk) 14:19, 30 September 2014 (UTC)
I don't care much about the format, I care about the content ;-) Pick whatever format you want. (And to avoid us getting blocked of formalities ... if you have no preference, let's start with a list.) QChris (WMF) (talk) 22:30, 1 October 2014 (UTC)
Added! Hopefully this is an improvement? Ironholds (talk) 22:43, 1 October 2014 (UTC)
Yay! It's a great start. But give us more details on them :-)
Let's explain using the requirements that you gave for the “How do people consume our content” item. There, they currently are
  Requirements for this would be a breakdown of pageviews, by access method, over time
That is still a bit vague, because for example “access method” is left undefined, and “over time” does not say anything about granularity. The use of “pageviews” here is a bit tautological, and also it does not really address the “Who” “Which data” and “which concrete purpose” parts.
So what I was looking for was something along the lines of this completely made up description
 Dave-the-Decisionmaker wants to assess the impact our mobile
 initiatives have and would need numbers relating the total number of
 webpages people viewed from desktop browsers vs. mobile browsers.
 
 Since the assessment of different initatives should be comparable,
 there is no need to drill down on reading or other kinds of
 actions. There is only interest in the overall effect.
 
 For the same reason, there is no need for “per project”-breakdowns
 or the like.
 
 The data should come with hourly resolution to allow closely
 relating changes in numbers with launches/deployments.
 
 The pure data is enough. Visualization is not needed. It'll be done
 by Dave by hand for specific launches/deployments.
The difference here is that it gives a customer. If there is any doubt around the use case or clarifications are needed, we know we can just ask Dave-the-Decisionmaker.
It explains (at least in layman terms) what Dave expects to be counted.
Also, it explains granularity of the data.
With this information and the provided rationales, one can now judge, discuss and challenge the use case. Like seeing that hourly resolution buys more noise than signal, due to our traffic following a strong daily patterns. Daily resolution would probably do a better service.
(I fully agree that this ask for more details around the requirements might appear unrelated because, “we know we need them”. However, drilling down into http status codes, url paths, mime types and the like is basically just a maintenance nightmare. So I want to make sure that the need for this really arises from the use cases, not from the fact that “this is the way it's been done for years”. Because, some of the use cases I heard of might actually better be solved in a different way.) --QChris (WMF) (talk) 13:56, 5 October 2014 (UTC)
That makes a lot of sense. @QChris (WMF): I've rebuilt the first use case in line with your example, here. Does this contain the sort of detail you're looking for? (Obviously even after I refactor we should get Toby et al to check them for stupid). Ironholds (talk) 13:24, 12 October 2014 (UTC)
Thanks! This helps me a lot.--QChris (WMF) (talk) 11:23, 23 October 2014 (UTC)

Tagging protocol

I'm not quite clear on the tagging protocol. Can webrequests have multiple tags? E.g. could we have a request from a Zero user via a Wikipedia App? It looks like the definition provides for this. E.g. since a request is from an app, it would abide by the rule "MIME type header = application/json;" to be tagged with "App" as well as abiding by the rules "Request is not to api.php;", "the X-Analytics header contains 'zero';" and "the request is not tagged as Spider;" --Halfak (WMF) (talk) 21:10, 3 October 2014 (UTC)

Crucially, yes, although that should be the sole time when it happens: you can have zero-rated app requests. Ironholds (talk) 22:00, 9 October 2014 (UTC)

Redundancy with "api.php" and content directory

The general filter rule "the URL contains a 'content' directory (/wiki/, /zh-han/ or other variant-specific directories);" seems to make it impossible that any of the tag filters could fail the rule "Request is not to api.php;". Am I missing something? --Halfak (WMF) (talk) 21:12, 3 October 2014 (UTC)

If you look at the pseudocode /w/ is included, to make sure we don't miss App hits, basically. Now, we could just go "all the API requests are App requests, so if it's not already tagged, it's not an API request. Sorted!" but unfortunately it's possible to have an api.php request with a text/html MIME type and a 200 status code A particularly stupid class of PHP exceptions. It's providing an error page, which is html, and it succeeded in doing so. Successful text/html request! I guess we could just have "text/html and not api.php, or application/json and [other requirements for apps]" in the first filter's pseudocode? I'm not sure what it'd do to speed, but conceptually it'd be much simpler, because we'd get to remove the endless "and not to api.php" lines. Thoughts? Ironholds (talk) 22:20, 3 October 2014 (UTC)

Mobile web traffic per url is now available

The second is for per-page counts - the community or research-focused questions. This produces aggregated counts of requests for each URL. One major limitation here is that it does not include mobile traffic, even mobile web traffic, which makes it difficult to accurately represent how reader attention is focused (and how much reader attention is focused) on particular articles.'

This statement is no longer correct Erik Zachte (WMF) (talk) 16:12, 7 October 2014 (UTC)

Add tag for 'crap requests'

We discovered last week that we get up to billion (peak in Sep) requests for page [uU]ndefined a month. We want to count those for ops, but not as genuine user requests for community/strategy. (we get 10,000's from one ip within hours, so there is some loop in client code). What about a hiveconf:white_list_of_crap_strings that would tag the requests as 'crap'. Erik Zachte (WMF) (talk) 16:38, 7 October 2014 (UTC)

I think this comes back to the "when we say spiders, do we mean spiders, do we mean spiders and automated software, or do we just mean 'stuff we don't care about'?" conversation. Ironholds (talk) 21:59, 9 October 2014 (UTC)
The [uU]ndefined is misbehaved software, but as far as we know that isn't a spider. So yeah I favor a separate 'rejected because of blacklist string' tag, aka 'crap'. Erik Zachte (WMF) (talk) 22:10, 22 October 2014 (UTC)
Totally; automated libraries, though, where do they live? Okeyes (WMF) (talk) 14:04, 23 October 2014 (UTC)
Not sure how you mean that. I was thinking of something like 'hiveconf:whitelisted_mediawiki_projects' in [2] Erik Zachte (WMF) (talk) 15:46, 23 October 2014 (UTC)

False anchors

Media Viewer

What about URLs such as https://az.wikipedia.org/wiki/Mar%C3%A7ello_Malpigi#mediaviewer/File:Marcello_Malpighi_large.jpg ? --Nemo 11:13, 11 November 2014 (UTC)

@Nemo bis: Bah; that's manky. That'll definitely show up as a pageview at the moment :(. Unless we want it to, of course? I mean, directly clicking on an image and ending up on the file page would. Okeyes (WMF) (talk) 20:30, 24 November 2014 (UTC)
I don't think we do, and it's text/html (because...because the universe is mean). I'll drop an email to the public list. Okeyes (WMF) (talk) 20:01, 25 November 2014 (UTC)
Email sent. Will dig into the VE weirdness now! Okeyes (WMF) (talk) 20:29, 25 November 2014 (UTC)

Background / Primary use cases

I changed the headers to reflect the hierarchy in less space, I hope it's fine. --Nemo 14:53, 29 October 2014 (UTC)

How do people consume our content?

Per wiki?

Are those numbers needed on a per-wiki level, or summed across all wikis? --QChris (WMF) (talk) 11:33, 23 October 2014 (UTC)

Clarified here. Okeyes (WMF) (talk) 14:26, 23 October 2014 (UTC)

“Monthly” in Monthly 1-day rolling average

What period does the “Monthly” in “Monthly 1-day rolling averages” refer to? 28 (Kudos to ezachte!), 30, 31 or “ever-changing 30 and 31 depending on the month one is in”? --QChris (WMF) (talk) 11:33, 23 October 2014 (UTC)

clarified! Okeyes (WMF) (talk) 14:27, 23 October 2014 (UTC)
I still think it is bit unfortunate to choose 30 days instead 28. As this will contain 4 or 5 weekends on different days, and hence a meaningless periodic bump, which is distracting. Having said that, I don't think I understand the usefulness of 'rolling whatever'. Having weekly counts instead of 'rolling monthly' would make an exceptional day affect one weekly total, easy to grasp, easy to localize when some event occurred. Erik Zachte (WMF) (talk) 15:52, 23 October 2014 (UTC)
Let me rephrase: I understand it if we say look at a strange drop/peak in UV's for a given month, that we want to have other metrics also confined (almost) to that same month. But that's different from making habitual trend plots from rolling. Erik Zachte (WMF) (talk) 16:02, 23 October 2014 (UTC)
I think we should consider this later; let's get the definition done first. We know we'll be outputting hourly, so how it is visualised is a future problem. Okeyes (WMF) (talk) 18:12, 2 December 2014 (UTC)

Resolution of Monthly 1-day rolling average

Using per day (instead of per 4 weeks) metrics are used to exhibit short-term changes (increase solution). Averages are used to smooth out short-term changes (decrease resolution). Hence, increasing resolution only to decrease it afterwards sounds like it foremostly makes the verbal description of the metric longer without increasing usefulness of the numbers.

(Note that other metrics where we daily average over some 30 day period, have a different structure. Like it totally makes sense to daily ask for active editors. But Page Views happen or they don't not. They don't need a period over which they can grow to reach a threshold.) --QChris (WMF) (talk) 11:33, 23 October 2014 (UTC)

I can't parse this, I'm afraid; my bad :(. Could you clarify? Okeyes (WMF) (talk) 14:28, 23 October 2014 (UTC)
This seems related to my comments on previous section. Am I right? Erik Zachte (WMF) (talk) 16:04, 23 October 2014 (UTC)
I think so; see my comments above. Okeyes (WMF) (talk) 18:12, 2 December 2014 (UTC)

Where do people consume our content?

Editors

While the “How do people consume our content?” use-case focusses on readers, “Where do people consume our content?” calls out readers and editors. However, I cannot find how the requirements of the use-case get to editors. How are editors represented in this use-case? --QChris (WMF) (talk) 11:33, 23 October 2014 (UTC)

clarified! Okeyes (WMF) (talk) 14:29, 23 October 2014 (UTC)

Access methods

The use-case as is does not state the need to have “per access method” numbers. However, the requirements list “use case 1” (which has per access methods) as base requirement. Is there use in having “per access method” numbers for countries? --11:33, 23 October 2014 (UTC)

I'd think so. Specifically, digging into whether there are national/cultural variations in how people get to Wikipedia. If we're interested in encouraging growth in say, India, and we find that India's mobile consumption far outstrips their desktop, that produces a different set of priorities from if we simply had to assume that Indian consumption followed the overall, global mobile/desktop split.
As noted this use case (and subsequent primary use cases) are not entirely built out yet; the first was a prototype to see if it helped. I'll build these out now :). Okeyes (WMF) (talk) 14:31, 23 October 2014 (UTC)

How do people get to our content?

Internal referers and “Third-party”

Since third party dependencies are mentioned, does this mean we're not interested in internal referers (like wikipedia -> wikipedia, or wikivoyage -> commons)? --QChris (WMF) (talk) 11:33, 23 October 2014 (UTC)

I don't think we particularly are? It feels like the sort of thing that we can dig into via ad-hoc queries if something interesting shows up, but I haven't heard any interest in it as a regular thing (nor have I got any questions about it thus far. @Erik Zachte (WMF):) Okeyes (WMF) (talk) 14:56, 23 October 2014 (UTC)
I have a report which does tally by origin, also internal origin. I have to say I received zero feedback on it over the years. Erik Zachte (WMF) (talk) 16:09, 23 October 2014 (UTC)
It looks like we're getting more attention around referers now pageviews are lagging behind the general internet growth trend. Okeyes (WMF) (talk) 18:12, 2 December 2014 (UTC)

Aggregation Level

You mention that referers would get aggregated. But on what level? Per company (like Google vs Facebook referers), per domain (like google.es vs google.ch), per page (like example.org/PageFoo vs example.org/PageBar)? --QChris (WMF) (talk) 11:33, 23 October 2014 (UTC)

Google metric?

Since the use-case calls for “responsible for an incredibly high proportion of our readership”. A crude check against the sampled-1000 logs hints at Google being the leader with 10-15%. The second place being already <1%. Is this effectively geared towards Google? --QChris (WMF) (talk) 11:33, 23 October 2014 (UTC)

FYI This report on Google Requests says "In total Google was somehow involved in 37.0% of daily external page* requests". This report on Crawlers shows other major players." I'd hope to see analysis of several mayor search engines, and how they affect us. Erik Zachte (WMF) (talk) 16:13, 23 October 2014 (UTC)
Sure, but the reports you linked are counting bots. The use-case is about people. So the linked reports (and the contained numbers) do not apply to the use-case. --QChris (WMF) (talk) 17:12, 23 October 2014 (UTC)

Access Methods

The use-case as is does not state the need to have “per access method” numbers. However, the requirements list “use case 2” (which has per access methods) as base requirement. Is there use in having “per access method” numbers for referers? Also ... is there need for country information (which use case 2 alse calls for)? --11:33, 23 October 2014 (UTC)

What articles do readers focus on?

Quality measure

How would the counts determine which pages aren't of “incredibly high quality”? --QChris (WMF) (talk) 11:33, 23 October 2014 (UTC)

Rereading example 2 it seems to me this is about content coverage and article quality. Two highly important but quite distinct issues. And the bold text only mentions quality. We have some hope to quantify what we call in Q2 'knowledge gaps'. Quality is an even wilder beast to tame. What about splitting these two issues in separate sections? Erik Zachte (WMF) (talk) 16:21, 23 October 2014 (UTC)

Method

What does “method” refer to? HTTP methods (GET, POST, ...), API actions (action=edit, action=history, ...), something more fine grained, or something completely different? --QChris (WMF) (talk) 11:33, 23 October 2014 (UTC)

Access methods

Just to make sure ... we're not caring at all about acces methods here; we're only focusing on the Articles themselves. Right? --QChris (WMF) (talk) 11:33, 23 October 2014 (UTC)

Redirects

Since this seems to be per article, how would the definition count redirects? Only towards the redirect source, only to the target, both, or some other scheme? --QChris (WMF) (talk) 11:33, 23 October 2014 (UTC)

Templates

Since this seems to be per article, how would the definition count templates? Would it count each request to a page that uses a template also as template view? --QChris (WMF) (talk) 11:33, 23 October 2014 (UTC)

Bots

I assume so, but since it does not explicitly say so (only the first three use-cases call for it) ... we're not counting bots for this use-case, are we? --QChris (WMF) (talk) 11:33, 23 October 2014 (UTC)

Content namespace

Are we interested in all articles, only those in content namespaces, or only in namespace 0? --QChris (WMF) (talk) 11:33, 23 October 2014 (UTC)

Pages covering multiple articles

Are pages that cover more than one article (like a diff between two different articles) considered? If so, how are they counted? --QChris (WMF) (talk) 11:33, 23 October 2014 (UTC)

Tagging crawlers

I edited mention of crawlers as follows: [3]. This is highly visible and often lamented shortcoming of current stats. Erik Zachte (WMF) (talk) 16:24, 23 October 2014 (UTC)

Thanks! It looks like we're going over the use cases Yet Again, so we'll see what we see. Okeyes (WMF) (talk) 18:15, 2 December 2014 (UTC)

Kevin's use cases

Thoughts:

  1. Is this intended to be exhaustive, first-order-priorities-only, priorities-we-plan-to-work-on-only...?
  2. How do we reconcile privacy protections with including user agent data and country in per-page breakdowns for the community? Some pages are going to get 1 pageview in a day.
    • For internal use, this shouldn't be a problem. If we publish the data set, I think we would need to suppress pages with fewer than X views. I'm open to other ideas. KLeduc (WMF) (talk) 07:17, 28 October 2014 (UTC)
      • Well, as well as rare pages, it's rare tuples. If we have a page with 3m pageviews but only one from a particular combination... it's still PII. I would suggest that we simply strike this until we have an anonymisation strategy in place. Okeyes (WMF) (talk) 14:01, 28 October 2014 (UTC)
  3. Device is going to be expensive to maintain, even if it's cheap to implement, and shift irregularly.
  4. Why API, and not the apps? Okeyes (WMF) (talk) 21:09, 24 October 2014 (UTC)
  5. Daily counts
    • I think users want hourly counts as they have right now. Important for quickly detecting trending topics. As that's already covered by new hourly pageview dumps, I'm not sure if it is relevant here. (but I expect new pageview dumps to assimilate to these definitions). Erik Zachte (WMF) (talk) 00:07, 3 December 2014 (UTC)

Side ways

Permalinks

The effect is probably negligible, but it should be noted that permalinks (URLs containing "oldid=" but not "diff=", like [4]) are legitimate pageviews. Significant traffic probably only happens if some external website uses a permalink to link one of our pages. --Nemo 14:57, 29 October 2014 (UTC)

We should still factor it in. Okeyes (WMF) (talk) 20:31, 24 November 2014 (UTC)

&stable=0 etc.

FlaggedRevs redirects you to URLs like [5] after an edit is completed. On non-FlaggedRevs wikis, you're instead redirected to an URL like w:ru:Мальпиги, Марчелло. Ideally, FlaggedRevs wikis should be comparable to the others. --Nemo 11:13, 11 November 2014 (UTC)

That shouldn't make a difference (we have ?title=) unless, of course, they use a different MIME type? Okeyes (WMF) (talk) 20:31, 24 November 2014 (UTC)

Parameters appended to short URLs

I don't remember which exactly, but few days ago I saw URLs, producted by a Wikimedia-deployed extension, in the form /wiki/foo?bar=baz . I have no idea why they're making URLs "manually" instead of using fullurl, but apparently such things happen. Ideally they wouldn't be treated differently from the index.php?title=foo&bar=baz URLs they should be. --Nemo 11:13, 11 November 2014 (UTC)

Huh. Let me know if you spot it again?
I happened to see it today: if you start editing with VisualEditor, but then commute to wikitext editing, you're sent to a /wiki/Foo?action=submit URL (which loads the equivalent of a "show diff" after action=edit). --Nemo 16:38, 25 November 2014 (UTC)
...that's just weird. I'll check out the MIME type. Okeyes (WMF) (talk) 16:54, 25 November 2014 (UTC)
Looks like the same is true of preview actions. We'll have to filter those; thanks! Okeyes (WMF) (talk) 22:07, 25 November 2014 (UTC)
Perhaps exclude everything with 'action=' in the URL? Note, also 'search=' ([6]) can work without curid/oldid/diff/title, while most other "actions" would have at least the title of the special page. mw:Manual:Parameters to index.php in theory knows all of them. --Nemo 21:56, 2 December 2014 (UTC)

primary use cases/existential threats

"A lot of others come from the volunteer community, and third-party researchers, who are also our customers: after solving for existential threats, these use cases are also primary." Not serving good stats to our volunteers, who request better page views stats since many years, might be an existential threat. We're testing their patience. Erik Zachte (WMF) (talk) 23:59, 2 December 2014 (UTC)

Special namespace and actual problems

Does this mean I wouldn't have stats for

? I don't see much gain in excluding special pages, while more interesting would be to filter out requests which include parameters to index.php, like /wiki/Page?action=history or /wiki/Page?action=raw. --Nemo 17:39, 28 September 2014 (UTC)

Parameters is a good point; we should factor that in. Why is filtering special pages not useful, in your opinion? Simply that having those numbers is worthwhile? Ironholds (talk) 14:02, 30 September 2014 (UTC)
Special pages are worthwhile and there is no meaningful distinction between them and "content". For instance translatewiki:Main Page became translatewiki:Special:MainPage, or a WP:RfX page might in ten years become a special page, but substance wouldn't differ.
What matters is the status code: for instance Special:MyLanguage above gives a 302 (but also Special:Diff etc.). --Nemo 22:44, 1 October 2014 (UTC)
Surely there is a distinction? Special pages are the software, or the...ink trails left by contributions. They're not contributed to directly. Although I guess you could make the same argument for WP namespace pages. Ironholds (talk) 22:51, 1 October 2014 (UTC)
I'd like to keep Special pages. Some of those are viewed a lot, and our UI designers (to name one example) will want to know how often. One more use case: a botnet used Special:Random to swamp us with 5% of our overall page views during two months. Erik Zachte (WMF) (talk) 23:10, 2 December 2014 (UTC)
A for localized 'Special:' that could be costly to detect. Would it help to have Ops attach an extra parameter, say ?special=yes and maybe even ?special=yes&en_page=Random? Erik Zachte (WMF) (talk) 23:10, 2 December 2014 (UTC)
Yeah, I actually built a tool that filtered this with a localised regex; costly indeed :(. An extra parameter would probably be easier, or we could go for Christian's MediaWiki-extension-that-outputs-pageID-and-namespace, which would do the same thing; that's an implementation detail so I'll leave it up to him. On filtering special pages, I think you're right that we need to find some way to include them, but there are certainly ones we should exclude. So, the existing query explicitly calls out login attempts, for example. We might go around building a list of "things to exclude" and rely on that; off the top of my head, EventLogging and banner-related calls, logins. The risk there is that it's another thing to keep up to date, but that may be a risk worth bearing. Okeyes (WMF) (talk) 14:48, 3 December 2014 (UTC)
I'm going to grab some sampled logs and take a stab at identifying those requests. Let's see what happens! [7] Okeyes (WMF) (talk) 14:57, 3 December 2014 (UTC)

As for canonical name, the <head> of any MediaWiki-produced page has "wgCanonicalSpecialPageName":false on non-special pages or the English name of the special page on all special pages. If you need the information as request header, you "only" need to reuse that. --Nemo 15:40, 3 December 2014 (UTC)

Neat! @QChris (WMF): - see Nemo's line. Okeyes (WMF) (talk) 15:45, 3 December 2014 (UTC)

Filter external requests from WMF scripts

We filter "and the request is not internally generated by the cluster;". Shouldn't we also filter external requests which are done after page rendering, by WMF scripts? We received millions of housekeeping requests for fundraiser banners, afaik from javascript. I'm not sure if this still applies though. Erik Zachte (WMF) (talk) 23:23, 2 December 2014 (UTC)

Absolutely! I think most of the JS calls from fundraising are caught by the MIME-type filtering, at the moment, but we should go through and make sure. I'm particularly worried about what happens if we include versus exclude bits traffic, and what that's going to do even with MIME filtering. Okeyes (WMF) (talk) 14:49, 3 December 2014 (UTC)

Detect mobile requests

and the URL contains "m.w"; -> and the url contains ".m.w" (for extra safety) Erik Zachte (WMF) (talk) 23:39, 2 December 2014 (UTC)

Makes sense! Will tweak. Okeyes (WMF) (talk) 14:49, 3 December 2014 (UTC)
already there, apparently, the main page just doesn't currently match the tags. Will go through and make sure they do. Okeyes (WMF) (talk) 14:50, 3 December 2014 (UTC)

Webstatscollector, Hives, Hadoop

The page has some "no longer true" mentioning Hives, Hadoop. FYI, we lack a wiki page for request logs AKA squid logs AKA Domas logs AKA webstatscollector etc. etc. There should be one, under a descriptive and easy title, briefly describing the status as of 2015; the legacy pages should then be merged to it. I added a section at wikitech:Request logs with some links to the scattered information I found. --Nemo 09:03, 26 January 2015 (UTC)

Artifical increase in statistics of Romanian Wikipedia

Here one can find that Romanian Wikipedia added +53% of its popularity (number of page views) last month.

But here one can find most visited pages on Romanian Wikipedia this month

  1. 1 (183 576 views)
  2. 2 (183 470 views)
  3. 6 (183 463 views)
  4. 5 (183 462 views)
  5. 8 (183 456 views)
  6. 4 (183 453 views)
  7. 7 (183 449 views)
  8. 3 (183 446 views)
  9. 9 (183 435 views)
  10. Zero (dezambiguizare) (183 432 views)

Strange. Is not it?

How to fight the artifical increase in the most important Wikipedia statistics? --Perohanych (talk) 04:22, 10 April 2015 (UTC)

Massviews – how to request by API using a pagepile list

API/REST_v1 offers some ways to get Pageviews data. What about the following features using API calls:

  1. I miss to call massviews defined by a PagePile list.
  2. A wmflabs call of massview offers a simple list day by day that can be downloaded as csv. I'ld prefer to get the data in csv format, e.g. by application/csv or text or something else. (I want to analyze the data inside an own application and want to avoid to store the data manually to a csv file and then load by app.)
  3. What's the best way to get the monthly or yearly summarized views? Of course, I can read the csv file and add 365 values of a year. On the other side, the database has to add 24 values of each day – it should be able to post one number as sum of 24*365 values.

I'ld be happy if someone could tell me a more direct way. Thanks in advance, Juetho (talk) 15:43, 28 December 2016 (UTC)

In the meanwhile, I found massviews URL structure. It works if I call the URL directly from the browser. But I'm not able to call it by an own .NET application using HttpWebRequest class and GetResponse method – always getting "System.Net.WebException: Der Remoteserver hat einen Fehler zurückgegeben: (403) Unzulässig." HTTP status codes says: "... the server is refusing to respond to it. The user might be logged in but does not have the necessary permissions for the resource." How can I tell the wmflabs server by my application that I'm allowed to call these data? -- Juetho (talk) 09:28, 1 January 2017 (UTC)

This question is moved in a more uptodate way to Talk:Pageviews Analysis. -- Juetho (talk) 10:05, 1 January 2017 (UTC)

Is the data comparable with Webalizer?

We try to compare page views running on our WMDE server with this wiki page views tool. Is this data comparable? Webalizer provides the following description for the metric 'Pages':

"Pages are, well, pages! Generally, any HTML document, or anything that generates an HTML document, would be considered a page. This does not include the other stuff that goes into a document, such as graphic images, audio clips, etc... This number represents the number of 'pages' requested only, and does not include the other 'stuff' that is in the page. What actually constitutes a 'page' can vary from server to server. The default action is to treat anything with the extension '.htm', '.html' or '.cgi' as a page. A lot of sites will probably define other extensions, such as '.phtml', '.php3' and '.pl' as pages as well. Some people consider this number as the number of 'pure' hits... I'm not sure if I totally agree with that viewpoint. Some other programs (and people :) refer to this as 'Pageviews'."

--Stefan Schneider (WMDE) (talk) 14:01, 20 July 2017 (UTC)

Dialect Specific Directories

As per T92020:

One of the big improvements of the new definition over the old one is that the old one is not limited to /wiki/. It includes all of the chinese and serbian dialects that have their own folder names and were not appearing, as a result, in the old pageview counts.

James F (thanks James!) pointed out that there are other wikis that do this - see the list at https://meta.wikimedia.org/wiki/Wikipedias_in_multiple_writing_systems#With_Automatic_Conversion_System.

New pageview definition covers every language presented in the previous link. This has been checked using the language tab on country specific wikipedia website (third tab on the top the page, with a dropdown list).

Note: The definition also include zh.wikipedia.org/zh-hans and zh.wikipedia.org/zh-hant event if those two languages are not present in the chinese wikipedia website language tab. There was no hit on those folders on the week 2015-03-[16-22]

So views of past versions count as well?

Should be in the article. 85.240.216.101 21:08, 14 February 2018 (UTC)