Research talk:Page view/Generalised filters

Monitor trend in overall percentage rejects[edit]

BTW I'm hoping we find a solution to monitor the reject percentage. If our environment or the world at large changes we may find that our filter is no longer adequate. In order to be triggered, it would help to write the hourly percentage of unfiltered requests either to the projectcounts file (conveniently small file with similar overall stats per project) or to a dedicated log file, so we can monitor trend in % rejects. Erik Zachte (WMF) (talk) 15:33, 7 October 2014 (UTC)[reply]

I agree. Actually, what I would like to do is set up a system that appends the percentages to a file, endlessly, and also stores, say, 10k rows of [field match is running over] that matched, and 10k that didn't match, scrubbing it every few weeks or something (we can discuss the precise timings some other time), allowing us to go "oh, the percentage went down. Let's see what wasn't being caught/was being caught and look for false positives/negatives" conveniently, without having to yank out a big chunk of data from HDFS and apply lots of filters to reduce noise. Ironholds (talk) 21:42, 9 October 2014 (UTC)[reply]

Filtering by MIME type[edit]

Do we want to restrict to certain charsets? Why not only filter on text/html regardless of what follows? Erik Zachte (WMF) (talk) 15:13, 7 October 2014 (UTC)[reply]

We can if we want; it's an efficiency thing, really. If entire string matching would be faster, let's go with text/html(+ charset if applicable); if regular expressions or fuzzy matching, text/html. The range of possibly charsets in use is, afaik, 5, so it's only 6 strings to compare against. Implementation detail, though :). Ironholds (talk) 21:39, 9 October 2014 (UTC)[reply]

Filtering to applicable sites[edit]

I'd include Incubator. Is genuine content. Erik Zachte (WMF) (talk) 15:21, 7 October 2014 (UTC)[reply]

Yeah, that makes sense. I'll add it now. Ironholds (talk) 21:42, 9 October 2014 (UTC)[reply]

"The actual error rate introduced by removing it is minuscule. In the event that implementing this system requires processing time savings, this filter has the best return from elimination." Maybe the filter doesn't do much now, but that could change, and if so would we notice? Even if a filter may seem a formality now it also works as safeguard against unwanted changes. (talk) 15:33, 7 October 2014 (UTC)[reply]

Yeah; for clarity, I think it's worth including, but the challenge is "build this system so that it takes <1 hour to process 1 hour". I'm saying we can dispense with this if performance renders it necessary. Okeyes (WMF) (talk) 18:16, 2 December 2014 (UTC)[reply]

Filtering to content directories[edit]

The pseudo-code is really maintenance sensitive. Which is another way of saying it might lag behind reality soon. I would opt for slightly more generic version where e.g. zh- followed by any 2-4 letters would pass. Erik Zachte (WMF) (talk) 15:39, 7 October 2014 (UTC)[reply]

There are a lot of requests directed to some directories that don't exist (when we were discussing this via email, Christian brought up one of the /sr-*/ directories as not being real, for example); I worry that we would exchange one area of mutability that we have internal control over (the creation of new directory structures) for an area that we have no internal control over (non-existent directory structures random readers or bots choose to point at). How mutable is the directory structure? It seems to have remained static since I started looking at this problem, although that was only December. We could poke Platform and check how often they get requests to modify these - Chad would know. Ironholds (talk) 21:45, 9 October 2014 (UTC)[reply]

Filtering to exclude internal requests[edit]

Apply the test of "if it's from a WMF IP address, and does not contain a valid XFF (as HTTPS requests will), treat it as internal";
Get everyone within the WMF to standardise their user agents, and filter thataway.

...both are suboptimal. We should talk this one through.

Please explain, why are these suboptimal (except for 2 requiring our code base to be adapted and perpetrators to be disciplined) Erik Zachte (WMF) (talk) 15:49, 7 October 2014 (UTC)[reply]

The first one requires us to be able to accurately identify valid XFF providers,and update when Production does, and accurately identify our own ranges; at the moment, as I understand it, the production IP ranges and what-machines-we-accept-XFFs-from are stored in various Puppet manifests, which are not the most easily parsable thing for anything other than Puppet itself. Ironholds (talk) 21:54, 9 October 2014 (UTC)[reply]