User:Jeblad/Dynamic statistics

Dynamic statistics describes a AJAX-based solution to how statistics can be included dynamically on pages at Wikipedia. Data about actual page views are collected from the squid proxies, then stored as language specific temporary files. Once every hour the files are then processed into JSON-files. Those files are downloaded on demand from the client browser.

WikiStats files

The proxy log files are found at WikiStats, and are downloaded by wget and processed by compact.pl. The later script will split out individual language parts to individual folders.

Download by wget is done each five minute, unless the queried file already exist in the download folder. If a new file is found, and download succeeds, then compact.pl will run. This divides the different project entries to files in correct subfolders.

The WikiStats files in the subfolders are kept for the duration of the longest running statistics.

Page and redirect files

During processing of the WikiStats files it is necessary to rewrite page names into page ids, and identify redirects. The data to facilitate this comes from sql-dumps, and are found from backup index and are typically named <project>-<date>-pages-articles.xml.bz2 and <project>-<date>-redirect.sql.gz.

When the script extract.pl has built its internal structures there will be a page hash and a page array holding references to arrays, consisting of mostly zeros, and some pairs of page ids and page titles. The page hash and page array both points main references and redirects to the same underlaying arrays. This makes it possible to collect statistics for both main reference and redirects as if they were identical.

JSON files and SQL load file

When a new file with raw data are available it will be processed by extract.pl. This script collects the statistical data into unique records for each page, even if its named through a redirect. Each record is identified both with the page title and with the page id. When done collecting the statistics the records are then formatted as JSON files, and according to the page ids files are written such that a record can be found in a bucket, that is each file represents a bucket, with name bucket-<pageid mod bins>.JSON. There are a total of bins buckets, that is the numbered files goes from zero to bins - 1.

Each JSON file has a leading timestamp, normalizing factors, and then a number of structures identified by the page id. An AJAX client can then construct the name for the correct JSON file, download this, and then find the correct record inside this file.

If there is a database server available the script can write an alternate form of the statistics. This is a SQL file, that has to be uploaded to a database and then be made available to the public. This server will then serve data according to the actual page ids, typically with a REST interface.

Record structure

Each record consists of an identifier, and the named sets of statistics. If there are no identified need for page statistics belove certain threshold it will not be included. Also if there are no identified need for page ids and titles for the redirects, then they too will not be included.

Each set of statistics will then consists of the count of page views within the given time frame (note that this is a period consisting of several sampling intervals) and a sum of squared sampling values. The statistics delivered to the client is thus not ready for presentation, it is a raw form for further processing.

Typically there will be data for four different periods; hour (h), day (d), week (w) and a four week period (m). The four week period is not a complete moth as this will create periodic variations due to inclusion of a variable number of weekends. A four week period should be sufficient long compared to a moth that it is possible to extrapolate the numbers.

Privacy issues

To limit the possibility of information leakage, especially if the JSON files are stored at external sites, some limits should be respected. There should be produced no statistics for periods belove a certain threshold, typically ten – 10 – page views. The size of the buckets should be such that for a given sampling interval there should be at least ten – 10 – requests for any given bucket. There should be no derived statistics that attempts to identify specific users behaviour.

Minimum number of page views before data is published is to protect the previous readers, while minimum number of requests for a bucket is to protect the present reader. If the present reader edits the same page then both the ip-address and the user name is potentially available to the provider of the statistics service.

If the statistics are served from å trusted server a database solution can be used, and then a single record can be served in each request. Still the minimum number of page views before any data is served should be respected.

Derived statistics

Some statistics can be derived from the dataset, while other can be built by further processing. Intersting derived statistics are totals – which pages has the overall largest number og page views, climbers – which pages shows an unusual large increase in the number of page views, and in the news – climbers which also can be found in special news serch engines.

Total numbers are fairly easy to calculate, as it is a simple sum of all page views within the given time period. An interesting variation is to sort the totals on different categories. Such sorting on categories are not trivial, and it is not obvious how to make it work for overall totals.

Climbers are typically detected by a function $(\Delta _{x}/\sigma _{x})\log {{\mathit {N}}_{x}}$ or some kind of similar function. Note that $\Delta _{x}$ is typically the difference between the number of page views within the present period and the expected number of page views from the larger period. Likewise the $\sigma _{x}$ and ${\mathit {N}}_{x}$ comes from the larger period.

Such climbers should be limited so on-going work in newspapers, at schools, etc, should not be listed. Only notable major changes should be listed. This comes also from the fact that functions to detect climbers tend to amplify noise in the data set, and especially rather small changes in pages that has few page views normally tend to get much to high rating.

If the climbers are filtered through a news catalog, or news search service, it is possible to correlate the climbers with known news articles. This makes it possible to lower the threshold for what to list and what to throw away as noise. Such tests either has to use a news feed, or has to use a news search engine. As the prodding of a news search engine will be very intence it is necessary with an explicit statement that such use are acceptable.