Wikistats/archive

From Meta, a Wikimedia project coordination wiki

Information on this page is outdated. For more information on Wikistats, please see Mediawiki.org

Wikistats is a set of Perl scripts used to generate detailed statistics for Wikimedia projects. These statistics are available at stats.wikimedia.org. Erik Zachte is the author of the scripts, and he is also responsible for running them and posting the results. All statistics are produced by analyzing the database dumps, which are usually created monthly.

See Wikistats csv for information on accessing statistics in comma-separated values (CSV) format.

Documentation[edit]

Detailed explanation of some statistics. There is not a real documentation, but some article on specific items here and there.

Source code[edit]

The scripts are stored at GitHub in wikimedia/analytics-wikistats.

Running Wikistats on your own MediaWiki site[edit]

The scripts have not yet been packaged for general consumption, but they can be made to work on any MediaWiki site without too much trouble.

You will need:

  • MediaWiki 1.5 or later (for the dumpBackup.php script, at least)
  • Perl version 5.8 or later (avoid 5.6, it has memory leaks)
  • Ploticus

Here are the (admittedly hacky) steps to generate the statistics. This is known to work on FreeBSD and Windows XP at least.

  1. Create a new directory and unzip the scripts there
    • Note that the script files are in DOS text format. If you are on Unix, you should convert them to Unix format.
    • You might also need to make WikiCounts.pl and WikiReports.pl executable.
    • You may need to update the contact / website information in the file WikiReportsNoWikimedia.pl
  2. Obtain a full XML dump of your MediaWiki data using the dumpBackup.php script as described at MediaWiki#Database_dump
  3. In the directory with the scripts, create these subdirectories:
    • counts
    • dumps
    • reports
  4. Rename your xml dump like this : en-latest-pages-meta-history.xml
  5. Copy your dump in the dumps directory : dumps/en-latest-pages-meta-history.xml
  6. The script support xml compression (gz, bz2, 7z),so this dumps are supported :
    • dumps/en-latest-pages-meta-history.xml
    • dumps/en-latest-pages-meta-history.xml.gz
    • dumps/en-latest-pages-meta-history.xml.bz2
    • dumps/en-latest-pages-meta-history.xml.7z
  7. Run this command, where YYYYMMDD is the date the XML dump was taken:
    • WikiCounts.pl -x -i dumps -o counts -l en -d YYYYMMDD
      • This should create a bunch of CSV files in counts
  8. The WikiReportsOutputPlots.pl script is hardcoded to run pl to invoke Ploticus. On some systems (like Unix) the Ploticus executable is named ploticus. If that's the case on your system, edit the script to change the two occurrences of "pl -" to "ploticus -"
  9. Adapt WikiReportsNoWikimedia.pl so that site specific details are used, like your site name and admin name and mail address
  10. Run this command, using the same YYYYMMDD as above:
    • WikiReports.pl -x -i counts -o reports -l en -d YYYYMMDD
      • This should create a bunch of HTML, PNG, and SVG files in reports/EN
  11. In the reports directory, download these additional files which are referred to by the HTML in the reports/EN directory using a relative ../ path:
  12. Now you should be able to load reports/EN/index.html in a web browser and see the statistics.

Notes for Windows XP[edit]

The same instructions apply on windows but you will need to install the following:

  1. Perl from ActiveState.
  2. Bzip2 for windows from here, unzip and put the file bin\bzip2.exe in your Windows directory. (if dump compression in bz2 is used)
  3. Ploticus from here, unzip to Windows directory. You can also choose to install ploticus from Cygwin (which has built in support for PNGs unlike the generic windows binary of ploticus).
  4. Recent scripts can make calls to du (disk space used), df (disk space free), and top (process list), to monitor system resources. Cygwin provides these programs. Resource monitoring is probably not useful on any wiki dump that contains less than 100,000 articles or less than a couple of millions of revisions. From WikiCounts.pl 2.1 on resources are not traced unless you specify option -r.

Alternate Method[edit]

Alternately you can run the commands below which will accept an uncompressed dump called pages_full_en.xml.

  • WikiCounts.pl -x -i dumps -o counts -l en -d YYYYMMDD -t -m wp
  • WikiReports.pl -x -i counts -o reports -l en -d YYYYMMDD -t -m wp

Quality Survey[edit]

During Wikimania 2006 Jimbo gave a keynote speech in which he asked the community to focus less on counts and more on quality. People interested in discussing how wikistats could contribute please check Wikistats/Measuring Article Quality

Serving multiple languages[edit]

Wikistats supports multiple languages. However, users don't always spot the language links at the top right. To serve the "right" one for users automatically from a common domain on Apache, the following approach is suggested:

  • Create an index.var file in the root directory with sections for each language in the following format:
URI: index; vary="language"

URI: index.CS
Content-language: cs
Content-type: text/html 

URI: index.DE
Content-language: de
Content-type: text/html

...
  • Specify the following Apache directives in the apache.conf or local .htaccess file:
LanguagePriority en # default: adjust as appropriate
Options +Indexes
DirectoryIndex index.var Sitemap.htm
RewriteEngine On
RewriteRule index\.(..)$ /$1/Sitemap.htm [R=302,L]

The rewriting/redirection ensures the user is in the right base directory. Specifying "XX/Sitemap.htm" as the URI looks like it works at first, but the links on that page will not as they will go to the root directory. Regular Redirect directives do not appear to work in combination with content negotiation.

External links[edit]

See also[edit]