Data analysis/mining of Wikimedia wikis

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search

See also e.g. Research:Data or wikitech:Analytics#Datasets

This page aims to collect relative strengths and weaknesses of different approaches to data mining Wikimedia wikis. Other data sources (e.g. squid logs) are out of scope.

XML Dumps[edit]

  • XML dumps act as abstraction layer which shields user from physical database changes
These schema changes were not uncommon in early years, probably less common now
  • Existing support structure
  • Dumps available for all projects / languages
  • Suited for offline processing by community / researchers
  • Only part of database contents are available in XML format (some in SQL dump format, some not at all)
  • Dump generation is lengthy process, despite improvements even in best circumstances some dumps trail live data with many weeks
English Wikipedia dump job runs for about two weeks, start of job could be up to two weeks after closure of month
  • Scripts for dump generation are maintenance intensive
  • Restructuring of dumps could make them more download friendly
This is a long standing issue but not trivial (for instance incremental dumps would still require updates due to article /revision deletions)
  • Dump generation, although much improved, is still inherently unreliable process
    • Code is intertwined in general parser code, therefore
    • Process is suspended on mediawiki code upgrades
    • No regression tests

Wikistats scripts[edit]

  • Lots of functionality, developed in constant dialog with core community
  • Produces large set of intermediate csv files
Many of these are reused by community and researchers (but lack proper documentation)
  • Monthly batch update for all reports
  • New functionality and bug fixes work for all historic months, as all reports are rebuilt from scratch every time
This has flip side as well: long run time, deleted content vanishes from stats
  • Services all Wikimedia projects and languages on an equal footing
(some toolserver projects also have this approach)
  • Many reports are multilingual
(but much work needed here, seems path to go)
  • Wikistats portal as navigational aid
  • Extensive, well formatted activity log that helps to track program flow
This is a great tool for bug fixing, but also for learning what the script does: it partly compensates the lack of documentation
  • Prerendered reports, not designed for ad hoc querying
  • Hardly any documentation (but see activity log above)
  • Many reports are too rich in details/granularity for casual readers
  • Some scripts score low on maintainability
    • It contains many optimization tweaks, part of which may be entirely obsolete with current hardware resources.
    • Not KISS: WikiReports section contains lots of code to fine tune layout (even per project), with added complexity as result
    • Some scripts still contain some test code, tuned to one particular test environment (file paths)
    • Where WikiCounts might be seen as largely self documenting code (sensible function and variable names etc), this is less so for WikiReports
(being a one person hobby project for many years this simply had not highest priority)
  • WikiCounts job not restartable.
WikiCounts evolved since 2003, when a full English dump took 20 minutes on one thread rather than a full week on 15.
In a new design, reprocessing collected data would have been punt in a separate step, in order to maximize restartability.

Other dump clients[edit]

  • Existing code - 3rd party scripts in several script languages can process the XML dumps
  • Simplicity - these scripts tend to be single purpose, are often very simple and efficient, therefore can be good starting point to explore the dumps
  • Support - presumably only some 3rd party scripts are supported, of course simplicity makes this less of an issue

Mediawiki API[edit]

  • Well designed feature set
  • Lots of functionality
  • Acts as abstraction layer
(like XML dumps, shields user from physical database changes)
  • Data from live database, always up to date results
  • Supports several data formats
  • Expertise available among staff and community
  • Even with special bot flag limited to x calls per second, each with limited quantity of data returned
  • Inherently too slow to transfer large volumes of data
  • Data from live database also means
    • Regression testing more difficult
    • Limited ability to rerun reports for earlier periods, either based on new insights or because of bug fixing


  • Flexibility - ad hoc queries easy
  • Access to all data
  • Querying language widely known, even among advanced end users
  • Too slow for some purposes
Compare generation of English full history dump: even with 95+ reuse of cached data takes full week on 15 nodes
  • No abstraction layer (see XML dumps above)
  • Requires good knowledge of database schema

Live Database[edit]

Traditionally widely used by admins

  • The 'real thing': reliability of data as good as it gets
  • What third parties are most likely to use and help with: see for instance the existing StatMediaWiki
  • Performance - only for trivial queries (and the risk they are not trivial after all)
  • Access - few people have MySQL access to Wikimedia live database, for obvious reasons

Slave Database at WMF[edit]

  • Quick to setup, with limited effort
  • Ability to do user analysis by geography (?)
  • Complex ad-hoc queries for editor history without disrupting site operations

  • No access outside WMF staff
This will hamper reusability of code although in theory code might be reused on tool server
  • Possibility of losing complex queries
  • Reusability of queries on all wikis
  • Need for extra work to create API or script calls for queries found to be useful
  • Bottleneck of complex queries (?)

Tool Server[edit]

  • Existing support structure
  • Large community of volunteer dev's
  • Time sharing puts upper limit on resource usage
  • Due to scale and few staff less tuned to 24*7 operations


  • Built-in data replication & failover (all implementations?)
  • Scales Horizontally
  • Developed by leading Web 2.0 players
  • Designed for really huge data collections
  • Optimized for fast response times on ad hoc queries
  • Still maturing technology
  • Compared to MySQL few large implementations yet
  • Limited expertise available
    • How important is the choice of implementation?
  • Export from live MySQL databases will require extensive effort (?)
  • Needs good initial design for best performance
  • New technology frontier, will appeal to potential new volunteers and staff who want to make their mark
  • Will it be feasible to compact stored information over time (e.g. only preserve aggregated data after x days)
Some NoSQL solutions reputedly are better tuned to addition of new info than updating/filtering existing info


  • Decentralized - no master server , no single point of failure
  • Elasticity - machines can be added on the fly, read and write scale linearly with nodes added
  • Fault-tolerant - redundant storage on multiple nodes, replication over multiple data centers, hot swapping of failed nodes
  • Tunable consistency - from 'writes never fail' till 'deliver fast at expense of replication integrity'