Database queries on MediaWiki

From Meta, a Wikimedia project coordination wiki

Jump to: navigation, search

This is an article which is being developed from direct experience; please amend as appropriate

This article is intended to address the subject of database queries on Wikipedia. At present, direct access to the database is restricted to sysops and developers; it may possibly be the case that these privileges are extended at some future point. With this in mind, and given that some categories of users (including sysops) are unable to directly access some of the MySQL system tables, the following should serve as a useful starting point.

Users of query should keep in mind that some queries are very cumbersome to the server, and you are supposed to consult with more knowledgeable, experienced users before using queries.

Contents

[edit] Cur tables

Table cur looks like a useful place to start. The table has the following fields:

cur_id
cur_id
cur_namespace
cur_title - the name of the article without the possible namespace prefix
cur_text - the article's contents
cur_comment - comments
cur_user
cur_user_text - user editing/creating table (user name or ip address)
cur_timestamp - edit/creation timestamp
cur_restrictions
cur_counter
cur_ind_title
cur_ind_text
cur_is_redirect - redirect flag
cur_minor_edit - minor edit flag
cur_is_new - New article flag 0 - old article 1 - new article


The following example SQL select will find the title and creator of the last 10 new articles:

  • SELECT cur_title, cur_user_text FROM cur WHERE cur_is_new ORDER BY cur_timestamp DESC LIMIT 10

[edit] Useful sysop queries

[edit] Tracking vandals

If you are tracking a vandal or suspected vandal the following SQL statement will reveal all their handiwork:

  • SELECT cur_title, cur_user_text FROM cur WHERE cur_user_text LIKE '%000.000.000.000%'

(Just substitute the IP address you're checking for the 000.000.000.000 block}

If you want to see quickly whether there's a pattern to their handiwork, you can add the field cur_text to your SQL select and query on an article which you presume to have been vandalised. This will expose the text of all versions of this article.

  • SELECT cur_title, cur_user_text, cur_text FROM cur WHERE cur_title LIKE 'XXX' ORDER BY cur_timestamp DESC

(Just put the name of the article where it says XXX)

This will expose all the changes in reverse time order which have been effected to this article (except those prior to the move from the previous database).

[edit] Detecting spelling errors

To find all spelling mistakes use something like (substituting the misspelling you're interested in locating for teh, of course).

  • SELECT cur_title from cur where cur_namespace=0 and cur_text regexp "[[:<:]]teh[[:>:]]" order by cur_title

[edit] Monitoring growth and change

Certain stats on wikipedia, such as number of articles and number of edits can be obtained from Special:Site_statistics page. But others are harder to obtain.

[edit] number of user contributions

The following two queries will return number of total edits by user, by the order of User ID. Contributions from IP address will be listed as ID = 0.

  • SELECT cur_user, count(*) AS count FROM cur GROUP BY cur_user LIMIT 20
  • SELECT old_user, count(*) AS count FROM old GROUP BY old_user LIMIT 20

(The number 20 should be replaced with other appropriate value.)

[edit] Distribution of sizes of articles

The following query can be used to count the number of article within specific size-range.

  • SELECT COUNT(*) FROM cur WHERE LENGTH(cur_text)>1500 AND LENGTH(cur_text)<2500 AND cur_namespace=0

The numbers 1500 and 2500 in the above example should be replaced with other numbers. By systematically changing the article size ranges, one can obtain information about distribution of number of articles by size ranges.

[user construction: user:sjc ]

[edit] See Also