Database queries on MediaWiki
From Meta, a Wikimedia project coordination wiki
This is an article which is being developed from direct experience; please amend as appropriate
This article is intended to address the subject of database queries on Wikipedia. At present, direct access to the database is restricted to sysops and developers; it may possibly be the case that these privileges are extended at some future point. With this in mind, and given that some categories of users (including sysops) are unable to directly access some of the MySQL system tables, the following should serve as a useful starting point.
Users of query should keep in mind that some queries are very cumbersome to the server, and you are supposed to consult with more knowledgeable, experienced users before using queries.
Contents |
[edit] Cur tables
Table cur looks like a useful place to start. The table has the following fields:
- cur_id
- cur_id
- cur_namespace
- cur_title - the name of the article without the possible namespace prefix
- cur_text - the article's contents
- cur_comment - comments
- cur_user
- cur_user_text - user editing/creating table (user name or ip address)
- cur_timestamp - edit/creation timestamp
- cur_restrictions
- cur_counter
- cur_ind_title
- cur_ind_text
- cur_is_redirect - redirect flag
- cur_minor_edit - minor edit flag
- cur_is_new - New article flag 0 - old article 1 - new article
The following example SQL select will find the title and creator of the last 10 new articles:
- SELECT cur_title, cur_user_text FROM cur WHERE cur_is_new ORDER BY cur_timestamp DESC LIMIT 10
[edit] Useful sysop queries
[edit] Tracking vandals
If you are tracking a vandal or suspected vandal the following SQL statement will reveal all their handiwork:
- SELECT cur_title, cur_user_text FROM cur WHERE cur_user_text LIKE '%000.000.000.000%'
(Just substitute the IP address you're checking for the 000.000.000.000 block}
If you want to see quickly whether there's a pattern to their handiwork, you can add the field cur_text to your SQL select and query on an article which you presume to have been vandalised. This will expose the text of all versions of this article.
- SELECT cur_title, cur_user_text, cur_text FROM cur WHERE cur_title LIKE 'XXX' ORDER BY cur_timestamp DESC
(Just put the name of the article where it says XXX)
This will expose all the changes in reverse time order which have been effected to this article (except those prior to the move from the previous database).
[edit] Detecting spelling errors
To find all spelling mistakes use something like (substituting the misspelling you're interested in locating for teh, of course).
- SELECT cur_title from cur where cur_namespace=0 and cur_text regexp "[[:<:]]teh[[:>:]]" order by cur_title
[edit] Monitoring growth and change
Certain stats on wikipedia, such as number of articles and number of edits can be obtained from Special:Site_statistics page. But others are harder to obtain.
[edit] number of user contributions
The following two queries will return number of total edits by user, by the order of User ID. Contributions from IP address will be listed as ID = 0.
- SELECT cur_user, count(*) AS count FROM cur GROUP BY cur_user LIMIT 20
- SELECT old_user, count(*) AS count FROM old GROUP BY old_user LIMIT 20
(The number 20 should be replaced with other appropriate value.)
[edit] Distribution of sizes of articles
The following query can be used to count the number of article within specific size-range.
- SELECT COUNT(*) FROM cur WHERE LENGTH(cur_text)>1500 AND LENGTH(cur_text)<2500 AND cur_namespace=0
The numbers 1500 and 2500 in the above example should be replaced with other numbers. By systematically changing the article size ranges, one can obtain information about distribution of number of articles by size ranges.
[user construction: user:sjc ]