Research:Editor month dataset

From Meta, a Wikimedia project coordination wiki

An editor month dataset aggregates some of the information present in the revision , user, and other tables of different wikis to make it easier to calculate Wikimedia-wide editing metrics.

Current version[edit]

There is currently an editor month dataset in the staging database on the analytics-store MariaDB server. As of December 2017, the Contributors team at the Wikimedia Foundation uses it to calculate certain key metrics, so it contains all the data up to that date and is likely to be updated monthly. You will need research database access to use it.

mysql:research@analytics-store.eqiad.wmnet [(none)]> describe staging.editor_month;
+-------------------+------------------+------+-----+---------+-------+
| Field             | Type             | Null | Key | Default | Extra |
+-------------------+------------------+------+-----+---------+-------+
| wiki              | varbinary(255)   | NO   | PRI | NULL    |       |
| month             | date             | NO   | PRI | NULL    |       |
| local_user_id     | int(10) unsigned | NO   | PRI | NULL    |       |
| user_name         | varbinary(255)   | NO   | MUL |         |       |
| edits             | int(10) unsigned | NO   |     | 0       |       |
| content_edits     | int(10) unsigned | NO   |     | 0       |       |
| deleted_edits     | int(10) unsigned | NO   |     | 0       |       |
| mobile_web_edits  | int(10) unsigned | NO   |     | 0       |       |
| mobile_app_edits  | int(10) unsigned | NO   |     | 0       |       |
| visual_edits      | int(10) unsigned | NO   |     | 0       |       |
| bot               | tinyint(1)       | NO   |     | 0       |       |
| user_registration | datetime         | YES  |     | NULL    |       |
+-------------------+------------------+------+-----+---------+-------+
12 rows in set (0.02 sec)
Field Explanation
wiki The database name (arwiki, elwiktionary) of the wiki
month The month. Stored as a date (specifically, the first day of that month) rather than a string to make date operations possible.
local_user_id Unique only within that particular wiki. Mainly for easy joins with a wiki's application database. All anonymous editors for a given wiki and given month are grouped into a single row with local_user_id = 0.
user_name The text of the editor's name. Unlike the local_user_id, this is globally unique so it's useful for calculating global user numbers. It is an empty string for the rows representing anonymous users (rather than NULL, which would mean the row could not be used in keys).
edits Number of edits made. Includes edits to deleted pages, which are also broken out in deleted_edits.
content_edits Number of edits to the main namespace. This is not exactly the same thing as edits to content namespaces, but it's a pretty close approximation and there's no way to access the list of content namespaces in the database anyway.
deleted_edits Subset of edits made to pages which were subsequently deleted and remained that way at the time of the query. Unlike edits, this is stateful because pages can be deleted and undeleted at any time.
mobile_web_edits Subset of edits made using the mobile web interface (tagged with mobile edit and either tagged with mobile web edit or not tagged with mobile app edit)
mobile_app_edits Subset of edits made using one of the mobile apps (tagged with mobile app edit)
visual_edits Subset of edits made with the visual editor opened at any point in the session (tagged with visualeditor or visualeditor-switched).
bot True if the account has ever been tagged as a bot on that particular wiki (ug_group = "bot" OR ufg_group = "bot"). This assumes the main reason bots have the bot flag taken away is that they've stopped operating or the community has changed its mind.
user_registration Timestamp of the user's registration. Stored as a datetime rather than a string to make date operations easier. It is NULL for many users who registered before MediaWiki started storing it.

Rejected fields[edit]

  • global_user_id: although Central Auth does assign each global user a unique numerical ID, this is almost never used. Central Auth itself actually uses the text of the (globally unique) user name as the ID.
  • api_edits: not clear how we can detect this or what utility it would have.

Query[edit]

This is a query that generates a table meeting these requirements.

insert into staging.editor_month_new
select
    database() as wiki,
    str_to_date(concat(rev_month, "01"), "%Y%m%d") as month,
    local_user_id,
    ifnull(user_name, "") as user_name,
    ifnull(sum(edits), 0) as edits,
    ifnull(sum(content_edits), 0) as content_edits,
    ifnull(sum(edits * deleted), 0) as deleted_edits,
    ifnull(sum(mobile_web_edits), 0) as mobile_web_edits,
    ifnull(sum(mobile_app_edits), 0) as mobile_app_edits,
    ifnull(sum(visual_edits), 0) as visual_edits,
    if(ug_group = "bot" or ufg_group = "bot", 1, 0) as bot,
    str_to_date(user_registration, "%Y%m%d%H%i%S") as user_registration
from
(
select
    left(rev_timestamp, 6) as `rev_month`,
    rev_user as `local_user_id`,
    count(*) as `edits`,
    sum(page_namespace = 0 or defined_in_site_config is not null) as content_edits,
    sum(
        ts_tags like "%mobile edit%" and 
        (ts_tags like "%mobile web edit%" or ts_tags not like "%mobile app edit%") 
    ) as mobile_web_edits,
    sum(ts_tags like "%mobile app edit%") as mobile_app_edits,
    sum(ts_tags like "%visualeditor%") as visual_edits,
    0 as `deleted`  
from revision
left join page on rev_page = page_id
left join tag_summary on rev_id = ts_rev_id
left join datasets.extra_content_namespaces on database() = wiki and page_namespace = namespace
-- time restriction here
group by left(rev_timestamp, 6), rev_user

union all

select
    left(ar_timestamp, 6) as `rev_month`,
    ar_user as `local_user_id`,
    count(*) as `edits`,
    sum(ar_namespace = 0 or defined_in_site_config is not null) as content_edits,
    sum(
        ts_tags like "%mobile edit%" and 
        (ts_tags like "%mobile web edit%" or ts_tags not like "%mobile app edit%") 
    ) as mobile_web_edits,
    sum(ts_tags like "%mobile app edit%") as mobile_app_edits,
    sum(ts_tags like "%visualeditor%") as visual_edits,
    1 as `deleted`
from archive
left join tag_summary on ar_rev_id = ts_rev_id
left join datasets.extra_content_namespaces on database() = wiki and ar_namespace = namespace
-- time restriction here
group by left(ar_timestamp, 6), ar_user
) revs
left join user on local_user_id = user_id
left join user_groups on local_user_id = ug_user and ug_group = "bot"
left join user_former_groups on local_user_id = ufg_user and ufg_group = "bot"
group by month, local_user_id;

To consider[edit]

  • Public dumps?
  • Make it accessible on the public replicas (e.g. Quarry)?