Research:Defining monthly active editors, 2016
Active editors is a key metric used in the Wikimedia movement. Current definitions agree that an active editor is a registered user who makes 5 or more edits a month to content pages, but there are subtle difference in how this general rule is applied. The WMF Editing Analysis (Neil P. Quinn), Research (Leila Zia), and Analytics teams (Erik Zachte) are working to agree on a fully-specified version of this definition.
- Which namespaces are considered content?
- Decision: the main namespace, extra configured content namespaces, and any non-configured content namespaces whose absence would have a major impact on the numbers (the File namespace on Commons is the only current example). However, we should also contact communities who have de-facto content namespaces to see if they would be willing to update their configurations.
- Are deleted pages considered?
- Decision: yes. This makes the metrics more stateless, so the numbers are more consistent no matter when they are calculated. Erik used to be a strong opponent of this, on the grounds that edits to deleted pages were probably unproductive, but he did some research and found that the number of edits excluded this way was only about 0.5% of the number of reverted edits, which were not excluded. In addition, Neil and others felt that this metric should focus on capturing editing activity, rather than also trying to capture productivity, which could be better captured in a separate metric.
- Are all content pages counted, or are non-countable pages (mainly redirects) included as well?
- Decision: all content pages will be counted, which will significantly simplify calculation at the (acceptable) expense of making this definition less similar to the article definition. Erik believes that this is the primary cause of the discrepancy between the Wikistats and Editing Analysis numbers, so this will likely cause the Wikistats numbers to change substantially.
- What do we do about the fact that a page's namespace can change long after an edit is made? (The analogous issue is one reason why we decided to count deleted edits.)
- Decision: if feasible, an edit should be counted if the page was in a content namespace at the time of the edit, in order to make the metric more stateless. This is generally infeasible with calculation done using the dumps or the databases, but it is likely to be possible using the Hadoop Data Lake.
- What do we do when sites start configuring already-existing namespaces as content?
- How should bots be identified?
- Decision: we should treat accounts as bots if they have ever had the bot user right (because some projects remove the right from inactive bots) or if they match the Wikistats bot regex (with the corresponding whitelist, which currently has three users on it). Any account that's ever been flagged on any wiki should be treated as a bot globally, since it's much more likely that a cross-wiki bot makes unflagged edits on a small wiki than that someone uses the same account for a bot on one wiki and normal editing on another (or that someone is mistakenly flagged as a bot on a small wiki).
- Which wikis should be included in the global number?
- Decision: we should adopt the list used by Wikistats, which includes the Foundation wiki and Meta where the Editing Analysis list does not.
Comparison of recent data
|May 2016||80 685||86 608|
|June 2016||78 207||83 323|
|July 2016||73 564||78 118|
|August 2016||74 202||78 556|
|September 2016||77 145||80 453|
|October 2016||79 329|
Editing Analysis definition
An editor-month table using the following SQL query:
select month, count(*) as active_editors from ( select month, user_name, sum(content_edits) as content_edits from staging.editor_month where bot = 0 and local_user_id != 0 group by month, user_name ) global_edits where content_edits >= 5 group by month;
The editor month table is aggregated from both the
archive tables, and the
content_edits column includes all edits from both.
Included are edits made to all pages which were in content namespaces when the aggregating query was run. This may be different from the namespace the page was in when the edit was made. Content namespaces are defined as the following:
- the main namespace
- other namespaces defined by wikis as content namespaces (currently 69 extra namespaces across the cluster).
- other namespaces that are not configured as content namespaces by their wikis, but that in the judgment of the Editing Analysis team contain user-facing content. Currently (Nov 2016), these namespaces are the File and Category namespaces on Commons, the Property namespace on Wikidata, and the Grants, Research, Iberocoop, and Participation namespaces on Meta.
Users are classified as bots and excluded if they have ever been in the bot user group on that wiki.
Wikis included in global value
All the wikis with the following
site_group values are included:
('commons', 'incubator', 'mediawiki', 'sources', 'species', 'wikibooks', 'wikidata', 'wikinews', 'wikipedia', 'wikiquote', 'wikisource', 'wikiversity', 'wikivoyage', 'wiktionary')
This means that Meta-Wiki and other movement-internal wikis like the Wikimania wikis and affiliate wikis are not included.
Note: definition on data page shows old situation and needs to be updated. Only namespace 6 on Commons is included hard coded, all others are collected via API.
Data is based on content dumps, so edits to pages which have since been deleted are not included.
Non-countable namespaces and redirects are excluded.
Pages without internal link are not excluded. Wikistats only processes stub dumps and article content is not available in those dumps.
All edits to pages currently in the following namespaces are included:
- the main namespace
- As mw:Analytics/Metrics definitions says: "Wikistats dynamically establishes extra content namespaces per wiki via the API (since July 2013, for all history)".
- on Commons the File namespace (6) is added (hard coded, as it did not appear in API results), namespace Category was also enforced but has recently been excluded, as it's not included on any other wiki, and it raised eyebrows why Wikistats article counts for Commons exceeded online article counts by 5 million
Bots are excluded. Recap on how Wikistats detects bots:
- Is a name registered as bot, in other words is there a bot flag in the most recent dump of the user group table? Note that when a blog flag is removed in the user group table, all edits by that user name are no longer considered bot edits. (does this ever happen?)
- Does it sound like a bot? (nowadays such user names are only allowed for bot, on many wikis). Wikistats is rather restrictive in 'does it sound like a bot': Perl: if (($user =~ /bot\b/i) || ($user =~ /_bot_/i)) Meaning only names where 'bot' is end of string or is followed by non alpha-numerical char or is preceded and followed by underscores (in Mediawiki often place holder for spaces) sound like a bot for Wikistats. It would be interesting (but a bit more work) to break this down by language. I guess some languages are more prone to have 'bot' in real names than others.
- Is it known to be an unregistered bot ? (English Wikipedia has a list of false negatives). Erik copied that list long ago but does not keep it auto-updated.
- Is a name flagged as a bot on at least 10 wikis than treat it so on any wiki within the project (in the past when user names could easily collide this was more relevant). Basic rationale is that on smaller wikis bot registrations are often forgotten. With SUL it is unlikely that people use same name as bot on one wiki and as regular user on another wiki.
- Three names that sound like bot are hard coded exceptions (people who wrote me to tell me they are human): Paucabot|Niabot|Marbot
Wikis included in the global value
All the wikis with the following
site_group values are included: ([!] are extra beyond editing analysis section above)
('commons', 'incubator', 'foundation' [!], 'mediawiki', 'meta' [!], 'sources', 'species', 'wikibooks', 'wikidata', 'wikinews', 'wikipedia', 'wikiquote', 'wikisource', 'wikiversity', 'wikivoyage', 'wiktionary')
Edits to pages which have since been deleted are not included.
Only countable pages are included, using the standard definition of pages which contain at least one internal link and are not a redirect.
Only pages in content namespaces (the main namespace and extra configured content namespaces) are included.
Bots are excluded and defined by the following characteristics:
- users with a bot flag on that particular wiki
- users with a bot flag on at least 10 other wikis (does this still make sense in a post-SUL world?)
- users with names that match a regex for the word "bot" before non-alphabetic characters or at the end of the name
Wikis included in the global value