Research:Refining the definition of monthly active editors

From Meta, a Wikimedia project coordination wiki

The number of monthly "active editors" has been used historically as the primary metric to track the number of contributors active in a given Wikimedia project. Beginning in 2013, the Research and Data team at the Foundation has sought to refine this metric. This page documents historical definitions, criticisms and new analysis of candidate definitions.

Historical definition[edit]

Historically, the official monthly active editor count has been generated for wikistats and the reportcard.

A registered and logged-in person (not known as a bot) who makes 5 or more edits in any month in countable namespaces on countable pages.

The complete definition of an active editor (we refer to this definition as "canonical" or "historical") is available on this page. The definition of the total active editors (TAE) metric counting those across Wikimedia projects (which is used in the 2010-15 strategic goals and is one of the core numbers highlighted in the WMF monthly reports) was refined in 2012 to take into account uploads to Wikimedia Commons and activity across several projects.[1]

The data is currently generated from the latest available dump for the corresponding project. Aside from this use on the official Wikimedia statistics, this metric has been used informally in various contexts to indicate registered users completing 5+ edits monthly in a project's main namespace.

Countable pages[edit]

The canonical definition of Active Editors only considers editor activity on so-called countable pages. Countability criteria vary across wikis and can vary for the same wiki over time. They determine how the article counts are generated on Special:Statistics. Main namespace contributions that occur on pages that are not considered countable do not count towards Active Editor status.

Content namespaces[edit]

MediaWiki's namespace 0 is called main namespace (aka mainspace) because it's always the main content namespace of a wiki. Some projects (like Commons or Meta) consider additional namespaces as content namespaces. The canonical definition of Active Editors included contributions in these namespaces.

Bot exclusion[edit]

The canonical definition of Active Editors excludes the following bots:

  • registered bots, meaning user names with a 'bot flag'
  • user names without blot flag, but with bot flags on 10 or more other wikis within the same project (this catches a lot of bots on smaller wikis where bot flag is not consistently enforced)
  • user names that contain a 'bot' substring at specifc locations in the name (before non-alpha char or at end of string) ((with a small number of exceptions, which are verified real persons)

Issues[edit]

  • The third condition is a heuristic that is hard to automatically enforce in the same way as the bot flag is.
  • Conversely, limiting the bot definition to registered bots only may produce false negatives, i.e. include as active editors users that perform scripted activity.

Activity on deleted pages[edit]

Being based on public dumps (i.e. dumps that do not include revisions of deleted articles), the legacy Active Editor definition fluctuates depending on whether a page still exists at the time a dump was generated. This issue is informally known as deletion drift.

In the early years of Wikimedia this deletion drift did not impact figures significantly, there simply wasn't much deletion going on (mostly for privacy sensitive vandalism, hardly for content deemed 'not notable' in later years). So this feature of deletion drift was not added by conscious design. But it worked out well in some sense, in that it filtered out (albeit with some delay) activity on most non-encyclopedic content. In that sense the metric known as Total Active Editors is more about Productive Editors than Active Editors. However non-productive edits (partly vandalism) on articles that remain do not get filtered in this way. Still it could be considered one step ahead towards less number inflation.

Issues[edit]

  • The current definition creates historical inconsistencies, insofar as any data point in the past can potentially change in the future if it includes activity on pages that eventually are deleted or merged into other articles. Technically, active editor figures dating back to 2001 are censored. They could change at any point in the future (even if the probability of a substantial change for 2001 is much smaller than one in 2013). Practically, this means that nobody can cite an authoritative active editor figure for any project for any given month.
  • An anomalous increase in pages deleted or a change in deletion policy will impact monthly Active Editors (a drop in active editors under the current definition may have nothing to do with activity).
  • As noted above, while activity on deleted or merged pages is discounted in the historical Active Editor definition, activity that is considered of a low-quality (reverted) is included. The following example illustrates this issue:
User Foo makes 5 main-namespace edits in a given month. They all get reverted. Foo is considered active.
User Bar makes 100 main-namespace edits in a given month. None of these edits gets reverted. One of the pages s/he edited gets deleted (or merged) at some point as a result of an AfD. 96 edits that were made on this page are suddenly considered bad edits. With 4 good edits left, Bar is considered inactive.
  • At the other hand the current definition is consistent with the dumps. Bad content could be blanked without removing the history from the dumps. On that level bad edits to good articles and good edits to bad articles have been treated differently all along without arousing much debate.

WMF standardization[edit]

Principles[edit]

In proposing a standardized definition for Active Editors, we will follow these guiding principles:

relevant
this is most important, metrics which are simple, reliable, etc etc but not very descriptive of a core issue miss the point
consistent
same definition of metric should apply for all data points. if a definition changes all historic counts need to be updated where possible or,
when that is not possible, need to be omitted from trend line when the change is significant (estimated effect > 5% ?)
note: this is less an issue when all data points are recalculated on every reporting run, more so when metrics are stored and data for new months appended
reliable
monitoring and quick error escalation should safeguard against data capture/processing errors
again this is more an issue when aggregated data are stored for long time and appended with new data (as happens with e.g. transient data, like page views)
understandable by lay person
a metric definition should not refer to MediaWiki internals
replicable in DB
the metric should be replicable with data obtained from sources other than the dumps (MediaWiki DB or EventLogging data)
flexible to config changes
the metric should follow measures where existing content gets redistributed over several namespaces
a metric measures exactly the same thing whether it's applied to the Swahili or English Wikipedia (in so far different languages have same definition and config)
stateless
historical data for a metric remains the same, regardless of when it's generated. Metrics that rely on redacted or deleted data are not stateless
(except when definition changes or data acquisition/processing bugs are discovered in the old data, in which case repairing erroneous data clearly overrides need for static data)
have only minor deviations from the legacy definition
redefining a metric should produce data comparable with the historical definition within a margin we're comfortable with
computable at arbitrary time resolutions
we should be able to compute the metric hourly, daily, weekly, monthly

Technical implementations[edit]

See also the comparison of alternatives at Data analysis/mining of Wikimedia wikis.

dump db archive
Understandable N[2] N[3] N[4]
Replicable in DB N[5] Y Y
Flexible to config changes Y[6] ?? ??
Consistent across projects Y[7] Y[8] Y[8]
Stateless N[9] N[9] Y[10]
Small deviation from historical definition Y Y N[11]
Arbitrary resolution N[12] Y[13] Y[13]

Analysis[edit]

The following plots compare monthly active editors series generated from 3 different data sources and with different assumptions:

dump
the canonical Active Editors data as obtained from the dumps (counting activity on all non-deleted main namespace countable articles and filtering both registered and unregistered bots)
db
Active Editors as obtained from the revision table (counting activity on all non-deleted main namespace articles with no page exclusion and filtering registered bots only).
archive
Active Editors as obtained from the union of the archive and the revision tables (counting activity on all main namespace articles, including pages that later got deleted, and filtering registered bots only).

Factor change[edit]

The following plots represent the deviation of Active Editors from the canonical definition (dump) expressed as a relative change. Data prior to 2004 is removed due to the small number of observations for smaller wikis.

Difference[edit]

The following plots represent the deviation of Active Editors from the canonical definition (dump) expressed as the absolute difference. Data prior to 2004 is removed due to the small number of observations for smaller wikis.

Notes[edit]

  1. "Improving the accuracy of the active editors metric". Wikimedia blog, August 31, 2012
  2. Deletion drift is not easy to grasp for extreme outsider. The canonical definition refers to MediaWiki internals such as countable page or content namespace.
  3. The canonical definition refers to MediaWiki internals such as countable page or content namespace.
  4. The canonical definition refers to MediaWiki internals such as countable page or content namespace.
  5. The canonical definition is hard to replicate as is in the database due to the way in which a countable page is defined (which in some cases requires checking the page content or the link /category structure) and the heuristics used for bot detection.
  6. The script will absorb configuration changes without effect on the numbers (provided API info is timely updated).
  7. All projects follow the same rule for establishing countability (API driven).
  8. a b Detecting bots only via the bot flag may produce false negatives, depending on the local bot policy of each project and the adoption rates of bot registration.
  9. a b Future page deletions cause fluctuations in historical Active Editors data
  10. Including data from the archive table minimizes fluctuations due to page deletions. A very small number of fluctuations may occur due to redacted revisions that are removed from the database
  11. Including data from the archive table overrepresents active editors whose activity happens on pages that are later deleted.
  12. Canonical Active Editor data can only be generated when a new dump is available.
  13. a b Counts from the database can be generated at any desired time resolution.

See also[edit]