User:Halfak (WMF)/New page creations, deletions, and drafts

From Meta, a Wikimedia project coordination wiki

One of the new projects that I'm working on for the Growth team is an exploration of the nature of article creation for newcomers.

There are a few questions that I'd like to explore, but the first thing to do is really to just get a sense for how many articles are being created, who is creating them and what the rate of deletion is.

Datasets[edit]

In order to preserve consistency between datasets, I'll limit my analysis to all revisions, pages and log events that occurred before November 5th, 2013. The following subsections describe datasets that can be pulled from the database.

Pages[edit]

SQL
SELECT
    ar_page_id AS page_id,
    ar_namespace AS page_namespace,
    ar_title AS page_title,
    COUNT(*) AS revisions,
    MIN(ar_timestamp) AS first_revision,
    MAX(ar_timestamp) AS last_revision,
    TRUE AS archived,
    MIN(ar_rev_id) AS first_rev_id
FROM archive
WHERE ar_timestamp < "20131105000000"
GROUP BY 1,2,3;
UNION
SELECT
    page_id,
    page_namespace,
    page_title,
    COUNT(*) AS revisions,
    MIN(rev_timestamp) AS first_edit,
    MAX(rev_timestamp) AS last_edit,
    False AS archived,
    MIN(rev_id) AS first_rev_id
FROM revision
INNER JOIN page ON page_id = rev_page
WHERE rev_timestamp < "20131105000000"
GROUP BY 1,2,3;
  • A page was created on the date of its first revision (first_edit == creation).
  • Each page with a row in the logging table for a page, it was_deleted.
  • If an ar_page_id appears in the archive table, then the page is_currently_deleted.


Events[edit]

Page creation[edit]

SQL
SELECT
    IFNULL(rev_id, ar_rev_id) AS rev_id,
    IFNULL(rev_page, ar_page_id) AS page_id,
    IFNULL(rev_comment, ar_comment) AS rev_comment,
    IFNULL(rev_user, ar_user) AS user_id,
    IFNULL(rev_user_text, ar_user_text) AS user_text,
    IFNULL(rev_timestamp, ar_timestamp) AS rev_timestamp,
    IFNULL(rev_deleted, ar_deleted) AS rev_deleted,
    IFNULL(rev_len, ar_len) AS rev_len
FROM halfak.nov13_page
INNER JOIN revision ON first_rev_id = rev_id
INNER JOIN archive ON first_rev_id = ar_rev_id
  • The revision (or archive) with the lowest rev_id was the creating revision
  • There should not be duplication between revision and archive, but if there is, prefer revision


Page deletion[edit]

SQL
SELECT
    log_id, 
    log_type,
    log_user AS user_id,
    log_user_text AS user_text,
    log_page AS page_id,
    log_namespace AS page_namespace,
    log_title AS page_title,
    log_timestamp,
    log_comment
FROM logging
WHERE
    log_action = 'delete' AND 
    log_type = 'delete' AND 
    log_timestamp < "20131105000000"
  • Each log_action="delete" AND log_type="delete" row in logging corresponds to a deletion event.
  • Each deletion event has a "reason" in the log_comment field. This reason should be classifiable.
  • log_page IS NULL or log_page = 0 for all records. This might be a bug (#26122) and will make analysis difficult.


Page restorations[edit]

SQL
SELECT
    log_id, 
    log_type,
    log_user AS user_id,
    log_user_text AS user_text,
    log_page AS page_id,
    log_namespace AS page_namespace,
    log_title AS page_title,
    log_timestamp,
    log_comment
FROM logging
WHERE
    log_action = 'restore' AND 
    log_type = 'delete' AND 
    log_timestamp < "20131105000000";
  • Each log_action="restore" AND log_type="delete" row in logging corresponds to a restoration event.
  • Each restoration event has a "reason" in the log_comment field. This reason should be classifiable.
  • When a page is restored, it's page_id changes. log_page stores this new page_id.


Page curation actions[edit]

SQL
SELECT
    log_id,
    log_action,
    log_page AS page_id,
    log_comment
FROM logging
WHERE 
    log_type = 'pagetriage-curation' AND
    log_timestamp < "20131105000000";
  • Each log_type = "pagetriage-curation" row in logging correspends to a page curation event.
  • The log_action field represents the type of action that was taken
    • "reviewed" -- marks the page as "reviewed" in the Page curation system.
    • "tag" -- marks the addition of cleanup templates like "referenced" and "uncategorized"
    • "delete" -- adds a deletion nomination template to the article.
    • "unreviewed" -- marks the page as "unreviewed" in the Page curation system.


Timeseries[edit]

Article status[edit]

In these plots, "now-archived" means that, when the query was run on 20131105, the page appeared in the archive table. "Now-visible" represents pages that had survived deletion. In the case of these plots, a page is considered to be an "article" if page_namespace=0 in the page table or ar_namespace=0 in the archive table.

The number of articles archived and visible are plotted by month created for the English Wikipedia.
Visible article counts. The number of articles archived and visible are plotted by month created for the English Wikipedia.
The proportion of surviving of articles is plotted by month created for the English Wikipedia.
Article survival. The proportion of surviving of articles is plotted by month created for the English Wikipedia.

Page actions[edit]

This plot captures the rate of three relevant page actions:

  • creation: The first revision is saved to particular page
  • deletion: A page is deleted
  • restoration: A page is undeleted

A log scale is used on the y axis to make restorations visible since they are highly uncommon.

The number of page actions per month is plotted on a log scale over time for articles in the English Wikipedia.
Article actions. The number of page actions per month is plotted on a log scale over time for articles in the English Wikipedia.

It looks like logging didn't exist before 2005.

Page lifetime[edit]

Edit diff approximation[edit]

The following plot shows the density of time between the first and last edit to a page (approximation of the page's lifetime).

The density of time between first and last edits is plotted for deleted articles created between 2005 and 2013 in the English Wikipedia
Article lifetime (edit diff approx). The density of time between first and last edits is plotted for deleted articles created between 2005 and 2013 in the English Wikipedia

Note the spike at zero lifetime which represents the pages that were deleted after only one revision. Note that these pages couldn't have even been tagged before they were deleted.

Note also the gentle, but steady lengthening of time between creation and deletion (last edit) between 2007 and 2013. This suggests that the lifetime of deleted pages has been extending over time.

Articles for Creation[edit]

The workflow for Article for Creation is presented in its most simplistic form.
AfC workflow. The workflow for Article for Creation is presented in its most simplistic form.

AfC was originally created in 200?[citation needed] when direct article creation for IP editors was disabled in order to allow IPs to still contribute new articles. A strange artifact of AfC is how it was designed around the creation of pre-article pages in the Wikipedia_talk namespace. This is due to the fact that when article creation was disabled for IP editors, the right to create talk pages remained.

Recently, newly registered editors have been increasingly directed to AfC with the intention that these newcomers would benefit from mentoring along with the review process built into AfC.

Historical notes[edit]

AFC change over time[edit]

AFC functioned as a place to request articles as of 2006: [1]

Submissions reviewed as of Feb 2007: [2]

Anon backstory[edit]

See 2005 [3] RFC Nov 2007: [4]

Relevant moments[edit]

WikiProject created June 2007: [5]

Article Wizard page appeared August 2009 [6]

Hypotheses[edit]

AfC is too slow.
Submissions don't get reviewed fast enough -- possibly due to lack of reviewers
AfC reviews are ineffective at ensuring quality.
Submissions are accepted, but later deleted.
AfC doesn't improve new editor retention
Because of the two above effects, AfC causes new editors to leave just as often as if they create their article in mainspace and risk speedy deletion
AFC is a low-quality dump
A significant number of drafts are unlikely to ever cross our Notability threshold
AFC has poor reviewing
AFC might be biased towards declining submissions, and have incorrect reviewing on the same
AFC might hinder good faith work, eg editathons

Workflow[edit]

There are 4 main states that an AfC page can occupy:

  • Draft: Pages that have not yet been submitted for review.
  • Pending: Pages that are waiting for review by AfC volunteers
  • Accepted: Pages that were accepted and moved to NS=0
  • Declined: Pages that were declined

Any non-accepted submissions that remain idle for more than 60 days[citation needed] can be flagged with CSD criteria G13 "Abandoned Articles for creation submissions". Pages that are currently occupying these states can be discovered via categories.

Historical data extraction[edit]

Regular expression
/\{\{AfC( |_)submission\|([^\|]*)|/g

Sadly, MediaWiki does not keep a history of categorizations that we can simply query in order to find out when an article moved from category to category. The best way we can do so is by performing text analysis on the content of AfC pages and looking for instances of {{AfC submission}}. The template's first positional parameter denotes the state of the submission:

  • Draft: "t"
  • Pending: "" (empty) or "r" for "review in progress"
  • Accepted: "a"
  • Declined: "d"

Part of the AfC process involves appending duplicate templates to the bottom of the article (and then manually removing an other version). In this case, the template with the greatest precedence should be taken. "t" < "" < "r" < ("a", "d"). Since "accepted" and "declined" share precedence we might have a problem determining what status the article occupies if both are present. I'll arbitrarily decide that "a" < "d" and plan for analysis to see how often this occurs.

Strategy[edit]

For all articles that appear in a relevant AfC category (see above), process revision content from the first revision forward in time.

  • If page is in ns=0, process corresponding talk page revisions in parallel to content page revisions.
  • For each revision content, extract all instances of the AfC submission template using the Regular expression above.
    • Extract first group match looking for {"t", "", "r", "a", "d"}
    • Identify the most advanced (based on precedence) status
    • If most advanced status is different from previous revision, record status change at revision timestamp