Research:Wikipedia article creation

From Meta, a Wikimedia project coordination wiki
This page documents a completed research project.


The process of creating articles is becoming increasingly difficult for new users due to increasingly restrictive criteria[1] and the speed at which their articles are tagged and deleted[2]. This trend is concerning because new users tend to leave the wiki when their work is deleted.

The English Wikipedia Articles for Creation WikiProject has recently adjusted in order to encourage new editors to create draft articles outside of the usual article space. However, it's unclear whether such initiatives are successful in improving the success rate of articles created by new editors or improving their retention. In this study, we'll discuss our analysis of newcomer created articles in the most active Wikipedia projects and answer questions about how different workflows affect the success rate of articles.

Related work[edit]

Research has established that the number of active editors in the English Wikipedia has entered a decline and that this decline is the result of decreased retention of new users[3]. Subsequent research by Halfaker et al. has shown evidence that this decline is not due to the quality of newcomers, but rather the increasing complexity newcomers must manage in order to successfully contribute and the negative reactions they receive[4]. One of the key factors in Halfaker et al.'s model predicting the retention of new editors was whether they created articles that were quickly deleted. Related work by User:Mr.Z-man confirmed that new editors who created articles that were deleted are less likely to continue to contribute[5]. Research performed in parallel found that the rate at which newly created articles are deleted has risen sharply in recent years[1] and the speed at which new articles are tagged and deleted has increased dramatically[2].

Given that several other large Wikipedias exhibit trends similar to the English Wikipedia (e.g. German[6]) and the recent interest in creating native draft functionality in the English Wikipedia, we have set out to better understand the nature of newcomer article creation. Specifically, we sought to understand how drafts have affected the success rate of newcomers' articles.

Research questions[edit]

For this analysis, we focused on the top 10 Wikipedias by daily number of articles created[7]: English, Spanish, Russian, German, French, Italian, Polish, Chinese, Japanese and Portuguese. For the sake of simplicity, we'll focus our in-depth analyses on the English and German Wikipedias, but we will compare general statistics across all 10 wikis listed previously.

RQ 1: At what scale do new editors create articles?

  1. How many newcomers create articles?
  2. How many articles are created by newcomers?

RQ 2: How successful are new editors in creating articles in Wikipedia?

  1. What is the success rate of articles created by new editors? ...of more experienced editors? ...of IP editors?
  2. How many articles are created indirectly, as drafts? How does the success rate differ for these draft articles?
  3. How has AfC affected the success rate of new editors in English Wikipedia?

Methods[edit]

Data was gathered using the page, revision, archive and logging tables. Data was originally extracted on Nov 5th, 2013, so all subsequent queries were bound by this date for fairness across wikis.

General terminology[edit]

On Wikipedia, there is a great deal of complexity and nuance to the jargon for various kinds of wiki pages and page creation processes. The following definitions are terms we will use in our research analyses:

Page
a page is any wiki page, i.e. in all namespaces. If we are referring to pages within a particular namespace only, we will either say article (see below) or add the namespace in specific, e.g. "userspace page" or "user talk page"
Article
an article is a page in the main namespace, aka namespace zero
Draft
a draft is any page originally created outside of the main namespace that is intended to be an article.
Articles for Creation, or AFC
Articles for Creation is a project on English Wikipedia originally intended to allow IP editors to request the creation of articles by registered users. However, recently the project has become a sort of pre-review and mentoring space for both IP editors and new editors.

Assumptions and notes[edit]

Page creation
  • We assumed that the timestamp of the first revision to a page is the time of creation
Page creator
  • We assumed that the user who saved the first revision to a page is the page's creator
  • Newcomer tenure was determined by comparing the timestamp of page creation with the user_registration timestamp for all users.
    • Note that user_registration represents something different for Research:Attached users, so they are held aside in newcomer analyses.
Page deletion
  • We assume that the timestamp of the last revision to a page is the time of deletion
  • We chose this approximation due to bug #26122 which does not allow for easily associating deletion events with the page that was deleted).

Datasets[edit]

To perform this analysis, we generated two datasets. For most of our analysis, we focused on the top 2 Wikipedias by article count: English & German.

English & German Wikipedia[edit]

Article creation process workflow for English Wikipedia is presented as a directed graph and split into "draft space" and "main space"
Article creation process (enwiki). Article creation process workflow for English Wikipedia is presented as a directed graph and split into "draft space" and "main space"

In this dataset, we sought to build as complete a picture of article creation as possible. In order to do this, we needed to be able to distinguish articles that started as drafts from articles that were created directly in the main namespace. The problem is that it is difficult to programmatically tell the difference between an unpublished draft and other kinds of non-namespace zero pages. For example, many users start drafts in their "user sandbox" and later move those pages to main (e.g. en:User:'DesoHaa/sandbox was moved to en:Fredrick_Kúmókụn_Adédeji_Haastrup on May 27th, 2013), but much "user sandbox" usage is unrelated to article drafting.

One exception is the drafts created via Articles for Creation (AFC) in the English Wikipedia. Pages originally created in the Wikipedia Talk namespace as a sub-page of en:Wikipedia_talk:Article_for_creation. We assume that all such sub-pages represent a draft that is intended to eventually become an article. This allows us the ability to make a clear distinction between AFC pages that have been published and those that have not.

However useful this is for our analysis of AfC specifically, we sought to be able to compare English & German Wikipedia fairly. In order to do this, we invented the concept of an "article page" -- a page that was, at some point, visible in the main namespace -- under the assumption that any page that is moved to the main namespace was probably intended to be an "article" of some sort.

Page move
A page is moved when its namespace or title are changed. Draft articles tend to be "published" by being moved into the main namespace from other namespace
  • Note that we extract page move information from structured comments that appear in the revision history of pages due to a bug (#57084) which does not allow for easily associating move events with the page that was moved.
    • English -- rev_comment RLIKE '.*moved .*\\[\\[([^\]]+)\\]\\] to \\[\\[([^\]]+)\\]\\].*:.*'
    • German -- rev_comment RLIKE ".*(hat „|verschob „|verschob Seite |verschob die Seite )\\[\\[([^\]]+)\\]\\](“)? nach („)?\\[\\[([^\]]+)\\]\\](“)?(.*)"
Article page
A page that appears in the main namespace at some point. This includes pages that were originally created as articles and those that were moved to namespace zero.
Original namespace
Many article pages were created in directly in the main namespace. However, others were created in "userspace" (namespace 2) and Articles for Creation (namespace 5). By capturing the originating namespace, we hope to get a sense for how different workflows affect the survival of articles once they are moved ("published") to the main namespace.
Publication date
An article is "published" when it first appears in the main namespace. This matches the creation date for pages that were initially created as articles, but instead matches the date a page was moved into namespace zero for pages that were created in other namespaces.
Unpublication date
An article is "unpublished" when it is first deleted or moved out of namespace zero. For simplicity, we did not account for undeletions and moves back to the main namespace. Once a page is unpublished, it's gone. Regretfully, the logging table does not track an appropriate identifier for deleted pages (see bug #26122), so we considered the last revision to an archived page to represent the approximate time-of-deletion.

The other top 8 Wikipedias[edit]

In order to be able to generalize our conclusions we sought to get a representative sample from our larger projects. For this dataset, we extracted general page creation & deletion data from the top 10 wikis by size (English, German, Italian, Spanish, Russian, Portuguese, Polish, Japanese, Chinese and French). Due to the way that the analytics slaves which host non-English/German wikis operate, we were unable to efficiently extract move information. So, rather than examining article pages, we instead used all pages that appeared in the main namespace as of Nov. 13th, 2013 as our set of "articles". While this means that we are not able to observe the effects of different workflows on article survival, we are still able to reflect on how creation and deletion rates change over time for those articles that appear in the main namespace.

Article
A page that appeared in the main namespace as of Nov. 13th, 2013. (This includes deleted pages.)
Creation date
The timestamp of page creation
Deletion date
The timestamp of page deletion

Note that, when we compare German and English Wikipedias to the other 8, we revert to this simple definition of an "article", creation and deletion.

Article creator classes[edit]

The density of time between first and last edits is plotted for deleted articles created between 2008 and 2013 in the English Wikipedia.
Article lifetime. The density of time between first and last edits is plotted for deleted articles created between 2008 and 2013 in the English Wikipedia.

In order to observe newcomer article creation, we split newly registered users into three groups based on their tenure at the time of article creation:

-day
Newcomers who registered less than a day before saving the first revision of a page.
day-week
Newcomers who registered between 24 hours and 7 days before saving the first revision of a page
week-month
Newcomers who registered between 7 and 30 days before saving the first revision of a page

Note that, when we refer to "newcomers" without specifying which class, we mean the aggregate of the above newcomer classes (i.e. all newcomers with less than 30 days of tenure).

We also examined other groups for comparison

month-
Wikipedians who registered more than a month before saving the first revision of a page
anons (IP editors)
Editor who created pages "anonymously" -- not through a registered account. Note that the English Wikipedia and a handful of other wikis disallowed direct article creation from IP addresses.
autocreated
Attached editors are users with global accounts who came from another wiki (e.g. meta). Since they are automatically registered (hence "autocreated") when they visit the wiki of interest, registration dates don't generally reflect their actual tenure across Wikipedia. For this reason, these users were analyzed separately.

Successful articles[edit]

In order to examine RQ 2, we sought to formalize "successful" article creation. Preferably, we'd like newly all created articles to meet minimum guidelines for inclusion in an encyclopedia. If an effective review process is in place, new articles below such a threshold are deleted in a reasonable amount of time and articles above the threshold should remain. In such a system, any article that survives a certain amount of time can be assumed to be a successful article creation. In order to identify how long a reasonable amount of time might be, we performed an analysis of the time between creation and deletion (last revision timestamp). Figure #Article lifetime shows a strong cluster between one minute and one hour. While some deletions take more than one year, 87.3% of deletions occurred within one month. Based on these observations, we operationalize a reasonable amount of time as 30 days.

Results[edit]

Since the archive table only contains page_id values for revisions of pages deleted after 2007, we'll focus our attention on 2008 through 2013.

RQ 1: At what scale do new editors create articles?[edit]

How many newcomers create articles?[edit]

To get a sense for how many newcomers create articles and to identify changes over time, we built a monthly timeseries of the number of newly_registered_users, new editors, new page creators, new article page publishers and new draft article publishers. These editor classes represent a "funnel" as users approach creating articles and eventually draft articles.

  • new page creators = new editors who create a page within 1 month of registration
  • new article publishers = new editors who create an article page within 1 month of registration
  • new draft article publishers = new editors who create a draft (original namespace != 0) article page within 1 month of registration.
The monthly (relative) proportion of editors reaching stages of the draft article publishing funnel is plotted for English Wikipedia.
Relative funnel proportions (enwiki). The monthly (relative) proportion of editors reaching stages of the draft article publishing funnel is plotted for English Wikipedia.
The monthly (relative) proportion of editors reaching stages of the draft article publishing funnel is plotted for German Wikipedia.
Relative funnel proportions (dewiki). The monthly (relative) proportion of editors reaching stages of the draft article publishing funnel is plotted for German Wikipedia.

Figures #Relative funnel proportions (enwiki) and #Relative funnel proportions (dewiki) show how the relative proportion of newcomers reaching each step in the funnel has changes over time. Both English and German Wikipedia have experienced a slow decline in the proportion of newly_registered_users who make at least one edit (new editors). However, while the proportion of new editors who create pages (page creators) has been declining for German Wikipedia (55% in Jan. 2008 to less than 40% in Oct. 2013), the proportion has been holding steady for English Wikipedia at about 40-45%.

Another difference can be observed in the proportion of new page creators who publish an article page. Both the English and German Wikipedia fluctuate around 60-65% between 2008 and mid 2011, but the percentage of such editors drops to about 32% thereafter. Note that the timing of this switch corresponds to the time that newcomers began to be directed to en:Wikipedia:Articles for creation. This transition is explored more closely in #How has AfC affected the success rate of new article creators in English Wikipedia?


The table below summarizes statistics for English and German Wikipedia's for the most recent full month in the dataset: October, 2013.

group English (Oct. 2013) German (Oct. 2013)
Users Relative % Absolute % Users Relative % Absolute %
Newly registered users 157008 100 100 9633 100 100
New editors 48163 30.6 30.6 4019 41.7 41.7
New page creators 18808 39.1 12.0 1575 39.2 16.4
New article publishers 6045 32.1 3.9 952 60.4 9.9
New draft publishers 118 2.0 0.0 0 0 0.0

How many articles are created by newcomers?[edit]

Next we sought to explore how many articles newcomers were responsible for. Figure #Article created by experience plots the proportion of articles created in October, 2013 for each of the observed wikis.

The proportion of all articles created in Oct. 2013 is plotted by the experience level of the creator for 7 wikis.
Articles created by experience. The proportion of all articles created in Oct. 2013 is plotted by the experience level of the creator for 7 wikis.

In all wikis, Wikipedians with more than a month of tenure create the vast majority of articles. On the low end is Italian with "month-" Wikipedians creating 54% of new articles. On the high end is English and German with "month-" Wikipedians creating 83.6 and 79% of articles respectively.

In all non-English wikis, anonymous editors (aka IP editors) represent the next largest class of article creators. Their article creation proportion ranges from 14.5% in German Wikipedia to 36.2% in Italian Wikipedia.

The next largest group of article creators are newcomers with less than 1 day of tenure on Wikipedia. In all the observed wikis, a higher proportion of articles are created by newcomers in their first day than those created by newcomers in the rest of their first week/month combined.


RQ 2: How successful are new editors in creating articles in Wikipedia?[edit]

What is the success rate of articles created by new editors?[edit]

In order to address this research question, we first sought to get a general sense of the success rate of newcomers in creating articles across wikis and how that success rate has changed over time. Figure #Article survival by experience summarizes the success rate of editors from each of the 6 article creator classes from October, 2013 across the set of languages.

The survival proportion of all articles created between Oct. 2012-2013 is plotted by the experience level of the creator.
Article survival by experience. The survival proportion of all articles created between Oct. 2012-2013 is plotted by the experience level of the creator.

We were surprised to find that, in wikis that allow article creation by anonymous editors, their survival rate was/is substantially higher than that of recently registered new editors. In most cases, anons were twice as likely to create an article that would stick than newcomers with less than a day of tenure. One notable exception is Polish Wikipedia which has the lowest observed survival rate of anon created articles (22.2%) and the the highest observed survival rate of articles created by "-day" newcomers (58.6%).

In general, the rate of article survival is much higher for all editors in Japanese and Polish Wikipedias. It's unclear to us why this is the case. All other wikis seem to follow a similar pattern whereby newcomers with less than a day of tenure have the lowest article survival rate.

Next, we looked at how article survival rate has been changing over time for different article creator classes. The figures below plot the monthly survival rate for each class of editor with linear models overlaid to aid with visualizing trends.

(German)
Article survival over time. (German)
(English)
Article survival over time. (English)
(Spanish)
Article survival over time. (Spanish)
(French)
Article survival over time. (French)
(Italian)
Article survival over time. (Italian)
(Japanese)
Article survival over time. (Japanese)
(Polish)
Article survival over time. (Polish)
(Portuguese)
Article survival over time. (Portuguese)
(Russian)
Article survival over time. (Russian)
(Chinese)
Article survival over time. (Chinese)

In general, the survival rate of newcomers articles has been decreasing -- even in Polish and Japanese, where the survival rate of newcomer articles is very high. English and German represent exceptional cases where it appears that the success rate of newcomer created articles is rising for all three newcomer classes. For English, this could be explained by the introduction of Articles for creation. We'll discuss this possibility more in #How has AfC affected the success rate of new article creators in English Wikipedia? For German Wikipedia, we don't have any likely explanations to put forward.

How does the success rate differ for draft articles?[edit]

Next we looked to our dataset of article pages for English and German Wikipedia to examine the survival rate of articles that were created in another namespace (drafts) and moved to the main namespace later. The Article page survival figures (enwiki & dewiki) below plot the survival rate of articles over time by their original namespaces, for the three classes of newcomers and experienced Wikipedians.

  • 0 = Main (article) namespace
  • 2 = User namespace
  • 5 = Wikipedia_talk (Project_talk) namespace (used by Articles for creation on English Wikipedia)
The monthly proportion of surviving articles is plotted for article pages by the namespace from which they originated and by the tenure of the editor at time of draft creation.
Article page survival (enwiki). The monthly proportion of surviving articles is plotted for article pages by the namespace from which they originated and by the tenure of the editor at time of draft creation.
The monthly proportion of surviving articles is plotted for article pages by the namespace from which they originated and by the tenure of the editor at time of draft creation.
Article page survival (dewiki). The monthly proportion of surviving articles is plotted for article pages by the namespace from which they originated and by the tenure of the editor at time of draft creation.

As expected, newcomers with the least tenure ("-day") have the most divergent survival rates for direct to main (origin = 0) article creations and drafts. In the English Wikipedia, direct article creations survive about 25% of the time while articles that start in userspace (origin = 2) and Articles for creation (origin = 5) survive about 96% of the time once published. Similar, but less substantial differences in the survival rate exist for these newest of newcomers in the German Wikipedia. There, direct article creations by "-day" newcomers survive about 20% of the time while articles that start in userspace survive about 80% of the time.

This is where the similarities between English and German Wikipedia with regards to draft survival seem to disappear. While the survival rate of both draft types remains high for English Wikipedia through all editor classes, in German, the survival rate of userspace drafts created by slightly more experienced newcomers in ("day-week" & "week-month") and experienced editors have a surprising low survival rate -- lower than even direct article creations. This differing trend could represent the different cases in which userspace drafts are used in German vs. English Wikipedia. However one thing is clear: more examination is necessary to explain this difference.

The tables below present summary statistics for October, 2013, the most recent complete month in the dataset.

English Wikipedia (Oct. 2013)
origin creator_tenure authors articles surviving survival %
0 -day 5142 6221 1511 24.3
0 day-week 759 1415 788 55.7
0 week-month 750 1320 805 61.0
0 month- 5775 37033 35051 94.6
2 -day 69 70 67 95.7
2 day-week 22 23 22 95.7
2 week-month 40 53 45 84.9
2 month- 219 457 387 84.7
5 -day 96 96 92 95.8
5 day-week 29 32 32 100.0
5 week-month 29 31 31 100.0
5 month- 176 529 512 96.8
German Wikipedia (Oct. 2013)
origin creator_tenure authors articles surviving survival %
0 -day 802 967 190 19.6
0 day-week 111 163 83 50.9
0 week-month 101 171 113 66.1
0 month- 1615 15183 14216 93.6
2 -day 35 39 31 79.5
2 day-week 23 26 9 34.6
2 week-month 21 30 11 36.7
2 month- 129 352 154 43.8

How has AfC affected the success rate of new article creators in English Wikipedia?[edit]

The number of surviving articles per newcomer page creator is plotted with loess fits before and after AfC's era began in Feb. 2011.
Surviving article per new page creator. The number of surviving articles per newcomer page creator is plotted with loess fits before and after AfC's era began in Feb. 2011.
The monthly count of AfC drafts created by newcomers (<= 1 month since registration) is plotted for the English Wikipedia.
AfC drafts created per month. The monthly count of AfC drafts created by newcomers (<= 1 month since registration) is plotted for the English Wikipedia.

Finally, we sought to get a sense for how Articles for creation (AfC) affects the work of new editors. We showed in #How does the success rate differ for draft articles? that articles created through AfC are most likely to survive for all three newcomer classes as well as experienced Wikipedians. However, we also observed in #How many newcomers create articles? that the proportion of new page creators whose pages get published in the main namespace declined sharply around the time that new article creators began being directed to AfC.

Given the high survival rate of AfC published articles (~96%), it could be AfC is merely acting as an effective filter where articles that won't survive are just not published in the first place. In order to test this hypothesis, we filtered all article pages that did not survive at least 30 days from dataset and draw a similar proportion. Figure #Surviving article per new page creator plots this proportion with loess fits for before and after newcomers were directed toward AfC. Despite filtering for surviving articles, it looks like the trend remains. Figure #AfC drafts created per month shows the corresponding rise in the number of AfC drafts created by newcomers during this time period.

This analysis makes it clear that about half as many good articles are published by newcomers since newcomers started being directed to AfC when creating articles. However convincing, this analysis merely demonstrates a temporal correlation. More analysis will be necessary to assert with confidence that AfC is causing the decline in successful newcomer created articles.


Summary[edit]

In this study, we examined article creation across the top 10 language Wikipedia's by number of articles and performed a focused analysis of draft article creation in English and German Wikipedias.

We found some strong regularities in success rate of articles depending on the experience level of the article's creator; namely, that the more experience an editor has, the more likely their articles are to survive. We were surprised to find that articles created by anonymous editors (where such creations are possible) are more likely to survive than articles created by newcomers who recently registered an account. This result suggests that it's time to review English Wikipedia's policy against anonymous article creation.

We also found that, in general, the rate of survival for newcomer articles has been decreasing over time with two notable exceptions. In the English Wikipedia, the survival of newcomer articles decreased steadily until the introduction of Articles for creation, a space for creating article drafts and receiving review before publishing. In German Wikipedia, the survival of newcomer articles has been rising steadily since 2008, but we see no evidence of a comparable switch toward a draft & review process in that wiki.

Finally, we found correlation based evidence that directing new article creators to AfC has resulted in a dramatic decline in the creation of good new articles by newcomers. We also showed that the drafts published via AfC are extremely likely to survive. More work is necessary to identify what factor may be limiting AfCs success to this smaller proportion of articles.

See also[edit]

Notes[edit]

  1. a b Lam, S. T. K., & Riedl, J. (2009, May). Is Wikipedia growing a longer tail?. In Proceedings of the ACM 2009 international conference on Supporting group work (pp. 105-114). ACM. pdf
  2. a b Research:The Speed of Speedy Deletions
  3. Suh, B., Convertino, G., Chi, E. H., & Pirolli, P. (2009, October). The singularity is not near: slowing growth of Wikipedia. In Proceedings of the 5th International Symposium on Wikis and Open Collaboration (p. 8). ACM.
  4. Halfaker, A., Geiger, R. S., Morgan, J. T., & Riedl, J. (2013). The Rise and Decline of an Open Collaboration System How Wikipedia’s Reaction to Popularity Is Causing Its Decline. American Behavioral Scientist, 57(5), 664-688. pdf
  5. https://en.wikipedia.org/wiki/Wikipedia:Wikipedia_Signpost/2011-04-04/Editor_retention
  6. http://stats.wikimedia.org/EN/ChartsWikipediaDE.htm
  7. New articles per day from stats.wikimedia.org