Research talk:AfC processes and productivity/Work log/2014-04-12

From Meta, a Wikimedia project coordination wiki

Saturday, April 12th[edit]

Basic stats -- checking dataset[edit]

I have my full dataset of extracted AfC page events. Today, I'll be generating some basic stats about the dataset and trying to get a sense for how AfC affects collaboration patterns around the early days of an article.

> select count(*) from nov13_status_change;
+----------+
| count(*) |
+----------+
|   567124 |
+----------+
1 row in set (0.65 sec)
> select count(distinct page_id) from nov13_status_change;
+-------------------------+
| count(distinct page_id) |
+-------------------------+
|                  220455 |
+-------------------------+
1 row in set (0.57 sec)
> select count(*) from afc_page_20140331;
+----------+
| count(*) |
+----------+
|   244316 |
+----------+
1 row in set (0.41 sec)

So, it looks like I'm missing about 20k pages. This could be the result of limiting my dump analysis to Nov. 2013.

> select count(*) from (SELECT DISTINCT page_id, page_namespace, page_title FROM afc_page_20140331 INNER JOIN nov13_creation USING (page_id) WHERE rev_timestamp <= "20131103010101") as foo;
+----------+
| count(*) |
+----------+
|   215102 |
+----------+
1 row in set (41.92 sec)

Yup. That looks right.

> SELECT page_namespace, count(distinct page_id) from nov13_status_change group by 1;
+----------------+-------------------------+
| page_namespace | count(distinct page_id) |
+----------------+-------------------------+
|              0 |                   32819 |
|              5 |                  187637 |
+----------------+-------------------------+
2 rows in set (0.58 sec)
> select 32819/(32819+187637);
+----------------------+
| 32819/(32819+187637) |
+----------------------+
|               0.1489 |
+----------------------+
1 row in set (0.00 sec)

It looks like about 14% of all of drafts started in AFC made it to the main namespace. I wonder if most of the drafts that didn't make it out were never submitted for review. Let's check.

+----------------+-------------------------+
| page_namespace | count(distinct page_id) |
+----------------+-------------------------+
|              0 |                   20092 |
|              5 |                  129098 |
+----------------+-------------------------+
2 rows in set (1.01 sec)

It looks like almost half of the AfC articles that made it to main namespace don't have a record of being submitted or accepted, but most of the pages that didn't make it out of draft were at least marked "pending" or had some evidence of review. 129,098/187,637 = 68.8%. --EpochFail (talk) 17:32, 12 April 2014 (UTC)[reply]


!!! I just realized that we might be having a problem with old MediaWiki behavior with regards to deletions. Time to trim this dataset to fit within the bounds of good data. "200901" through "201311".

> SELECT
    ->     page_namespace,
    ->     count(distinct page_id)
    -> FROM nov13_status_change
    -> INNER JOIN nov13_creation creation USING (page_id)
    -> WHERE 
    ->     status IN ("reviewing", "pending", "accepted", "declined") AND
    ->     creation.rev_timestamp BETWEEN "200901" and "201311"
    -> GROUP BY 1;
+----------------+-------------------------+
| page_namespace | count(distinct page_id) |
+----------------+-------------------------+
|              0 |                   19971 |
|              5 |                  123350 |
+----------------+-------------------------+
2 rows in set (4.63 sec)

For articles created between Jan. 2009 and Nov. 2013, 19,971/123,350 = 16.19% were ever published. --EpochFail (talk) 17:32, 12 April 2014 (UTC)[reply]

Page moves dataset[edit]

Brief interlude. I have to gather a dataset of page moves for Bluma. I had a long-running script re-generating these. I just have to extract the moves into a tsv so that Bluma can load them into a DB for querying.

> select count(*) from nov13_move;
+----------+
| count(*) |
+----------+
|  4788744 |
+----------+
1 row in set (4 min 57.41 sec)

Here we go: http://stat1001.wikimedia.org/public-datasets/enwiki/etc/nov13_move.tsv.gz --EpochFail (talk) 17:41, 12 April 2014 (UTC)[reply]

Exploration of time between AfC states[edit]

Just as a quick look, I wanted to know the time between when a draft is first marked "pending" and when the first evidence of review occurred. TO generate this plot, I found the first timestamp during which a page's status changed to "pending" and then compared to the first timestamp during which the page's status changed to "reviewing", "accepted" or "declined".

The density of time between when AfC submissions are first marked as "pending review" and when a review is first completed is plotted.
Time between submission and review. The density of time between when AfC submissions are first marked as "pending review" and when a review is first completed is plotted.

This looks bimodal. I'm surprised to see that reviews tend to take place either 1 day or 1 week after the submission is "submitted" (marked "pending"). Before I go any farther, I'm going to take this whole dataset and limit it to "200901" and "201311" and incorporate page moves. I don't really care if an article is not marked as reviewed if it was moved to the main namespace. --EpochFail (talk) 17:55, 12 April 2014 (UTC)[reply]


Done! The figure above is updated (mostly no change).

Here's the time between initial draft creation and submission as "pending":

The density of time between when AfC submissions are created and when they are first marked as "pending review" is plotted.
Time between creation and submission. The density of time between when AfC submissions are created and when they are first marked as "pending review" is plotted.

It looks like drafts are usually submitted for review within minutes of creation. However, if they aren't submitted for review right away, then their likely to wait a day, week or even a year before submission! --EpochFail (talk) 19:22, 12 April 2014 (UTC)[reply]


That's all for today. I'm starting to disagree with my previous conclusion that the review backlog is really AfC's problem. I really think that it is about visibility. Next time that I sit down, I'll be gathering data the article length and # of collaborators over time for newcomer created articles in AfC vs. direct to main. --EpochFail (talk) 19:25, 12 April 2014 (UTC)[reply]