Research talk:Wikipedia article creation/Work log/Thursday, December 12th
Thursday, December 12th
Today I need to settle an issue that's been troubling me. In order to determine the success rate of articles based on how they were created, I'd like to be able to detect practically all pages as they reach an "entry point" into the article creation process. I've made a handy graphic to help discuss the issues.
Under this conceptualization, a page is a draft until it is moved to the main namespace ("Main space") at which point it becomes an article.
There are three main entry points worth considering:
- AfC draft
- Any page that was originally created in the "Wikipedia_talk" namespace with a title prefixed by "Articles_for_creation/". (Note that this does not include User space AfC drafts. For the time being, I'm considering that loss OK.)
- User space draft
- A page that was originally created in the "User" or "User_talk" namespace and is intended to be an article. A draft that starts in user space can enter the AfC process as a "pending" draft.
- Direct to main
- A page that was originally created in the Main namespace.
The issue I'm struggling with is capturing the intention of a user-space page to find out if it is an article draft or not. Right now, I have about 75 sub-pages under my user page and only one of them is a redirect to an article I created (GroupLens Research for the curious). Many of the user space drafts that are currently pending AfC review are in User:<name>/sandbox, so there might even be old revisions that were test edits and had nothing to do with the version of the article at the time it gets submitted for review. --Halfak (WMF) (talk) 18:31, 12 December 2013 (UTC)
OK. Here's my plan. I'm only going to look at pages that crossed over into the main namespace at some point. I'm not going to worry about failed user space drafts that don't make it to Main and I'm going to worry about AfC drafts that don't make it to Main with some follow-up analysis.
Time to define some terms.
- article page -- A page that was ever moved to the main namespace
- "direct to main" -- If the page started in the main namespace
- "AfC" -- if original_namespace = 5 and original_title LIKE "Articles_for_creation/%"
- "user space draft" -- If the page started in "User" or "User_talk"
- "other" -- otherwise
- first main date -- The date that the page first appeared in the main namespace. This with be the same as first_revision if the article started in main. Otherwise, the date represents the first move.
- first move out of main -- Some pages are moved into a user's Sandbox rather than being deleted. This is nearly synonymous with deletion so I'd like to include. This will be NULL if the page was never moved out of main.
- last revision -- The date of the last revision to a page. This is used as a proxy to the page's deletion date in the case of archived pages.
- archived -- Currently (as of Nov. 13, 2013) appears in the archive table.
An article page will be considered retained if it meets either of the following conditions.
- The page continues to exist in the main namespace
- The page was not moved out of main or archived within 30 days of the first main date
CREATE TEMPORARY TABLE halfak.publish SELECT origin.page_id, origin.page_namespace, origin.page_title, MIN(main_move.timestamp) AS timestamp FROM halfak.nov13_page_origin origin INNER JOIN halfak.nov13_move main_move ON origin.page_id = main_move.page_id AND origin.page_namespace = main_move.page_namespace AND origin.page_title = main_move.page_title AND main_move.from_namespace != 0 AND main_move.to_namespace = 0 WHERE origin.original_namespace != 0 GROUP BY 1,2,3; CREATE UNIQUE INDEX page_name_title ON halfak.publish (page_id, page_namespace, page_title); CREATE TEMPORARY TABLE halfak.unpublish SELECT origin.page_id, origin.page_namespace, origin.page_title, MIN(out_move.timestamp) AS timestamp FROM halfak.nov13_page_origin origin INNER JOIN halfak.nov13_move out_move ON origin.page_id = out_move.page_id AND origin.page_namespace = out_move.page_namespace AND origin.page_title = out_move.page_title AND out_move.from_namespace = 0 AND out_move.to_namespace != 0 GROUP BY 1,2,3; CREATE UNIQUE INDEX page_name_title ON halfak.publish (page_id, page_namespace, page_title); DROP TABLE IF EXISTS halfak.article_page; CREATE TABLE halfak.article_page SELECT page_id, page_namespace, page_title, original_namespace, original_title, first_revision as created, IF(original_namespace = 0, first_revision, publish.timestamp) AS published, IF(archived OR unpublish.page_id IS NOT NULL, IF( archived, IF(unpublish.timestamp > last_revisions, last_revision, unpublish.timestamp), unpublish.timestamp ), NULL ) AS unpublished, FROM halfak.nov13_page INNER JOIN halfak.nov13_page_origin origin USING(page_id, page_namespace, page_title) LEFT JOIN halfak.publish USING(page_id, page_namespace, page_title) LEFT JOIN halfak.unpublish USING(page_id, page_namespace, page_title) WHERE original_namespace != 0 OR publish.page_id IS NOT NULL;
That took longer than expected. I'll babysit this for a while before calling it a night. With any luck, I'll be able to make use of this in the morning. --Halfak (WMF) (talk) 23:34, 12 December 2013 (UTC)