Research talk:New page reviewer impact analysis/Work log/2017-06-11

From Meta, a Wikimedia project coordination wiki

Sunday, June 11, 2017[edit]

Looking at the mediawiki page history table.

hive (wmf)> describe mediawiki_page_history;
OK
col_name	data_type	comment
wiki_db             	string              	enwiki, dewiki, eswiktionary, etc.
page_id             	bigint              	Id of the page, as in the page table.
page_id_artificial  	string              	Generated Id for deleted pages without real Id.
page_creation_timestamp	string              	Timestamp of the page's first revision.
page_title          	string              	Historical page title.
page_title_latest   	string              	Page title as of today.
page_namespace      	int                 	Historical namespace.
page_namespace_is_content	boolean             	Whether the historical namespace is categorized as content
page_namespace_latest	int                 	Namespace as of today.
page_namespace_is_content_latest	boolean             	Whether the current namespace is categorized as content
page_is_redirect_latest	boolean             	In revision/page events: whether the page is currently a redirect
start_timestamp     	string              	Timestamp from where this state applies (inclusive).
end_timestamp       	string              	Timestamp to where this state applies (exclusive).
caused_by_event_type	string              	Event that caused this state (create, move, delete or restore).
caused_by_user_id   	bigint              	ID from the user that caused this state.
inferred_from       	string              	If non-NULL, some fields have been inferred from an inconsistency in the source data.
snapshot            	string              	Versioning information to keep multiple datasets (YYYY-MM for regular labs imports)
	 	 
# Partition Information	 	 
# col_name            	data_type           	comment             
	 	 
snapshot            	string              	Versioning information to keep multiple datasets (YYYY-MM for regular labs imports)
Time taken: 0.129 seconds, Fetched: 22 row(s)
hive (wmf)> SELECT page_id, page_creation_timestamp, page_namespace AS page_creation_namespace, caused_by_user_id AS creation_user_id FROM mediawiki_page_history WHERE wiki_db = 'enwiki' AND caused_by_event_type = 'create' AND snapshot = "2017-05" AND page_namespace_latest = 0 limit 10;
page_id	page_creation_timestamp	page_creation_namespace	creation_user_id
609284	20040421024403	0	60719
2653101	20050910134608	0	36424
39050015	20130408122524	0	18801839
25399871	20091211213244	0	3010696
47053441	20150622221319	0	20318817
41954592	20140215234257	0	494861
35278895	20120331003832	0	4649642
15487338	20080129204923	0	231599
257932	20030704030557	0	6120
26284416	20100221181306	0	11571269
Time taken: 34.195 seconds, Fetched: 10 row(s)

Cool so it looks like we can split up the page creations by the original namespace and we can also get the creating user_id. OK my plan for the next time I sit down is to use this to generate a dataset that I'll start with. I only want page creations after 2008 because everything else is likely to be a little goofy. I think I'll then look into sampling from various time periods to know what proportion of new pages are created by newcomers. --EpochFail (talk) 19:59, 11 June 2017 (UTC)[reply]