Jump to content

WSoR datasets/trending articles

From Meta, a Wikimedia project coordination wiki

This dataset is used to the sprint on new editor retention in trending articles.

Location

[edit]
  • Dropbox/wsor/yusuke/trending-20110627/data/bursts_Jan2009_daily_rev.tsv
  • Dropbox/wsor/yusuke/trending-20110627/data/bursts_Jul2009_daily_rev.tsv
  • Dropbox/wsor/yusuke/trending-20110627/data/bursts_Jan2010_daily_rev.tsv
  • Dropbox/wsor/yusuke/trending-20110627/data/bursts_Jul2010_daily_rev.tsv
  • Dropbox/wsor/yusuke/trending-20110627/data/bursts_Jan2011_daily_rev.tsv

Fields

[edit]
    $ head -4 ~/Dropbox/wsor/yusuke/trending-20110627/data/bursts_Jan2009_daily_rev.tsv

title	page_id	redirect?	pageview timestamp	predicted pageview	actual pageview	trending hours	surprisedness	revision	timestamp	user type	username	editcount	new?	protect	editcount_120d+90d
National_Football_League_Most_Valuable_Player_Award	2019918	REDIRECT_RESOLVED	2009/01/03 12:00:00	212	1270	1	4.99056603774	261557507	20090103000010	REG	Howdythere	880	OLD	NO_PROTECT	1
La_Toya_Jackson	152297	REDIRECT_RESOLVED	2009/01/03 12:00:00	65	1668	1	24.6615384615	261557521	20090103000013	ANON	81.151.114.161	0	NEW	NO_PROTECT	0
National_Football_League_Most_Valuable_Player_Award	2019918	REDIRECT_RESOLVED	2009/01/03 12:00:00	212	1270	1	4.99056603774	261557736	20090103000124	REG	Howdythere	881	OLD	NO_PROTECT	1

Each row represents a revision that has its 'surprisedness' value higher than the threshold. Each file covers those revisions found in one month.

  • title
  • page_id
  • redirect?
  • pageview timestamp
  • predicted pageview: linear prediction from the previous two days
  • actual pageview
  • trending hours: the duration of the continued trending days
  • surprisedness: percentage of the increase from the prediction to the actual page view count
  • revision ID
  • revision timestamp: in date, hour, min and seconds
  • user type: registered user, bot, or anonymous user
  • username
  • editcount: editcount until the revision timestamp
  • new user?: whether the user had 30 days editing history as of the revision

Reproduction

[edit]

Use the scripts available at [1] and follow the documentation.

Notes

[edit]

Since it takes 1-2 days to produce the dataset for one month, only the samples for 5 months, every half a year between January 2009 and January 2011 are prepared. Edit count values may be incorrect bug:19311.