WSoR datasets/trending articles

From Meta, a Wikimedia project coordination wiki
Jump to: navigation, search

This dataset is used to the sprint on new editor retention in trending articles.

Contents

Location[edit]

  • Dropbox/wsor/yusuke/trending-20110627/data/bursts_Jan2009_daily_rev.tsv
  • Dropbox/wsor/yusuke/trending-20110627/data/bursts_Jul2009_daily_rev.tsv
  • Dropbox/wsor/yusuke/trending-20110627/data/bursts_Jan2010_daily_rev.tsv
  • Dropbox/wsor/yusuke/trending-20110627/data/bursts_Jul2010_daily_rev.tsv
  • Dropbox/wsor/yusuke/trending-20110627/data/bursts_Jan2011_daily_rev.tsv

Fields[edit]

    $ head -4 ~/Dropbox/wsor/yusuke/trending-20110627/data/bursts_Jan2009_daily_rev.tsv

title   page_id redirect?       pageview timestamp      predicted pageview      actual pageview trending hours  surprisedness   revision        timestamp       user type       username        editcount       new?    protect editcount_120d+90d
National_Football_League_Most_Valuable_Player_Award     2019918 REDIRECT_RESOLVED       2009/01/03 12:00:00     212     1270    1       4.99056603774   261557507       20090103000010  REG     Howdythere      880     OLD     NO_PROTECT      1
La_Toya_Jackson 152297  REDIRECT_RESOLVED       2009/01/03 12:00:00     65      1668    1       24.6615384615   261557521       20090103000013  ANON    81.151.114.161  0       NEW     NO_PROTECT      0
National_Football_League_Most_Valuable_Player_Award     2019918 REDIRECT_RESOLVED       2009/01/03 12:00:00     212     1270    1       4.99056603774   261557736       20090103000124  REG     Howdythere      881     OLD     NO_PROTECT      1

Each row represents a revision that has its 'surprisedness' value higher than the threshold. Each file covers those revisions found in one month.

  • title
  • page_id
  • redirect?
  • pageview timestamp
  • predicted pageview: linear prediction from the previous two days
  • actual pageview
  • trending hours: the duration of the continued trending days
  • surprisedness: percentage of the increase from the prediction to the actual page view count
  • revision ID
  • revision timestamp: in date, hour, min and seconds
  • user type: registered user, bot, or anonymous user
  • username
  • editcount: editcount until the revision timestamp
  • new user?: whether the user had 30 days editing history as of the revision

Reproduction[edit]

Use the scripts available at [1] and follow the documentation.

Notes[edit]

Since it takes 1-2 days to produce the dataset for one month, only the samples for 5 months, every half a year between January 2009 and January 2011 are prepared. Edit count values may be incorrect bug:19311.