WSoR datasets

From Meta, a Wikimedia project coordination wiki
Jump to: navigation, search

This page lists datasets created during the Wikimedia Foundation Summer of Research (WSoR) 2011 that are likely to be of use in the future.

Please note: we are still working out the most viable long term solution for publishing these datasets, in collaboration with Dario Taraborelli and others. If you have questions or comments about any of the data, please feel free to ask on the Talk page.

Add a dataset[edit]

To create a new dataset page:

  1. Use this form to create the sprint page. Make sure to give your dataset a name.


  2. Add your dataset to the list below. eg. {{WSoR_dataset|Dataset name|Short description}}

Datasets[edit]

  • policy_counts
    yearly contribution counts to selected pages in project namespaces
  • bot
    A curated table of bot user_ids--useful for flagging and removing bots during analysis.
  • user_year_month_namespace
    An aggregation of user activity by namespace and month--useful for visualizing months of editor activity.
  • user_cohort
    A list of users who made at least one edit with the dates of their first and last edits included--useful for grouping editors into cohorts.
  • rev_len_changed
    An approximate diff size for each revision--useful for approximating the amount of content an editor has added.
  • user_first_msg
    The first edit to a user's talk page, with metadata including message type and automated tool used.
  • user_activity_first_msg
    A summary of editor activity before and after they receive their first message.
  • rev_len_changed
    An approximate diff size for each revision--useful for approximating the amount of content an editor has added.
  • user_approx_registration
    An approximate registration date for editors who started editing before registration dates were recorded.
  • revert
    Reverting revisions with a field for whether the revert was for vandalism (guess based on RegExp of comment).
  • reverted
    Reverted revisions with information about the reverted edit and whether the revert was for vandalism (guess based on RegExp of comment).
  • revision_diff
    The optimal diff information for all revisions from the April, 2011 XML database dump.
  • trending_articles
    A list of revisions of trending articles within the time period when they were trending, for 5 months.

WikiProject Datasets: