Research:Detox/Data Releasev2

All data collected or generated for the Wikipedia Detox projects is available under free licenses on Figshare, per our open access policy. There are currently two distinct types of data included:

a complete corpus of discussion comments
a samples of crowd-annotated discussion comments

This document is dedicated to recording the schema of the published data files. For details on data collection methodology and modeling, please refer to our research paper. For a quick demo of how to use the data for model building and analysis, check out this ipython notebook.

Datasets[edit]

Wikipedia Talk Corpus[edit]

Schema for comments_{year}.tsv[edit]

This file contains all comments posted in user and article talk page discussions in {year} .

comment_id: Unique comment id. Corresponds to the MediaWiki revision id of the edit that added the comment to a talk page (i.e. discussion).
comment: Comment text. Consists of the concatenation of content added during a revision/edit of a talk page. MediaWiki markup and HTML have been stripped out. To simplify tsv parsing, \n has been mapped to NEWLINE_TOKEN, \t has been mapped to TAB_TOKEN and " has been mapped to `.
raw_comment: Raw comment text. Consists of the concatenation of raw content added during a revision of a talk page.
timestamp: Timestamp in UTC.
page_id: MediaWiki page id of the talk page the comment was made on.
page_title: Title of the talk page the comment was made on.
user_id: MediaWiki user id of the author of the comment. Is always "0" for anonymous contributions.
user_text: Username of the author of the comment. Is an IP in the case of anonymous contributions.
ns: Namespace of the discussion page the comment was made in. Takes on values in {user, article}.
bot: Indicator of whether the comment was made by a bot based on simple heuristics.
admin: Indicator of whether the comment serves and administrative purpose based on simple heuristics.

Wikipedia Talk Labels: Personal Attacks[edit]

Schema for attack_annotated_comments.tsv[edit]

This file contains the comment text and metadata for comments with personal attack labels generated by crowd-workers. The actual labels are in attack_annotations.tsv since each comment was labeled multiple times.

comment_id: Unique comment id. Corresponds to the MediaWiki revision id of the edit that added the comment to a talk page (i.e. discussion).
comment: Comment text. Consists of the concatenation of content added during a revision/edit of a talk page. MediaWiki markup and HTML have been stripped out. To simplify tsv parsing, \n has been mapped to NEWLINE_TOKEN, \t has been mapped to TAB_TOKEN and " has been mapped to `.
year: The year the comment was posted in.
logged_in: Indicator for whether the user who made the comment was logged in. Takes on values in {0, 1}.
ns: Namespace of the discussion page the comment was made in. Takes on values in {user, article}.
sample: Indicates whether the comment came via random sampling of all comments, or whether it came from random sampling of the 5 comments around a block event for violating WP:npa or WP:HA. Takes on values in {random, blocked}.
split: For model building in our paper we split comments into train, dev and test sets. Takes on values in {train, dev, test}.

Schema for attack_annotations.tsv[edit]

This file contains personal attack labels from several crowd-workers for each comment in attack_annotated_comments.tsv. It is meant to be joined with attack_annotated_comments.tsv on `comment_id`.

comment_id: Unique, random comment id. It is not the MediaWiki revision id.
worker_id: Anonymized crowd-worker id. Might be useful in culling unreliable annotators.
attack: Indicator for whether the worker thought the comment contains a personal attack. The exact question we posed can be found here. The annotation takes on value 1 if the worker selected the option "This is not an attack or harassment" and value 0 otherwise. Takes on values in {0, 1}.

Wikipedia Talk Labels: Aggression[edit]

Schema for aggression_annotated_comments.tsv[edit]

This file contains the comment text and metadata for comments with aggression labels generated by crowd-workers. The actual labels are in aggression_annotations.tsv since each comment was labeled multiple times.

comment_id: Unique comment id. Corresponds to the MediaWiki revision id of the edit that added the comment to a talk page (i.e. discussion).
comment: Comment text. Consists of the concatenation of content added during a revision/edit of a talk page. MediaWiki markup and HTML have been stripped out. To simplify tsv parsing, \n has been mapped to NEWLINE_TOKEN, \t has been mapped to TAB_TOKEN and " has been mapped to `.
year: The year the comment was posted in.
logged_in: Indicator for whether the user who made the comment was logged in. Takes on values in {0, 1}.
ns: Namespace of the discussion page the comment was made in. Takes on values in {user, article}.
sample: Indicates whether the comment came via random sampling of all comments, or whether it came from random sampling of the 5 comments around a block event for violating WP:npa or WP:HA. Takes on values in {random, blocked}.
split: For model building in our paper we split comments into train, dev and test sets. Takes on values in {train, dev, test}.

Schema for aggression_annotations.tsv[edit]

This file contains aggression labels from several crowd-workers for each comment in aggression_annotated_comments.tsv. It is meant to be joined with aggression_annotated_comments.tsv on `comment_id`.

comment_id: Unique, random comment id. It is not the MediaWiki revision id.
worker_id: Anonymized crowd-worker id. Might be useful in culling unreliable annotators.
aggression: Indicator for whether the worker thought the comment has an aggressive tone. The exact question we posed can be found here. The annotation takes on the value 1 if the worker considered the comment aggressive and value 0 if the worker considered the comment neutral or friendly. Takes on values in {0, 1}.

License[edit]

These datasets are released under a CC0 public domain dedication. If you're using this data in your research, please provide attribution via the recommended citation below.

Citation[edit]

This dataset can be cited as:

Wulczyn, Ellery; Thain, Nithum; Dixon, Lucas (2016): Wikipedia Detox. figshare. doi.org/10.6084/m9.figshare.4054689

Retrieved: 13 00, Oct 31, 2016 (GMT)