Jump to content

Research:Automatically labeling low quality content: Difference between revisions

From Meta, a Wikimedia project coordination wiki
Content deleted Content added
Sumit.iitp (talk | contribs)
Line 84: Line 84:
<!--''It's very important that researchers do not disrupt Wikipedians' work. Please add to this section any consideration relevant to ethical implications of your project or references to Wikimedia policies, if applicable. If your study has been approved by an ethical committee or an institutional review board (IRB), please quote the corresponding reference and date of approval.''-->
<!--''It's very important that researchers do not disrupt Wikipedians' work. Please add to this section any consideration relevant to ethical implications of your project or references to Wikimedia policies, if applicable. If your study has been approved by an ethical committee or an institutional review board (IRB), please quote the corresponding reference and date of approval.''-->
Our work does not record any data that is not available publicly on the Wikipedia ecosystem. We do not need any information other than the assessments of Wikipedia editors on the statements which are part of the study. The ''University of Michigan Institutional Review Board Health Sciences and Behavioral Sciences'' has determined that this study is exempt from IRB oversight (Date: 9/24/2020, IRB No. HUM00187850).
Our work does not record any data that is not available publicly on the Wikipedia ecosystem. We do not need any information other than the assessments of Wikipedia editors on the statements which are part of the study. The ''University of Michigan Institutional Review Board Health Sciences and Behavioral Sciences'' has determined that this study is exempt from IRB oversight (Date: 9/24/2020, IRB No. HUM00187850).
In order to solicit feedback from the community, we intend to post a small number of predictions from the model on English Wikipedia's [[en:WP:FAR|Featured Article Review]] space to show the potential of the research in helping ease the review process.


== Results==
== Results==

Revision as of 17:42, 25 November 2020

Created
04:02, 7 October 2020 (UTC)
Collaborators
Aaron Halfaker
Nikola Banovic
labeling, machine learning

This page documents a research project in progress.
Information may be incomplete and change as the project progresses.
Please contact the project lead before formally citing or reusing results from this page.


We are trying to build models that can automatically flag issues with statements on Wikipedia. These issues are along the dimensions of improving grammar, removing bias, adding citations, etc. The goal is to use edits across Wikipedia to learn quality improving behaviors on statements along the given dimensions which can then be used to build models that can automatically identify such issues.

Methods

Data Extraction

We extracted about 6 million Wikipedia edits from articles of varying quality and preprocessed them to identify their semantic intention. We extract edits for the following semantic intentions:

  1. Point-of-view
  2. Citations
  3. Clarifications

We use the statements before modification from each of the semantically identified edits as positive examples of needing the semantic improvement. For example, an edit that makes a point-of-view change, is trying to make a statement or a paragraph more neutral. We extract such statements as positive examples. We use these statements to train models that can then automatically identify issues such as point-of-view, clarification,citations on unseen Wikipedia statements.

Statement Quality Identification

At present, we are focusing on three categories of improvements: point-of-view, need for citations and need for clarifications. We use the extracted and labeled edits from above to extract statements that were modified in those edits as needing those quality improvements. We then use such statements to build quality identification models to show that meaningful quality improving behaviors can be learnt from non-vandalism good quality edits.

Based on the results, we intent to expand this to detecting a variety of issues with statements using the same approach of learning quality improving behaviors from Wikipedia edits.

Visual depiction of the proposed pipeline for identifying content issues in statements on Wikipedia using edits

Policy, Ethics and Human Subjects Research

Our work does not record any data that is not available publicly on the Wikipedia ecosystem. We do not need any information other than the assessments of Wikipedia editors on the statements which are part of the study. The University of Michigan Institutional Review Board Health Sciences and Behavioral Sciences has determined that this study is exempt from IRB oversight (Date: 9/24/2020, IRB No. HUM00187850). In order to solicit feedback from the community, we intend to post a small number of predictions from the model on English Wikipedia's Featured Article Review space to show the potential of the research in helping ease the review process.

Results

This work will directly benefit the Wikipedia community in its efforts to improve the quality of Wikipedia articles. By automatically detecting issues with statements on Wikipedia articles, efforts around article quality improvement can be accelerated.

References