Community Tech/Improve the plagiarism detection bot

From Meta, a Wikimedia project coordination wiki
Tracked in Phabricator:
Task T120435

The improve the plagiarism detection bot project aims to make it easier to find and flag plagiarism, so that it can be reverted, and users can be educated.

The new tool is CopyPatrol -- come try it out!

This tool is based on EranBot, the plagiarism detection bot which is currently running on English Wikipedia. EranBot is built and maintained by en:User:ערן (aka Eran); Community Tech's role is supporting and building on Eran's work.

Goals for this project[edit]

This process needs people to check the possible copy violations. Currently, there are a small number of users who are using the EranBot interface on en.wp. Our goal is to increase the number of people participating in the following ways:

  • Make the tool easier for people to pick up and learn
  • Make the workflow easier to use
  • Make the tool more satisfying to use, so that people return and continue to help checking possible violations
  • Eliminate some False Positives, so that people don't have to spend time checking them

Status[edit]

Dec 8[edit]

CopyPatrol is now available for French Wikipedia! You can select the language project at the top left of the CopyPatrol interface. There are thirty languages that we can set up; we'll be putting out a call soon to see which other language WPs are interested in using the tool. Here's some comments from French WP: [1]`


June 23[edit]

Several new features on CopyPatrol today: There's a history link under the user name, there are labels for "Article" and "Source" in the Compare panels, you can undo your own review if you've made a mistake, and it shows the correct time of the suspected edit.

June 15[edit]

CopyPatrol now has a dropdown Compare panel, which shows the comparison between the Wikipedia edit and the possible copyvio source as a part of the interface. This is a huge step -- one of the most important things that we wanted to do with this tool -- so please come take a look, and tell us what you think. :)

June 8[edit]

The CopyPatrol interface now has OAuth support, so people can log in. We're changing the name from Plagiabot to CopyPatrol for a couple reasons -- "Plagiabot" could make users think that this is all done by bot, and "CopyPatrol" sounds like a job that humans can help with. It's also easier to spell. :)

May 18[edit]

Progress on the new Plagiabot interface is going well. The basic item structure is there, with links to the article, the diff and the editor info. WikiProjects are displayed on the page (although not yet clickable/searchable). There are dropdowns to show the text comparisons, and a link to the Turnitin report.

Coming up next: OAuth support so that people can log in, showing the text comparison in the dropdown compare panes, an interface to filter items by WikiProject, and a history log. We'll see lots more progress over the next few weeks.

May 9[edit]

We're currently building a new interface for Plagiabot on Tool Labs, which will make the tool easier to learn and use. There are a lot of links and pieces of information that people need in order to review a suspected copyvio edit, and the new interface should organize those in a way that will make sense for more users, and facilitate their workflows. We also want to include the comparison view in the interface page, so that users don't have to open more tabs to see the article and source comparisons.

We're working off the wireframes that are posted on this page. The UI is still developing as we go along, so please post your questions and suggestions on the talk page!

April 8[edit]

We talked to User:Eran, the creator of [en:User:EranBot/Copyright/rc|EranBot] (aka Plagiabot) at the Wikimedia Hackathon. It's a great tool; our job will be to port it to Tool Labs, and make interface improvements that will encourage more people to use it on a regular basis.

There are a lot of ideas for possible improvements on the Notes page.

Rationale[edit]

People copy and paste text into Wikipedia articles. We want to detect when that happens, so that we can check it out, revert, and notify the contributor that we can't use copyright-protected texts, or block them if necessary. There is currently a tool called EranBot, which checks new edits against the Turnitin database, and reports possible copyvio for users to inspect and address. The proposal to improve the system was the ninth most popular suggestion in the 2015 Community Wishlist Survey, with 63 supporting votes.

Wireframes[edit]

These wireframes are a draft, posted here for discussion. Come join the discussion on the talk page.

These are wireframes for continued development on the CopyPatrol tool. We're making changes on both and figuring things out as we go, so don't expect the wireframes and the prototype to match exactly. :)

Requirements[edit]

This is a draft of requirements for the new interface; this needs discussion. Come join the discussion on the talk page.

In this description: "Editor" is the person who made the suspected edit; "user" is the person who's using the EranBot interface to evaluate edits.

Page structure[edit]

  • User can sign in using OAUTH, connected to their wiki account. Only logged-in users will be able to review open cases. (T132081)
  • User can filter items by WikiProject, using an input field at the top of the page that autocompletes with WikiProject names. The user can select multiple WikiProjects. Pressing Submit reloads the page, showing just the items that are in the selected WikiProjects. (T132350)
  • User can filter items based on triage state -- open cases, all Page fixed cases, all No action needed cases. (T132352)
  • User can filter items to see only the cases that the user has reviewed. (T132352)
  • User can sort by oldest/newest. (T132349)
  • User has access to all unchecked items in the system, using infinite scroll. (T132348)
  • User can see a chronological log (history) of all actions taken using the interface, which can be sorted and filtered by time, editor's name, user's name, and WikiProject. (T132830)
  • User can see a brief set of instructions near the top of the page, with a link to more detailed instructions.

Item structure[edit]

User has easy access to the following links/information: (T132831)

  • Page
    • Article title
    • Link to article history
  • Edit
    • Diff of the suspected edit
    • Date/timestamp of the suspected edit
  • Editor
    • User name
    • Link to contributions
    • Link to user talk page
    • User's edit count
      • Note: If editor's user page and/or talk page are redlinks, show as red
  • WikiProjects
    • A set of bubbles with each related WikiProject
  • Review
    • Button: Page fixed
    • Button: No action needed
    • Once the item's been reviewed, display the name of the user who reviewed it, and the date/timestamp.

Comparison[edit]

  • For each item, user can open a pane that shows a two-column comparison between the suspected edit on the left, and the possible copyvio source on the right. (T132832)
  • The potential text matches will be highlighted in red (as in Earwig's tool).
  • Both columns will have a scroll bar, so the user can line up the corresponding text matches.

Workflow[edit]

  • When the user has determined whether the suspected edit is acceptable or not, they can take the following actions: (T132079)
    • Page Fixed: The edit is determined to be copyvio, and the user has reverted or fixed the page.
    • No Action Needed: The edit is a false positive, nothing needs to be done.
  • (Call to action/reminder to leave a message on the editor's talk page?)

False Positives[edit]

  • Don't mark as copyvio if the potential violation is within quotation marks and followed by a <ref> tag. (to-do: check the policies on acceptable length of quotes?)
  • Don't mark as copyvio if the suspected edit is reverted within ten minutes. (ie: check the page history ten minutes after the edit was made, don't add to the report form if the edit was reverted in that time)

Advanced features[edit]

  • An indication on the item if an editor has been flagged as copyvio multiple times before
  • A whitelist for trusted editors who have been flagged as false positives
  • Investigate as possible sources of false positives: adding tables, adding timelines, list articles

Technical discussion and background[edit]

Hosting the EranBot interface[edit]

This is an open question that needs discussion. Come join the discussion on the talk page.

One of the core questions that we need to answer is where the interface should be hosted. Currently, the interface is on en.wp in User:EranBot's user space -- User:EranBot/Copyright/rc. The Community Tech team is currently suggesting that the tool be hosted on Tool Labs. (A very rudimentary skeleton for the tool is at http://tools.wmflabs.org/plagiabot/ now.)

Pros for Tool Labs[edit]

  • Requiring people to add a line of code to their common.js is a barrier that turns away potential users.
  • It is not possible to build a Special page on-wiki that depends on a database or service that is not hosted in the Wikimedia Foundation production data centers.
  • Operating the bot as a Wikimedia Foundation production service will require security review and resource negotiations that will slow down development.
  • Easier and faster to make an interesting/engaging interface because code changes and deployments will not be subject to the Wikimedia Foundation production change timelines and protocols.
  • Easier to embed the comparison reports within the interface, so people don't have as many extra clicks.
  • Gives us the freedom to use more libraries and go through less intense code-review processes.
  • Discoverability is the same as the current page -- you need to follow a link or bookmark the page anyway; we could point those links point to the Tool Labs version.

Cons for Tool Labs[edit]

  • Some people don't want to go to Tools in order to work on an en.wp workflow, it's a potential barrier.
  • On-wiki, navigation popups give links for article history, diff preview and editor info (although we could build those functions into the new interface).
  • From Diannaa: "Links that I have already visited would no longer appear in a different color, which I currently use to alert me to whether I have visited a page before. This lets me know that I have likely already given the user a warning and that it's a repeat violator."
  • Tool Labs wouldn't give us access to Echo notification system.

Internal Community Tech team assessment[edit]

Support: Very high. Lots of support on the proposal, with several people specifically calling out for a tool that can be used on multiple projects and multiple languages.
Impact: Medium to High, depending on what we're able to do. Integrating the human-checked false positive/true positive data into EranBot's existing database and improving the API could be particularly useful for research and machine learning projects, potentially improving the bot’s true positive rate and requiring less human involvement. The ability to adapt this for multiple projects and languages would be especially helpful.
Feasibility: There's an existing tool on English Wikipedia - EranBot, aka Plagiabot, based on the Turnitin database - and Community Tech has done some work to make the results more broadly useful in the last couple months, including displaying the tool's results alongside Copyvios Detector's reports. (There's more details on ticket T110144.) Turning EranBot into an extension would be considerably more difficult than making improvements to the bot.
Risk: Medium, higher for more involved work. We'll need considerable discussion on the scope and definition.
Status: We're confident that we can do some helpful work on this wish this year. We need more investigation and discussion to figure out a clear scope of work. We'll be able to focus on it more in a few months.