Grants:IdeaLab/Automatic copy-paste & copyright identification

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search
Questionmark copyright.svg
Automatic copy-paste & copyright identification
AI based plagiarism detection to identify and triage copyvio insertions to Wikipedia
Hex icon with lightning white.svg
idea creator
Hex icon with star white.svg
project manager
Hex icon with bee white.svg
community organizer
Hex icon with hexes white.svg
created on17:22, Monday, March 28, 2016 (UTC)

Project idea[edit]

What is the problem you're trying to solve?[edit]

In a recent 2015 Community Wishlist Survey Wikimedians ranked "Improve the copy and paste detection bot" as the #9th critical issue. This is a problem where users copy paste copyrighted content from other sources (including websites) to Wikipedia creating a problem. While Wikimedians have traditionally been crowdsourcing this using a number of resources such as copyscape, turnitin or simply their own intuition. However, a good chunk of these cases are more obvious making them monotonous and tedious to continuously deal with.

What is your solution?[edit]

This is where Artificial Intelligence (AI) would be of great benefit. AI is a branch of computer science that makes use of computers to perform tasks that we commonly associate with intelligent humans. While we don't have AIs that are smart enough to replace a human Wikipedia editor, many AI strategies are very good at automating some of the more monotonous and voluminous tasks. Using AI, we would be able to detect obvious cases automatically and triage the rest based on urgency and likelihood for human editors to review.

Recently something similar was achieved for edit quality control for many language editions of Wikipedia as well as Wikidata using Machine Learning algorithms of AI in a project called Revision Scoring as a Service where vast majority of the actual work was triaged as not needing review. This system is able to distinguish between productive edits, damaging edits as well as the intent (good faith/bad faith) to distinguish damaging edits such as newbie mistakes from malicious edits. Output of this tool is used by a variety of third party developers including but not limited to huggle, raun, Real Time Recent Changes and Dexbot (automatic vandalism revert bot of Persian Wikipedia). The tool achieved this by gathering feedback from the local community to identify what such edits looked like by utilizing Wiki labels. This way the system is trained based on the needs of the local community.

Likewise, the problem with plagiarism detection (identification of copyvios) can be handled with cutting edge AI algorithms. Bearing mind, just like how we have a past history of reverted vandalism edits to train AI on vandalism, we have a history of deleted content over copyright/plagiarism concerns which would serve as a starting point of the AI implementation. Such labelled information is a goldmine of knowledge for AI algorithms to train on. This way even less community effort would be needed during the training phase of the implementation.

I hope to give a talk on the matter in Wikimania 2016 as a kick-off of the project and community outreach.

Project goals[edit]

The end goal of this project is to provide a service API (Much like Revision scoring as a service) for bots, gadgets and tools to use. This would provide a feed that triages edits with a copy-paste score.

If there is time the project would include UI output for human review but it would be preferable for tool developers to integrate the output into their tools. Tools such as huggle for instance could benefit from this and users would not need to check yet another page and instead adjust their huggle feed instead.

Get involved[edit]



  • We need more of plagiarism detection tools. —M@sssly 21:54, 28 March 2016 (UTC)
    • Nothing to do with machine learning? The current en:User:CorenSearchBot works pretty well, which doesn't involve machine learning at all. The main purpose for our automatic copy-paste detection tool is to locate the possible sources (from search engines or databases). Once sources are located, it's easy to compare the added text with these sources, and give a score on it. I know there are other approaches that involve machine learning, such as comparing the writing style of the added text with the old text to identify the possible copyvio. But these approaches seem to be useless: Unless we have identified the sources, we can't revert the addition of text or send a new article to CV just because it looks like a copyright infringement. Antigng (talk) 01:03, 29 March 2016 (UTC)

Expand your idea[edit]

Would a grant from the Wikimedia Foundation help make your idea happen? You can expand this idea into a grant proposal.

Expand into an Individual Engagement Grant
Expand into a Project and Event Grant