What is the problem you're trying to solve?
Editors will from time to time, especially the new one, copy content from other sites without rephrasing the content themselves, thereby infringing copyrights. The copyright is about the form of the prose, not about the facts. Usually the editors want to get the facts, and more or less unwittingly creates a copyright infringement. One thing is that the editors should not copy content content verbatim, but it still happen from time to time.
At Norwegian (bokmål) Wikipedia we had a copyright infringement detector for a time, but it has now been deprecated as the API it used is not available anymore. The old Norwegian help page is at w:no:Bruker:Jeblad/Påvisning av opphavsrettsbrudd, the code at w:no:Bruker:Jeblad/Gadget-copyvio-check.js, and sthe style at w:no:Bruker:Jeblad/Gadget-copyvio-check.css. One possibility could be to reimplement this as a node.js service, but it is probably equally easy to start from scratch and implement it as an extension.
Copyright violations must be interpreted in a legal context; many discussions at Wikipedia tend to center around the idea that some concept is copyrighted, but it is the form of the expression that is copyrightde. This means that there must be some distance between the old expression and the new one to dismiss any claims of copyright violation, and that this distance must be maintained over some larger piece of text.
What is your solution?
What makes the proposed system possible is that it checks the edits as they are made, before the edits can propagate to other sites.
- The user makes an edit
- The diff of the edit is used to build query sets
- The query sets are split on sentence boundaries
- Filter out sets with to few words
- The queries are sent to a search engine
- Keep results with sufficient number of hits
- Accumulate the results for each sentence
- Filter out pages with more than N number of hits
- Check the fragments from the search engines for maximum similarity (edit distance through local sensitive hashing)
- Filter out pages with similarity above some level
- Tag and log the edits with a "high similarity"-tag
Such edits should then be inspected for possible copyright violations. This warning should also be available for the editor so the copyvio can be fixed swiftly. The tag would altough be visible to all, so anyone can fix it.
It should be possible to continue editing of a specific copyvio for the purpose of fixing it, and then the tag should be removed when it no longer apply. If the edit changes the diff, but without solving the issue it should be retained. A new edit that is not a continuation of a previous edit will not remove tags set on the previous edit. It will although be possible for a user with sufficient rights to remove such a tag. The log entry will although be retained.
Changes to external sites will not trigger a reevaluation of the tag, it will only change due to editing of the internal content and matches or non-matches against the original text snippets delivered by the search engines. This makes the system causal on the original time stamp.
Note that from point 3 the processing will be done in a separate worker thread so to not block the delivery of the page.
Primary goal will be to remove the easy claim that "Wikipedia is full of copyright violations". It should be non-trivial to add a copyright violation to an article, and it should be trivial to find it and remove it. How it is done, and why, should be clearly visible for everyone.