Talk:CopyPatrol

From Meta, a Wikimedia project coordination wiki
NOTE: This page may not be regularly checked. If you need prompt attention from the maintainers please ping a member of Community Tech.

New backend coming soon[edit]

Tracked in Phabricator:
Task T333724

Hello all! I'm here to inform you a new backend (bot) that powers CopyPatrol will soon be updated. I've been working with @JJMC89 on this for quite some time. We now have a demo ready, and are asking you all to see how it fares alongside the legacy feed powered by @EranBot.

You can check out the new feed on our staging instance at toolforge:plagiabot. Feel free to test out saving reviews there for the time being, as it is using a test database, but note the production CopyPatrol should still be tended to as well.

Our main concern is the volume of cases that appear in the new feed versus the old. We worry many of these are false positives, and we may be putting too much burden by cluttering the feed with illegitimate cases.

Other questions, which may effect the number of cases reported by the bot:

  • Should the bot skip reverted edits? We're planning on changing it so that it doesn't, and for CopyPatrol to clearly indicate which edits have already been reverted, and if you are a sysop, we'll provide a link to revision-delete the diff. Do you agree with this approach?
  • The new backend checks replaced text, and not just added text. We hope this surfaces more copyvios, but it may be leading to too many false positives. Let us know if you have any thoughts on this.
  • The current threshold for matching text against a source is 50%. We're wondering if that should be changed at all.
  • Compared to the old feed, the new one surfaces many more sources, including non-internet sources. Some such as this example have over 30 sources. Is this overkill? Maybe we should collapse the sources in the view to say, 10 maximum, or just omit showing them at all? This is with the understanding that sources towards the top will have a higher matching percentage.

Feel free to leave your thoughts on the associated task (phab:T333724), or here in this thread. Pinging a few of our most prolific users: @Diannaa @Moneytrees @Sphilbrick @L3X1 @DanCherek @Ymblanter @Framawiki

Thanks for your feedback! MusikAnimal (WMF) (talk) 21:40, 16 August 2023 (UTC)[reply]

Hi @MusikAnimal (WMF). The new tool is listing a huge number of cases: 521 cases are listed for August 16, for example, where the original CopyPatrol only listed 108. That's an impossible number of cases for us to complete given the number of patrollers we have that work on this task daily. I can only do about 20 cases per hour tops, and often a lot less. Even with the old version of CopyPatrol, if a key person misses even one day, we have difficulties. So that has to be fixed.
Something I see in the old version that I am not yet seeing in the new version: When I click on the iThenticate link, the old version tells me the date the source was crawled. That can be a helpful clue to help determine if the material was copied from elsewhere on Wikipedia or if it's a true copyvio, so I would like to see it included.
We don't need to see a huge list of possible sources. This is especially true where the edit itself is tiny. Typically a lot of the potential sources are replicating the same material. Here is an example. All the editor did was move some prose from an image caption into the body of the article. If an editor has added a lot of copyvio from multiple sources, it's usually noticeable right away from the page history, and can be checked with Earwig's tool.
I love that you've added the ability to search within the loaded pages on the iThenticate report. That is impossible to do in the original version of CopyPatrol, at least on my setup. That's all for now. Diannaa (talk) 02:15, 17 August 2023 (UTC)[reply]
I am getting an error message when I attempt to mark a case as "Page fixed" or "No action needed". it says, 'Something went wrong. Please try again.' Diannaa (talk) 02:47, 17 August 2023 (UTC)[reply]
Ah, that's a glitch I must have recently introduced. I'll fix in soon, but for now you can ignore the reviewing process since it's identical to the old one, anyway. MusikAnimal (WMF) (talk) 00:59, 18 August 2023 (UTC)[reply]
This should be fixed now. MusikAnimal (WMF) (talk) 01:58, 18 August 2023 (UTC)[reply]
Hello. As there's unresolved Eranbot listings from 2015 to 2016, I would like to request all of these listings to be restored to check if they were already resolved. Currently, listings before June 20, 2016 are not at CopyPatrol per Phab. Thanks! MrLinkinPark333 (talk) 19:34, 17 August 2023 (UTC)[reply]
Hi @MrLinkinPark333! As per the phab task, those old reports are still accessible in the EranBot archives. There is no viable means to import them into CopyPatrol, I'm afraid. MusikAnimal (WMF) (talk) 00:36, 18 August 2023 (UTC)[reply]
Okay. Thank you for the update! MrLinkinPark333 (talk) 00:53, 18 August 2023 (UTC)[reply]
en:User:EranBot/Copyright/Batches lists all the pages where the postings were made, and the work that I did to clean them up before we initiated the CopyPatrol interface. If you wish to investigate those reports, you could do so from those postings. The iThenticate links no longer work though. But I don't think that's a good use of editor time; old cases are very difficult to solve, and we already have a huge amount of work between CopyPatrol, en:wp:CCI, and en:WP:CP, and very few people willing to do it. Postings from Batch 46 forward would not need to be checked, because the are duplicates of items that were also listed at CopyPatrol and we dealt with them as they happened on a daily basis. I switched over to working the CopyPatrol queue somewhere around June 17, 2016, and don't have time to do any of those old reports in addition to the hours I spend daily on the CopyPatrol queue. Diannaa (talk) 00:42, 18 August 2023 (UTC)[reply]
The iThenticate IDs still work, the URL was switched to a new one when Copypatrol was introduced so appending them to https://copypatrol.toolforge.org/ithenticate/<ID> works. A lot of the pages are already deleted/the additions are long gone, through all the blacklisted links have to be removed as well. It's still probably worthy looking at though. Isochrone (talk) 09:41, 18 August 2023 (UTC)[reply]
I am very interested in participating, although I am on a bus in Slovenia at the moment, with a packed schedule, so we will see. Sphilbrick (having issues with login so will post logged out) 188.198.37.7 08:59, 19 August 2023 (UTC)[reply]

Mirrors[edit]

A new suggestion: We spend an inordinate amount of time repairing unattributed copying within Wikipedia. If some of the more common Wikipedia mirrors could be identified and whitelisted, it would reduce the amount of time we spend on that, which is not as serious a violation as a true copyright violation (copying copyright material from external news sources, books, or elsewhere). There's already a whitelist at User:EranBot/Copyright/Blacklist but some of the ones I frequently see are not listed there: Bookpedia and Handwiki, for example. Diannaa (talk) 15:10, 19 August 2023 (UTC) Adding: It looks like "Wikia" is on Eran's list; but it's now called "Fandom". Should we whitelist that? Diannaa (talk) 15:38, 19 August 2023 (UTC)[reply]

Or perhaps pages with a high-similarity to existing articles could be marked as such on the UI to quickly identify/filter CWW, as for removing mirrors the list at en:WP:MIRRORS is quite extensive and machine-friendly.
N.B. are the iThenticate links meant to be broken? Isochrone (talk) 17:52, 19 August 2023 (UTC)[reply]
@MusikAnimal (WMF): we are getting an error message when attempting to view iThenticate reports in the new version. 'Oops! An Error Occurred. The server returned a "500 Internal Server Error"'. Diannaa (talk) 09:44, 21 August 2023 (UTC)[reply]
@Diannaa Fixed Sorry about that! If it wasn't obvious, this new version of CopyPatrol is a complete rewrite, so some bugs were expected. We'll get everything fixed before we go "live", though :)
I'll also note that I just got a 500 error from iThenticate itself. I just refreshed and the report loaded fine, so if you run into this you can try the same. If it happens a lot, we'll report it to Turnitin. MusikAnimal (WMF) (talk) 16:29, 21 August 2023 (UTC)[reply]
I should have mentioned, the new ignore lists are centralized at User:CopyPatrolBot/UrlIgnoreList and User:CopyPatrolBot/UserIgnoreList. Please feel free to edit them as desired. Before we deploy the new CopyPatrol, we'll ensure all the entries are copied over from the old ignore lists, so don't worry about that. MusikAnimal (WMF) (talk) 16:34, 21 August 2023 (UTC)[reply]
I don't have any knowledge of Regex so I won't be able to add any urls myself unfortunately. Diannaa (talk) 16:48, 21 August 2023 (UTC)[reply]
Yeah, I was wondering if it would be possible to leverage the recently introduced w:en:Special:BlockedExternalDomains system. Just as with the Spamblacklist, the CopyPatrol URL ignore list almost never truly needs regular expressions, rather just plain URLs. Pinging @Ladsgroup for input. I'm happy to file a ticket for this as well as help code and review this effort, if we don't think it will be terribly hard. So basically we'd like to generalize the UI, something like Special:EditUrlList/Pagename.json. I imagine there are other use cases beyond Spamblacklist and CopyPatrol ignore list. MusikAnimal (WMF) (talk) 17:03, 21 August 2023 (UTC)[reply]
Sure thing. I don't think it's too hard to make that happen. Amir (talk) 03:58, 24 August 2023 (UTC)[reply]
Bug filed. Thanks, Amir! MusikAnimal (WMF) (talk) 23:51, 29 August 2023 (UTC)[reply]
Hi @MusikAnimal (WMF), is  User:EranBot/Copyright/Blacklist  still used? Because it looks like it is still the one maintained by patrollers Framawiki (talk) 17:01, 11 December 2023 (UTC)[reply]
@Framawiki Yes, until the new version goes live, that's the page to use. A redirect will be left when it is changed. We're still waiting on the final approval from Turnitin before we switch everything over. MusikAnimal (WMF) (talk) 02:17, 19 December 2023 (UTC)[reply]

Edit summaries[edit]

I just started looking at the new tool. I don't yet have comments on the new tool per se, but since the code is being worked on thought I'd throw out an idea that I would find helpful, and I think it would be pretty easy to implement.

In a nutshell, I propose that the edit summary be posted as part of the information displayed about the identified edit.

I am fully aware that as soon as I click on the diff button, I can easily see the edit summary, so you might be puzzled why I would want it on the case listing page. My rationale is that I have found, through experience, that looking at the edit summary is one of the most important things to look at because it will help define my process. For example, if the edit summary is "rvv", I'm not going to start with the type that report to see if the text matches some other source, I'm going to look at the history to see if the edit summary is accurate and this is a false positive, because the edit reverts to an earlier version and the matching text arises because the earlier version is in some mirror.

In contrast, if the edit summary states "material copied from {some other article], see that article for attribution", my process will be a little different.

"So what", you might be thinking, because I'm always going to click on the diff where I can see the edit summary. The point is that I have different processes depending on the edit summary, and I think it would be more efficient if I could glance at edit summaries and work on similar issues as a group. So, for example, I could glance down the page and look for all of the edit summaries containing RVV, or revert to earlier version or something similar, handle all of those, then come back and look for all of the edit summaries indicating it's a copy from another article, handle all those, and then look for another group of similar articles. Maybe my age is showing, but I don't switch gears is easily as I used to, so I would find it more efficient if I could handle half a dozen reports consecutively where my process is the same, then switch to a different type.

If this only helps me it's not worth implementing, but if someone else finds this potentially useful, I think it's almost a trivial change, copiy the edit summary and place it on the report somewhere. (My simple suggestion would be to just drop it below the iThenticate report button, but if there is an easier option, as long as it's always in the same place I'll be happy.)--Sphilbrick (talk) 12:06, 29 August 2023 (UTC)[reply]

@Sphilbrick Ask and you shall receive :) In addition to edit summaries, I've also added tags and the edit size. The tags are especially useful I hope, as they will tell you if it's a revert, or if it was reverted. In the latter case, I was thinking of providing a "revdel" link next to the "Diff" link for quick access to the revision delete form. Would that be useful? MusikAnimal (WMF) (talk) 00:15, 30 August 2023 (UTC)[reply]
This is great. (wish you had been at my board meeting last night, a lot of asking, and not a lot of receiving:). Yes, easy access to the Revdel button would be nice. Edit I just noticed you said I have rather than I will; very nice thanks. Sphilbrick (talk) 10:40, 30 August 2023 (UTC)[reply]

Review comments[edit]

I'm not sure of how difficult this is, but perhaps adding a review comment button (i.e. under the resolve options) would be useful, as opposed to more options as previous proposed? I know this is mainly a focus on backend changes and I can file a task on Phab if appropriate, but for some cases it may not be obvious to other "patrollers" about the action taken.
I can make a little mockup if that helps. Thanks for all the work you and the comtech team are doing. Isochrone (talk) 19:52, 30 August 2023 (UTC)[reply]

I believe what you're asking for is basically the same as phab:T279083, only more generalized. I was thinking we could allow adding any arbitrary comments, but also have a dropdown of commonly used ones. That list can be configurable by CopyPatrol users.
With the new system this is all much easier to implement, so I will look into it :) MusikAnimal (WMF) (talk) 21:17, 1 September 2023 (UTC)[reply]

Pre-filled revision deletion[edit]

You mentioned the possibility of a link to the revision deletion template.

This reminds me of something I've always wanted to ask for, but didn't think I could justify setting up a project for this small request. However, if you are actively working on a new version maybe now's the time.

I use a number of the options in Special:RevisionDelete when generally working on RD1 requests, But if I am carrying out a revision deletion in the context of copy patrol work, four of the five choices are identical in close to 100% of the cases. It would be nice and helpful if a customized RD1 template came up when doing copy patrol tasks.

I would preset the template with:

  • Delete revision text Set
  • Delete edit summary Do not change
  • Delete performer's username/IP address Do not change

Reason: <Pre-fill with the RD1 option>

To put it differently, the standard invocation has three visibility restrictions for which the default is "do not change" for all three. Change the first default from "Do not change to "Set". The reason field is a drop-down box allowing the editor to choose from seven options. There is no default, so prefill or make the default the RD1 option.

There is also a field for "other/additional reason" I don't know about other editors but I typically use that field to add the URL of the copyrighted source material. I fully grant that changing the first field, and selecting the reason is only a couple of clicks, but a couple of clicks repeated ten thousand times adds up. This customized template would mean I could just drop in the source URL which I typically already have in my buffer and complete the RD1 in half the time.--Sphilbrick (talk) 13:10, 31 August 2023 (UTC)[reply]

@Sphilbrick Done! I've also added an undo link (as you wouldn't usually rollback here), and a Delete link for new pages. The latter fills in the deletion summary with G12, and also supplies the top source URL. I can't do the same for Undo and also have the automated summary (undid revision by so-and-so), unfortunately. MusikAnimal (WMF) (talk) 21:13, 1 September 2023 (UTC)[reply]
Oh, I should mentioned however that the deletion reason auto-selection only works for English Wikipedia, as we must hard-code the value. This doesn't scale well and is fragile (i.e. if someone changes the copyvio reason at en:MediaWiki:Deletereason-dropdown then our code must also be updated). Longer-term, I was thinking we could have an interface page where admins for said wiki can customize the links in CopyPatrol. This would allow CopyPatrol users to update the deletion reason as needed without developer intervention, and also allow each wiki to customize links that meet their workflows. MusikAnimal (WMF) (talk) 21:20, 1 September 2023 (UTC)[reply]
I tried the revdel option and loved it. I didn't even ask to prefill the the source as I thought that was asking too much but at least in this case it worked. I was able to invoke the revdel and complete it with a single click. KUDOS Sphilbrick (talk) 13:05, 2 September 2023 (UTC)[reply]

Match percentage[edit]

The percentage shown next to each "Compare" line now shows two places after a decimal point instead of rounding to a full percentage points. Is this something that was requested? It probably doesn't hurt anything and if there is a value to the increase places, fine but I can't think of a situation where I would need the numbers to the right of the decimal point.--Sphilbrick (talk) 11:41, 3 September 2023 (UTC)[reply]

Missing reports[edit]

I understand this is a complete rewrite, so one shouldn't expect the exact same set of cases in the rewrite and the legacy. However, I am puzzled to see this page: Draft:Nordic Film & TV Fund show up in the legacy not in the rewrite. It may be gone by the time you see this but it was a 93% match and essentially a copy paste from the about us page for the organization. I notice in draft space but I do see some entries in draft space in the new version so I am puzzled why this one wasn't picked up. Sphilbrick (talk) 12:26, 4 September 2023 (UTC)[reply]

It is at toolforge:plagiabot/en?id=8ef096d3-d98d-4c31-9d25-dcbf294c2286. — JJMC89(T·C) 17:24, 4 September 2023 (UTC)[reply]
Thanks, wonder how I missed it. Good to see. Sphilbrick (talk) 22:02, 4 September 2023 (UTC)[reply]

Damage score[edit]

I noticed "damage score" for the first time today. Example:

Link

My very cursory review of the entries on the current page identified three examples.

Can you explain what this means?--Sphilbrick (talk) 22:28, 14 September 2023 (UTC)[reply]

@Sphilbrick It's the same as the ORES score in the old system. ORES has been replaced by new service called LeftWing, so we can't call it "ORES" anymore. The models are still the same, though, in this case the "damaging" model. I didn't add a link yet as I assume the Machine Learning team will move the documentation now that it has new name. MusikAnimal (WMF) (talk) 03:07, 16 September 2023 (UTC)[reply]
OK thanks. Sphilbrick (talk) 12:49, 16 September 2023 (UTC)[reply]

Do we even need damage scores?[edit]

On the topic of damage scores (previously called ORES), I'm wondering just how useful this information is for CopyPatrol users. I ask because it is by far the slowest part of the application, especially with the new LeftWing system that replaced ORES that requires us to make a separate request for each revision, instead of doing a bulk query. Once we fetch a damage score, we cache it, but since the feed is constantly updated, it usually will take a while on the first load of a session. If we take out damage scores entirely, you should experience a signficant performance improvement. Pinging a few top users for feedback: @Sphilbrick, Diannaa, DanCherek, L3X1, Ymblanter, and Moneytrees: Thanks, MusikAnimal (WMF) (talk) 17:32, 5 October 2023 (UTC)[reply]

I guess I should note that since the old ORES system is now gone, I had to disable it in the old UI entirely, so you all have been going at least a few weeks now without "damage" (aka ORES) scores and no one seems to have complained… perhaps I already have the answer I need. MusikAnimal (WMF) (talk) 17:34, 5 October 2023 (UTC)[reply]
I am not even sure I know what a damage score is. Unless it is the same as the percentage of text overlap, I am probably not using it at all. Ymblanter (talk) 17:44, 5 October 2023 (UTC)[reply]
I see, it is not the same. I am unlikely to use it. Ymblanter (talk) 17:45, 5 October 2023 (UTC)[reply]
I don't think I ever used it, user edit count is what grabs my attention first then I go straight to the diff enL3X1 ¡‹delayed reaction›¡ 01:44, 6 October 2023 (UTC)[reply]
Removing it would not affect my workflow at all. DanCherek (talk) 18:08, 5 October 2023 (UTC)[reply]
I don't use it.Diannaa (talk) 21:00, 5 October 2023 (UTC)[reply]
Great, thanks for the replies, all! MusikAnimal (WMF) (talk) 00:48, 10 October 2023 (UTC)[reply]

Fix the problem by the same user![edit]

It should NOT be allowed to a user to "mark his/her own articles as fixed". Otherwise this tool will NOT be trusted. Here is an example: (https://copypatrol.toolforge.org/ar/?filter=reviewed&filterUser=Kamalelsayedmohamed), his articles have 99% copy from other site, then the user marked them as "Fixed"! or "No action needed"! Dr-Taher (talk) 05:59, 30 November 2023 (UTC)[reply]

@Dr-Taher See task T334272. 1AmNobody24 (talk) 06:58, 30 November 2023 (UTC)[reply]
Thanks @1AmNobody24, but more than 7 months, and no action is taken! Dr-Taher (talk) 10:05, 30 November 2023 (UTC)[reply]
I'll get this implemented in the new version, which we'll be rolling out before the end of the year. However if the intention is solely to prevent misuse, it's worth noting a bad actor can easily get around this by simply creating a new account and using that to review their other account's edits. Perhaps use of CopyPatrol should be limited to autoconfirmed accounts? MusikAnimal (WMF) (talk) 20:23, 30 November 2023 (UTC)[reply]
Could there be an option for local communities to associate it with different rights (for example, it could be limited on EnWiki to new page reviewers if the community wants it, since Autoconfirmed gaming is very easy). — Red-tailed hawk (nest) 03:29, 2 December 2023 (UTC)[reply]
@Red-tailed hawk There's talk about that here, task T178700. Auto-confirmed globally and either that or Extended confirmed for EN Wiki (@MusikAnimal (WMF) your task ) Nobody (talk) 13:06, 4 December 2023 (UTC)[reply]

Can the tool access paywalled full texts?[edit]

Curious whether this tool would detect violations like this from 2015 which copied from this source(you'll need to log in)? If not, have you considered whether the tool can be linked up with The Wikipedia Library to access full texts? Smartse (talk) 10:59, 19 December 2023 (UTC)[reply]

@Smartse I tried it by copying that old version to Draft:Sandbox. CopyPatrol picked up the edit [1]. In the iThenticate-Report it shows that source as a 13% match. Nobody (talk) 13:16, 19 December 2023 (UTC)[reply]
@1AmNobody24: Thanks for that - I see that percentage at 9% for link.springer.com, but looking at https://www.ithenticate.com/ I see that they do indeed have the full texts for many paywalled articles. Good to see that we should catch edits like this today, but I wonder how many we missed! Smartse (talk) 12:29, 21 December 2023 (UTC)[reply]

Question about marking edits[edit]

When I encounter an edit that somebody else has already fixed (by removing content and adding copyvio-revdel tags, or by tagging for G12), should I mark the edit as "Page fixed" or as "No action needed"? I've been marking these sorts of things as "Page fixed", since it was a true copyvio and the page was fixed, but the use of you in If you fixed the problem, tagged the page for revision deletion, or tagged the page for deletion as a copyright violation, mark it as "Page fixed" is now giving me a bit of pause. — Red-tailed hawk (nest) 02:54, 21 December 2023 (UTC)[reply]

@Red-tailed hawk I also mark those as Page fixed. You think something like If the problem is fixed, the page tagged for revision deletion, or tagged for deletion as a copyright violation, mark it as "Page fixed" could be better? Nobody (talk) 06:27, 21 December 2023 (UTC)[reply]
I think the proposed text would work well, yes. — Red-tailed hawk (nest) 16:38, 22 December 2023 (UTC)[reply]

Is that list still working? Cause this came in. Nobody (talk) 09:52, 22 December 2023 (UTC)[reply]

Sometimes things slip through; I don't know why. Diannaa (talk) 15:35, 26 December 2023 (UTC)[reply]

Is copy patrol down?[edit]

only 4 cases going back quite some hours enL3X1 ¡‹delayed reaction›¡ 21:34, 25 December 2023 (UTC)[reply]

I'm not seeing any significant gaps, just a general slowdown. I guess people had something else to do on Christmas Day. Diannaa (talk) 15:34, 26 December 2023 (UTC)[reply]