Community Wishlist Survey 2017/Miscellaneous/Overhaul spam-blacklist

From Meta, a Wikimedia project coordination wiki

Overhaul spam-blacklist

  • Problem: The current blacklist system is archaic; it does not allow for levels of blacklisting, is confusing to editors. Main problems include that the spam blacklist is indiscriminate of namespace (an often re-occurring comment is that it should be possible to discuss about a link in talkspaces, though not to use it in content namespaces). The blacklist is a black-and-white choice, allowing additions by only non-autoconfirmed editors, or only by admins is not possible. Also giving warnings is not possible (on en.wikipedia, we implemented XLinkBot, who reverts and warns - giving a warning to IPs and 'new' editors that a certain link is in violation of policies/guidelines would be a less bitey solution).
  • Who would benefit: The community at large
  • Proposed solution: Basically, replace the current mw:Extension:SpamBlacklist with a new extension based on mw:Extension:AbuseFilter by taking out the 'conditions' parsing from the AbuseFilter and replace it with only parsing regexes matching added external links (technically, the current AbuseFilter is capable of doing what would be needed, except that in this form it is extremely heavyweight to use for the number of regexes that is on the blacklists). Expansions could be added in forms of whitelisting fields, namespace selectors, etc.
expanded solution
The following discussion has been closed. Please do not modify it.
  1. Take the current AbuseFilter, rename it to SpamFilter, take out all the code that interprets the rules ('conditions').
  2. Make 2 fields in replacement for the 'conditions' field:
    • one text field for regexes that block added external links (the blacklist). Can contain many rules (one on each line, like current spam-blacklist).
    • one text field for regexes that override the block (whitelist overriding this blacklist field; that is generally simpler and cleaner than writing a complex regex, not everybody is a specialist on regexes).
  3. Add namespace choice (checkboxes like in search; so one can choose not to blacklist something in one particular namespace, with addition of an 'all', a 'content-namespace only' and 'talk-namespace only'.
    • Some links are fine in discussions but should not be used in mainspace, others are a total nono
    • Some image links are fine in the file-namespace to tell where it came from, but not needed in mainspace
  4. Add user status choice (checkboxes for the different roles, or like the page-protection levels)
    disallow IPs and new users to use a certain link (e.g. to stop spammers from creating socks, while leaving it free to most users).
  5. Leave all the other options:
    • Discussion field for evidence (or better, a talk-page like function)
    • Enabled/disabled/deleted - not needed, turn it off, obsolete then delete
    • 'Flag the edit in the edit filter log' - maybe nice to be able to turn it off, to get rid of the real rubbish that doesn't need to be logged
    • Rate limiting - catch editors that start spamming an otherwise reasonably good link
    • Warn - could be a replacement for en:User:XLinkBot
    • Prevent the action - as is the current blacklist/whitelist function
    • Revoke autoconfirmed - make sure that spammers are caught and checked
    • Tagging - for combining certain rules to be checked by RC patrollers.
    • I would consider to add a button to auto-block editors on certain typical spambot-domains (a function currently taken by one of Anomie's bots on en.wikipedia).

This should overall be much more lightweight than the current AbuseFilter (all it does is regex-testing as the spam-blacklist does, only it has to cycle through maybe thousands of AbuseFilters). One could consider to expand it to have rules blocked or enabled on only certain pages (for heavily abused links that actually should only be used on it's own subject page). Another consideration would be to have a 'custom reply' field, pointing the editor that gets blocked by the filter as to why it was blocked.

Possible expanded features:

  1. block or whitelist links matching regexes on specific pages (disallow linking throughout except for on the subject page)
  2. block or whitelist links matching regexes when added by specific user/IP/IP-range (disallow specific users to use a domain)
  • More comments:
  • Phabricator tickets: task T6459 (where I proposed this earlier)

Discussion[edit]

  • I agree, the size of the current blacklists is difficult to work with; I would be blacklisting a lot more spam otherwise. A split of the current blacklists is also desired:
  • I still want to see a single, centralized, publicly available, machine readable spam blacklist for all the spammers, bots, black hat SEOs and other lowlifes so that they can be penalized by Google and other search engines. This list must continue to be exported to prevent spam on other websites. Autoblocking is also most useful here.
  • The same goes for URL shorteners and redirects -- this list would also be useful elsewhere. This is one example where the ability to hand out customized error messages (e.g. "hey, you added a URL shortener; use the original URL instead") is useful.

My issue with this (as I have with supposed “spam-fighting”) is that it takes way too much collateral damage both when it comes to users as when it comes to content, many useful sites are blacklisted purely because a user is banned, and if a user gets globally banned the link 🔗 gets globally blacklisted and removed from any Wikimedia property even if it were used as a source 100% of the time, now let's imagine a year or so later someone wants to add content using that same link (which is now called a “spamlink”) this user will be indefinitely banned simply for sourcing content. I think 🤔 that having unsourced content is a larger risk to Wikimedia projects than alleged “spam” has ever been. This is especially worrisome for mobile users (which will inevitably become the largest userbase) as when you're attempting to save an edit it doesn't even warn you why your edit won't save, but simply says “error” so a user might attempt to save it again and then gets blocked for “spamming”. Abuse filters currently don't function 100% accurately, and having editors leave the project forever simply because they attempted to use “the wrong 👎🏻” reference is bonkers. Sent 📩 from my Microsoft Lumia 950 XL with Microsoft Windows 10 Mobile 📱. --Donald Trung (Talk 🤳🏻) (My global lock 😒🌏🔒) (My global unlock 😄🌏🔓) 10:15, 15 November 2017 (UTC)[reply]

Also after a link could be blacklisted someone might attempt to translate a page and get blocked, the potential for collateral damage is very high, how would this "feature" attempt to keep collateral damage to a minimum? --Donald Trung (Talk 🤳🏻) (My global lock 😒🌏🔒) (My global unlock 😄🌏🔓) 10:15, 15 November 2017 (UTC)[reply]
@Donald Trung: that is not going to change, actually, this suggestion is giving more freedom on how to blacklist and whitelist material. The current system is black-and-white, this gives many shades of grey to the blacklisting system. In other words, your comments are related to the current system.
Regarding the second part of your comment - yes, that is intended use of the system, if it is spammed to page one, then translating that page does not make it a good link on the translation (and actually, this situation could actually also be avoided in the new system). --Dirk Beetstra T C (en: U, T) 10:39, 15 November 2017 (UTC)[reply]
  • The blacklist currently prevents us from adding a link to a site, from the article about that site. This is irrational. Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 14:03, 15 November 2017 (UTC)[reply]
    • @Pigsonthewing: What do you mean, do I have an unclear sentence? If it is what I think, is that I would like per-article exceptions (though that is a less important feature of it). --Dirk Beetstra T C (en: U, T) 14:29, 15 November 2017 (UTC)[reply]
    • Ah, I think I get it, you are describing a shortcoming of the current system - that is indeed one of the problems (though there are reasons why sometimes we do not want to do that (e.g. malware sites), or where the link gets more broadly blacklisted (we blacklist all of .onion, which is then indeed not linkable on .onion, but also not on subject X whose official website is a .onion .. ). But the obvious cases are there indeed. I would indeed like to have the possibility to blanket whitelist for specific cases, like <subject>.com on <subject> (allowing full (primary) referencing on that single page, it is now sometimes silly that we have to allow for a /about to link to a site on the subject Wikipage to avoid nullifying the blacklist regex, or a whole set of specific whitelistings to allow sourcing on their own page), or on heavily abused sites really allow whitelisting only for a very specific target ('you can only use this link on <subject> and nowhere else'). --Dirk Beetstra T C (en: U, T) 14:35, 15 November 2017 (UTC)[reply]

Or just add an option to AbuseFilter to compare against a regexp list that's on a wikipage. (Would require some thought in that we might want to expose the matching rule in the error message and logs, but otherwise easy.)

More generally, it would be nice if we could standardize on AbuseFilter instead of having five or six different anti-abuse systems with fractured UX and capabilities. That's a bit beyond CommTech's scope though. --Tgr (WMF) (talk) 23:54, 18 November 2017 (UTC)[reply]

No, User:Tgr (WMF), using the current AbuseFilter for this is going to be a massive overload of the servers, it will still interpret the whole rule and we would probably have hundreds if not thousands of separate filters for this. It also would not allow for whitelisting (unless, again, you write a full rule with even more overload), namespace exclusion (unless ..), user-level exclusion (unless ..).
Making the AbuseFilter more modular may be an idea .. please read my suggestions above as a detailed request for capabilities. I am not familiar with the coding of the AbuseFilter to see how far this would need to go. --Dirk Beetstra T C (en: U, T) 11:00, 20 November 2017 (UTC)[reply]

Voting[edit]