Talk:IP Editing: Privacy Enhancement and Abuse Mitigation/Improving tools

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search

Main project page (discuss)
Ideas for privacy enhancement (discuss)  · Improving anti-vandalism tools (discuss)


Should we include "WhoIs" as an existing tool? — Arthur Rubin T C (en: U, T) 02:54, 30 August 2019 (UTC)

WHOIS has become more or less a generic name for an information blurb that is available from dozens of providers; many IP lookups give as much information as a typical WHOIS page. There's an experimental version on the ToolForge (it's the link available on enwp IP address user pages/contribs pages). I'd suggest linking the ToolForge version, but there are so many other versions that people kind of pick and choose their favourite providers. Risker (talk) 03:26, 30 August 2019 (UTC)

Linking to other feedback[edit]

So it's not lost, and is available for others to read.

wikidata:Wikidata:Project_chat/Archive/2019/08#New tools and IP masking. /Johan (WMF) (talk) 10:13, 9 September 2019 (UTC)

Feedback about proposed tools[edit]

We've put out some ideas on the project page about tools we can build to help improve vandalism detection and mitigation on our projects. We want your help with brainstorming on these ideas. What are some costs, benefits and risks we might be overlooking? How can we improve upon these ideas? What sounds exciting, what sounds sub-optimal? We want to hear all your thoughts. -- NKohli (WMF) (talk) 00:12, 14 January 2020 (UTC)

Feedback about IP info feature[edit]

  • One concern that I have about this is that many admins have sufficient knowledge to make IP rangeblocks by looking at the IP address, the whois data, the contributions. How would admin rangeblocking work under this scenario? "Owner: Wikimedia Foundation" might be true for many different IP address ranges. A block on shouldn't necessarily also affect 2620:0:860::/46. How would an admin view the contributions for a specific range, and issue a block? I'll note that it isn't sufficient to simply aggregate blocks based on the announced prefixes. For IPv4, many residential ISPs provide a single IP per customer which can be static for many months, so there is no need for a rangeblock at all. In IPv6, we almost always want to block the /64 - but that is never the announced prefix, and for some providers, especially mobile ones, a /64 is too wide. Users need to see the IP address WHOIS data and contributions history, not an opaque "similarity percentage", in order to understand how to interact with that IP address. ST47 (talk) 00:34, 15 January 2020 (UTC)
  • "VPN" is not just a green checkmark or a red x. There are many different types of proxies and VPNs, some are public and some are only used by specific schools or corporations, and some data sources are of higher quality than others. Admins routinely use the reverse DNS output, not just of a single IP address but of the entire IP range to observe trends, and other tools like port scanning, DNS caches, blocklists, and threat intelligence, to identify proxies. Many of these require access to the true IP address. ST47 (talk) 00:39, 15 January 2020 (UTC)
  • I think the WMF should be supplying their own summary whois and geolocation information, but not at the expense of hiding the addresses. They should also be providing rDNS. I share the same concerns about proxies expressed above by ST47. Sometimes we are dealing with a level of sophistication far beyond what any blacklist can possibly tell you. I really don't trust half the blacklist results, especially at the levels of precision we need and the levels we can get from blacklists. There's also a whole bunch of grey areas. Also, those who don't understand IP addresses regularly defer to those of us who do. There's no real deficiency there. -- zzuuzz (talk) 19:33, 15 January 2020 (UTC)
  • I support this general idea. Please emphasize and fund community conversation before investment in tech. Blue Rasberry (talk) 20:19, 15 January 2020 (UTC)
  • There are two things we will miss with this information:
    1. A way to identify a range. While the name is useful for a small company, a government institution or a university, any nationwide mobile operator in a big country has an eight-digit number of users who can potentially edit Wikimedia projects from their networks. This information in itself makes little sense, a range block on the entire Vodafone network will probably be simply dangerous. Some abusers are localised to a small subrange, often /18 or even /24, and having this information is extremely useful. I also wonder how we can translate this to rangeblocks...
    2. A way to check other forms of IP abuse, from open proxies to leaky colos. In order to check whether an IP is involved in any form of abuse, I just google it, usually you find out some strange results for IPs that are VPNs or proxies, and you get next to nothing for legitimate ones. I don't think a VPN red/green status will work, for instance, my work computer is technically connected to a corporate VPN, but it is not abusive as it can be accessed only by me. I think that in order to make a real check, a heavy investment with regular updates of VPNs/proxies etc. databases is needed.
    On the other side, this will make identifying some people easier, not more difficult. For instance, we have a user working for a Japanese university in Ukrainian Wikipedia. It is highly likely that there are very few Ukrainians working for this specific Japanese university, which will make identifying him easier to everyone, not just to the few who will do the extra mile and use Whois — NickK (talk) 13:39, 18 January 2020 (UTC)

Feedback about Finding similar editors feature[edit]

  • Automated behavioral comparison is a great idea, if you can find a way to make it work from a performance standpoint. Why limit it to only IP editors? ST47 (talk) 00:41, 15 January 2020 (UTC)
  • Yes great idea. I agree with the noted risk that automation can inappropriately accuse editors, but we already have an underregulated and nonstandard accusation detection process. The way to counter the ethical challenges is by funding documentation, more accessible instructions, and online and in-person meetups for people to raise issues and develop solutions. We already have a very labor intensive system which is not scaling. Our greatest threat is not the early bias of the first automation, but of the existing current problem of undetected and unanswered misconduct. By not having semi-automation we permit too much behavior which we ought to exclude. Blue Rasberry (talk) 20:25, 15 January 2020 (UTC)
  • It is a useful tool, but in no way it can be a replacement to finding editors in the same range. Hint: very few LTAs have an interest in exactly the same pages, many have same editing pattern but can edit any pages. A realistic example from Ukrainian Wikipedia: an LTA is making POV-pushing on Crimea topics. Here are three edits for analysis:
    1. replaced in Ivan (footballer) Ivan was born in Sevastopol, Ukraine with Ivan was born in Sevastopol, Russia
    2. replaced in Crimea Crimea is internationally recognised as a part of Ukraine with Crimea is wrongly recognised as a part of Ukraine
    3. replaced in Ivan (singer) Born in Sevastopol, Ivan works for MTV Ukraine with Born in Sevastopol, Ivan works for MTV Russia
    A local patroller finds edit 1, finds out that it is our known LTA, checks contributions and reverts edit 2.
    A machine learning tool gets edit 1 as an input and will most likely suggest to revert edit 3 (which is almost identical). Unfortunately, edit 3 is a legitimate update for a person who really moved from MTV Ukraine to MTV Russia. I wonder whether ORES will even find edit 2. I have already spotted such behaviour in Ukrainian Wikipedia: after an edit war edits of the side which ended up being consensual were labelled as abusive by the machine learning tool.
    I also wonder whether this tool will work across wikis, e.g. if made this edit in Ukrainian Wikipedia and made this edit in Polish Wikivoyage.
    Thus I would probably use such tool as an additional instrument, but it will not replace range contributions for me — NickK (talk) 13:39, 18 January 2020 (UTC)

Feedback about Database for documenting LTAs[edit]

  • Is the idea that this database is being populated by CheckUsers, but used by users without any special permissions? Based on the fact that it doesn't actually show IP addresses or user agents, just a count of matching ones? There's probably still a privacy concern there. Also, due to dynamic IP addresses, showing "IPs: 0 out of 5 match" is one thing, but how many are matching the same IP ranges, ISPs, geolocations? For user agents, a simple match isn't very effective because most browsers increment versions every month, it should be checking for the same browser/OS/platform. If you could do even more fingerprinting, that would be great too. ST47 (talk) 00:46, 15 January 2020 (UTC)
  • An LTA database was discussed once upon a time at enwiki (link). I remain of the opinion that a public library which both registered and unregistered users can consult is a useful tool. As a responder to requests for blocks, I have often been educated by some unregistered user pointing to an LTA page along with the latest vandal. And I will sometimes visit LTA pages on wikis where I have no edits, so would not benefit from having this information hidden from me on that basis. For such an idea to work effectively, I think you should basically throw out any ideas of read-level security. I also think when you have more information limited to a more restricted group, you get more misinformation. Don't take this as total opposition to the idea, I'm just not persuaded that LTA pages should be deprecated. -- zzuuzz (talk) 19:10, 15 January 2020 (UTC)
  • Yes, great idea. Please fund Wikimedia community organizations and focus groups to develop text and documentation for how this should be. I am especially interested in developing categories or labels for humans to apply to different sorts of behavior. If we have a database we need sorting systems, and socially we are far, far from being able to have reasonable or useful conversations about sorting. This will require slow conversation in many places over time, and if we invest a little now slowly then that will save great expense. Without the labels the quality of the data we collect will suffer and we will be unable to usefully discuss various types of long term abuse. Blue Rasberry (talk) 20:22, 15 January 2020 (UTC)
  • I don't think a public library of LTAs is a good idea, this library should necessarily be private and require special permissions. We have already an LTA in Ukrainian Wikipedia who studied public rules (filters, rangeblocks etc.) and adapted their abusive edits to them, becoming even more abusive as their ability to circumvent restrictions improved. As ST47, I also think that pages/IPs/UAs dimensions are not sufficient: an LTA can perfectly make similar edits to a different topic, using the next IP from the same range (because we blocked the previous one!) and an updated UA (because of the browser update). A possibility of human comparison will be more useful — NickK (talk) 13:39, 18 January 2020 (UTC)