Talk:IP Editing: Privacy Enhancement and Abuse Mitigation/Improving tools

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search

Main project page (discuss)
Ideas for privacy enhancement (discuss)  · Improving anti-vandalism tools (discuss)


Should we include "WhoIs" as an existing tool? — Arthur Rubin T C (en: U, T) 02:54, 30 August 2019 (UTC)

WHOIS has become more or less a generic name for an information blurb that is available from dozens of providers; many IP lookups give as much information as a typical WHOIS page. There's an experimental version on the ToolForge (it's the link available on enwp IP address user pages/contribs pages). I'd suggest linking the ToolForge version, but there are so many other versions that people kind of pick and choose their favourite providers. Risker (talk) 03:26, 30 August 2019 (UTC)

Linking to other feedback[edit]

So it's not lost, and is available for others to read.

wikidata:Wikidata:Project_chat/Archive/2019/08#New tools and IP masking. /Johan (WMF) (talk) 10:13, 9 September 2019 (UTC)

Feedback about proposed tools[edit]

We've put out some ideas on the project page about tools we can build to help improve vandalism detection and mitigation on our projects. We want your help with brainstorming on these ideas. What are some costs, benefits and risks we might be overlooking? How can we improve upon these ideas? What sounds exciting, what sounds sub-optimal? We want to hear all your thoughts. -- NKohli (WMF) (talk) 00:12, 14 January 2020 (UTC)

Summary of previous feedback[edit]

There was extensive discussion at Talk:IP_Editing:_Privacy_Enhancement_and_Abuse_Mitigation, which I will attempt to summarize:

There was clear and overwhelming community rejection of masked editing, to the extent of a likely community consensus to place a preemptive block against all masked edits. If the Foundation were to terminate the ability for logged out users to IP-edit (replacing IP editing with with masked editing), the Foundation would be effectively be denying logged-out users users any ability to edit at all. Any new tools for handling masked-edits would then be completely non-functional, as there would be exactly zero masked-edits to examine.

If the Foundation drops the masking aspect, there is of course interest in tools or improvements for dealing with IP-edits. Alsee (talk) 21:08, 20 January 2020 (UTC)

Feedback about IP info feature[edit]

  • One concern that I have about this is that many admins have sufficient knowledge to make IP rangeblocks by looking at the IP address, the whois data, the contributions. How would admin rangeblocking work under this scenario? "Owner: Wikimedia Foundation" might be true for many different IP address ranges. A block on shouldn't necessarily also affect 2620:0:860::/46. How would an admin view the contributions for a specific range, and issue a block? I'll note that it isn't sufficient to simply aggregate blocks based on the announced prefixes. For IPv4, many residential ISPs provide a single IP per customer which can be static for many months, so there is no need for a rangeblock at all. In IPv6, we almost always want to block the /64 - but that is never the announced prefix, and for some providers, especially mobile ones, a /64 is too wide. Users need to see the IP address WHOIS data and contributions history, not an opaque "similarity percentage", in order to understand how to interact with that IP address. ST47 (talk) 00:34, 15 January 2020 (UTC)
  • "VPN" is not just a green checkmark or a red x. There are many different types of proxies and VPNs, some are public and some are only used by specific schools or corporations, and some data sources are of higher quality than others. Admins routinely use the reverse DNS output, not just of a single IP address but of the entire IP range to observe trends, and other tools like port scanning, DNS caches, blocklists, and threat intelligence, to identify proxies. Many of these require access to the true IP address. ST47 (talk) 00:39, 15 January 2020 (UTC)
  • I think the WMF should be supplying their own summary whois and geolocation information, but not at the expense of hiding the addresses. They should also be providing rDNS. I share the same concerns about proxies expressed above by ST47. Sometimes we are dealing with a level of sophistication far beyond what any blacklist can possibly tell you. I really don't trust half the blacklist results, especially at the levels of precision we need and the levels we can get from blacklists. There's also a whole bunch of grey areas. Also, those who don't understand IP addresses regularly defer to those of us who do. There's no real deficiency there. -- zzuuzz (talk) 19:33, 15 January 2020 (UTC)
  • I support this general idea. Please emphasize and fund community conversation before investment in tech. Blue Rasberry (talk) 20:19, 15 January 2020 (UTC)
  • There are two things we will miss with this information:
    1. A way to identify a range. While the name is useful for a small company, a government institution or a university, any nationwide mobile operator in a big country has an eight-digit number of users who can potentially edit Wikimedia projects from their networks. This information in itself makes little sense, a range block on the entire Vodafone network will probably be simply dangerous. Some abusers are localised to a small subrange, often /18 or even /24, and having this information is extremely useful. I also wonder how we can translate this to rangeblocks...
    2. A way to check other forms of IP abuse, from open proxies to leaky colos. In order to check whether an IP is involved in any form of abuse, I just google it, usually you find out some strange results for IPs that are VPNs or proxies, and you get next to nothing for legitimate ones. I don't think a VPN red/green status will work, for instance, my work computer is technically connected to a corporate VPN, but it is not abusive as it can be accessed only by me. I think that in order to make a real check, a heavy investment with regular updates of VPNs/proxies etc. databases is needed.
    On the other side, this will make identifying some people easier, not more difficult. For instance, we have a user working for a Japanese university in Ukrainian Wikipedia. It is highly likely that there are very few Ukrainians working for this specific Japanese university, which will make identifying him easier to everyone, not just to the few who will do the extra mile and use Whois — NickK (talk) 13:39, 18 January 2020 (UTC)
To the extend that easy access to the information is an issue, this feature could be a gadget that has to be manually activated in Preferences/Gadgets. Any user who needs the information can activate the gadget just as they can now run their Whois search but there will be less casual discovery of the information. ChristianKl❫ 16:39, 20 January 2020 (UTC)
  • To me this feature does appear to provide the information that non-admins need to be able to access. On the other hand it doesn't seem to provide the necessary information to do rangeblocks for the reasons Zzuuzz and ST47 have pointed out. ChristianKl❫ 16:39, 20 January 2020 (UTC)
  • There are quality issues with this proposal. Easily accessed tools to identify the "owner" of an IP regularly provide different information for exactly the same IP; I see it on a regular basis, where "respected" third party IP information sources will give different granularity and different information. There are also issues when looking at ranges, where an entire range is identified as being "owned" by one organization, when in fact they only have one or a few IPs within a larger range. In some ways, this change is no different than the present situation; anyone looking up an IP externally could encounter the same issues. But they aren't usually being "published" by the WMF, which would be the major change here. Anything that the WMF uses will be dependent on third party sources, just as is used now. The difference is that it's pretty transparent that they're not WMF sources; either they name the website, or they say it's generated by script data under the management of community members. The absence of that buffer is a non-negligible risk if we wind up blocking huge ranges because they're supposedly a colocation host, only to later find that the colo only has a /32 or even only a handful of IPs. (And yes, I know that we're blocking ranges that are far too large now, but...) Risker (talk) 20:09, 20 January 2020 (UTC)
  • Even putting rangeblocks aside, non-admins will often check the contributions of adjacent IPs when reverting vandalism from IP-hopping vandals. Shifting that burden entirely to admins is not realistic. --Ahecht (TALK
    ) 21:28, 20 January 2020 (UTC)

Feedback about Finding similar editors feature[edit]

  • Automated behavioral comparison is a great idea, if you can find a way to make it work from a performance standpoint. Why limit it to only IP editors? ST47 (talk) 00:41, 15 January 2020 (UTC)
  • Yes great idea. I agree with the noted risk that automation can inappropriately accuse editors, but we already have an underregulated and nonstandard accusation detection process. The way to counter the ethical challenges is by funding documentation, more accessible instructions, and online and in-person meetups for people to raise issues and develop solutions. We already have a very labor intensive system which is not scaling. Our greatest threat is not the early bias of the first automation, but of the existing current problem of undetected and unanswered misconduct. By not having semi-automation we permit too much behavior which we ought to exclude. Blue Rasberry (talk) 20:25, 15 January 2020 (UTC)
  • It is a useful tool, but in no way it can be a replacement to finding editors in the same range. Hint: very few LTAs have an interest in exactly the same pages, many have same editing pattern but can edit any pages. A realistic example from Ukrainian Wikipedia: an LTA is making POV-pushing on Crimea topics. Here are three edits for analysis:
    1. replaced in Ivan (footballer) Ivan was born in Sevastopol, Ukraine with Ivan was born in Sevastopol, Russia
    2. replaced in Crimea Crimea is internationally recognised as a part of Ukraine with Crimea is wrongly recognised as a part of Ukraine
    3. replaced in Ivan (singer) Born in Sevastopol, Ivan works for MTV Ukraine with Born in Sevastopol, Ivan works for MTV Russia
    A local patroller finds edit 1, finds out that it is our known LTA, checks contributions and reverts edit 2.
    A machine learning tool gets edit 1 as an input and will most likely suggest to revert edit 3 (which is almost identical). Unfortunately, edit 3 is a legitimate update for a person who really moved from MTV Ukraine to MTV Russia. I wonder whether ORES will even find edit 2. I have already spotted such behaviour in Ukrainian Wikipedia: after an edit war edits of the side which ended up being consensual were labelled as abusive by the machine learning tool.
    I also wonder whether this tool will work across wikis, e.g. if made this edit in Ukrainian Wikipedia and made this edit in Polish Wikivoyage.
    Thus I would probably use such tool as an additional instrument, but it will not replace range contributions for me — NickK (talk) 13:39, 18 January 2020 (UTC)

Feedback about Database for documenting LTAs[edit]

  • Is the idea that this database is being populated by CheckUsers, but used by users without any special permissions? Based on the fact that it doesn't actually show IP addresses or user agents, just a count of matching ones? There's probably still a privacy concern there. Also, due to dynamic IP addresses, showing "IPs: 0 out of 5 match" is one thing, but how many are matching the same IP ranges, ISPs, geolocations? For user agents, a simple match isn't very effective because most browsers increment versions every month, it should be checking for the same browser/OS/platform. If you could do even more fingerprinting, that would be great too. ST47 (talk) 00:46, 15 January 2020 (UTC)
  • An LTA database was discussed once upon a time at enwiki (link). I remain of the opinion that a public library which both registered and unregistered users can consult is a useful tool. As a responder to requests for blocks, I have often been educated by some unregistered user pointing to an LTA page along with the latest vandal. And I will sometimes visit LTA pages on wikis where I have no edits, so would not benefit from having this information hidden from me on that basis. For such an idea to work effectively, I think you should basically throw out any ideas of read-level security. I also think when you have more information limited to a more restricted group, you get more misinformation. Don't take this as total opposition to the idea, I'm just not persuaded that LTA pages should be deprecated. -- zzuuzz (talk) 19:10, 15 January 2020 (UTC)
  • Yes, great idea. Please fund Wikimedia community organizations and focus groups to develop text and documentation for how this should be. I am especially interested in developing categories or labels for humans to apply to different sorts of behavior. If we have a database we need sorting systems, and socially we are far, far from being able to have reasonable or useful conversations about sorting. This will require slow conversation in many places over time, and if we invest a little now slowly then that will save great expense. Without the labels the quality of the data we collect will suffer and we will be unable to usefully discuss various types of long term abuse. Blue Rasberry (talk) 20:22, 15 January 2020 (UTC)
  • I don't think a public library of LTAs is a good idea, this library should necessarily be private and require special permissions. We have already an LTA in Ukrainian Wikipedia who studied public rules (filters, rangeblocks etc.) and adapted their abusive edits to them, becoming even more abusive as their ability to circumvent restrictions improved. As ST47, I also think that pages/IPs/UAs dimensions are not sufficient: an LTA can perfectly make similar edits to a different topic, using the next IP from the same range (because we blocked the previous one!) and an updated UA (because of the browser update). A possibility of human comparison will be more useful — NickK (talk) 13:39, 18 January 2020 (UTC)

with this change, maybe use more information in addition to IP, when assigning anon identity[edit]


cellular networks often (always?) allocate the same IP address to multiple users: both simultaneously (using NAT), and more prominently, over time, so allocation of IP to subscriber is very volatile. this is definitely true for IPV4 addresses, i'm not sure about IPV6. if this this is not the (typical) case for IPV6, maybe this proposal should only apply to IPV4 addresses.

in hewiki, we often leave messages on anon's talk pages - either warnings after abuse was observed from this address, or invitation to create account when observing good edits.

there is no way to guarantee that the message will catch the right person, but i suggest that we can reduce the miss-rate, by looking at more identifying information: only thing i can think of is "user-agent" field, but maybe you wizards can come up with more.

doing so will not make everything perfect, but at least, when we warn a mobile-user for vandalism over "anon-12345", talk page, someone else, using a different browser or different version of browser, will be "anon-12346" instead, even when sharing IP, so they will not be disturbed by this warning.

we semi-regularly see anon complaints in our help-desk that read more or less like "hey, why are you accusing me of vandalism, i never edited a single article on your !@#$%^ wikipedia!", and it will be good if we can irk less readers, and of course, use less wrongfully-applied blocks.

peace - קיפודנחש (talk) 20:29, 20 January 2020 (UTC)

Take no for an answer[edit]

The community has decided near-unanimously that IP addresses will stay. Stop trying to do otherwise. Have you learned nothing from superprotect? From the fram ban? This is the third installment in the direction of this project, and I don't like where it's going. Asking for consensus but then ignoring it. Computer Fizz (talk) 03:39, 21 January 2020 (UTC)