Talk:IP Editing: Privacy Enhancement and Abuse Mitigation/Improving tools

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search

Main project page (discuss)
Ideas for privacy enhancement (discuss)  · Improving anti-vandalism tools (discuss)

WhoIs?[edit]

Should we include "WhoIs" as an existing tool? — Arthur Rubin T C (en: U, T) 02:54, 30 August 2019 (UTC)

WHOIS has become more or less a generic name for an information blurb that is available from dozens of providers; many IP lookups give as much information as a typical WHOIS page. There's an experimental version on the ToolForge (it's the link available on enwp IP address user pages/contribs pages). I'd suggest linking the ToolForge version, but there are so many other versions that people kind of pick and choose their favourite providers. Risker (talk) 03:26, 30 August 2019 (UTC)

The following fields are useful from whois data: nets.cidr (ip subnet range/mask), asn (isp/owner id), asn_cidr (isp/ower full range/mask), nets.name, nets.description (isp/owner name), city-state-country (isp/owner registration address, not geolocation; city is seldom provided).
To identify dynamic IPs of one ISP the subnet mask is crucial, the ISP name with country is useful to identify different subnets of the same ISP. —Aron Man.🍂 edits🌾 15:40, 25 January 2020 (UTC)

Linking to other feedback[edit]

So it's not lost, and is available for others to read.

wikidata:Wikidata:Project_chat/Archive/2019/08#New tools and IP masking. /Johan (WMF) (talk) 10:13, 9 September 2019 (UTC)

Feedback about proposed tools[edit]

We've put out some ideas on the project page about tools we can build to help improve vandalism detection and mitigation on our projects. We want your help with brainstorming on these ideas. What are some costs, benefits and risks we might be overlooking? How can we improve upon these ideas? What sounds exciting, what sounds sub-optimal? We want to hear all your thoughts. -- NKohli (WMF) (talk) 00:12, 14 January 2020 (UTC)

Summary of previous feedback[edit]

There was extensive discussion at Talk:IP_Editing:_Privacy_Enhancement_and_Abuse_Mitigation, which I will attempt to summarize:

There was clear and overwhelming community rejection of masked editing, to the extent of a likely community consensus to place a preemptive block against all masked edits. If the Foundation were to terminate the ability for logged out users to IP-edit (replacing IP editing with with masked editing), the Foundation would be effectively be denying logged-out users users any ability to edit at all. Any new tools for handling masked-edits would then be completely non-functional, as there would be exactly zero masked-edits to examine.

If the Foundation drops the masking aspect, there is of course interest in tools or improvements for dealing with IP-edits. Alsee (talk) 21:08, 20 January 2020 (UTC)

I am also quite amazed at how the Foundation deals with the will of the (at least so far) overwhelming majority ... --Udo T. (talk) 20:59, 22 January 2020 (UTC)
@Alsee: Thanks for the summary, although I think it's premature to say that there has been a community consensus. That discussion was the first reaction to the first announcement that we need to think and work in this area. As you can see on the page and in the discussions below, our thinking has evolved since then and we're exploring more ideas based on what we're learning from the folks on this page.
I agree with you that this is a scary idea, and it makes sense that you would want to draw clear lines in order to protect the wiki from falling into chaos. However the truth is that the rules about the use of personally identifiable information on the internet are coming under more and more scrutiny, and we have to figure out how to respond to the incoming changes. Publishing people's IP address on the internet is not going to be a viable solution forever, either because of new rules or because the technology changes around us. For instance, there are ongoing talks about a Federal Privacy Law in the United States that talks about personal data protection. Similar laws are already in effect in other places. In light of all this, we think it's important to put into place tools that have the ability to work in absence of IP addresses while still providing our wikis the level of protection that they need. In order to do that, we need to have these conversations, understand the costs and risks, and see if we can figure out solutions together. Your valuable experience as a long-term community member can be helpful in shaping these tools. -- NKohli (WMF) (talk) 02:56, 23 January 2020 (UTC)
NKohli (WMF) my specialty and focus is in Foundation-community engagement itself. I'm happy to say I've seen positive developments in Foundation engagement, but to be honest, the Foundation is still struggling to figure how to engage with the community. This discussion is about engagement-itself, not about the substance of your project. I am trying to alert you that there is a problem here - it is in part a communication problem. The smoke alarms are squealing and you don't seem to hear them. I trying to alert you that the flames are about to escalate badly unless you can change the approach to the community. I can't explain and solve community-engagement in this post. However I can boil down your current situation to these minimalistic points:
  • The community believes it has the right to participate in the discussion and decision about whether or not we do masking.
  • The community has the capability to prevent masked editing.
There are complex philosophical arguments about whether the two points above should be true, philosophical augments whether they are morally legitimate. However debating philosophy will not help your immediate needs. Right or wrong, good or bad, legitimate or not, those two points are your current reality.
You are, perhaps unintentionally, creating an appearance of implicitly declaring masking to be mandatory and non-negotiable. You are, perhaps unintentionally, creating an appearance that you are unwilling to even discuss whether or not masking will be built and deployed. You are, perhaps unintentionally, creating an appearance that you consider the subject out-of-scope of your job duties. Right or wrong, your current approach conflicts badly with community beliefs and expectations.
The community values discussion and consensus based resolution of issues. The community has strong anti-value against "refusal to discuss". It doesn't matter how polite and friendly you are, it doesn't matter how good your intentions are. It is a serious error to treat masking-itself as a topic you consider already-closed, already-resolved, and therefore not constructively discussable. That discussion is in fact happening, regardless of your participation. It is a mistake to fail to participate in that conversation. The community will interpret your refusal to discuss the subject as ignoring the community, as dysfunctional, as rude, as abusive, as illegitimate. Other teams have faced essentially the same situation, and the current approach consistently ends badly. If you are unwilling or unable to participate in that discussion, the only remaining helpful thing I (or another community member) can do is organize and deliver a formal consensus. The community has the ability to block all masked edits, rendering all of your work null. You need the community as a partner.
I understand your difficulty. However your most immediate job requirement is to acknowledge the situation, and to and find a way to engage the subject that the community wants to discuss. I would be eager to offer any help I can towards successful dialog between the Community and Foundation. Alsee (talk) 14:58, 24 January 2020 (UTC)

Feedback about IP info feature[edit]

  • One concern that I have about this is that many admins have sufficient knowledge to make IP rangeblocks by looking at the IP address, the whois data, the contributions. How would admin rangeblocking work under this scenario? "Owner: Wikimedia Foundation" might be true for many different IP address ranges. A block on 91.198.174.0/24 shouldn't necessarily also affect 2620:0:860::/46. How would an admin view the contributions for a specific range, and issue a block? I'll note that it isn't sufficient to simply aggregate blocks based on the announced prefixes. For IPv4, many residential ISPs provide a single IP per customer which can be static for many months, so there is no need for a rangeblock at all. In IPv6, we almost always want to block the /64 - but that is never the announced prefix, and for some providers, especially mobile ones, a /64 is too wide. Users need to see the IP address WHOIS data and contributions history, not an opaque "similarity percentage", in order to understand how to interact with that IP address. ST47 (talk) 00:34, 15 January 2020 (UTC)
  • @ST47, NickK, ChristianKl, Ahecht, and HaussmannSaintLazare: Pinging everyone who chimed in about ranges and range-blocks - Thanks for bringing up the concerns about detecting ranges to apply appropriate range-blocks. We could potentially build in a feature to the Finding similar editors feature, which allows one to see the number of users associated with a range (/16, /48 etc) for an IP address. In addition to that, we could also do show contributions from a range (I realize there was an idea proposed about doing RangeContributions in the past but it did not get built) but also make it work for a masked IP address. There is a possibility that if you can see all the contributions coming in from a range (maskedIP/16 or maskedIP/48 etc) you can make a call about blocking that range or not and then the software can block the range on behalf of the user. Do you think that could work?
  • @ST47: you mention - "Users need to see the IP address WHOIS data” - can you tell me more about what pieces of information you look at when you do a whois on an IP address and how would you use it? Thanks so much. -- NKohli (WMF) (talk) 23:08, 23 January 2020 (UTC)
  • @NKohli (WMF): For your first point: Seeing the number of "users" on a given range may not be terribly useful because I don't know how you would count users, especially in IP ranges that are highly dynamic. Seeing contributions from a masked range would be essential, but there are still at least two key limitations. One is that I wouldn't know how big of a range to query without being able to see the WHOIS data - pick a too small range and your block won't be effective, pick a too large range and you might be blocking too many people unnecessarily. The WHOIS includes the actual size of the IP address block, which is often the best size to block. Another is that knowing about the type of range helps us evaluate blocks. Knowing what type of company owns the IP range provides important info, and I would be generally less comfortable placing rangeblocks without knowing that information.
  • And that leads in to your second point. When looking at range contributions, I look at the WHOIS and start with the most specific listed range. Different companies allocate their space differently, sometimes the most specific meaningful range is a /16 (especially residential and mobile), some go as far down as /28 (especially Cogent). Without knowing the network size for the specific IP address, I don't know what size range to look at in the contributions. The WHOIS data also tells me what company operates an IP address. If it's a school (and there's a history of abuse on the range) we tend towards longer anon-only blocks. If it's a mobile range, we tend towards much shorter blocks because there are often many more legitimate customers on the same range. Residential and commercial ISPs depend on the company and the contribs, some are very busy and blocks need to be kept short and specific, others are very quiet and can be safely rangeblocked if needed, you would want to look at the rate of good edits coming from the IP range to tell the difference. If it's the public wifi at a specific Starbucks, then the other IP addresses in the range are almost certain to be irrelevant. Data centers and similar need to be hard blocked, while the other types of ranges are almost always anon-only blocks. So, the AS, network, and all of the organization names (sometimes there is more than one) would guide that. In some cases, we would use the AS number or organization name to get a list of all of the networks assigned to that AS or organization, which we would then query for contributions and block log, to determine whether they should be blocked as well. This is especially true for data centers, as we block those on sight. Some have many dozens or hundreds of IP ranges, so being able to identify that IP address's owner and being able to find and block all of the other IP ranges with the same owner allows us to prevent a user from evading a range block on their proxy by simply switching to a different site with the same VPN provider, or releasing and renewing their IP address with their cloud services provider. ST47 (talk) 05:12, 24 January 2020 (UTC)
  • @ST47: Thanks for your response. It adds a lot of context that I did not know earlier. If we could surface IPs partially (say, 198.72.38.***), what are the kinds of things that you won't be able to do? I'm especially interested in your thoughts about doing rangeblocks in such a scenario. Given that this feature still exists and we surface high level location, company/ownership information about the IP address. It removes the device association, which will improve privacy for unregistered editors. -- NKohli (WMF) (talk) 18:34, 18 February 2020 (UTC)
  • @NKohli (WMF): I think that managing rangeblocks without knowing ranges will be tricky. A rangeblock is always a compromise between blocking the entire range to prevent the user from editing and blocking as small subset as possible to minimise collateral damage. Not having IPs will be very much like blocking blindfolded. A provider name will definitely help to understand the behaviour (e.g. if the IP is static or dynamic) but will not resolve the problem. What can help is, firstly, getting contributions from any range from /16 to /32 for IPv4 and from /19 to /48 for IPv4, secondly, getting default ranges for a Whois (e.g. a provider uses a /16 and the specific IP belongs to a /20 group). Something like a drop-down menu Get contributions for: /16 (default range), /17, /18, /19, /20 (default subrange), /21, ..., /31 with a Block this range button — NickK (talk) 14:36, 24 January 2020 (UTC)
  • @ST47 and NickK: Got it. This is helpful. What I'm hearing from you both is that in order to do a rangeblock you will need access to more information than is currently suggested in the mockup on the project pagee. This will include (but not be limited to):
    • Listed specific/default range for an IP in the WHOIS
    • Network operator(s)
    • Having an option to look at contributions from many different ranges (/16, /17 etc.)
Of course, this is not exhaustive. As a first pass on this feature, I am looking to get a sense of what parts of a WHOIS are useful so we can show those and save the trouble of doing a manual WHOIS for our power users. @ST47 and NickK: - does it feel like having such a feature would be helpful in the first place? I'm not talking about IP Masking here -- only the IP info feature to show information about an IP address in a more standardized way, on the wiki. Thanks a lot. -- NKohli (WMF) (talk) 23:30, 5 February 2020 (UTC)
@NKohli (WMF): Personally for me knowing the operator, location and range is usually enough. This would however not change my work a lot: for most popular providers on my wiki (e.g. major Ukrainian mobile operators for Ukrainian Wikipedia) I already recognise the ranges without WHOIS, if I happen to get something really obscure I will probably google to find out if it is not a VPN or open proxy. But, as I stated above, there are risks associated with making this data more visible: for real users of obscure providers (e.g. a person editing from a Japanese university on Ukrainian Wikipedia) this will attract more attention to their personal information — NickK (talk) 00:48, 6 February 2020 (UTC)
@NickK: Thanks for your response. I heard that you recognize a lot of IPs on sight, without needing to do a WHOIS. If we could surface IPs partially (say, 198.72.38.***), what are the kinds of things that you won't be able to do? Given that this feature still exists and we surface high level location, company/ownership information to you. -- NKohli (WMF) (talk) 18:34, 18 February 2020 (UTC)
@NKohli (WMF): If I am given a /24 range instead of an address, this will probably allow me to do most of administrative actions I am doing now. I might still be unable to distinguish two different users in the /24 range (if they happen to exist) and might have some collateral damage doing rangeblocks, but /24 should resolve most problems — NickK (talk) 15:16, 22 February 2020 (UTC)
  • @NKohli (WMF): looking to get a sense of what parts of a WHOIS are useful
The exact fields that are useful are listed in #WhoIs?. —AronM🍂 edits🌾 12:36, 6 February 2020 (UTC)
Missed that. Thanks for pointing that out, @Aron Manning:. -- NKohli (WMF) (talk) 01:42, 11 February 2020 (UTC)
  • "VPN" is not just a green checkmark or a red x. There are many different types of proxies and VPNs, some are public and some are only used by specific schools or corporations, and some data sources are of higher quality than others. Admins routinely use the reverse DNS output, not just of a single IP address but of the entire IP range to observe trends, and other tools like port scanning, DNS caches, blocklists, and threat intelligence, to identify proxies. Many of these require access to the true IP address. ST47 (talk) 00:39, 15 January 2020 (UTC)
  • @ST47: How do you usually go about assessing if an IP address needs additional scrutiny like what you just mentioned? And how do you go about it? Do you have a list of tools you use? Thanks. -- NKohli (WMF) (talk) 23:14, 23 January 2020 (UTC)
  • @NKohli (WMF): I start with WHOIS and ipcheck. Inconsistent or inconclusive results in the ipcheck tool, if the WHOIS company name "sounds kinda like" a hosting provider but it isn't one that I recognize, or especially in cases of a long-term abuse user who is using multiple IP addresses from completely different ranges, would all call for additional investigation. The reverse DNS of the ip address is shown in the whois tool itself, the range reverse DNS is here, I use nmap for port scanning, the spamhaus lists and the XBL in particular are good for identifying compromised machines that are acting as a proxy, and talos can have useful information even though it's primarily for email spam. ST47 (talk) 05:24, 24 January 2020 (UTC)
  • @ST47: Thanks a bunch -- I will make note of these tools on the project page so we don't lose track of them. I have a follow-up question - how often would you say you encounter an IP address that you would spend time investigating with the tools? And how much time do you invest when doing such investigations? -- NKohli (WMF) (talk) 23:39, 5 February 2020 (UTC)
  • I think the WMF should be supplying their own summary whois and geolocation information, but not at the expense of hiding the addresses. They should also be providing rDNS. I share the same concerns about proxies expressed above by ST47. Sometimes we are dealing with a level of sophistication far beyond what any blacklist can possibly tell you. I really don't trust half the blacklist results, especially at the levels of precision we need and the levels we can get from blacklists. There's also a whole bunch of grey areas. Also, those who don't understand IP addresses regularly defer to those of us who do. There's no real deficiency there. -- zzuuzz (talk) 19:33, 15 January 2020 (UTC)
  • @Zzuuzz: Thanks for pointing out the fragility of blacklists. I'll ask you what I asked ST47 above - How do you usually go about assessing if an IP address needs additional scrutiny? And how do you go about doing the additional checks? Do you have a list of tools you use? Like, are there any blacklists that you would always trust? Thanks. -- NKohli (WMF) (talk) 23:16, 23 January 2020 (UTC)
  • @NKohli (WMF): How do you usually go about assessing if an IP address needs additional scrutiny? Some things include: unblock requests, apparent sockpuppetry, bot-like edits (not only rapid edits, but also certain types of spam and vandalism), scams, egregious libel, previous blocks or reports (eg ProcseeBot or Open proxy detection), as well as 'known issues'. Most admin activity will inevitably involve looking at Whois, geolocation and rDNS. Certain ISPs are inherently suspicious (eg Korea Telecom, Softbank Japan, MTN Ghana), sometimes there's country-hopping, sometimes there's indications that it's a data centre.
  • @Zzuuzz: Got it. If I'm understanding this right, you look at the IP contributions, block logs and WHOIS (including geolocation and rDNS). Does that sound right? -- NKohli (WMF) (talk) 00:16, 6 February 2020 (UTC)
  • To assess if an IP needs further scrutiny? Yes, probably, but I think a better summary is the one I listed above. The 'contributions' part can involve a whole host of things (and may also include contributions from an account, as I'm a checkuser). Unblock requests are another distinct thing, where I might deploy every advanced check from search results to port scans. -- zzuuzz (talk) 17:46, 9 February 2020 (UTC)
  • how do you go about doing the additional checks? There are two types of check: those which involve connecting to the IP, and those which involve using the Duck Test. Very often I'll just put the IP address in my browser address bar and check the response. Sometimes I might use nmap or telnet, or some similar connection tools which I don't think I really need to go into details about. And sometimes I'll check for an active proxy connection, whereas sometimes I'll just check for a port response. As I've indicated in previous feedback, sometimes I'll target a neighbouring IP or others in the same range.
  • @Zzuuzz: Got it. How often would you say you need to use advanced tools to investigate IPs and how much time does it take? -- NKohli (WMF) (talk) 00:16, 6 February 2020 (UTC)
  • I can spend anything from 10 seconds to 30 minutes checking if an IP is an open proxy. I might deploy port scans less than 1% of the time, but I might check simple port responses most of the time, I might check for a working proxy around less than quarter of the time, and probably less than 10% would be based on whois/rdns alone, if I were to guess. Because of the nature of the work I typically do, mainly dealing with LTA sockpuppets and serious vandalism, I probably do some of this most of the time. -- zzuuzz (talk) 17:46, 9 February 2020 (UTC)
  • are there any blacklists that you would always trust? Probably the only blacklist I fully trust is the list of exit nodes provided by Tor. This isn't always that helpful, since Tor is mostly automatically blocked by MediaWiki. I think the next blacklist I'd specify is the one provided by VPN Gate[1]. This comes with caveats, that it "sometimes contains wrong IP addresses", and it isn't complete, and it isn't directly queryable. I would place most other major blacklists into their own third category, with some more generally reliable than others, but all a bit 'fuzzy'. The IP check tool contains some of these blacklists, and you can tell the 'fuzziness' - if you check some IPs you will get some lists that say yes, and others that say no. I'd use these lists as supplemental information combined with contributions and page histories, posts on other websites, and things like geolocation, dynamicity, and ownership of the range.
  • One of the biggest problems with blacklists is dynamic IPs and 'peer-to-peer' open proxy networks. VPN Gate can be considered just one example of this; Tor used to have similar issues. In a common scenario, people using mobile phone networks act as open proxies, sometimes for a really short time (measured in hours). This ends up with a 'blacklisting' for an open proxy which is no longer open, or a whole phone network being blacklisted, or a false negative result. -- zzuuzz (talk) 21:52, 24 January 2020 (UTC)
  • @Zzuuzz: Thanks! This is very helpful. I'm wondering if we could automate blacklist checking against some standard blacklists and show a combined result. I'm not sure how technically feasible that is but it sounds like it could save a lot of manual checking. -- NKohli (WMF) (talk) 00:16, 6 February 2020 (UTC)
  • Really, I'd refer you to my previous answer. To say than an IP is an average of 80% open proxy is really not good enough, IMO. It might be useful to flag an IP as potentially open, or likely not open, but I think this might be misleading more often than not. -- zzuuzz (talk) 17:46, 9 February 2020 (UTC)
  • @Zzuuzz: Sorry, I should have clarified - it wouldn't be an absolute indicator but if we were to provide some indication (say maybe only by checking Tor lists), that could potentially be useful. Of course we would have to make it amply clear that this can be incorrect. I get your point though - we would have to be careful to not convey that what they are seeing is necessarily correct. -- NKohli (WMF) (talk) 01:42, 11 February 2020 (UTC)
@NKohli (WMF): The tools mentioned above by ST47 and Zzuuzz (whois-ISP info, geoloc and VPN detection) are those that I've proposed to include in the design of the CU2.0 tool in this comment: phab:T237593#5643299, with an example in phab:T174553#5643405 and reiterated in phab:T238782#5788564. In my experience determining the ISP and geoloc is a primary step to connecting dynamic IPs, and VPN reports/likelihood are useful to recognize if the real source might be concealed. This data could be efficiently included in the result tables, greatly increasing the efficiency with which the tool can be used, eliminating the need to individually look up each IP in whois and vpn databases. So far this feedback was not considered or discussed for the design, although this would be the best time to include it in the mock-ups, so later this feature request does not need another round of design and implementation. —Aron Man.🍂 edits🌾 12:17, 25 January 2020 (UTC)
The above proposes to add columns for the ISP, the geolocation and the proxy report. Sorting by the ISP is important, but the most useful feature would be to group edits by ISP (subnet). This would make a long list of edits by an IP hopper into one entry. The group should open like a collapsed section by clicking an "open" button and show the individual edits. —Aron Man.🍂 edits🌾 15:50, 25 January 2020 (UTC)
@Aron Manning: CheckUser project is specifically focused around the work Checkusers do and hence the technical work is all about the Checkuser extension. This project proposal is about bringing this functionality to the broader community. The table columns you mention should be added to MediaWiki core or maybe made into a new extension altogether. I agree that this is important work and we will dedicate the time and resources needed to build it in, once we have a concrete definition of what we want to build. -- NKohli (WMF) (talk) 00:19, 6 February 2020 (UTC)
@NKohli (WMF): We are mixing the CheckUser project and the public user info (the popup bubble) here, because both display information from the same provider: whois (subnet/ISP), geoloc, proxy. See the task to implement querying the data: phab:T174553 ("Create a mechanism that allows fetching geolocation and subnet data for IP addresses"). This can be part of the core, or checkuser, I don't see if that decision has been made. As long as the query result can be displayed in a popup on the history page, it doesn't matter.
However, the same information is also necessary for the checkuser extension: it would be best to show the subnet, geoloc and proxy info next to each IP address, to eliminate the need to look up each and every individual IP with those tools. This information is best suited to be presented on the Compare tab of the new CU2.0 extension as additional columns. This can significantly speed up a CU investigation by avoiding opening 3+ tools for each IP. When you asked about what information is necessary for a CU investigation, this was the answer: subnet, rDns, geoloc, proxy. —AronM🍂 edits🌾 13:16, 6 February 2020 (UTC)
@Aron Manning: I think we're both talking about doing the same thing, just from different angles. I agree that this information would be helpful for checkusers -- it can be helpful for everyone else too, that's why we're talking about doing this for a broader audience, not just checkuser. Which is why I'm seeking everyone's input on it. It doesn't need to live inside Checkuser (technically or otherwise). Once it's implemented, we will add the links to access this information in Checkuser as well. -- NKohli (WMF) (talk) 01:42, 11 February 2020 (UTC)
  • I support this general idea. Please emphasize and fund community conversation before investment in tech. Blue Rasberry (talk) 20:19, 15 January 2020 (UTC)
  • There are two things we will miss with this information:
    1. A way to identify a range. While the name is useful for a small company, a government institution or a university, any nationwide mobile operator in a big country has an eight-digit number of users who can potentially edit Wikimedia projects from their networks. This information in itself makes little sense, a range block on the entire Vodafone network will probably be simply dangerous. Some abusers are localised to a small subrange, often /18 or even /24, and having this information is extremely useful. I also wonder how we can translate this to rangeblocks...
    2. A way to check other forms of IP abuse, from open proxies to leaky colos. In order to check whether an IP is involved in any form of abuse, I just google it, usually you find out some strange results for IPs that are VPNs or proxies, and you get next to nothing for legitimate ones. I don't think a VPN red/green status will work, for instance, my work computer is technically connected to a corporate VPN, but it is not abusive as it can be accessed only by me. I think that in order to make a real check, a heavy investment with regular updates of VPNs/proxies etc. databases is needed.
    On the other side, this will make identifying some people easier, not more difficult. For instance, we have a user working for a Japanese university in Ukrainian Wikipedia. It is highly likely that there are very few Ukrainians working for this specific Japanese university, which will make identifying him easier to everyone, not just to the few who will do the extra mile and use Whois — NickK (talk) 13:39, 18 January 2020 (UTC)
  • @NickK: Got it. Zzuuzz also mentioned that checking for a VPN is not a black/white thing. Would it be helpful to, say, show that an IP address is from a corporate VPN or some way to indicate how "big" that VPN might be? It looks to me like there is often a lot of manual searching and institutional knowledge required for these processes. Like you mentioned that you recognize a lot of IP addresses on sight. How would a newcomer to this process be able to do those things? Can the tools help? I appreciate your thoughts. Thanks again. -- NKohli (WMF) (talk) 17:36, 11 February 2020 (UTC)
    @NKohli (WMF): I think the best thing you can do is to use knowledge of bot owners who are active in blocking open proxies. They do use multiple databases and use algorithms that can be generalised for all wikis.
    Regarding VPNs, I wonder what is your strategy on finding information on VPNs. Usually it is not public, and I am not aware of any tool that would allow to find it out. For instance, how can you find out the number of people using the VPN of the company I am working for? I don't have any tools for this, do you? — NickK (talk) 19:11, 11 February 2020 (UTC)
  • @NickK: I'll admit that I do not know whether that is a technical feasibility or not. That will warrant a discussion with the engineers to figure out what is the best way to track and surface VPN information. When I said how "big", I meant using the IP to figure out the company and seeing how big is the network block they own. Of course with a VPN, that all changes. But if it's a private VPN, chances are that you wouldn't see that information in any blacklists or elsewhere, right? Isn't that already a problem with the existing system? -- NKohli (WMF) (talk) 23:25, 12 February 2020 (UTC)
    @NKohli (WMF): From what I know, there are two reliable ways of identifying VPNs.
    • The first one is based on public information for identifying VPNs, e.g. by googling them or checking whois. For instance, 216.162.44.0/22 is quite easy to identify as a public VPN provider by googling it (reported pretty much everywhere). 217.148.76.0/24 is quite easy to identify as a private VPN (belonging to the company Caixa) by checking whois. However, this does not really answer how meany people are really using them: the first one is linked to a small tech company with clearly thousands of people using their IPs, the second one is linked to a known company but without any accurate way to estimate how many of their employees use this specific VPN.
    • The second one is based on the technical information like number of different devices connecting from the same IP address. It is both interesting to explore different UAs and different MAC addresses. For instance, a large corporate network will tend to have very few different UAs due to the willingness of the company to uniformise as much as possible, but a lot of MAC addresses. A lot of desktop computer UAs and MAC addresses will probably mean a popular VPN or open proxy, e.g. a VPN offered by Opera. A lot of mobile UAs and MAC addresses will probably mean a popular Wi-Fi network, e.g. a public Wi-Fi of an airport or of a railway station. Such identification can technically be done on the server side and proactively (e.g. raising an alert if a behaviour similar to a public VPN or proxy is identified), but I don't know if this is a kind of information we are allowed collect.
    Regarding private VPNs, I don't think I really want to see it. If we have a user, usually in a restrictive country like China, who subscribed to and connects to Wikipedia via a private VPN in a non-restrictive country (e.g. Netherlands), do we really want to take any actions to their Dutch IP? I don't think so — NickK (talk) 15:16, 22 February 2020 (UTC)
    1. A way to check other forms of IP abuse, from open proxies to leaky colos. In order to check whether an IP is involved in any form of abuse, I just google it, usually you find out some strange results for IPs that are VPNs or proxies, and you get next to nothing for legitimate ones. I don't think a VPN red/green status will work, for instance, my work computer is technically connected to a corporate VPN, but it is not abusive as it can be accessed only by me. I think that in order to make a real check, a heavy investment with regular updates of VPNs/proxies etc. databases is needed.
To the extend that easy access to the information is an issue, this feature could be a gadget that has to be manually activated in Preferences/Gadgets. Any user who needs the information can activate the gadget just as they can now run their Whois search but there will be less casual discovery of the information. ChristianKl❫ 16:39, 20 January 2020 (UTC)
  • To me this feature does appear to provide the information that non-admins need to be able to access. On the other hand it doesn't seem to provide the necessary information to do rangeblocks for the reasons Zzuuzz and ST47 have pointed out. ChristianKl❫ 16:39, 20 January 2020 (UTC)
  • There are quality issues with this proposal. Easily accessed tools to identify the "owner" of an IP regularly provide different information for exactly the same IP; I see it on a regular basis, where "respected" third party IP information sources will give different granularity and different information. There are also issues when looking at ranges, where an entire range is identified as being "owned" by one organization, when in fact they only have one or a few IPs within a larger range. In some ways, this change is no different than the present situation; anyone looking up an IP externally could encounter the same issues. But they aren't usually being "published" by the WMF, which would be the major change here. Anything that the WMF uses will be dependent on third party sources, just as is used now. The difference is that it's pretty transparent that they're not WMF sources; either they name the website, or they say it's generated by script data under the management of community members. The absence of that buffer is a non-negligible risk if we wind up blocking huge ranges because they're supposedly a colocation host, only to later find that the colo only has a /32 or even only a handful of IPs. (And yes, I know that we're blocking ranges that are far too large now, but...) Risker (talk) 20:09, 20 January 2020 (UTC)
  • Even putting rangeblocks aside, non-admins will often check the contributions of adjacent IPs when reverting vandalism from IP-hopping vandals. Shifting that burden entirely to admins is not realistic. --Ahecht (TALK
    PAGE
    ) 21:28, 20 January 2020 (UTC)
  • I feel that in Japanese Wikipedia there are many range blocks used for a long time, and they affect to many IP users. For an example, I show 36.11.224.0/24 and 4 36.11.225.0/24. (6-month range block from January 1, 2020 to July 1, 2020.) I found these range blocks trying to edit the Japanese Wikipedia with my KDDI smartphone. It should be made some corrections by other administrators such as shortening, narrowing down, etc. But in the Japanese Wikipedia, such corrections are hardly made because of the small number of administrators. I disagree with this proposal because it will make the administrators' activities invisible and make the situation worse. Perhaps one or some of the Wikipedias require such feature, but if so, they should consider installing some feature as one or some WIkipedia only with discussions within the Wikipedia, I don't think it's appropriate to make changes to MediaWiki that affect all wikipedias etc. Thank you. --HaussmannSaintLazare (talk) 18:53, 21 January 2020 (UTC)
IP info feature mock-up 2.png
  • See mock-up on right. This is very useful to aid editors with reviewing IP edits and identifying COI edits coming from the subject (example, source). In the checkuser tool it is crucial to identifying dynamic IPs (IP hopping), suspicious subnets and IPs (vpns). A solution to identify (track) anon editors is necessary, but best discussed separately, as that's a different topic altogether. —Aron Man.🍂 edits🌾 09:21, 2 February 2020 (UTC)
  • In principle a good idea. I think we can learn range blocks, I personally applied a few before. Per above, proxies are not binary yes / no type of operation, hence, it will need the best of resources to consolidate all of the possible proxy detection methods and we have to try out whether the cross or ticks works for a period of time for tethering. Abused proxies are many, so this is key. P.S. The above in depth discussions on how to detect proxies, reverse DNS and etc is useful, but it give out ways for abusers to know how to detect, can we do these in a private setting (like in some mailing lists). I don't wish too many beans spilled.--Camouflaged Mirage (talk) 15:28, 17 February 2020 (UTC)

Feedback about Finding similar editors feature[edit]

  • Automated behavioral comparison is a great idea, if you can find a way to make it work from a performance standpoint. Why limit it to only IP editors? ST47 (talk) 00:41, 15 January 2020 (UTC)
  • @ST47: We were not sure how good an idea it is to make this tool work for all, not just IP editors. There can possibly be concerns about legitimate socks (for privacy purposes or something else) to be exposed if we make it work for all. There's potential to find workarounds though. I figured it was probably easiest to start with just unregistered editors and then expand it to everyone if the community decides so. -- NKohli (WMF) (talk) 21:26, 12 February 2020 (UTC)
  • Yes great idea. I agree with the noted risk that automation can inappropriately accuse editors, but we already have an underregulated and nonstandard accusation detection process. The way to counter the ethical challenges is by funding documentation, more accessible instructions, and online and in-person meetups for people to raise issues and develop solutions. We already have a very labor intensive system which is not scaling. Our greatest threat is not the early bias of the first automation, but of the existing current problem of undetected and unanswered misconduct. By not having semi-automation we permit too much behavior which we ought to exclude. Blue Rasberry (talk) 20:25, 15 January 2020 (UTC)
  • @Bluerasberry: Indeed - ample documentation and configuration options for the communities to decide what works for them would be important to make this work successfully. I agree that the current system is very labor intensive, which is why we are focusing on tools that can assist in the work our prolific editors do. -- NKohli (WMF) (talk) 21:26, 12 February 2020 (UTC)
  • It is a useful tool, but in no way it can be a replacement to finding editors in the same range. Hint: very few LTAs have an interest in exactly the same pages, many have same editing pattern but can edit any pages. A realistic example from Ukrainian Wikipedia: an LTA is making POV-pushing on Crimea topics. Here are three edits for analysis:
    1. 198.73.209.236 replaced in Ivan (footballer) Ivan was born in Sevastopol, Ukraine with Ivan was born in Sevastopol, Russia
    2. 198.73.209.230 replaced in Crimea Crimea is internationally recognised as a part of Ukraine with Crimea is wrongly recognised as a part of Ukraine
    3. 55.45.87.173 replaced in Ivan (singer) Born in Sevastopol, Ivan works for MTV Ukraine with Born in Sevastopol, Ivan works for MTV Russia
A local patroller finds edit 1, finds out that it is our known LTA, checks 198.73.209.0/24 contributions and reverts edit 2.
A machine learning tool gets edit 1 as an input and will most likely suggest to revert edit 3 (which is almost identical). Unfortunately, edit 3 is a legitimate update for a person who really moved from MTV Ukraine to MTV Russia. I wonder whether ORES will even find edit 2. I have already spotted such behaviour in Ukrainian Wikipedia: after an edit war edits of the side which ended up being consensual were labelled as abusive by the machine learning tool.
I also wonder whether this tool will work across wikis, e.g. if 198.73.209.236 made this edit in Ukrainian Wikipedia and 198.73.209.230 made this edit in Polish Wikivoyage.
Thus I would probably use such tool as an additional instrument, but it will not replace range contributions for me — NickK (talk) 13:39, 18 January 2020 (UTC)
  • @NickK: I totally did not mean to suggest this tool as a replacement to Range Contributions. As I imagined it, searching this tool would surface editors in the same (/16, /32 etc) ranges too, in order of increasing proximity (or whatever sort order is preferred). Since the IPs will be in the database, it is easy for us to do this. How do you currently look up range-contributions? Is there a place to find out all other IPs active from the same range?
  • And to your point about whether this will work across wikis - my gut reaction is that it will be better if we can have an option to see cross-wiki results but I wonder about the technical feasibility or that. Would cross-wiki results be helpful? Thanks. -- NKohli (WMF) (talk) 19:42, 13 February 2020 (UTC)
    @NKohli (WMF): There is currently a great tool for cross-wiki IP range contributions, see here how it works for 198.73.209.0/24. It is really helpful for cross-wiki vandals from small ranges — NickK (talk) 15:27, 22 February 2020 (UTC)
Wikimania2019 research presentation sockpuppetDetection.pdf
and intro: wikimania:2019:Research/Sockpuppet_detection_in_the_English_Wikipedia.
I've been looking at this project with great interest. This would be very helpful to identify LTAs and users to investigate. —Aron Man.🍂 edits🌾 10:54, 2 February 2020 (UTC)
  • @Aron Manning: Thanks for linking to those resources. Those are indeed the tools we are looking into for this feature. It seems quite promising so far. -- NKohli (WMF) (talk) 19:42, 13 February 2020 (UTC)
  • In general useful, especially in sockpuppet investigations. Good idea. --Camouflaged Mirage (talk) 15:31, 17 February 2020 (UTC)
  • there is also an impersination risk. If this feature is a black box and trolls figure out how to trigger it, they might intentionally trigger it to make it look like users they don't like are sockpuppets. Bawolff (talk) 21:28, 17 February 2020 (UTC)

2 more automated approaches focusing on prevention of simple cases of socking[edit]

I suggest 2 more automated approaches in parallel that focuses on preventing simple cases of socking by mistake and from inexperienced users:

  • Track the last logged-in user with a cookie/localsetting and register any other account logging in from the same browser. Ask the user if the previous account is an alternative account of theirs. Save the answer (with timestamp, without IP). Also save a non-editing (not sanctionable, only informative) entry in the CU log.
If a user makes an edit to a page that was edited before by an alt account then tell the user to log in with the account that first edited the page and prohibit the edit. Exception: if the alt account is an abandoned and irrevocably retired account.
If a user makes an edit to a page that was edited before by a paired account, that was declared as NOT an alt then inform the user that this edit will be scrutinized and ask for confirmation before saving the edit. Publish the list of these edits for editors to review.
  • Prohibit editing the same page from the same IP in a certain time-frame (ex. one week) with the exception of highly dynamic IPs and ranges (these need to be identified by admins). Reddit does this filtering: votes from the same IP do not count. Apply this more strictly to any kinds of votes: without a limit of one week.

Benefits of these preventive methods:

  • Decreased number of mistaken socking by inexperienced users, causing less negative experiences to newcomers.
  • Require more knowledge and investment by bad-faith editors to sock: the need to change IP (enforced by the software) and understanding how cookies and localsettings work.
  • Clearer distinction between bad-faith and mistaken socking. Currently the two are treated with the same severe punishments, although inexperienced users do it mostly because they don't know the rules. This has caused many disappointments and negative opinions about wikipedia.
  • Prevention is a more humane way of dealing with socking. Those users, who find out how to circumvent these measures prove their bad faith by doing so, thus they deserve the sanctions. —Aron Man.🍂 edits🌾 11:06, 2 February 2020 (UTC)

Feedback about Database for documenting LTAs[edit]

  • Is the idea that this database is being populated by CheckUsers, but used by users without any special permissions? Based on the fact that it doesn't actually show IP addresses or user agents, just a count of matching ones? There's probably still a privacy concern there. Also, due to dynamic IP addresses, showing "IPs: 0 out of 5 match" is one thing, but how many are matching the same IP ranges, ISPs, geolocations? For user agents, a simple match isn't very effective because most browsers increment versions every month, it should be checking for the same browser/OS/platform. If you could do even more fingerprinting, that would be great too. ST47 (talk) 00:46, 15 January 2020 (UTC)
  • @ST47: The idea I had when writing that feature idea is that it would be a replacement for pages like Muppets LTA on enwiki. There are a lot of such pages which document IP addresses publicly. With IP masking in place, these IP addresses would be masked and storing them in the clear would not be very helpful. But since we will have the IPs in the database, we can provide quick matching/similarities. But the bigger advantage is the ability for the tool to auto-flag similarities and point out when an edit comes from an IP that belongs to a known LTA. We can of course expand the pattern-matching to include factors like user-agent, editing patterns etc. About usage, I think it could be used by anyone without any special permissions unless the community feels otherwise. Does that sound like a useful feature to you? -- NKohli (WMF) (talk) 20:02, 13 February 2020 (UTC)
  • An LTA database was discussed once upon a time at enwiki (link). I remain of the opinion that a public library which both registered and unregistered users can consult is a useful tool. As a responder to requests for blocks, I have often been educated by some unregistered user pointing to an LTA page along with the latest vandal. And I will sometimes visit LTA pages on wikis where I have no edits, so would not benefit from having this information hidden from me on that basis. For such an idea to work effectively, I think you should basically throw out any ideas of read-level security. I also think when you have more information limited to a more restricted group, you get more misinformation. Don't take this as total opposition to the idea, I'm just not persuaded that LTA pages should be deprecated. -- zzuuzz (talk) 19:10, 15 January 2020 (UTC)
  • @Zzuuzz: Please see my reply to ST47 above. I was not planning for this information to be restricted to a small group - rather it would work much like LTA pages, except more machine-readable and with the ability of searching and pattern-matching. Do you think this idea does not capture something that LTA pages on wiki do more effectively? Thanks. -- NKohli (WMF) (talk) 20:02, 13 February 2020 (UTC)
  • I will say that I'm intrigued, and I look forward to further refinements of the proposal. I have a concern which is that the IPs listed on a LTA page have two purposes: they show a sense of the IP ranges sure, along with range blocks, ISPs, geolocation and some signs of dynamicity. But another primary purpose is to show examples of the behaviour through diffs and contributions, with the actual IP not being very relevant. Looking at your reply to ST47, I think you might find that LTAs rarely use the same IP. If they did we'd probably just block them. And they often don't just stick to one user agent or set of articles. They might use many ranges and ranges more than once, and then they will share them with a lot of other people. Anyway, if no information is lost from LTA pages, then there is nothing to be concerned about, and I'm sure things like searching can be improved, and pattern matching used, so I look forward to hearing more. -- zzuuzz (talk) 20:49, 13 February 2020 (UTC)
  • Yes, great idea. Please fund Wikimedia community organizations and focus groups to develop text and documentation for how this should be. I am especially interested in developing categories or labels for humans to apply to different sorts of behavior. If we have a database we need sorting systems, and socially we are far, far from being able to have reasonable or useful conversations about sorting. This will require slow conversation in many places over time, and if we invest a little now slowly then that will save great expense. Without the labels the quality of the data we collect will suffer and we will be unable to usefully discuss various types of long term abuse. Blue Rasberry (talk) 20:22, 15 January 2020 (UTC)
  • I don't think a public library of LTAs is a good idea, this library should necessarily be private and require special permissions. We have already an LTA in Ukrainian Wikipedia who studied public rules (filters, rangeblocks etc.) and adapted their abusive edits to them, becoming even more abusive as their ability to circumvent restrictions improved. As ST47, I also think that pages/IPs/UAs dimensions are not sufficient: an LTA can perfectly make similar edits to a different topic, using the next IP from the same range (because we blocked the previous one!) and an updated UA (because of the browser update). A possibility of human comparison will be more useful — NickK (talk) 13:39, 18 January 2020 (UTC)
  • I think it is impossible because the LTA standards are different in each Wikis. If you use the standard of some Wiki, you need to send a message to every Wiki community portal to make some consensus, not just discuss it here. -HaussmannSaintLazare (talk) 19:27, 21 January 2020 (UTC)
  • We have LTA compilations on major projects, they are deliberately brief to prevent gaming the system and etc. We also don't want shrines for vandals. This database can be like CUwiki or etc, where sysop or higher permission is needed to access. I think we are fine with the current status quo on this. --Camouflaged Mirage (talk) 15:33, 17 February 2020 (UTC)
  • I for one never enjoyed enwiki's practice of publicly documenting LTAs, but regardless I agree with others that this can't be decided here for the global community. I think the question is whether or not we wanted this database made available to us, and I see nothing wrong with that. Each community can decide for themselves if that want to make use of it or stick with the status-quo. I'd also love to hear how this system would work. As Zzuuzz says, many socks have nothing in common beyond behaviourial evidence, and I'm not sure how you'd document that in a machine readable format. I do think it'd be cool if we say added Wikibase to the CheckUser wiki, along with some tools to query it. This would make it easier to dig up long-term technical connections. MusikAnimal talk 23:25, 17 February 2020 (UTC)

Public or private?[edit]

So there doesn't seem to be any specific note on whether this'll be public or private. To be clear I'm talking about publicly and privately readable. I've seen people give their thoughts both under the preassumption that it could be public, and those doing the same but with private. I'll lay down my own beliefs here.
Advantages to public
# Allows more users to make use of it - lots of the times IPs and the like fight against LTAs
# Hiding this information could result in there being loss on how to fight them, false information, microbits being passed aroudn and slowly being public anyway.
# Since this will most likely replace already existing stuff like en:WP:Sockpuppet investigations/etc , this would essentially be turning what is already a public working system into a private one.
Advantages to private
# LTAs can learn the information that is had about them and use that to change their disruption strategies
# Giving the user a publicly viewable space about them is contrary to en:WP:DENY.
My personal opinion is that it should be publicly viewable, but not publicly editable. Publicly editable just results in giving an LTA another place to cause harm. What do you all think? Let's go RfC style with the bold text Public or Private. Computer Fizz (talk) 06:01, 30 January 2020 (UTC)
  • Public for transparency and distributing the weight of fighting LTAs in the community. DENY is an unproven concept, that didn't stop any LTAs that I know of. The community being informed, however, is like having a the whole community vaccination against a virus: effective in preventing the spread. —Aron Man.🍂 edits🌾 14:38, 30 January 2020 (UTC)
  • It depends. This seems like a per-community decision and I don't think we're going to be able to make a global decision without a proper RfC. Even then if it's decided to be private, there's nothing stopping a community from publishing details on LTAs to the public eye. MusikAnimal talk 23:06, 17 February 2020 (UTC)
@MusikAnimal: Per-community decision? What do you mean? That some LTA pages are public but others are private? That sounds really complicated. Computer Fizz (talk) 08:27, 20 February 2020 (UTC)
Every project (community) has their own way of doing things. Some may want documentation of LTAs to be public, others might want it private. We can't decide for them here, is what I'm saying. MusikAnimal talk 16:19, 24 February 2020 (UTC)
  • By community, but... - the whole idea of a database seems to be making it a cross-language function. I'm fine with that, but the problem is I can see different communities wanting different things. For en-wiki, I think the benefits of public heavily outweigh the damage done from knowing the tracking methods. However, from the above an editor from uk-wiki might massively disagree with that. Is a cross-language database capable of having different permissions - e.g. the "home-wiki" for an LTA has a set of rules, and all LTA profiles associated with that wiki use a single set of rules. We'd probably want to limit it to a couple of choices. Ext-protect for all (editing wise) seem wise. Nosebagbear (talk) 17:52, 24 February 2020 (UTC)
@MusikAnimal: @Nosebagbear: Unless I've misunderstood the WMF about this, I assume it would be one single website for all WMF wikis to use. Computer Fizz (talk) 23:19, 2 March 2020 (UTC)
  • Different levels of privacy for different information about the same (l)user. There can be a public report with general information, a private report available to extended confirmed editors that has more information about tactics, a report available to admins that has very sensitive heuristics for detection and a report for CUs that has IP/UA/other protected information. Example: there are some aspects of detecting paid-for spam that I am not comfortable sharing except to highly trusted users, while I am more freely able to talk about others and others can be published. MER-C (talk) 17:54, 28 May 2020 (UTC)

with this change, maybe use more information in addition to IP, when assigning anon identity[edit]

PROBLEM

cellular networks often (always?) allocate the same IP address to multiple users: both simultaneously (using NAT), and more prominently, over time, so allocation of IP to subscriber is very volatile. this is definitely true for IPV4 addresses, i'm not sure about IPV6. if this this is not the (typical) case for IPV6, maybe this proposal should only apply to IPV4 addresses.

in hewiki, we often leave messages on anon's talk pages - either warnings after abuse was observed from this address, or invitation to create account when observing good edits.

there is no way to guarantee that the message will catch the right person, but i suggest that we can reduce the miss-rate, by looking at more identifying information: only thing i can think of is "user-agent" field, but maybe you wizards can come up with more.

doing so will not make everything perfect, but at least, when we warn a mobile-user for vandalism over "anon-12345", talk page, someone else, using a different browser or different version of browser, will be "anon-12346" instead, even when sharing IP, so they will not be disturbed by this warning.

we semi-regularly see anon complaints in our help-desk that read more or less like "hey, why are you accusing me of vandalism, i never edited a single article on your !@#$%^ wikipedia!", and it will be good if we can irk less readers, and of course, use less wrongfully-applied blocks.

peace - קיפודנחש (talk) 20:29, 20 January 2020 (UTC)

Perhaps a cookie can be served up that can record that it is the same person doing stuff, and then give them the appropriate talk page, and allow better matching of anonymous users. His would be good even if IP addresses of unlogged in people is still in use. If cookies are too obvious, perhaps you could use a bit of local storage in the browser to match users with themselves. You could also then trace them hopping across IP addresses and still target messages and blocks. Graeme Bartlett (talk) 12:05, 22 January 2020 (UTC)
@Graeme Bartlett: Associating alternative accounts should be done and it is already done for blocked users with the 'enwikiBlockID' cookie. This information should be used to prevent editing that would result in blocking if a check is run, to reduce the punishment factor of this tool. MediaWiki so far don't have such preventive solutions (reddit has: double-voting from the same IP does not count), these need to be developed yet. —Aron Man.🍂 edits🌾 07:56, 23 January 2020 (UTC)
No, no no nonono. If their identity is dependant on a particular cookie, that means they can just clear that cookie and immediately be back. At least now they have to find a proxy first. Computer Fizz (talk) 19:26, 30 January 2020 (UTC)
The cookie is in addition to the IP and the problem to be solved here is identifying good-faith editors, not blocked editors, who want to circumvent the block. —Aron Man.🍂 edits🌾 11:51, 31 January 2020 (UTC)
@Aron Manning: No, what's being discussed is all identification. Plus, even if it is only for good editors, if the cookie gets cleared their good track record has been lost. IP editing should only be for people who want to make one or two minor typo fixes etc then leave. I don't plan to design around serious IP editors, serious editors can create an account. Additionally, if it was dependant on a cookie and an IP, that would mean either of those things changing would give you a new anon ID, including clearing that cookie. Things like cookies, user agents, mac addresses aren't reliable because they can easily be spoofed. With an IP, at the very least you have to find a proxy or cafe. Laptop Fizz (talk) 00:00, 1 February 2020 (UTC)
I think you misunderstood: obviously both the IP and the cookie identify the editor. If the IP changes (dynamic ip, public wifis), the cookie associates the editor. If the cookie is deleted than we are back to the present conundrum: is the same IP the same editor? This is an unanswered question and we just assume "yes". A cookie obviously makes the solution useful for the majority of the cases: non-evading editors. For evading editors we need a different approach. These are two completely different cases. What are you designing, by the way? Do you have any practical suggestions? —Aron Man.🍂 edits🌾 00:48, 1 February 2020 (UTC)
@Aron Manning: If both IP and cookie are used to assign an identity, what happens if both link to a different but existing identities? Which takes priority? If it's the IP, then that means edits coming from a different IP could list there making that as an IP-based tool useless. If cookie takes priority, then you can just clear your cookies. What identity the user has should not be in the user's control at all, which is why it should be based only on IP. But that's a sidetrack from the fact that IPs should continue to display as IPs and not some representation. Laptop Fizz (talk) 00:56, 1 February 2020 (UTC)
The cookies are encoded tokens that can't be forged, therefore the user can't fake its identity. As the IP is unreliable, the cookie takes precedence, if present. If not (ex. removed) then the last cookie used by the IP (and same UA) is sent to the client again, if the IP is not known to be shared. This is as far as you can get with IPs. —Aron Man.🍂 edits🌾 06:54, 1 February 2020 (UTC)

┌──────────────────────────┘
So if you delete the cookie and change your user agent, it gives you a new identity. Which is bad. Computer Fizz (talk) 01:31, 2 February 2020 (UTC)

Only bad-faith editors change their UA. Disclaimer: this is only one possible approach, the developers haven't discussed this so far AFAICT. What would be your suggestion, that won't give the same identity to different users on a shared or reallocated IP? —Aron Man.🍂 edits🌾 08:19, 2 February 2020 (UTC)
I wanna try experimenting with disabling IP editing. Yes a lot of good edits are made from IPs, and the ratios are different for all wikis, but it would be interesting to see. Picking something like the French Wikipedia or whatever, try disabling IP editing for three months and look at what changes, and whether or not it would be better or worse for all wikis. Computer Fizz (talk) 18:29, 2 February 2020 (UTC)
That's an unrelated, parallel approach, discuss it in its own topic. Also see the research, why it is unfeasible in that form. However, flagging each IP edit for review is feasible and a good approach. See a small wiki's report about using flagged revisions for ALL IP edits. My impression about the statistics is that vandalism increased after flagged revisions has been disabled. AronMan (talk) 09:58, 3 February 2020 (UTC)
In regard to Only bad-faith editors change their UA: Note that there are quite a few browser extensions out that that rotate or randomize the user agent, to undermine the invasive tracking that some websites and advertisers do. I used to have one. It was always on, so my one-account would have incidentally sent varying user agents to Wikipedia. If someone simultaneously changes account&UA, then yeah, that's pretty iron-clad evidence of sophisticated and premeditated bad faith. Alsee (talk) 17:10, 7 February 2020 (UTC)
@Alsee: Thanks for pointing that out. I should have been more clear: "Very few good-faith editors will *delete the identification cookie* and also *change the user agent*." —AronM🍂 edits🌾 02:00, 8 February 2020 (UTC)
At work, I have multiple web browsers installed that I use somewhat interchangeably. (I also have portable devices that I connect to the corp. wifi and egress through same external-facing IP. Come to think of it, that applies at home also.) Not only do they have different user-agent strings, but separate cookie storage. There's only a small number of common UA strings, so they probably don't add much unless combined with other browser-fingerprinting techniques like font enumeration. Pelagic (talk) 18:11, 16 February 2020 (UTC)

Show both[edit]

¿Por queno los dos? Display “Anon ip C0A80002 ℹ️ on Device AZ9B372GG2 ℹ️” You could then search content / history for matching IP hashes or device IDs (e.g. same phone, dynamic address). But then do you have a "User" page for each? Maybe new namespace(s). To which page would you post a message? Pelagic (talk) 18:48, 16 February 2020 (UTC)

Take no for an answer[edit]

The community has decided near-unanimously that IP addresses will stay. Stop trying to do otherwise. Have you learned nothing from superprotect? From the fram ban? This is the third installment in the direction of this project, and I don't like where it's going. Asking for consensus but then ignoring it. Computer Fizz (talk) 03:39, 21 January 2020 (UTC)

Fram's ban is totally off-topic: it's not even the same teams, the same issues. Lofhi (talk) 15:58, 21 January 2020 (UTC)
I don't fully agree that en:WP:FRAMBAN is off topic. Fram's ban had the full support of (all who were aware of it) in the Foundation, IP masking has no objections from the Foundation, and almost unanimous disapproval from (at least) en.Wikipedia, if not others. The question is whether the Foundation has learned from previous incidents to listen to the users, rather than enforcing something which the Foundation believes appropriate, but the users do not. I'm not convinced. — Arthur Rubin T C (en: U, T) 21:59, 21 January 2020 (UTC)
You said it, at least from enwiki. Enwiki is not the only one community. Also, there are subjects where the community's opinion is little out of date. In France, for example, the Commission nationale de l'informatique et des libertés considers that an IP adress is personal data, because it's information related to a physical person who can be identified directly. It doesn't matter whether this information is confidential or public. So the foundation is displaying sensitive data, and this is not something to be taken lightly. There are other ways to identify our anonymous editors without displaying this data. But well, we have to reject the whole block and refuse to think of other alternatives: damn WMF. Lofhi (talk) 17:21, 22 January 2020 (UTC)
@Computer Fizz, Arthur Rubin, and Lofhi: Firstly, thanks for engaging. I appreciate your input.
I agree that honest & open consultation is really important here, and I think it’s fair to talk about trust in the WMF. This project is complex and difficult, and having truly open communication about it is the only way that we'll figure out the right path.
To answer Computer Fizz's question: unfortunately, this isn't a yes-or-no situation. The rules on the web are changing about the use of personally identifiable data on the internet, and we have to figure out how to respond to the new situation that's developing. Publishing people's IP address on the internet is not going to be available to us forever, either because of new rules or because the technology changes around us. For example, there are ongoing talks about a Federal data privacy law in the United States. Lofhi also points out that similar laws already exist in other countries too. Also, Google has announced that they will be making user-agent strings more restrictive in the future. If we want to get ahead of the upcoming changes to data protection practices, we have to build tools to identify and block vandals and spammers without publishing personally identifiable information in public view. It would be irresponsible of us to assume that we can keep doing the same things forever, when we know for sure that things are already changing.
But there's a huge potential cost if we get this wrong -- if losing access to IPs means that volunteers can't identify and block the right people, then Wikipedia is basically immediately overrun by vandals and it destroys the project. We heard that loud and clear from our previous discussions, and we agree. We wouldn't let that happen. That's why we're continuing to talk about it, so we can keep learning and hearing ideas from you all. -- NKohli (WMF) (talk) 02:50, 23 January 2020 (UTC)
Here's what I think needs to be done:
Add a checkbox saying "I acknowledge editing while logged out will publish my IP" on the first edit, once this is ticked a cookie will be placed to not show it again.
If the law prevents you from displaying IPs publicly ever, then disable IP editing and require account creation. IPs are already much more lenient against than accounts, and I don't see any reason to give them more power to wreck a wiki.
I've read countless pages of stuff with both the WMF's concerns and my on-wiki experience, and that's what I think is the best solution. Computer Fizz (talk) 04:04, 23 January 2020 (UTC)
You are ignoring the information you were given: an IP address is sensitive data. It's not a question of asking the approval of the anonymous editor or not. The foundation wishes to hide this sensitive information in one way or another. Someone who ticks or not a box doesn't mean he understands what is at stake (e.g. opt-out settings in EU since the GRPD). And it is precisely on this point that the foundation is seeking advice: how to mask this sensitive information. All this in addition to the information given by NKohli (WMF), which are justified for me.
Finally, your last proposal seems opposed to limits to configuration changes for inclusivity reasons. It doesn't fit with founding principles : the ability of almost anyone to edit (most) articles without registration. The devs has already refused to allow idwiki to block modifications from anonymous contributors, so I can't see why they would justify forcing an account now. Lofhi (talk) 13:17, 23 January 2020 (UTC)
@Lofhi: - @NKohli (WMF): has noted already that there is a major portion of the communities who normally support IP editing but would change their minds if the tools didn't allow 100% as effective management. I personally don't think they're going to get that high - perhaps 80%, except there's a partial pareto issue with rangehopping vandals. Editors will have to make their own minds about what's the minimum level of success they're willing to accept from NKohli's team for them to not switch viewpoints. I imagine that if a major proportion does, it would be the first significant change. It was worked out on en-wiki late last year that there's some CSS that could be done to block every IP, so it would just take an interface admin, a bot, and a 'Crat (to give the bot the permissions) - plus a community consensus! Nosebagbear (talk) 15:57, 23 January 2020 (UTC)
Some of our (en.Wikipedia) best vandal-fighters are static IPs, so I would hate to prevent them from helping. But, if the masking+tools is not at least as good as the status quo (no masking, but no additional tools), I would support preventing IPs from editing outside of discussion areas.. As an aside, I'm a moderator on a message board, and we moderators have access to IPs of all contributions, even though the board requires registration to post. I'm not one of the moderators who uses the tracking software available there, but....
Furthermore, Wikipedia has already been found in violation of some EU privacy directives, and "we" seem to be ignoring them. The "right to be forgotten" is incompatible with the pillars. I haven't been keeping up with current en.Wikipedia directives (policy, guidelines, etc.) on the issue, but there have been a number of incidents where it has been explicitly disregarded. — Arthur Rubin T C (en: U, T) 18:01, 23 January 2020 (UTC)
Neither GDPR nor the more general rulings and directives that make up the "right to be forgotten" should have any general concern if we put a timelimit on IPs being visible - there's a very strong Legitimate Interest in being able to combat socks. The WMF Legal Counsel should be up to the LIA & DPIA to indicate that. Nosebagbear (talk) 18:48, 23 January 2020 (UTC)
Article 17 - Right to erasure (‘right to be forgotten’):
1. The data subject shall have the right to obtain from the controller the erasure of personal data concerning him or her without undue delay and the controller shall have the obligation to erase personal data without undue delay where one of the following grounds applies: [...]
2. Where the controller has made the personal data public and is obliged pursuant to paragraph 1 to erase the personal data, the controller, taking account of available technology and the cost of implementation, shall take reasonable steps, including technical measures, to inform controllers which are processing the personal data that the data subject has requested the erasure by such controllers of any links to, or copy or replication of, those personal data.
3. Paragraphs 1 and 2 shall not apply to the extent that processing is necessary: (a) for exercising the right of freedom of expression and information;
The article 17 seems irrevelant for this discussion. I would support preventing IPs from editing outside of discussion areas.: it seems to be opposed to the founding principles and I would not support this if someone is ever interested. Lofhi (talk) 22:56, 23 January 2020 (UTC)
I'm saying that the "right to be forgotten" is (an interpretation of) an EU directive which is explicitly not followed by en.Wikipedia; if IP masking is the result of attempts to follow EU directives, consistency suggests en.Wikipedia might very well refuse to follow it, as well. If mandated by the Foundation, and the tools provided do not provide a net improvement over the status quo, en.Wikipedia would almost certainly block attempts by IPs to edit user-facing material. There are some on en.Wikipedia who would block all IP editing if IPs are masked, even if the tools are significantly improved. The possibility of having actual IP information disappear after a year or so hadn't occurred to me. There are still en.Wikipedia guidelines which require specific IPs to be identified as, for example, coming from the Department of Defense, (recognized as) from the CIA, from Congressional offices, etc. See en:Wikipedia:Blocking IP addresses#Sensitive IP addresses. — Arthur Rubin T C (en: U, T) 21:35, 24 January 2020 (UTC)
@NKohli (WMF): Thanks for your diplomatic answer and for taking the position that it’s fair to talk about trust in the WMF. At the time when FRAMBAN was blowing up, there was some feedback that most folks at the Foundation truly didn't understand what en.wp (also de.wp and zh.wp after similar prior incidents) was getting upset about. I still feel that outside the community of English Wikipedia, people don't appreciate how incredibly damaging it was. It must be difficult for you to navigate trust issues when your intended task is to collect functional software requirements. But the bad feelings towards the Foundation are there – for many reasons, not just framban and the associated allegations of impropriety at the highest levels of the WMF – and we're all going to be dealing with them for a while. Pelagic (talk) 17:26, 16 February 2020 (UTC)

Medium sized projects[edit]

At nl.wiktionary I notice a gradually increasing number of IPv6 anonymous edits. Possibly as a consequence of the GDPR the information provided about the owners of IP-adresses is gradually becoming more cryptic. So I would like to offer strong support for the first two proposals on the page. Would it be possible to develop a kind of online test that users have to complete before they get access to IP-info feature? That might offer a solution to some objections.

For the type of disruptive behavior we usually confront, establishing that there is a common online identity is sufficient, there is usually no need to find out a real life ID. I understand that a dictionary might be different from an encyclopedia in this respect, but that would also mean that at least for some projects these tools might be very useful.

As we don't experience significant LTA, I don't feel I can say anything useful on the third proposal.--MarcoSwart (talk) 12:38, 26 January 2020 (UTC)

@MarcoSwart: I was thinking that the IP info feature could be made available to users with some permission level or even something like `autoconfirmed` status. Thanks for mentioning that this could be useful. I appreciate your comment. -- NKohli (WMF) (talk) 18:53, 18 February 2020 (UTC)
@NKohli (WMF):: If I would approach this matter just from the goals of our project, I would simply agree with you and "autoconfirmed" would be sufficient. But in my experience compliance with GDPR involves taking some care that the people who deal with personal data are properly instructed. --MarcoSwart (talk) 08:23, 19 February 2020 (UTC)

A simple solution?[edit]

I've only gotten halfway through reading this page, so forgive me if somebody has already suggested this above.

A lot of the discussion seems to be along the lines of ”if we don't have access to IP addresses, then we won't be able to do X, Y, Z”. So make it possible for (some subset of) users to see the IP address!

The immediate problem I see isn't in storing the IP, it's in publishing it and in storing it forever in the wikitext page source. Instead of doing that, we could store the actual IP address along with derived info, but only fetch the address when the info card is opened, or alternately, on click-through to a contributor-info page.

Of course, regulatory environments can change, and MediaWiki software is used outside the USA. It may become (or, in some places, already be) illegal to store network addresses anywhere in the database, or it may be permissible to store them for only a specific time. There should be options that site operators (not just Wiki[mp]edia Foundation) can configure for separate retention policies on the IP address, location, ISP/AS, etc.

Cheers! — Pelagic (talk) 15:36, 16 February 2020 (UTC)

I now see that Nosebagbear also mentioned timelimit on IPs being visible, above.
[Re-posting, as the mobile talk view (reply box) lost my previous edit.]
Pelagic (talk) 16:23, 16 February 2020 (UTC)
Thank you Pelagic, for your comments here in general. I want to stress that we have in no way decided on any specific solution, but as you note, data-privacy standards on the internet are currently evolving and it might be that we don't get to make the decisions on this, so having started this conversation, it seems like the best opportunity to investigate what we can do to lessen our dependency on IPs.
On a personal note, I'd really want us to find tools to make it easier to fight vandalism, spam and harassment in general, for everyone regardless of user rights, for a couple of reasons: not put the burden on a smaller subset of users (a lot of vandal-fighters aren't admins and some simply don't want to go through that kind of process, which – depending on the wiki – can be stressful), but also because some smaller wikis always end up without local editors with specific rights, then having to depend on stewards or global sysops who might not even speak the local language. /Johan (WMF) (talk) 03:08, 17 February 2020 (UTC)
Pelagic: Do you see this as a complement to the rest of the proposed suggestions, or are you trying to find a solution that would mean that they would be unnecessary? /Johan (WMF) (talk) 03:19, 17 February 2020 (UTC)
Sorry I missed your ping, Johan (WMF). It may have been swamped by pingspam from Structured Discussions on another wiki. Complementary, definitely. Keep the name masking where a community wants to enable that, keep the improved investigation tools (proxy lookup, similar users list). Keep the bubble or hover-card, but fetch its content dynamically using AJAX-like techniques. You can then apply user groups/rights/permissions/flags to show more information, possibly including the IP address, to trusted users, and less-than-normal level of detail to users not in good standing on the wiki. (For example, maybe IPs don't get to retrieve location and ISP data via the API unless they have been whitelisted. Maybe community sanctions for some behaviours could include limiting access to others' location.) By keeping the info out of the page cache, you deny automated scrapers from harvesting it off the served-up HTML/JS. You can store the sensitive information in separate silos, but still make it available to the right people in-place, without forcing them off page to a different server or tool-set. Different wiki operators and communities can assign the permissions as they see fit. Other tools could check the same permissions, e.g. suppress the location column for a user who lacks the view-IP-location perm. Pelagic (talk) 11:47, 6 May 2020 (UTC)
No worries, thank you, Pelagic. We'll take this feedback into account as we think about our next steps going forward. /Johan (WMF) (talk) 11:53, 6 May 2020 (UTC)
[edit conflict, thanks for the quick reply Johan (WMF)] In other words, separating the storage and applying permissions checks should underpin all tools, I'm not saying hover cards or some other UI will be the one solution for all use cases. My choice of heading was ill-considered. Pelagic (talk) 12:06, 6 May 2020 (UTC)
One last thought. I didn't address your mention of smaller communities. They might decide that everyone who is extended-confirmed also qualifies for view-IP-address. There should be some way to automate the right, if that's desired. It's information we publish openly now, accessing it shouldn't require a huge brouhaha like RfA or applying for Checkuser. Pelagic (talk) 12:21, 6 May 2020 (UTC) Though I notice Marco's mention of "proper instruction" above. Pelagic (talk) 12:27, 6 May 2020 (UTC)
Along these lines, if the IP addresses were "semi-masked" so that the "ISP" part of the IP address was shown and same actual IP used within a reasonable period of time would have the same "semi-mask", this would allow tools to work. They might need to be rewritten, but they could be made to work. For example, if 1.2.3.4 is part of a /23 block (that is, 1.2.2.x-1.2.3.x), and no other 1.2.2/23 address had edited in the recent past, the "semi-masked" address would be something like 1.2.2.0-2020a. If 1.2.3.5 edited tomorrow, it would be 1.2.2.0-2020b. If there was no editing from 1.2.3.4 for a long time, the next time it edited, it would show as 1.2.2.0-2020c or if it was next year, 1.2.2.0-2021a. The definition of "recent past" would vary by net-block, it might be months for some ISPs, hours for cell phone providers, or "forever" for known-fixed IPs. Davidwr/talk 17:55, 17 February 2020 (UTC)
@Davidwr: Thanks for mentioning that! I was talking about that solution with our privacy experts and it sounds like that's a viable option. It would surely be work to rewrite the tools. I'm hoping we don't lose any existing functionality and users' workflows are not affected. -- NKohli (WMF) (talk) 18:41, 18 February 2020 (UTC)

rDNS[edit]

The revrse dns lookup usually can be turned into ip addresses. It makes no sense to show that well hiding ips.

More generally, i'm concerned that this seems to be going a bit in the reverse direction. It seems to be identifying possible implementations and then talking about what they would mean. Instead we should start from a place of use cases (As a privacy fearing user I want the following info about me secret from X. As an admin i want to be able to do the following things which currently i use IP addresses). I fear that by not starting from use cases, we will end up with the privacy and usability properties that whatever solution we pick incidentally has instead of intentionally getting as close to those properties we truly want as we can. Bawolff (talk) 21:24, 17 February 2020 (UTC)

Bawolff: Noted, and thanks!
With regards to the general question, this is based on a few months of discussion at Talk:IP Editing: Privacy Enhancement and Abuse Mitigation, listening to the use cases there and trying to get an understanding of the problems, as well as other conversations, as briefly covered in IP Editing: Privacy Enhancement and Abuse Mitigation/Improving tools#Background. What do you think is lacking in the process? Is it that you feel the use cases should have been documented in a different way, or that we haven't spent enough time talking to people to be at this stage? /Johan (WMF) (talk) 01:52, 18 February 2020 (UTC)
That's a good question. I think part of the confusion is this project has two conflicting goals - to make a better checkuser tool (in essence) and to improve the privacy of logged out users. No doubt these goals are related and its important to consider how they affect each other. But ultimately they are separate, and I think combining them to such an extent confuses both issues and sets up certain dynamics where it suggests we are trying to maintain status quo by improving one part at the expense of the other part. I'm not sure this balancing framing is the healthiest framing to have this discussion in. So with that in mind, I'll respond to the parts separately:
  • For the improved checkuser/ip info tools. I think, of the two this has the clearest goals. The use case could be summed up with, "As checkuser (or admin dealing with anon vandals) I want tools to help me be able to better identify and respond to LTA bad actors" Checkuser tools (and the external tools at the bottom of an anons contributions) have not received much love (Excepting some volunteer effort) in probably a decade. Its well past time these were improved. It sounds like the first step is to take the info that people seek out themselves, and integrate it more directly into MW display. That sounds uncontroversial to me, and an idea that should have been done ages ago. The other ideas suggested for this goal, is a db of LTAs and using Machine learning and other magics to automatically identify sock puppets. I'm less convinced by these ideas, but they still seem worth exploring and at least fleshing out. My main criticism of them is how does an LTA db improve the status quo of on-wiki docs and possibly checkuser wiki docs? Are there queries that users want to do that isn't supported by the wiki platform? For auto identify via AI magic, my main criticism would basically be what are the false positive, false negative, true positive, true negative rates of such a system, what would acceptable rates for such a system be, would malicious people be able to manipulate such a system to their advantage, would such a system encode prejudice and biases in such a way to be discriminatory in some fashion (More an issue when using AI in say real world policing, but still, say you have a small number of users editing from some small region with a local dialect of english, would the system claim they are all sockpuppets because they use similar turns of phrase?)?
  • For privacy enhancement: The ultimate question here is, privacy for whom from what? I find this part of the proposal to be lacking in clear goals. Privacy doesn't exist in a vacuum, it is something that is defined in context. Its unclear what precisely we want to protect our users from. Almost equally importantly, its unclear what we do not want to protect them from. For example, you can get a rough idea what part of the world someone is from by what time of day they edit. However, that's not an aspect of user privacy that we really want to protect our users from. In some ways this almost feels more PR driven than anything else - It's not the early 2000s anymore and the internet has decided that IP addresses are now considered something that shouldn't be shared publicly, so we don't want to be the people who are sharing them. I may sound derisive when I say that, but I don't mean to be - it is a valid concern. Part of privacy is making sure our (anon) users are comfortable. If they used to be comfortable with IPs being public but now aren't, then it is reasonable to want to change that.
To be fair though, the doc does state officially that the goal is "to protect our unregistered editors from persecution, harassment and abuse by not publishing their IP addresses." which is a much loftier goal. I suppose what I'm saying is I'd like that to be much more specific as to what threats we are trying to protect anons from and by how much. We obviously cannot protect people from every form of persecution, harassment or abuse, so we should be specific about what types we want to protect them from. I imagine that is what the "by not publishing their IP" is about, but that's hardly enough. For example, the aforementioned rDNS is technically not publishing an IP. But if you take a recent IP editor (chosen randomly from recentchanges. Ok it took 3 tries to get one that proves my point): w:User:78.56.70.162 - s/he has a reverse DNS of 78-56-70-162.static.zebra.lt. (which in turn has a DNS A record of 78.56.70.162, not all rDNS go full circle but this one does, although its besides the point). Its not hard to figure out someone's IP address from 78-56-70-162.static.zebra.lt. So at the very least (I assume) we also want to hide information that can usually easily be turned into an ip address. Then the question becomes how easily? Is showing the ASN too much? Is something like w:CongressEdits a form of harassment (which would really only require the ASN name)? Is it only not a form of harassment because we think that congress should not be editing so we agree with the goals of the harasser? It would surely be harassment if instead of congress it was targeting edits from some region known for an ethnic group that is unpopular with whomever setup the account. What sort of protection we want to provide for logged out users is a hard question and I think its a question we need to think about carefully. We should think about it independently of any technical solutions. It is certainly a question of trade-offs, and the risk to hindering anti-abuse efforts is certainly real. We should think about it independently of any technical solutions , but more in the form of: Troll type A wants information B about victim type C to do D; Admin E wants information type F in order to prevent class G of abusive behaviour. We need to enumerate the various types of information we ourselves want, the types of information we want to hide from malicious parties, figure out how important each case is, and figure out what changes on the balance make sense.
Or to put it another way: This proposal concentrates on how and what we are changing, but not why. However without the why, we are probably going to make bad trade-offs and in the extreme case perhaps even make trade-offs that defeat the point of the whole venture. Bawolff (talk) 09:32, 18 February 2020 (UTC)
Thank you, Bawolff. Here's my suggested model for how to understand this project: better tools for handling vandalism is a worthy cause in itself. However, in this context, the project is about privacy for unregistered users. To do so, however, we need to address the fact that this would cause issues for the current models for how to handle spam, vandalism and harassment. In that sense, the main project is the IP masking – to which the current anti-vandalism toolset is a blocker. Does that help understanding how we're approaching it?
As to the why, part of it is that we've been talking about this for as long as I've been around in the movement and longer, but more importantly (and now I start pasting from answers we've given elsewhere, if this echoes what you've read before) evolving data-privacy standards on the internet. The rules about the use of personally identifiable information on the internet are coming under more and more scrutiny, and we have to figure out how to respond to the incoming changes. Publishing people's IP address on the internet is not going to be a viable solution forever, either because of new rules or because the technology changes around us. In light of all this, we think it's important to put into place tools that have the ability to work in absence of IP addresses while still providing our wikis the level of protection that they need. That's mainly covered in IP Editing: Privacy Enhancement and Abuse Mitigation. Is your criticism that users have to understand this page – on improving tools – on its own, without the context of IP Editing: Privacy Enhancement and Abuse Mitigation, and that is not happening here, or that it's unclear in general? /Johan (WMF) (talk) 14:51, 18 February 2020 (UTC)
Let me put it another way. Consider the following modest proposal: If you want to edit without registering an account, first you publicly upload a photocopy of your passport. You then get a cookie and a randomly generated id for as long as you keep the cookie. If you loose the cookie you go through the process again. As far as I can tell, this proposal meets all the goals and requirements of the project: IP addresses aren't published. Admins can still track abuse just as good as they used to. Its also fairly obviously ridiculous from a privacy perspective. If this proposal meets the requirements, I think that clearly means the requirements aren't sufficiently thought out. I don't think it makes sense to be talking about changes to the system for privacy, if we don't have specific privacy goals/risk reduction targets. Otherwise how will we evaluate if the proposals meet our needs? How will we know if this project is an actual success at improving privacy, or simply a re-arrangement of the existing system with its existing flaws? Bawolff (talk) 00:52, 19 February 2020 (UTC)
Bawolff: That's fair. In this software project, and in this discussion, we're focused on trying to make sure this project doesn't cause an additional burden for the communities in their work fighting vandalism, spam and harassment (that is, if certain things take more time to ensure a higher degree of privacy for non-registered editors, we'll at least have offset that by making sure other parts of the workflow is now faster), making sure we build this with the help of the expertise of the communities. This means that the discussion here skews towards what can be shown.
At the same time we’re also having a discussion with Legal (and will of course involve Security) – the Foundation's experts on privacy – to make sure everything we do is in line with the goals of the project, which is our main formal mechanism to avoid the kind of scenario you're describing.
We're also interested in any privacy concerns the communities are surfacing. I want to stress that everything here is really just on the idea stage. /Johan (WMF) (talk) 05:08, 25 February 2020 (UTC)
Purely on the latter point, for en-wiki, I only occasionally see privacy-related concerns from editors about editors (vs what content should go in articles etc). Some of those aren't relevant here (e.g. people asking for OS because they put up personal details they didn't mean to). There's some requests, usually on OTRS, wanting their "account and all contributions" to be deleted, the former usually because they named their account with their own name, and the latter usually because they're angry with Wikipedia for various reasons. Courtesy vanishing can handle the former, so nothing suggesting a sea-change in that route. 16:47, 26 February 2020 (UTC)

No machine learning yet, comments on LTA database[edit]

Stop chasing shiny new things when the existing software is derelict and irreplaceable. I do not want to see machine learning until you address each and every single item on this list: Talk:IP_Editing:_Privacy_Enhancement_and_Abuse_Mitigation/Archives/08-2019#Open_questions: That you find some items on the list boring does not matter. Your job is to serve the community. You cannot build a shiny new house if the foundation (pun intended) is crumbling.

That said, the LTA database is on the list. A list of common articles/IPs/UAs isn't going to cut it - some LTAs are described by a behavior across entire topic area(s) and other LTAs are only characterised by behavior. I want to be able to see how articles, subjects, topics, edits, socks, IPs and UAs link together visually. (The graph should be fully integrated with the admin tools - it should be a one-click operation to block, delete or CU from it.) You will need to borrow heavily from cybersecurity threat intelligence/attack graph visualisation packages in order to address this problem fully and not in the usual half-baked manner in which the WMF deals with community requests. A textual description is easy, but is also lazy, wastes potential and loses information.

There is one thing I want to add to the list: CUs being able to search for accounts with the same emails. You don't have to expose the email for this to work. MER-C (talk) 18:22, 28 May 2020 (UTC)

Local discussions[edit]

Hi everyone, we had a number of local discussions to gather information about workflows, and to make sure everyone's aware of what we're doing and what information we're keeping in mind when developing in addition to the feedback here, we've written a report at IP Editing: Privacy Enhancement and Abuse Mitigation/Improving tools/Local discussions 202006.

You can see the typical questions at IP Editing: Privacy Enhancement and Abuse Mitigation/Improving tools/Discussion starter. /Johan (WMF) (talk) 10:08, 1 July 2020 (UTC)