Talk:Data retention guidelines
Comments from //Shell
Indefinite retention of emails
Non-personal information associated with a user account (server logs)
This should include the contents of some HTTP headers, which may have privacy concerns, including:
- Referer: the previous page visited, which may be on any other site (in my opinion if this is from another site, it is strictly private and can only be used as analytic data, only in aggregated forms by origin domain). Almost all browsers send this information by default (unless the user has installed a filtering plugin).
- Accept-Language: the default language of the browser used, or the list of prefered languages defined in browser preferences; some combinations of prefered languages may be very user-specific, and notably if this/these languages are very uncommon in the country or region associated to the géolocalized IP (e.g. Icelandic or Wolof selected by a user currently in locations like Monaco, Addis Abheba or Harbin, China).
- User-Agent: and Accept: which identify precisely the type and version of the browser, and of its supported or installed plugins. These indormations are used by CheckUser admins teying to identify a user given its past navigation with the same browser installation when IP only is not enough to assert that this is the same user. The exact configation of these combinations of software versions may be very unique to a user; notably when the user has installed some uncommon plugin (this includes media player extensions, or localized versions of security tools) or uses an uncommon browser for a specific platform.
- X-Chrome-*: and similar custom HTTP headers defined by browsers or plugins (including antivirus tools), some of these headers contain user id's (associated to registration of the plugin or browser; this is very common for media players, or custom browsers embedded within game softwares, or within game consoles, or in some smart TV sets or set top boxes, or in some brands of mobile devices).
- Via: and similar HTTP headers defined by proxies relaying the user navigation. Some of these headers identify the origin user behind a non-anomizing proxy. Frequently, they contain personal information such as an authorized user name registered on the proxy, or the IP address of the connected user, or some hardware identifier of a mobile device using a public hotspot, or some user id associated internally by the proxy or hotspot (for example in a McDonald restaurant or in a train station), or session identifiers generated on those proxies or hotspots locally associated to an identified user whose account there may persist there for long, and will be sent again each time the same user returns to the same location to use the hotspot with the same device or same local user account). Generally these identifiers (and the full set of HTTP headers) may be requested by admins of these proxies or hotspot, when they receive an alert that one of its users is using their service to abuse external sites such as Wikimedia.
There are also:
- Cookies: but they are defined by the visited site itself and should be subject to the policy about permanent or session cookies defined by the visited wikimedia sites (this includes cookies generated once the user logs on any Wikimedia site with SUL).
- Data collected by media players for tracking the quality of connections for the delivery of streams. In some cases the media players will switch to use another stream.
- Some medias such as video and audio include timecodes that also allows the site to track which part of the media has been played, and how many times by the user. When the user pauses the media, rollbacks to repeat it, or skips some parts, the media server may know it.
- DNS resolution requests and similar "site info" requests, including for getting TXT records checked by security tools, of "finger" and "whois" info: not all of them are coming from an ISP but may be performed directly to Wikimedia DNS servers from a plugin in the browser or from the browser itself (trying to assess the site). Some of these requests may be very user-specific if they test some aliased subdomain names within Wikmedia domains, or if they perform queries that are typically only performed by ISPs. Users may perform direct DNS requests to Wikimedia domains. In some cases the ISP may reveal information about the user for which it forwards the DNS resolution request, as part of the DNS query itself in timely reproducible patterns of events. These requests are not reaching a webserver but an infrastructure server managed by Wikimedia (but possibly hosted by a third party domain hosting provider, operating with their own data retention and privacy policies).
More generally, this data includes everything that is stored by the webserver in the server logs, and it is much more than just the IP or the URL visited with its query parameters (some webserver logs may add query parameters not present in the URL but added in POST data (and that may be converted by one of the front proxies used by Wikimedia sites into GET parameters present in the URL submitted to the backend server).
Note that there are logs stored in front proxies (including instances the various Squid instances connected to the public IP address) and logs stored by backend webservers. There may be filters in front proxies, and front proxes may anonymize part of these requests (notably requests whose cacheable results will be delivered to multiple users).
Server logs are concerned by US laws, when they require that the sites in US retain these logs for some period of time. All these logs are also used by CheckUser admins. verdy_p (talk) 00:53, 15 January 2014 (UTC)
- Hi, Verdy:
- Thanks for your detailed thinking on this. There are many different parts to this; let me try to respond in pieces:
- User Agent information: We agree that UAs should be treated as personal information, and covered by this policy; that is why it is in the definition of PI :) We're already working on this, for example by filtering UAs in Labs and by working to sanitize them in Event Logging.
- Other HTTP headers: I see your point about putting this in PI. We’re talking with analytics and ops about how best to handle them.
- US law and log retention: There may be some unusual circumstances where we're required to stop deleting logs (i.e., if we're sued and the logs have some data relevant to that) but as a general matter there are no US laws (federal or state) that require log retention.
- Hope that helps clarify. —LVilla (WMF) (talk) 02:16, 4 February 2014 (UTC)
- Further followup on DNS: the DNS tool we use doesn't log requests at all, only aggregate counts. Hope that helps. —LVilla (WMF) (talk) 20:08, 4 February 2014 (UTC)
- Thanks a lot for taking note about these issues and revisiting a few missing/unclear items.
- However this subject of meta-data in server requests; as well as the integration of active components (like multimedia plugins) is not closed. As techologies will continue to evolve; and browsers as well (or security suites) performing some hidden background requests to many other third parties, we'll needto track it for a long time. The issue is more sever with components that are mandatory parts of the Internet architecture itself (notably DNS, IP routing data exchanges, finger, the PKI architecture and secure authentication key exchanges) and other technologies supposed to mitigate this risk (such as DNT protocols). DNS is now the most attacked protocol (in terms of global network neutrality) by ISPs themselves (and all their thrd-party service providers).
- I'm not even sure that the use of HTTPS now on Wikimedia will really improve the privacy, or if it will not just help those that want to identify and track users... Even users of The Onion Network may also find problems in terms of being tracked (even if the exchanged contents are encrypted! It will still be easy to track recent changes occuring in MediaWiki projects to correlate them with traffic initiated from one "anonymous site" whose authentication key may be indexed at its source and correlated to trafics reaching the public sites).
- May be we publish too many things on Wikimedia public logs (we could mitigate this risk by reducing the precision of timestamps to only 5 minutes, and shuffling entries from multiple users so that they won't have a deduced order of occurences; also we should probably hide part of IP addresses for non-logged in users, to only about 20 bits; we could also assign better "anonymous user names" for these IPs, for example by hashing these addresses with the time of creation of the user name and some secret data used at that time for a limited period and changed regularly: the server would issue new randomized data for each new period of time, for example once every week; by encrypting the start time of that period, with an encryption key owned only by the WMF, and then using that time-key as additional data to the IP address for generating a string hash used as the "public user name"). We should better protect the privacy of IP users (notably because they may be not logged in by accident (by expiration of their current login session); ans so we should not reveal these IP publicly (let's leave that possibility only to CheckUsers using server logs.)
- Note that the public username assigned for IP-only users (connected with IPv4 or IPv6), the encrypted user id generated as above (a unique but temporary id not lasting more than one week; so that admins can still block most abusers easily for one week), may take the form of a 128-bit IPv6 address allocated in a private IPv6 address block: It will not be routable on the Internet (except possibly via Wikimedia servers offering some routing to these users, using the privately stored secure mappings). This form would work with existing tools that expect to parse IP users as those using a username looking like an IP address.
- And this should be investigated to make sure that there are not "black hats" expliting them to track users up to their source even if these black hats don't know exactly the route followed by this trafic).
- For this I would advocate the development or support of very secure browsers which could hide the user's trafic directly from its source (TOR has this in its specific version of the Mozilla browser; but users are at risk when using any mobile device from famous brands, except possibly the rare mobile devices built on top of Linux OSes, such as Ubuntu Mobile) verdy_p (talk) 14:08, 19 February 2014 (UTC)
- Thanks for continuing the discussion, Verdy. Let me respond briefly:
- Server-request metadata: I agree that there will be a lot of changes in the future. That's why I like the change we made in response to your earlier comments - instead of using a precise, defined list, we gave ourselves some flexibility so that we can do the right thing when new technologies arise. Thank you again for raising that - it is probably one of the most important changes we made in response to community feedback.
- Browsers: I agree that it would be good if browsers and other related tools took privacy more seriously, but that's well outside the scope of what the Foundation can do at this time - we need to focus on what we can control.
- That's why I also suggested the possibility of hosting some authenticated users on trusted anonymizing proxies offered to them by local chapters acting on behalf of the Foundation to control these users (it could be better than just asking to these users to go to using oly TOR, when they can't predict if their TOR exit node will not be looged. The TOR Browser is anyway an existing solution that can be proposed to these users, as long as the Foundation allows these authenticated users to choose the trusted chapter on whiich they will connect (via TOR for their originating trafic, where they are) to visit wikimedia sites. proxies offered by chapters could use a technical solution developed in partnership betwene the Foudantion and the candidated chapters (or related parters, like privacy protection groups or NGO's those partners will limit the number os users they will accept to proxy; and these trusted proxies will be identified by the Foundation servers as such, without them knowing anything else than which partner is in charge of controling these proxied users). I'm convinced that TOR connections are not bad for the Wikimedia projects, as long as users are associated to a registered account, even if that account is not associated to a real user name known directly by the Foundation (and accessible to the US law or by the NSA and other "Big Ears" elsewhere in the world). verdy_p (talk) 00:58, 22 February 2014 (UTC)
- Hope that helps explain the situation - thanks again for your serious comments on these important issues. -LuisV (WMF) (talk) 19:14, 21 February 2014 (UTC)
Advertising on projects
This discussion is open since the 10th of January, and due to close on 4 days. However, it seems that no advertising of its existence as been made (since today) on the french wikipedia (correct me if I'm wrong). I see that as a problem, since those guidelines will affect all users of the projects of the Wikimedia Foundation...
This is part of the personal information definition, and it needs to be more specific. First, please revise to "...location information (if you have not posted it publicly)". In other words, personal information voluntarily provided on a WMF project by an individual can't really be treated in the same way as personal information that has not been publicly provided.
With respect again to location, when wearing my checkuser hat, I think we might need to be a bit more clear as to what would or would not fall into the "location" issue. Is naming the country giving away location? This comes up regularly when addressing sockpuppetry issues. Risker (talk) 16:44, 10 February 2014 (UTC)
Closing of the Consultation Period for the Data Retention Guidelines
The community consultation for the Data Retention Guidelines has closed as of 14 February 2014. We thank the community members who have participated in this discussion since the opening of the consultation on 09 January 2014 and have helped make the Guidelines better as a result. Although we are closing the community consultation, we welcome community members to continue the discussion. The Guidelines are intended to evolve and expand over time. You can read more about the consultation on the Wikimedia blog. Mpaulson (WMF) (talk) 00:02, 15 February 2014 (UTC)
If I read this correctly, it means that after 90 days IP info is not retained, meaning someone with the CheckUser permission will not be able to go back more than 3 months to investigate a possible sockpuppet situation? That seems unfortunate. I'm Tony Ahn (talk) 06:03, 31 May 2014 (UTC)
- It's been 90 days for quite a while now, and represents the balance point between protecting user privacy and allowing us to effectively investigate abuse. While it can make investigating long-term abuse more difficult, it is ultimately a good thing. Ajraddatz (talk) 06:32, 31 May 2014 (UTC)
- Considering that check user rights are used without a clear policy to govern their use, the rights are handed out on a popularity vote rather than measurable evidence of competence or maturity, and without any independent transparent accountability, including the fact that users being investigated may never be informed either that they have been subject to this process or why they were under suspicion, then putting a limit on how far back users can be pursued in this way is probably a good thing even if there were not legal reasons for doing so. --Fæ (talk) 07:21, 31 May 2014 (UTC)
- That is incorrect Fæ, there is a clear policy that governs its purpose, use, assignation, etc. and how it ties into the policy with regard to privacy; please see it at Checkuser policy. Your reflections on the assignation reflect your general unhappiness that you share across multiple forums within the whole of WMF. If you believe that there is a better process to undertake, then I look forward to your solid proposals in the RFCs, rather than your plaintive snipes across these forums. — billinghurst sDrewth 16:45, 31 May 2014 (UTC)
- Thanks for your response. My statement appears entirely correct against the policy you have linked to, please explain which of these statements is not correct.
- On Commons CUs rights are given out on a simple popularity vote. There is no other check of competence or maturity.
- Though there is a system of raising complaints, there is no transparent system of accountability as to do CUs there is no requirement to lay out a public justification, nor even inform those parties that CU has been run on their account. If you don't know it happened and you don't see a justification, how could the parties ever raise a complaint and supply the "links and proofs of bad behavior" that are required by policy in order to complain?
- You refer to RFCs, I would welcome a link to any.
- As for "plaintive snipes", that appears a value judgement about my character that you have not bothered to support with any evidence, and so it is not possible to defend against; I would appreciate it if you avoided haphazard personal attacks and focused on the issue in hand. Thanks --Fæ (talk) 17:51, 31 May 2014 (UTC)
- Thanks for your response. My statement appears entirely correct against the policy you have linked to, please explain which of these statements is not correct.
- A nomination, an ability to ask and answer questions, and a vote (>80%, 25+ votes) is not a popularity contest, no matter how much you may not like it. Checkusers are identified to WMF and there is an age requirement. Clearly covered in the policy. If you wish for a change then put it forward to the community on the appropriate page. It is not relevant to data retention period.
- Raising complaints, OC, and minimum number of checkusers is accountability. If you wish for more, then put forward a proposal to the community on the appropriate page. It is not relevant to data retention period.
- It was a statement about your comments, not your character. That you made the comments, and the manner that you made them, on a discussion about duration of retention should have been sufficiently indicative. — billinghurst sDrewth 03:25, 1 June 2014 (UTC)
- "your general unhappiness that you share across multiple forums within the whole of WMF" and "your plaintive snipes across these forums" are unambiguously not a statement about my comments in this discussion, please do not argue that black is white. A comment about another editor's "general unhappiness" is a comment intentionally about the person, not the the matter at hand. I find your response colours the discussion, taking it on a tangent as a personal attack, when my original comment here was entirely non-personal but about the systems we have in place. I have no idea why this is such a sensitive or fragile issue that you would want to put me off expressing my point of view on meta.
- However, well done, you 'win', if that was your objective. I cannot see the point in discussing the issue further on this thread if it is just going to be an excuse for you to have a series of jibes at my character rather than taking my points seriously. I'll go focus on some content creation issues and leave this discussion to more worthy people. --Fæ (talk) 05:03, 1 June 2014 (UTC)
- Already above one week, most dynamic IP are no longer valid and cannot be associated to a user. Dynamic IP is the standard, even more with mobile users and users of proxies.
- And most abuses will be made from mobile networks or proxies. So there's little need to keep that data as we can't investigate them at the ISP to match them with a user.
- If we need to keep data for 90 days, this would mean that we cannot take any action against massive abusers in a considerable time and need other tools to detect massive abusers.
- For the rest, its is only a question of individual problematic edits in specific topics: do we really need to keep this dangerous and massive data for so long? It's like a hammer to kill a mosquito and we increase the risks of having this data seized and reused for something else against many users (not abusers) by correlating this data against other data collected privately and abusively.
- My opinion is that keeping logs should be reduced to the strict miminimum required by laws applicable to the location where are the servers collecting these data (and in US, not moved to another state or juridiction when there are multiple servers or local frontal proxies, except possibly as offsite backups with strong encryption).
- The personal data used by CheckUser is extremely dangerous for the vast majority of legitimate users. verdy_p (talk) 08:35, 31 May 2014 (UTC)
- @Verdy p: The data is more than just real people's edits, it is also for spambots which are quite prolific across the systems.
Three months data is about the right length to get a consistency of pattern for abuse. To remember that we are not looking at the straight raw data, the process is to run a check on either an IP, a range of IPs, or a username, so it is targeted with the vast bulk of data not being seen. Re your reflections, they are opinions, broad sweeping statements, and not supported by evidence, and they don't align with what I see. While some of what you say may align with some nations, and some providers, it is not universal. While there is validity for general users, it is not accurate where we see spambots. Re abused open proxies, they are far more predominantly not dynamic addresses.
Rhetorical statements and opinions not supported by fact are problematic in this situation, especially where the stream is opinion that follows hypothesis by more opinion to your predetermined conclusion.
Then outlandish statements like
personal data used by CheckUser is extremely dangerous for the vast majority of legitimate usersis quite provocative. 1) Checkusers don't use personal data especially not for legitimate users, and would rarely see personal data for legitimate users and when seen would hardly be pursuing it and not publishing it. 2) How can the viewing of data be extremely dangerous? What is the basis for such a careless statement? The truth is that the vast bulk of our users are making occasional edits, and that the data is completely innocuous, and puts them in no danger as they edit their article on One Direction, Kylie Minogue, their favourite footballer, etc. The fact that it is not searched, is never viewed, and is not shared should set your mind to rest that the vast majority of our users are not exposed to danger. Your hyperbole is unhelpful. — billinghurst sDrewth 17:16, 31 May 2014 (UTC)
- @Verdy p: The data is more than just real people's edits, it is also for spambots which are quite prolific across the systems.
- I was certainly not provocative and rhetotical as you state here. You seem to overvalue your own work in this area. My comments are general considerations and I maintain that these logs are dangerous, including legally, to keep for too long. If they were not, we would not have specific CheckUser rights and a strongs policy for usng this tool, and any limit to their housekeeping. In fact you are using your own opinion that is going completly against th existing policy. I maintain that long retention times are more a problem than a solution. Even against spambots. 1 month is far enough against them, which should be used by a much larger army of normal users. Extending this time will not solve the problem better, we can work wth the general comunity of reviewers and should better work on tools allowing them to handle most of the spambot traffic
- I've seen recently someone being banned for 1 full year for only 1 single edit, only because of that edit caused problem, and he was logged off at this time. Unfortunately this was a dynamic IP and this bloc kmeans that any other user dring one year using this IP will be forbidden from editing (this IP is sued by a major ISP in US that can assign it to any user in a very large region (and this block was made by an admin that did not even check the status of this IP and did not even request CheckUser. And this was just for a stange comment posted in a talk page (not the correct one) by someone visibly new wanting to comment his way for the first time (and that short message was not even spam (i.e. not massive, not repeated anywhere else, not advertizing, not insulting or harassing anyone, it was politically neutral; all that should have been done is to revert that comment and alert that user that this was not the right place to post that and direct him to some other place explaing things). And it was also not sent from an open proxy. just a dynamic IP. The user just forgot to sign (or most probably did not even had the time to add the signature as he was banned completely including against his own user alk page and agasint any attempt to recreate an account. Such bans are nefast for the project, we forget the mission of the project to be educative and teach best practices to users.
- At the same time I've been victim of personal harassment by someone that also damaged lot of pages and refised to hear anyone for a long time. It took considerable time to have that user blocked even after that user used letal threats. Spambots are not so much a problem that we need to develop and maintain huge hammers to kill these mosquitos, their behavior is highly predictive, what they post is reproduced consistently with minor variation because their automated brain has limited choices and have no imagination and they are slow to adapt. Spambots have clearly identifiable patterns, may be they don(t care about their personal images, they just insist in postng their spew and not changing it. If the content is too much identifiable they introduce some limtied typos or use encoding quirks that no humane would even type on their keyboard (such that replacing characters by others similar in other scripts or posting in "1347" / "LEAT" style, frequently also with abuse of capitals to get heard)..
- So please calm yourself, even if spambots irritate you. We cannot do good job by precipitation and when overreacting nervouly. If you feel too nervous, it's time for you to take some wikibreak to apease your mind: you are not alone, don't take this task too personnaly if you participate to it. One good thing about the existing policies is that admin should never worl alone and decide everything alone and CheckUser admins should also work in cooperation with other users addressing most issues. verdy_p (talk) 18:40, 31 May 2014 (UTC)
- Your commentary was both wikt:provocative and wikt:rhetorical. I provided my opinion of the general usefulness of three months versus one month. I also commented that from my experience at looking at checkuser data, that your examples of dynamic IPs encompassed a subset of the situations that I see from the data. I believe that the provision of an opinion of the medium of checkuser data based on experience is relevant to the conversation. I don't see how the rest of your commentary, nor your blocking example, is relevant to the data retention guideline.
- Please read the definitions you cite. I have not provoked anyone in the initial message I posted (and certainly not against you because when I posted it above, you had never said anything here (you've changed the order of discussion: I had posted here before you in this thread above when you insterted a reply to someone else above my own posted one day before).
- You just started to accuse me of being provocative (against who? why?) and prookayive even though my message was short enough (your message was longer and only targetted against me, so yes it is your message that was provocative and rhetorical. I gave some personal opinion without forcing anyone to have the same like what you are ding. I also asked you to remain calm,but apparently you are nervous since the begining and cannot hear that.
- I gave arguments explinaing my poisiton by the simple existence of the limitation of length in the policy. Thunk about it: there are reasons why many people are getting nervous about keeping logs of personal data. IT is not just the question of the individual actions that can be taken by CheckUSers, but more about the risk taken if this data is disclosedn even accidentally or because someone would like to attack it and make intrusive usage of this data, notably someone that also has some large amounts of other data. For this reason this data must have a minimum lifetime needed for technical or legal reasons but nothing more. And this is true even if these are static IP or dynamic IP assignments or other sensitive data. Notably this data is tracking everyone, not just the few spambots you're wanting to find and block. That's the definition of a "hammer to kill mosquitos", a wellknown expression that is decriptive enough without being considerd "rhetorical" or "provocative".
- May be Ive used some terms that you feel are more irritating than I think, so sorry, English is not my native tongue. But it was definitely not personal like what you did and the general spirit was fairly understandablen don't infer subtle interpretations that I did not imply. You started replying to me that "data is more than people edits". I perfectly know that becuse edits in Wikiemdia are all visible to everyone and the policy is definitely not about these edits but oher personal data collected that people are sending only by their presence or by the technical communication mean they use with little thing they can really do to avoid it (the only solution against that is to use anonymizing proxies, but they are slow and in fact we know that they are used by abusers. So you're proposing to extend the retention time to personal data that essentially contains data from legitimate users only to try discovering a few tracks left by a few spambots or abusers. I maintain that this retention log is dangerous (and probably even more than spambots themselves that can't really do lot of irreversible damages). On the opposite, damages caused due to intrusion to personal data is almost always irreversible, think about it seriously : and the longer we keep this personal data, the more we are all exposed to these risks. In addition the actions taken by admins based on a few tracks collected and kept a bit abusively, are rarely definitive profs. They have known side effects that can affect any one at any time, even when they didn't do anything in Wikiemdia sites (e.g. tracking IPs that have been used in some past time by a few abusers). verdy_p (talk) 05:55, 1 June 2014 (UTC)
If there is any cases of retaining data longer than stated in this guideline...
Though I suppose this guideline is only about What WMF will do to retain original data on its servers but not what people who has access rights will do after seeing the data, I am concerned about this issue:
It is suggested by Ombudsman commission and a current practice to retain stale checkuser data in case long term abuse is involved and for the purpose of explaining some checkuser actions. Does retaining data on checkuser wiki count as "retaining non-public data"? Should we mention those aspects in this guideline as well? Do we have to notify related parties when there is a need to retain data longer than it should be retained?--朝鲜的轮子 (talk) 05:37, 8 June 2014 (UTC)
Data of logged-out visitors
Under "Articles viewed by a particular user" the table only mentions logged-in users as an example. Do you retain a complete history including logged-out page views? For example, would it be possible for you to create a list of all IP-adressess who watched a given article in the last 15 days? --Tinz (talk) 12:05, 8 June 2014 (UTC)
- Yes, it's possible. --Nemo 12:30, 8 June 2014 (UTC)
- IP Addresses, User agents and other finger-printable information is stripped from the request logs at 90 days. So yes, this sort of thing is possible in the short term (e.g. 15 days), but not in the long term. The document is a little misleading in the example. --Halfak (WMF) (talk) 15:50, 9 June 2014 (UTC)
Page request data retention
User:WNT and others raised the question of why the Foundation collects and retains data about pages visited by a logged-in user and we realized that the wording of the data retention guidelines was unclear and have since changed it. The Guidelines used to state that we can retain “a list of articles viewed by a logged-in user” and that “After at most 90 days, it will be deleted, aggregated, or anonymized.” This wording could easily be read to mean that if you are user XYZ, we have a list of every article that you read in the last 90 days. The Foundation is not interested in the behavior of users as individuals, and we have changed that language to “A list of articles viewed by readers.” The second change we made is to the maximum retention period of non-aggregate data. It now reads that “After 90 days, if retained at all, then only in aggregate form.” and just to clarify, aggregation means that we have removed all information that would identify specific readers. So, after aggregating the data in question, we would be able to keep, for example, the information that "5,000 readers visited article X on mobile devices on a given date" or that “the link from article X to article Y had the highest (75%) click-through rate for page X", but not who those readers were. We hope this addresses your concern. --DarTar (talk) 00:47, 19 June 2014 (UTC)
- But aren't the webserver logs containing the full list of accessed URLs so that it match for each connection request from any IP (logged-in ot not) the pages that have been visited, or submitted, along with a timestamp, the requested server hostname, some browser's metadata in its HTTP MIME headers (such as the User-Agent) and server-generated session cookies (retrieved by the browser in its cache and sent along with the request) and some form-data (when it is URL-encoded and appended to the queried URL) or possibly all (if the server also logs the POST form data in the attached request body?
- If so, the data will be archived exactly like all other server logs and still usable in non-agregated form by the CheckUser tool, even if it's not allowed in non-agreggated form for generating usage statistics.
- Also I agree with the change from "read" to "visited": logging the fact of "reading" an article would require the server to use some kavascript to tracking browser behavior (onload events) or user behavior (scrolling or hovering areas and moving the mouse in page area) on the page once it is loaded, and a "visit" is a technical access which may be performed automatically by the browser (anticipating clicks and preloading pages for example) which does not mean that the page was ever rendered and actually viewed.
- There's only one place where the server actually check if the user was "reading" a page, it is when it is using a "captcha" that the user must read and validate with the correct response on that page to continue (a captcha is undesirable in most cases when just reading articles, they are sent only when creating an account or when submitting an edited page containing some new external links when the user is not logged in, or with similar actions requested by users that create or modify data stored in servers: captchas are there only to stop unauthorized bots). verdy_p (talk) 08:14, 19 June 2014 (UTC)
- Yes and no. So, yes, we have those logs. No, they're not saved to the CheckUser table; the CU data is associated with edit requests, not page requests. In at least the page request logs, the session cookie (which lasts literally until you close the browser, and no longer) is stripped before it makes it to R&D: I don't recall seeing this in the CU data, either. --Ironholds (talk) 16:42, 19 June 2014 (UTC)
Is the content we submitted when we preview our edits stored indefinitely?
- On preview? I don't think so (if it somehow is, I'll be damned if any of the researchers know where it lives ;p.) Ironholds (talk) 20:36, 7 July 2014 (UTC)
Why retain user experience research data indefinitely?
The Wikimedia Foundation has hired a Lead User Experience Researcher to design, build, and implement a system to provide user experience research as part of how we build functionality at the Wikimedia Foundation. Part of that system includes the collection of qualitative data via design research methods. We use this data to recruit research participants. Also, once we implement research with participants (and collect qualitative data), we analyse that data, looking for patterns which then become findings informing the design of functionality.
We do user experience research with participants to inform what to build and why, as well as to iterate concepts and existing functionality. It is one way to bring the community into the design process, and iterate before release. The methodologies used are:
- Recruiting participants: Using a recruiting survey and a database to collect opted-in user experience research participants. The people who opt in to research will be drawn from for the other types of research mentioned below. (We will collect name, email, country (for scheduling purposes), and answers to questions about use of wikis, and what kinds of research people are willing to participate in. We will keep this data indefinitely, as it is useful to do research with participants over time, and it is important to have a reliable source of a wide range of participants to invite to research sessions in a timely manner. It is important to have the right people to participate in research, in order to properly answer the research question. People who opt in to be research participants can always opt out and be removed from the database by sending a request to do so to a dedicated email alias .)
- Remote usability studies: (both moderated and unmoderated) A researcher has a short conversation with participants, and asks them to attempt to achieve a goal or accomplish a series of tasks in a prototype or existing functionality using their own machine or device. As the participants proceed to accomplish the goal or task, the researcher (and observers in some cases), observe as the participant shares their screen. After a series of these sessions, researchers do analysis by reviewing recordings and notes, and looking for patterns in the data which become findings. (These sessions are recorded using Google+ Hangouts On Air and the recordings are kept indefinitely. People sign a release form before participating.)
- In-person usability studies: Same method as remote usability studies, but in person. (These sessions are recorded and the recordings are kept indefinitely. People sign a release form before participating.)
- Exploratory research: For example, collaborative design, observing users working with existing functionality, interviews, conversations with users to better understand their needs and wants, design ethnography, and diary studies. This method is similar to usability studies, but more about investigating people’s needs within their own contexts. We observe people accomplishing goals in their existing functionality, and have conversations about that. (These sessions are recorded and the recordings are kept indefinitely. People sign a release form before participating.)
- Surveys about specific subjects: A survey is sent out to a broad set of users of different varieties to better understand user needs and practices. For example, people’s use of mobile or better understanding the ecosystem of devices people use. (The data in these surveys will be retained indefinitely, as they are useful over time.)
- Surveys to gather feedback on specific functionality: These surveys are embedded in Wiki functionality and are short. They are used mostly for new functionality to gather feedback, collect bugs, and compile suggestions for improvement, in context while people are using that functionality. The data collected is about people’s reaction to a specific functionality in context. (Data in these surveys will be retained indefinitely, as they are useful over time.)
see question I have asked on Archive of blp articles
http://en.wikipedia.org/wiki/Wikipedia:Biographies_of_living_persons/Noticeboard#Archive_of_blp_articles  It does raise issues about archives and how retention is enforced and managed arbcom has been emailed with my comments about this as well Once a article is redacted by oversight, what steps are taken to remove this redacted information from db dumps ? What the are dump retention periods ?
How long do we retain non-public data?
@Mpaulson (WMF): It would be nice to clarify what we mean by IP address in (*) under the table in section "How long do we retain non-public data?". If we change "IP address" to "IP address for anonymous users" or something along that line, that would be clearer. I appreciate that this is defined in the Definitions section but that's not immediately available when we read the text. This had caused some confusion on our end. Thanks a lot! --LZia (WMF) (talk) 17:49, 13 July 2015 (UTC)
Added exception for extended retention of personal information in abuse cases.
It came to my attention recently that I mistakenly did not include an explicit exception in the data retention guidelines that addresses abuse cases. Without extended retention periods for abuse cases, WMF and trusted volunteers who work tirelessly to protect the Wikimedia projects would not be able to effectively do their jobs. Longer, and sometimes indefinite, retention of information (such as IP addresses and user agent information) in these rare cases of abuse is necessary for investigatory and enforcement purposes. Please don't hesitate to reach out to me if you have any questions or concerns about the recent addition. Mpaulson (WMF) (talk) 22:06, 17 November 2015 (UTC)
More information about exception for reader research data
It has recently come to my attention that we did not request an explicit exception for retaining a specific subset of the data that we collected for understanding Wikipedia readers beyond the general 90 day limit. The data collected is from the period of 2016-03-01 to 2016-03-08 and includes webrequest logs. The details of what gets collected as part of webrequest logs is captured here. We have consulted the Legal team about an exception that will give us until 2016-08-31 to anonymize or aggregate the data. We did not immediately delete the data completely before aggregation/anonymization because the anonymized data from this specific week is crucialto continue the research to understand Wikipedia readers (we ran a survey in that week in English Wikipedia and for doing the analysis on potential bias in the survey results we need to have a base for comparison with the general reader population in that same week to avoid having to work with data that can be different from the survey data because of seasonality, change in topic trends, etc.).
The research, as documented here, will help us build an ontology of Wikipedia readers and Wikipedia articles with respect to their usage by Wikipedia readers. Such an ontology can help The Foundation and Wikimedia communities to understand the different groups of Wikipedia readers and articles in a deeper level. With this information we hope that we all can provide better services (improved search, for example) to Wikipedia readers. The early results of this research can be found through the documentation in this table, the latest survey results are documented here. If you have questions or concerns, please feel free to reach out to me. --LZia (WMF) (talk) 20:34, 16 August 2016 (UTC)
Semi-protected edit request on 28 June 2019
<tvar|>REMOVE SYSTEM DEBUGGER DATA RECOVERY MODE </>
|This edit request has been answered. Set the |
2806:1016:4:C4F6:7955:C65F:E648:74EA 21:42, 28 June 2019 (UTC)
Added exception for page views investigation
The Privacy team has temporarily extended the retention period for two datasets for a short period so that the Data Engineering team can investigate the impact of a data collection technical issue. Between June 4, 2021 and January 27, 2022, some of the Foundation’s caching nodes stopped collecting web traffic data (see the Phabricator task for more details). This resulted in data loss for web requests and the derived pageviews, which impacts the Foundation’s ability to correctly report on the Wikimedia pageviews and fundraising banner impressions.
The Data Engineering team required a temporary short-term extension to the usual 90-day retention period in order to better estimate what data was not collected and which projects and geographies were most affected. The wmf.pageview_actor dataset is being used to estimate the data loss for pageviews and the wmf.webrequest dataset is being used to estimate the data loss for fundraising banners. Information from both datasets is required because webrequest data for visited banners is not reported as pageviews. Deletion of these datasets was paused on February 16, 2022 and deletion will resume by March 18, 2022.
If you have questions or concerns, please reach out to email@example.com. If you are interested in a conversation meeting to discuss this exception and investigation, please sign up below and we will contact you with details. MMoss (WMF) (talk) 19:51, 11 March 2022 (UTC)