User talk:Markus Krötzsch/Wikidata queries

Feedback[edit]

@Markus Krötzsch: Hi. I went over the steps and I can't spot any issue for the 2017 data, but I should emphasize that I'm by no means an expert in this space so we should treat my input as one input only. If the data is already available based on this format, please let me know where it lives. I will sample the data and eye-ball it as well.

@Smalyshev (WMF): Is this something Search Platform team will provide sign off for? We don't have the expertise needed in Research to be able to provide the sign off. If Search Platform will do the sign off, what other steps are required? (In Research, we have followed a 3-step process listed at Research:Improving link coverage/Release page traces in the past few cases where we were considering to release data.)

Thanks! --LZia (WMF) (talk) 22:42, 19 March 2018 (UTC)[reply]

Looks good[edit]

The proposal looks good to me for a particular data set. In fact, since it looks good, I think it could be used as a model in the future, and with this in mind I'd like to outline certain future concerns. As far as I can see, most of them do not apply to the current data set, but if we do the same in the future, we need to be aware of it.

Strings anonymization looks fine, but we need to be aware of other data types. Coordinates and times handling looks fine too. However, currently external IDs for example are reported as both string and URI. I am not sure whether using particular external ID could be constituted PII - after all, they are public IDs - but given that those could identify books, people and other physical entities, it could be plausible. Which means we might need to handle external IDs too.
In the future, we might have additional data types, such as creator links for SDC. It might be necessary to handle those too. In general, we'd need to be aware of new types popping up.
Numeric values do not seem to be of particular concern, but since we know illegal numbers exist, we may want to randomize numeric data too. Since most of numeric data would be statistically useless anyway, we might as well deal with it. Unless, of course, the parser treats numbers as string literals, in which case it should already be happening?

Also, selfish question: would the software doing all this work made available? It would be a huge help to use something like this for further data publishing, e.g. phab:T143819. Also, public code review for the software would both help find potential issues and help people have more confidence in the anonymization process. Smalyshev (WMF) (talk) 20:23, 16 December 2017 (UTC)[reply]

Data location?[edit]

@Markus Krötzsch: is there a place where the data set can be inspected, to get the idea about how it looks like, etc. ? Smalyshev (WMF) (talk) 20:19, 21 June 2018 (UTC) Answering myself: there's example data here: https://github.com/Wikidata/QueryAnalysis/tree/master/exampleMonthsFolder/exampleMonth/anonymousRawData Smalyshev (WMF) (talk) 20:19, 21 June 2018 (UTC)[reply]