User:Markus Krötzsch/Wikidata queries

From Meta, a Wikimedia project coordination wiki
The datasets described on this page have been released by now. For details and download links, please see https://iccl.inf.tu-dresden.de/web/Wikidata_SPARQL_Logs/en and the accompanying publication:
Stanislav Malyshev, Markus Krötzsch, Larry González, Julius Gonsior, Adrian Bielefeldt:
Getting the Most out of Wikidata: Semantic Technology Usage in Wikipedia’s Knowledge Graph.
In Proceedings of the 17th International Semantic Web Conference (ISWC-18), Springer 2018. PDF

The below page is kept for reference.

We are considering the release of abridged versions of the request logs for the Wikidata SPARQL query service. This service is accessed by sending GET requests to https://query.wikidata.org/sparql?query=query, where the query is a (URL encoded) database query in the SPARQL format. These requests are logged internally. The proposal is to publish an excerpt of this data.

What data is proposed to be shared publicly?[edit]

We propose to publish information about all requests that have been made within a period of twelve weeks in summer 2017. The data would be published in tabular form, where each line in the table has the following entries:

  1. Modified query: the original query, reformatted and processed for reducing identifiability (see below)
  2. Timestamp: The time of the request
  3. Agent type: This field states what kind of user agent seems to have made this request. It is simply "browser" for all browser-like agents, and might be slightly more specific for bot-like agents (e.g. "Java"). See below.
  4. Bot hint: This field indicates whether we believe that the query was issued by an automated process. This is true for all queries that came from non-browser agents, and in addition for some queries that used a browser-like agent. It is a heuristic measure.

Overall, the data amounts to around 200 million requests. Removing all queries that we believe are sent by bots, there are still than 650,000 queries remaining. The queries are very diverse in terms of size and structure.

How is this data generated?[edit]

All source code used for generating the data is published.

Modified query[edit]

The query strings will be heavily processed to remove potentially identifying information as far as possible, and to reduce spurious signals that could be used to reconstruct user traces. The following steps are performed:

  • Stage 1: A SPARQL programming library (OpenRDF) is used to transform the original query string into an object model. If this fails (invalid query), the query is dropped completely. We do not publish any information about invalid requests.
  • Stage 2: The structure of the parsed SPARQL query is modified:
    • All comments are removed
    • All string literals in the query are replaced by placeholders of the form "stringnumber" that have no relationship to the original string (we simply enumerate the strings as found in the query).
      • The same string will be uniformly replaced by the same placeholder within each query, but the same string across different queries will usually not be replaced by the same placeholder.
      • The only exception are very short strings (of at most 10 characters -- this is configurable), strings that represent a number, lists of language tags in language service calls (e.g., "en,de,fr"), and a small number of explicitly whitelisted strings that are used to configure the query service (e.g., the string "com.bigdata.rdf.graph.analytics.BFS" that instructs BlazeGraph to do a breadth-first search). These strings are preserved.
    • All variable names are replaced by generated variable names "varnumber" or "varnumberLabel"
      • Replacement is uniform on the level of queries like for strings.
      • The ending "Label" is preserved, since BlazeGraph has a special handling for such variables.
    • All geographic coordinates are rounded to the next full degree (latitude and longitude). This also is done with coordinates in the alternative, more detailed format, where latitude and longitude are separate numerical values.
  • Stage 3: The modified query is converted back into a string
    • All formatting details (whitespace, indentation, ...) are standardized in this process
    • No namespace abbreviations are used in the generated query, and no namespace declarations are given.

Example: The well-known example query for the 10 largest cities with a female mayor:

#Largest cities with female mayor
#added before 2016-10
#TEMPLATE={"template":"Largest ?c with ?sex head of government","variables":{"?sex":{"query":" SELECT ?id WHERE { ?id wdt:P31 wd:Q48264 .  } "},"?c":{"query":"SELECT DISTINCT ?id WHERE {  ?c wdt:P31 ?id.  ?c p:P6 ?mayor. }"} } }
SELECT DISTINCT ?city ?cityLabel ?mayor ?mayorLabel
WHERE
{
  BIND(wd:Q6581072 AS ?sex)
  BIND(wd:Q515 AS ?c)

	?city wdt:P31/wdt:P279* ?c .  # find instances of subclasses of city
	?city p:P6 ?statement .            # with a P6 (head of goverment) statement
	?statement ps:P6 ?mayor .          # ... that has the value ?mayor
	?mayor wdt:P21 ?sex .       # ... where the ?mayor has P21 (sex or gender) female
	FILTER NOT EXISTS { ?statement pq:P582 ?x }  # ... but the statement has no P582 (end date) qualifier
	
	# Now select the population value of the ?city
	# (wdt: properties use only statements of "preferred" rank if any, usually meaning "current population")
	?city wdt:P1082 ?population .
	# Optionally, find English labels for city and mayor:
	SERVICE wikibase:label {
		bd:serviceParam wikibase:language "en" .
	}
}
ORDER BY DESC(?population)
LIMIT 10

turns into the following normalized query, which yields the same results:

SELECT DISTINCT ?var1  ?var1Label  ?var2  ?var2Label 
WHERE {
  BIND (  <http://www.wikidata.org/entity/Q6581072>  AS  ?var3 ).
  BIND (  <http://www.wikidata.org/entity/Q515>  AS  ?var4 ).
  ?var1 ( <http://www.wikidata.org/prop/direct/P31> / <http://www.wikidata.org/prop/direct/P279> *) ?var4 .
  ?var1  <http://www.wikidata.org/prop/P6>  ?var5 .
  ?var5  <http://www.wikidata.org/prop/statement/P6>  ?var2 .
  ?var2  <http://www.wikidata.org/prop/direct/P21>  ?var3 .
 FILTER (  (  NOT EXISTS  {
   ?var5  <http://www.wikidata.org/prop/qualifier/P582>  ?var6 .
 }
 ) 
) .
  ?var1  <http://www.wikidata.org/prop/direct/P1082>  ?var7 .
 SERVICE  <http://wikiba.se/ontology#label>   {
    <http://www.bigdata.com/rdf#serviceParam>  <http://wikiba.se/ontology#language>  "en".
  }
}
ORDER BY  DESC( ?var7 )
LIMIT 10

Agent type[edit]

The agent type is set to be "browser" for all user agents that start with "Mozilla" (for example "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; chromeframe/12.0.742.100)" would be considered a "browser" for this purpose). For requests that do not originate from browsers, the "agent type" will be the name of the query software, if the software is identifying itself (e.g., "auxiliary matcher" is a popular tool used by Magnus Manske), and of a general platform category (e.g., "Java") if not. The strings used here will be manually whitelisted to ensure that none contains personal information (e.g., detailed system descriptions will not appear). Moreover, it will be ensured that all agent types will occur in at least 10,000 queries across more than one week. All remaining agents will be labeled "other".

Bot hint[edit]

The bot hint field is "true" if we believe that the source of the query was a bot (i.e, some automated software tool issuing large numbers of queries without human intervention). This is the case if the user agent was not a browser, or if the query traffic pattern was very unnatural (e.g., millions of similar queries in one hour). This field is there for convenience and only makes explicit how we interpreted the logs.

Privacy[edit]

An important goal is to reduce the risk of any identifying information being present in the published data. Identifying information is information that can be traced back to a specific human with (objectively) high probability.

Exclusions: Data that won't be published[edit]

Only the aforementioned fields are included. In particular, the following will not be published:

  • No IP addresses, not even in hashed or otherwise obfuscated form
  • No user session information
  • No user comments in queries
  • No malformed queries
  • No further HTTP request parameters
  • No additional server-side information (response times, cache status, etc.)

Replacements: Data that will be obfuscated[edit]

Obfuscated aspects of the queries include:

  • All geographic coordinates
  • All complex string literals that are not whitelisted
  • Most formatting details of the query (use of whitespace, indentation, etc.)

Obfuscated details in other fields include:

  • User agents will only be broadly classified, and no individual browser agent strings will be published

Comparison to data released by others[edit]

The release of SPARQL query logs has been done by several other major projects, such as DBpedia and the British Museum, since at least 2011. For example, several such log files are available from the USEWOD website. Typically, the data includes the query, some time information, some source information (e.g., hashed IPs), or some user agent information. The published queries have been studied by researchers to understand the usage of these data collections and of the SPARQL query language as such. To the best of our knowledge, privacy concerns have not been raised in relation to this data over the past few years.

In comparison, we propose to release slightly less data than is available elsewhere, in particular omitting data that can be directly used to reconstruct query sessions by the same user. We also add some obfuscation of the actual queries as an additional safety measure (this is not usually done).

Rationale for chosen obfuscation[edit]

Our proposed query modification is based on our analysis of the (non-public) query logs. Our main guidelines are:

  • Minimize non-essential information. We normalize the syntax of queries, rename variables, and remove comments since doing so does not change the meaning of the query, yet reduces the implicit information that might be found in the query.
  • Better safe than sorry. We obfuscate most string contents (string literals and variable names) although we have seen no evidence anywhere in the data that these might ever cause any privacy concerns. Since we cannot be certain whether a longer string may hold sensitive information or not, we go for the more restrictive option here.
  • Geographic coordinates are sensitive. Coordinates are sometimes used by mobile apps to find nearby items. This could be used to identify people, hence we propose to strongly reduce the resolution of coordinates (the area specified by two full degrees is about 110km x 40km (86mi x 25mi) even in the very North of Europe or very South of America). One could coarsen even further, but already at this scale it is impossible to create any relevant movement profiles or pinpoint street addresses.

We do not think that times or numbers in general are similarly sensitive in the data we consider. Times only occur up to day precision. There is a relatively small number of days in human history, and especially in the past hundred years (where people might be born), there are many events on each. Numbers are not very frequent in queries. They could be rounded too, but the risk of this data being identifying in any way is considered very low.

Wikidata items and properties abound in queries, but they are part of the publicly available content of the site. There are no "user items" or "user properties" that could be related to accounts, and there is no data about user accounts that could be queried in this way.

Note that this proposal is about the release of past query logs. In particular, it will not be possible for attackers to exploit the details given here in some way to influence the logs.

Feedback[edit]

space for feedback/discussion