Wikidata, the knowledge base of Wikimedia, is collecting a large amount of structured knowledge across all Wikimedia projects and languages. Since 2015, a query service is available to retrieve and analyze this data using the SPARQL query language. Since the data is rich and the query language is powerful, many complex questions can be asked and answered in this way. The service is used not only by individual power users but also by applications inside and outside of Wikimedia, which issue a large number of queries to provide users with the information they request.
Understanding these queries can be extremely useful to improve both the content of Wikidata ("What information do users care about most?") and the quality of the query service ("Which frequent types of queries are critical for performance?"). This project therefore aims to analyze the query logs to gain insights that may be valuable to the community and to WMF operational staff alike. It is hoped that a completely anonymized form of query logs can be gleaned from the available data, and that this can then be shared openly with the community for further analysis.
Research Questions and Expected Outcomes
If successful, the project will answer two orthogonal research questions:
- Which parts of the content of Wikidata are of interest to users?
- Which query load does this application put on the database server?
An answer to the first question may state which properties and items are queried for, but also which expressive features of Wikidata are (not) used in queries. More fine-grained analysis may even reveal the importance of particular combinations of properties (patterns) or even of individual statements stored in the system. The outcome may encourage editors to intensify their work (many now wonder how/if their contributions are used) or to investigate further why some content is not used as hoped for. The outcome may also provide a metric to show reuse, that can help to express relevance to funders.
The second question tries to classify (streams of) SPARQL queries according to the demands that they pose towards a database server that tries to answer them. Relevant dimensions include the SPARQL features that are used, the shape of the query patterns, the size of the intermediate results, or the speed at which queries arrive. The actual meaning of queries in terms is not of interest here. The outcome of this work could be a descriptive characterization of the query load (giving typical metrics) and, going further, a public benchmark that provides developers with realistic query loads to test and improve their system.
Both insights might also help the WMF in making their query services faster and more reliable. Understanding usage patterns is the key to many forms of optimizations. For example, it is possible that queries could be simplified significantly by slightly changing the encoding (based on RDF) of the Wikidata content in the database underlying the query service.
Data: What's there & what's needed
We expect that the query logs are available only as part of general server access logs that contain queries as part of URLs accessed by users. In addition to URLs, this data may contain time stamps, IP addresses, user agents, and HTTP header information. We do not know the details yet.
For answering the above research questions, only a small part of these logs is actually needed:
- SPARQL query strings
- Relative query times (how much time has passed between two queries)
- If available, the information if a query was asked by a specific software tool or bot
- During processing: A limited amount of "session" information to identify queries that have been asked by the same user/application in a short time
Published data should only include information of types 1 and 2, and optionally 3 (if some tools are found to account for many queries). Session information (item 4) may help during data cleansing, since it is possible that users try out many queries until they find the one that answers the question they really had. This may help to filter meaningless or unintended queries, or to find out which example queries a user started with (if not recognized, the example queries of Wikidata may lead to a bias in all analyses). User identities or long-term query histories of individual users are not of interest to our research questions.
Extract and validate data
The first step will be to extract from the raw server logs individual query records that contain valid SPARQL queries and a subset of the log data that may be required for further processing. The main processing step will be to evaluate and analyze query strings. Invalid queries should be marked; understanding how many queries are invalid and why can yield some insights. We intend to use an open source SPARQL processing library such as Apache Jena to process queries. Records obtained from this step may include:
time, SPARQL query string, isValidQuery?, IP, User agent
Cleanse, normalize, and anonymize queries
SPARQL is a rich language, and the (technically) same query can be written in many ways. We will remove user comments and formatting, and try to normalize query shape to help comparing queries. For example, the order of query parts in SPARQL is often irrelevant, so one can re-order parts in a uniform way to facilitate comparisons across queries. Stronger normalizations may help to compare more queries, but it may be useful to retain the original query shape for reference, since a different way of writing the same query may still affect database performance. For analyzing usage, however, it is not necessary to retain the shape as long as the meaning is preserved.
Cleansing and normalization will usually lead to queries that do not contain any user-specific information. However, we will investigate possible privacy issues in the processed queries in separation. In particular, we will analyze the string contents (SPARQL queries may contain strings, which could contain sensitive information) of queries. We aim to have this ready early on so that external experts can already be consulted. The assumption that needs to be verified is that anonymized queries together with their relativized time stamps can safely be published to a wider community.
Extract relevant features
Queries can be analyzed with respect to a large amount of features and metrics, some of which are not easy to obtain. In order to prepare further analysis, we will first identify relevant features and classify queries accordingly. We can foresee that the following will be of interest:
- SPARQL features. SPARQL has many algebraic operators and filter conditions, which may or may not be used in a specific query. The use of subqueries and custom SERVICE calls is also relevant here.
- Vocabulary elements. A query may explicitly refer to elements of the Wikidata vocabulary or other vocabularies. This includes Wikidata properties and entities as well as parts of the Wikidata ontology. A special case of this is the use of BlazeGraph-specific vocabulary (e.g., to deactivate query optimization).
- Size. We can quantify the size of a query in several ways, e.g., by counting the number of operators, triples, or variables used.
- Similarity to known example queries. We will attempt to quantify the similarity of queries to known example queries. The primary purpose of this measure is to distinguish test and demonstration queries from production queries. The measure should define how likely it seems that a given query was obtained by modifying a known example. This comparison would naturally involve otherwise irrelevant features such as comments and naming of variables that are available from the original query string.
- Execution time, cache usage, timeout. If available, the time to response, cache usage and whether or not a timeout happened should be recorded as features as well. This information could be useful to analyse queries with respect to their runtime behaviour on the WMF servers. However, this needs to be validated carefully, since the overall server load may influence query times significantly.
- Query agent information. This feature should record whether the query was created by a known query agent, such as a bot or a specific server (known by its IP). Individual user IPs or browsers should not be recorded. The goal is only to single out agents that create large volumes of queries that could be an unjustified bias in the analysis. We will conduct a pre-analysis of the logs to discover relevant indications and anomalies (IPs with large amounts of queries, special user agent strings that are not browsers, etc.) that should be recorded here.
- Query structure. A complex "feature" is the algebraic structure of the query as a whole. This would take the form of an abstracted parse tree of the query. Part of this analysis should be relevant structure sharing, i.e., the use of the same subquery in multiple queries.
- Query templates. One can generalize concrete queries by abstracting from specific constants (items, properties), so as to obtain query templates. The number of distinct templates is expected to be much smaller than the number of queries, so they can be used as a feature to group queries. Especially if one template occurs in a large number of instantiations, this may suggest that the queries have a common source (bot or tool).
- Pattern complexity. For conjunctive patterns (Basic Graph Patterns) in queries, we are also interested in specific measures that classify their shape (as a graph). Simple measures of this type are the diameter and maximal degree of the query graph (these can already be used to classify "star-shaped" and "line-shaped" queries). A more advanced measure is treewidth. Depending on how many different forms of graph patterns we will encounter, we will attempt to compute these measures (computing treewidth of larger graphs is computationally challenging, but there are implementations that can be used).
- Regular expression complexity. We are interested in the complexity of property path expressions used in queries. We will classify these expressions according to language-theoretic properties. This includes features such as star-height, determinism, and being part of specific levels of language-theoretic hierarchies. We expect that most expressions are very simple and can therefore readily be classified. Depending on how many non-obvious patters remain, we will either manually extend the patterns or build a special purpose classifier for particular language classes.
- Relative time of query execution. This quantity will help to identify temporal variations in query loads, which may also help to group queries later on.
Create a query-feature database
To enable further processing, the queries and (selected) features will be precomputed and stored in a common database. This should serve as a flexible foundation for further analysis. An RDF representation that can be queried with SPARQL may be a good solution. This would also enable exploring the dataset interactively (e.g., to query for the total number of queries that use certain features in combination) and thus support hypothesis generation for further analysis. It will depend on the actual size of the dataset if this solution will be feasible. Alternatively, an offline processing solution for storing and retrieving queries with their features will be considered.
Query distribution analysis and clustering
Based on the pre-computed features, one can further analyze and cluster queries. The simplest approach will be to extract histograms of feature distributions to understand obvious groups or spot anomalies. Correlations between features can be tested based on hypotheses developed earlier. A more advanced approach is to run unsupervised machine learning (clustering) algorithms to identify relevant types of queries.
The goal of this phase is to decompose the overall mass of queries into more informative groups which may exhibit distinct characteristics in terms of query structure, features used, and execution time. It might be more meaningful to analyze the groups individually, e.g., if a large number of uniform queries is generated by one tool (e.g., Reasonator), then it should be studied in separation from other queries.
A particularly relevant way of grouping queries is by execution time, if this feature is available. It would be interesting to check if there are specific query features that can statistically predict runtimes.
Query performance: Insights, conclusions, recommendations
Based on the data gathered so far, one can derive general insights about performance-related aspects of queries. Research questions include:
- Which important types of (concurrent) query loads should the query service be optimized for?
- Which features have significant negative correlations with performance?
- What kinds of bursts, hot spots or otherwise anomalous performance impacts can be found?
- What kind of query patterns are most important?
- What are the most challenging queries in terms of size or pattern complexity?
Answering some of these questions may lead to several outputs:
- Recommendations for optimizing the WDQS setup or BlazeGraph as such
- Guidelines for users to improve the performance of their queries
- Benchmarks that collect a representative set of relevant query types and patterns
Usage patterns: Insights, conclusions, recommendations
Based on the data gathered so far, one can derive general insights about content-related aspects of queries. Research questions include:
- Which parts of the content of Wikidata are most interesting to users of the query service? (e.g., which Wikidata properties and classes are most commonly used in queries?)
- Which aspects of the Wikibase data model are most relevant in queries? (e.g., to what extent are references used in queries?)
- Which aspects of the RDF translation are most relevant in queries? (e.g., "Are queries using the BestRank class?")
Answering some of these questions may lead to several outputs:
- Feedback to the Wikidata editor community regarding the wider interest in their work.
- Recommendations on how to improve the RDF encoding, e.g., by adding structure that simplifies common queries
- Correlation of access paths (tools) and content (e.g., are certain tool used more central to the work of certain communities, such as, GLAM or bio?)
An advanced analysis would be to identify which parts of the data are "hot" in the sense that many queries are relying on them for (some of) their answers. This is a challenging question that may lead to a longer term effort, but this project is an important starting point.
This section gives a short overview of the main areas of research where the proposed project has a potential for high scientific impact.
The scientific and technological progress in the field of RDF databases and graph-based query answering hinges upon the availability of meaningful benchmarks. They help researchers and developers to compare the performance of competing approaches, to identify current shortcomings that may inspire further work, and to understand the requirements arising in relevant application areas. Good benchmarks are rare and creating interesting new benchmarks is a field of active research. Examples related to RDF and graph databases include the DBpedia Benchmark, the Berlin SPARQL Benchmark, LinkBench, gMark, and the benchmarks of the Linked Data Benchmark Council LDBC.
Nevertheless, the availability of relevant RDF benchmarks remains unsatisfactory and there is much ongoing work to improve this (gMark and LDBC are examples of ongoing projects). While several benchmarks are based on real RDF data, benchmark queries are often artificial and therefore cannot reflect the information need of any real user group. Moreover, there is hardly ever any dynamic information regarding the frequency and temporal distribution of (different types of) queries. Also synthetic benchmarks, such as the aforementioned gMark, rely on relevant statistical information to generate data.
Finally, there is a specific need for benchmarks that cover the advanced knowledge-graph structure of Wikidata, which is significantly different from simpler RDF datasets (which are lacking, e.g., references and qualifiers, e.g., for temporal annotations). Wikidata's additional features and the encoding that they require have significant impact on query answering performance. Similar encodings are natural for other complex knowledge graphs, but current RDF benchmarks are not covering this area.
The study of community behaviour and user information needs when interacting with large-scale data management systems is still in its infancy. A major obstacle is that there is often no sufficient data available to conduct such studies. The only large-scale SPARQL query logs that we are aware of are the logs of the Bio2RDF and DBpedia SPARQL endpoints that are available in the USEWOD datasets. It could be interesting to compare especially the DBpedia queries to the ones asked in Wikidata, but we expect that the two sets are extremely different in many aspects, for the following reasons:
- The richer information found in Wikidata enables much more complex queries, as we can already see in the example queries (e.g., using references and qualifiers).
- The two datasets often contain very different information.
- Where the two datasets cover similar information, it is often encoded in different ways in RDF.
- The user communities asking queries are different (DBpedia traditionally has been very prominent among Semantic Web researchers; Wikidata has notable GLAM activities and users; etc.).
- The landscapes of third-party tools that use the query services are completely different.
Policy, Ethics and Human Subjects Research
The project does not involve studies with human subjects, and the goal of the study is not to analyze human behaviour beyond the usage of a specific technical service. However, the raw data that needs to be analyzed may contain IP addresses or other identifying information. The project team will work closely with WMF to ensure that user rights are protected at all times. Non-disclosure agreements and restricted server-side data access will provide the necessary legal and technical framework for ensuring that this is strictly observed.
- Denny Vrandečić, Markus Krötzsch: Wikidata: a free collaborative knowledgebase. Commun. ACM, 57(10):78-85, 2014
- Mohamed Morsey, Jens Lehmann, Sören Auer, Axel-Cyrille Ngonga Ngomo: DBpedia SPARQL Benchmark – Performance Assessment with Real Queries on Real Data. Proc. of the Int. Semantic Web Conf. (1) 2011: 454-469
- Christian Bizer, Andreas Schultz: The Berlin SPARQL Benchmark. Int. J. Semantic Web Inf. Syst. 5(2): 1-24 (2009)
- Timothy G. Armstrong, Vamsi Ponnekanti, Dhruba Borthakur, Mark Callaghan: LinkBench: a database benchmark based on the Facebook social graph. Proc. of the 2013 ACM SIGMOD Int. Conf. on Management of Data: 1185-1196
- Guillaume Bagan, Angela Bonifati, Radu Ciucanu, George H. L. Fletcher, Aurélien Lemay, Nicky Advokaat: gMark: Controlling Workload Diversity in Benchmarking Graph Databases. arXiv, http://arxiv.org/abs/1511.08386, v3, 6 Feb 2016.
- Linked Data Benchmark Council: http://ldbcouncil.org/, accessed 2016-06-17
- Songyun Duan, Anastasios Kementsietsidis, Kavitha Srinivas, Octavian Udrea: Apples and oranges: a comparison of RDF benchmarks and real RDF datasets. Proc. of the 2011 ACM SIGMOD Int. Conf. on Management of Dat: 145-156
- Fredo Erxleben, Michael Günther, Markus Krötzsch, Julian Mendez, Denny Vrandečić: Introducing Wikidata to the Linked Data Web. In Proc. of the 13th Int. Semantic Web Conf. (ISWC), 2014.
- Daniel Hernández, Aidan Hogan, Markus Krötzsch: Reifying RDF: What Works Well With Wikidata? In Proc. of the 11th Int. Workshop on Scalable Semantic Web Knowledge Base Systems, 32-47, 2015
- Markus Luczak-Roesch: USEWOD Homepage, http://usewod.org, accessed 2016-06-17