WikiCite/WDQS graph split

This page documents the impacts of the Wikidata Query Service (WDQS) graph split of May 2025 in a way oriented to the WikiCite community. Its goal is to serve as an entry point for anyone interested in WikiCite and curious about the "graph split" and what it means for the WikiCite endeavour.
Overview
[edit]As Wikidata grows in size, technical challenges for scalability appear. The Wikidata Query Service, in particular, is dependent on a piece of software, Blazegraph, that is considered nearing its capacity at the current size of Wikidata and unable to handle the speed in which the number of Wikidata items and statements grow. The technical team of Wikidata and numerous volunteers have been considering alternative solutions at least since 2021.
WikiCite is the project within Wikidata which curates the citation metadata of scholarly articles. Scholarly data, comprising over 30% of entries on Wikidata, are considered key pieces of Wikidata's growth puzzle. To prevent complete crashes in the near future, it was decided that the information about scholarly articles and similar items would not be served anymore by the main Wikidata Query Service. Instead, a second Query Service was created, serving only the scholarly information. The split went live on May 9, 2025.
The knowledge graph of Wikidata, then, is split for queries on the Wikidata Query Service. The split does not affect directly any systems that do not use the Wikidata Query Service directly.
While relatively contained, as the Query Service is used in numerous applications, which are not always easy to track, the change might cause unexpected or misleading information to appear on Wikidata and applications that rely on it.
FAQ
[edit]This Frequently Asked Questions (FAQ) page is complementary to the one present on Wikidata:SPARQL query service/WDQS graph split, but caters in particular to the WikiCite audience. It was created in the context of a rapid grand by Wikimedia CH in preparation for the WikiCite 2025 conference. New questions may be directed to the talk page, where they will be triaged for inclusion here.
Practical questions
[edit]
How do I know if a query I rely on has been affected?
[edit]It is often not trivial to know if a particular query has been affected, as query results may be split across different graphs. You may use this tool to check how many results a query will return in each of the endpoints.
Why is my query giving me results in both graphs? How can I deal with it?
[edit]Your query likely relies on multiple kinds of entities, some of which are in the main graph (e.g. books) and others are in the scholarly graph (e.g. thesis). The design of the split tried to minimize issues and maximize simplicity, but that meant, in some cases, that information got divided on both graphs.
For example, items with BHL Title IDs are present in both graphs:
SELECT ?item ?bhl_id WHERE {
?item wdt:P4327 ?bhl_id .
}
There are at least three possible ways to retrieve all results, as before:
- Run the query in the main and scholarly endpoints and combining the results programatically
- Run the query on some unsplit endpoint. For example, until November 2025, you may run the query on https://query-legacy-full.wikidata.org/. For some use cases, you may run it also on QLever.
- Adding a simple UNION clause, combining the results with the federated endpoint.
For example, you may run on main:
SELECT ?item ?bhl_id WHERE {
{?item wdt:P4327 ?bhl_id .}
UNION
{SERVICE wdsubgraph:scholarly_articles {?item wdt:P4327 ?bhl_id . }}
}
If your case is more complex, you may need to consult the Internal Federation Guide.
Until when can I use query-legacy-full as a stop-gap?
[edit]From the May 2025 scaling update, the query-legacy-full endpoint is expected to be available until November 2025, when it will be deprecated.
As of June 2025, there are no scheduled changes to the main and scholarly endpoints.
Is there a way to avoid writing federated queries? For example, by submitting the original query via an API?
[edit]You may be able in some (but not all) cases avoid writing federated queries by either:
- Using an unsplit endpoint, such as QLever or some other alternative SPARQL endpoint
- Rewriting your workflow for retrieving data in a SPARQL independent way, for example via the Wikidata:REST API
If I adapt my query something now, will it need to be revisited later?
[edit]While as of June 2025, there are no scheduled changes, it is possible that queries will need to be revisited in the future. There is a high likelihood that the query service will move off the Blazegraph backend at some point. The current split buys time to keep the Query Service working until an alternative emerges as the best. When the move off Blazegraph happens, many old queries will need to be rewritten. This may happen in a few years, but no timeline is yet designed.
For having an idea on the progress on migrating off blazegraph, you may see the WDQS backend alternatives page.
Theoretical questions
[edit]Does the split affect anything else other than the Wikidata Query Service? For example, the Action API?
[edit]No, the split does not affect functionalities of Wikidata that do not rely on the Wikidata Query Service. For example, requests via the REST API or the Action API should not be changed.
While there are assessments of growth on multiple parts of Wikidata, discussions on the limits of Wikidata and on regulating mass edits to prevent uncontrolled growth, as of 2025 there are no concrete plans to move the "split" beyond the SPARQL query service.
Which items exactly are included in the scholarly graph?
[edit]As of June 2025, the technical rules for the split select at set of items considered under the "scholarly article" scope and all their direct information. In other words, it includes the triples where scholarly articles are the subjects.
You may find the list of items considered "scholarly articles" below, clicking on "expand" on the right of the information box. You may also run a SPARQL query to navigate the subclass of (P279) hierarchy for the instance of (P31) values included in the split.
For example, the triple: Wikidata as a knowledge graph for the life sciences (Q87830400) author (P50) Andra Waagmeester (Q19845625) is in the scholarly graph, but the label "Andra Waagmeester", though, is in the main graph.
There are other triples there where Andra Waagmeester (Q19845625) is the object of the triple (w.wiki/EEeJ). However, there are no content-rich triples in the scholarly graph where Andra Waagmeester (Q19845625) is the subject of the triple. (w.wiki/EEeR).
Which governance mechanisms control the split rules?
[edit]The list of QIDs was determined after discussions in multiple places, including an analysis of possible "instance of" values on a Google Spreadsheet and an Wikidata page named "WDQS Split Refinement." The feedback page was announced on the April 2024 scaling update and was open until May, 15, 2024.
As of August 2025, there are no official community forums discussing further refinements of the graph split rules.
Do the items in scholarly have any information on the main query service?
[edit]The only triple on the scholarly graph where Andra is the subject is: Andra Waagmeester (Q19845625) <http://wikiba.se/queryService#subgraph><https://query.wikidata.org/subgraph/wikidata_main>. This states that the information about this item is present in the subgraph main.
Is there a reason why some kinds of works (e.g. monograph (Q193495)) were excluded?
[edit]The choice of what to include in the graph was complex, multiple elements at play. Some monographs, for example, are relatively more popular and may have Wikipedia articles, which is rare for scholarly articles. Wanted to limit site links within the scholarly graph.
The different people involved in the decision also granted a "simple" way to describe what is include what is scholarly and landed on a list of about 30 types. While this may seems complex, but consider that Wikidata actually has 1,000s of types which are used for written works.
Will article-adjacent entities (e.g. authors, scientific journals, etc.) be available in both the regular and scholarly query services?
[edit]No, data for authors and scientific journals (such as their labels) are available only from the main query service. There is no duplication of triples.
Are the rules for the split definition fixed? Will they change in time?
[edit]Based on conversations with DCausse (WMF), as of 20 May 2025, in long run, the responsible team expects publication type of scholarly work (P13046) to be widely used its presence on an item may become the sole rule for inclusion in the `scholarly graph`. This will eventually make the list of P31 values used for inclusion (v2.yaml) become obsolete.
If you are interested in technical details, the current rules are listed at the code repository for the Wikidata Query Service, under a "resources" folder. They follow a versioned schema, including wdqs-subgraph-definitions-v1.yaml and wdqs-subgraph-definitions-v2.yaml. If new rules that appear , they will eventually be added to new versions (-v3, v4 and so on) in the same `/resources/` location.
Are there any triples common to both the scholarly and the main graphs?
[edit]As of June 2025, the way the split is set up explicitly excludes triples in the scholarly graph from the main graph. The main graph, thus, has everything in Wikidata minus the scholarly information.
What about the information about properties? In which graph are they found?
[edit]The information about all the properties on Wikidata (e.g. their labels) are found in the main graph.
Are the examples in both Query Services the same?
[edit]Currently, yes. This is something, though, that is likely to change.
Does graph split occur in real time?
[edit]Yes. For example, adding the publication type of scholarly work (P13046) to an item will immediately remove it from queries directed to the main graph and add it to the scholarly graph.
Who manages the Wikidata Query Service? Are they the same people managing the split?
[edit]The Wikimedia Foundation Search Team manages for the Wikidata Query Service. The names and roles for the people in the team are listed in WDQS backend updates, including Guillaume Lederrey, David Causse, Ryan Kemper, and others.
The Wikimedia Deutschland Wikidata staff collaborates actively with the Search Team, for example the Portfolio Lead Product Manager for Wikidata, Lydia Pintscher.
As of June 2025, there is an open position in the Wikimedia Foundation as Lead Product Manager, Wikidata Platform which includes in the description the product lifecycle management of the Wikidata Query Service (WDQS) and related successor services. There is also an open position for a Tech Lead, Wikidata Platform who will work on:
- Stability, performance, and scalability of the Wikidata Query Service (WDQS) architecture and data pipeline
- Articulating a vision for Wikidata Platform’s query infrastructure that supports continued growth and future sustainability
- Developing new query methods, APIs, algorithms, and indexing strategies to optimize graph search capabilities for priority use cases
How many triples are now queryable on each endpoint? How many items?
[edit]As of June 2025, the scholarly endpoint hosts around 8.6 billion triples and the main endpoint hosts around 8.4 billion triples. The query-legacy-full endpoint hosts around 16.8 billion triples.
There are around 77 million items on the main graph and 45 million items on the scholarly graph.
Are there any affected tools? Gadgets? Workflows?
[edit]An open, community curated list of the affected resources is available at Wikidata:SPARQL query service/WDQS graph split/Affected tools.
How many items on the scholarly graph have sitelinks to Wikipedia, Commons, or other projects?
[edit]As of June 2025, on main, there are are about 10 million sitelinks to English Wikipedia (https://w.wiki/EZNL), 6 million sitelinks to Commons (https://w.wiki/EZNP) and about 96 million sitelinks in total (https://w.wiki/EZNN).
On the scholarly graph, there are only 285 sitelinks to English Wikipedia (https://w.wiki/EZNQ), 4561 links to Commons (https://w.wiki/EZNR) and 50 thousand sitelinks in total (https://w.wiki/EZNT).
Is the {{Cite Q}} template broken?
[edit]No, the Module:Cite Q (Q33429959) and Template:Cite Q (Q22321052) don't rely directly on the Wikidata Query Service and still work fine.
See also
[edit]- A blog post by Bob DuCharme on how suddenly "many Blazegraph engineers [became] Amazon Neptune engineers" and how Amazon started owning the trademark "Blazegraph".