WikiCite/Roadmap 2023/Federation

From Meta, a Wikimedia project coordination wiki

Implementation options[edit]

Option A[edit]

One proposal for handling the huge volume of scholarly article data in Wikidata is moving it to a separate federated wikibase instance. This page outlines a technical proposal to initialize, synchronize, and cut over to this separate instance. The physical infrastructure requirements, costs, and ownership and management of the separate instance may also be a concern.

Initialization[edit]

Wikibase Federation allows an instance to duplicate all the properties from Wikidata (the Federated Properties feature). We probably want this to simplify data synchronization. The set of properties already in Wikidata covers pretty much everything we need for scholarly articles; additional ones could be added through the regular Wikidata property proposal process if needed. However - see below - some new properties may be needed to support federation itself.

Initializing items:

  1. Every instance of Q13442814 (scholarly article) or its subclasses or any other relevant item already in Wikidata would have a new item created for it in the new wikibase. Some mapping is needed between these to support synchronization and coordination especially in the case of items that may be retained long-term in Wikidata (if there is a site link for the article itself for instance). Perhaps that needs a custom property - Wikidata item for this item?
  2. Non-item-valued properties on each of these items can be copied easily. Item-valued properties will be harder, unless the item is something that would in any case be copied to the new wikibase.
    1. Articles that are comments or corrections to other articles should be fine - we just replace the linked item id from wikidata with the item id from the new wikibase
    2. Journal and publisher etc. items perhaps should also be copied to the new wikibase and handled similarly?
    3. Author items will be tricky to copy since there are so many item-valued properties on authors themselves, as humans, it would be best to keep them in Wikidata. If so the author property (P50) will need to work differently - instead of being wikibase-item valued it would be *Wikidata*-item valued - perhaps a URL value could be used as a stopgap?
    4. Main subject items - I don't think it makes any sense to copy these - these properties will also need to be *Wikidata*-item valued.
  3. Other aspects of wikibase infrastructure will need to be set up with the new data - search engine, query service server, and their configuration for this instance may need to be somewhat different from the main Wikidata site.

Synchronization[edit]

Until a set cut-over date (when scholarly article items will be removed from Wikidata in favor of the new wikibase) we should probably make the new one read-only and regularly synchronize it with data from Wikidata. A tool would need to be developed to update the initialized data as updates are made on the Wikidata side - or perhaps just to regularly reinitialize it. Incremental updating may be preferable however if the timescale and effort involved in initializing is very large.

Cutover[edit]

Some scholarly article items that have been copied will be retained, while some will be removed from Wikidata in favor of the new wikibase, now open for updates. What are the criteria for this separation?

Option B[edit]

Option A focuses on creating a second copy of the data. This would create a large volume of duplicate data while the original Wikidata items continue to exist. As the Wikidata equivalents drift, the other wiki would need to be updated as well, and this would be a slow process owing to the number of items and the amount of time it takes to update a Wikibase in general.

Federation Option B creates a separate Wikibase, the "Wikibase of Metadata" as in Option A. The Wikidata RDF graph is synchronized with that of the Wikibase of Metadata, allowing for both datasets to be queried out of the same Blazegraph query service without going through the intermediary of MediaWiki. Entities on the Wikibase of Metadata are mapped to Wikidata entities via a property "exact equivalent Wikidata entity," allowing the two graphs to map onto each other. This allows the Wikibase of Metadata to use Wikidata as a referential starting point while adding new items not already in Wikidata, or additional statements to Wikidata items without directly editing that item.

Using federated properties sounds nice in theory, but it is still under development and it adds complexity. It is easier to maintain separate properties, copy the ones we need, and map them. Mapping of properties across Wikibases would work the same way it does for items.

To make this Wikibase easier to use, extensions could be developed to pull "shadow statements" live from Wikidata to display in the browser, as well as an interface widget to quickly convert between Wikidata and local QIDs. This would allow users to refer to the contents of Wikidata without needing multiple browser tabs, but at the same time, it does not create a redundant permanent copy that will need to be updated at a later date. The "redundant permanent copy" is handled directly through Blazegraph.

This is not to say that we can't do a mass migration of content from Wikidata to this new Wikibase. Rather, this gives us flexibility in how we implement it and a usable query service in the meantime. If a decision is made to permanently relocate an item from Wikidata to the Wikibase of Metadata, the data could be migrated automatically from Wikidata on request, as the properties and items will already be mapped. Then, once the Wikidata item is deleted, the "exact equivalent Wikidata entity" statement on the new local item is marked as deprecated, preserving the mapping while also indicating it is no longer active.

Analysis[edit]

One hypothetical cost proposal is as follows. Assume a usage model where a relatively small group of people will be hammering the servers with read and write requests. At minimum, we will need two database servers: one primary and one replica. The largest consumer of resources will be Blazegraph, especially if a local copy of the Wikidata Query Service is hosted. Assuming Blazegraph and MariaDB (for MediaWiki) can share a (physical) server, two dedicated servers each with 256 GB of RAM will suffice. In addition an app server will be needed for MediaWiki and other services supporting it. That server could have as little as 64 GB of RAM. Two database servers and one app server doubling as a data dump server costs around 560€ per month on Hetzner.

However, simply to get started, James Hare has offered to host a prototype Wikibase on hardware he already owns.