Wikidata/Notizen/Verbreitung der Änderungen

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search
This page is a translated version of the page Wikidata/Notes/Change propagation and the translation is 41% complete.

Other languages:
Bahasa Melayu • ‎Deutsch • ‎English • ‎Esperanto • ‎Lëtzebuergesch • ‎Nederlands • ‎dansk • ‎español • ‎italiano • ‎kurdî • ‎lietuvių • ‎magyar • ‎occitan • ‎polski • ‎português • ‎português do Brasil • ‎suomi • ‎čeština • ‎български • ‎русский • ‎српски / srpski • ‎українська • ‎ייִדיש • ‎العربية • ‎پښتو • ‎ਪੰਜਾਬੀ • ‎മലയാളം • ‎ភាសាខ្មែរ • ‎中文 • ‎日本語 • ‎한국어

Diese Seite ist nur ein Entwurf und soll nicht die definitive Architektur darstellen.

Diese Seite beschreibt, wie Wikidata Änderungen an Client-Wikis verbreitet.

Überblick

  • Jede Änderung an den Datenbanken wird in den letzte Änderungen des jeweils betroffenen Client-Wiki (z.B. die Wikipedias) vermerkt.
  • Dispatcher-Skripte überprügen periodisch die letzten Änderungen.
  • Jedes Client-Wiki wird über Veränderungen in der Wikidata-Datenbank durch einen Eintag in der Job queue informiert. Die eingetragenen Jobs werden zum Invalidieren und Re-Rendering der betroffenen Seiten genutzt.
  • Jede Veränderung wird außerdem auf der Spezialseite Letzte Änderungen des betroffenen Client-Wikis sichtbar.
  • Zum entlasten der Server können mehrere, aufeinanderfolgende Bearbeitungen des gleichen Benutzers am gleichen Wikidata-Datenobjekt können zu einem einzigen Edit zusammengefasst werden.

Annahmen und Terminologie

The data managed by the Wikibase repository is structured into (data) entities. Every entity is maintained as a wiki page containing structured data. There are several types of entities, but one is particularly important in this context: items. Items are special in that they are linked with article pages on each client wiki (e.g., each Wikipedia). For more information, see the data model primer.

The propagation mechanism is based on the assumption that each data item on the Wikidata repository has at most one site link to each client wiki, and that only one item on the repository can link to any given page on a given client wiki. That is, any page on any client wiki can be associated with at most one data item on the repository.

(See comment on discussion page about consequences of limiting change propagation to cases where Wikipedia page and Wikidata item have a 1:1 relation)

This mechanism also assumes that all wikis, the repository and the clients (i.e. Wikidata and the Wikipedias), can connect directly to each other's databases. Typically, this means that they reside in the same local network. However, the wikis may use separate database servers: wikis are grouped into sections, where each section has one master database and potentially many slave databases (together forming a database cluster).

Communication between the repository (Wikidata) and the clients (Wikipedias) is done via an update feed. For now, this is implemented as a database table (the changes table) which is accessed by the dispatcher scripts directly, using the "foreign database" mechanism.

Support for 3rd party clients, that is, client wikis and other consumers outside of Wikimedia, is currently not essential and will not be implemented for now. It shall however be kept in mind for all design decisions.

Speicherung der Änderungen

Every change performed on the repository is logged into a table (the "changes table", namely wb_changes) in the repo's database. The changes table behaves similarly to MediaWiki's recentchanges table, in that it only holds changes for a certain time (e.g. a day or a week), older entries get purged periodically. As opposed to the recentchanges table however, wb_changes contains all information necessary to report and replay the change on a client wiki: besides information about when the change was made and by whom, it contains a structural diff against the entity's previous revision.

Effectively, the changes table acts as an update feed. Care shall be taken to isolate the database table as an implementation detail from the update feed, so it can later be replaced by an alternative mechanism, such as PubHub or an event bus. Note however that a protocol with queue semantics is not appropriate (it would require on queue per client).

Versendung der Änderungen

Changes on the repository (e.g. wikidata.org) are dispatched to client wikis (e.g. Wikipedias) by a dispatcher script. This script polls the repository's wb_changes table for changes, and dispatches them to the client wikis by posting the appropriate jobs to the client's job queue.

The dispatcher script is designed in a way that allows any number of instances to run and share load without any prior knowledge of each other. They are coordinated via the repoysitory's database using the wb_changes_dispatch table:

  • chd_client: the client's database name (primary key).
  • chd_latest_change: the ID of the last change that was dispatched to the client.
  • chd_touched: a timestamp indicating when updates have last been dispatched to the client.
  • chd_lock_name: the name of the global lock used by the dispatcher currently updating that client (or NULL).

The dispatcher operates by going through the following steps:

  1. Lock and initialize
    1. Choose a client to update from the list of known clients.
    2. Start DB transaction on repo's master database.
    3. Read the given client's row from wb_changes_dispatch (if missing, assume chd_latest_change = 0).
    4. If chd_lock_name is not null, call IS_FREE_LOCK(chd_lock_name) on the client's master database.
    5. If that returns 0, another dispatcher is holding the lock. Exit (or try another client).
    6. Decide on a lock name (dbname.wb_changes_dispatch.client or some such) and use GET_LOCK() to grab that lock on the client's master database.
    7. Update the client's row in wb_changes_dispatch with the new lock name in chd_lock_name.
    8. Commit DB transaction on repo's master database.
  2. Perform the dispatch
    1. Get n changes with IDs > chd_latest_change from wb_changes in the repo's database. n is the configured batch size.
    2. Filter changes for those relevant to this client wiki (optional, and may prove tricky in complex cases, e.g. cached queries).
    3. Post the corresponding change notification jobs to the client wiki's job queue.
  3. Log and unlock
    1. Start DB transaction on repo's master database.
    2. Update the client's row in wb_changes_dispatch with chd_lock_name=NULL and updated chd_latest_change and chd_touched.
    3. Call RELEASE_LOCK() to release the global lock we were holding.
    4. Commit DB transaction on repo's master database.

This can be repeated multiple times by one process, with a configurable delay between runs.

Aufträge zur Benachrichtigung über Änderungen

The dispatcher posts changes notification jobs to the client wiki's job queue. These jobs contain a list of wikidata changes. When processing such a job, the cleint wiki performs the following steps:

  1. If the client maintains a local cache of entity data, update it.
  2. Find which pages need to be re-rendered after the change. Invalidate them and purge them from the web caches. Optionally, schedule re-render (or link update) jobs, or even re-render the page directly.
  3. Find which pages have changes that do not need re-rendering of content, but influence the page output, and thus need purging of the web cached (this may at some point be the case for changes to language links).
  4. Inject notifications about relevant changes into the client's recentchanges table. For this, consecutive edits by the same user to the same item can be coalesced.
  5. Possibly also inject a "null-entry" into the respective pages' history, i.e. the revision table.
(See comment on discussion page about recentchanges versus history table)

Zusammenfügen von Änderungen

Die oben beschriebene Funktionsweise bewirkt bei jeder Änderung auf Wikidata zahlreiche Transaktionen an Datenbanken, sowie möglicherweise häufiges Auslesen selbiger, je nach dem, was zum Rendering der Seite benötigt wird. Und dies geschieht auf jedem Client-Wiki, also möglicherweise auf über hundert solcher. Seitdem Änderungen an den Wikibase-Datenbanken dazu neigen, einzeln von einander getrennt zu werden (so zum Beispiel das Angeben einer Objektbeschreibung oder eines Links), kann sich dies problematisch auf die Server auswirken. Das Zusammenfügen von Änderungen kann diesem Problem entgegenwirken:

Wie bereits im Abschnitt über die Versendung der Änderungen erklärt, werden Einträge auf der Liste der Aufträge per Batchverarbeitung verarbeitet, standarmäßig umfassen die Batches nicht mehr als 100 Einträge.

Wenn mehrere Änderungen, die das gleiche Wikidata-Objekt betreffen, im gleichen Batch verarbeitet werden sollen, können diese zusammengefügt werden, wenn sie vom gleichen Benutzer getätigt worden sind. Dies verringert die Anzahl an Datenbanktransaktionen, da die betroffenen Änderungen zusammengefügt nur eine einzige Transaktion auslösen. Die Feinabstimmung dieses Prozesses kann durch das Einstellen der Batch-Größe und der Verzögerungszeit zwischen den Verarbeitungen von einzelnen Batches erfolgen.