Talk:Wikidata/Notes/Change propagation

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search

Here a few comments from a user's and robot writer's PoV:

  1. any page on any client wiki can be associated with at most one data item on the repository. - I admit I don't know exactly how this association is made. But if it cam be created/modified by users, you can be 99.999% sure this assumption will fail.
  2. these changes can be coalesced together (at least if they were all performed consecutively by the same user) - I don't see a need to limit the grouping by user; for instance, images from commons change only when the page is purged/edited, without any link to how many times the image has been updated on commons.
  3. it isn't clear to me whether polling would work at an item level or lower (e.g. if the birth date of a person is updated, is the whole page invalidated?
  4. I don't see any provision for requesting updates real-time. This will become important when Wikidata will serve infobox-filling data. Let's say that we have a huge update on Wikidata (like modifying the nationality of thousands of people). The last change will only be polled after several dozen seconds, if not minutes, plus the time needed to execute the queue. If during this time, somebody would decide to edit the page, he would need to see the updated version in order to have a reliable result. This would cause another foreign database read and MUST also clear the job associated with that particular item.--Strainu (talk) 20:22, 18 October 2012 (UTC)Reply[reply]
Hi Strainu, thanks for your feedback.
ad 1: the 1-to-1 relationship between wikipedia pages and wikidata items is maintained in the wikidata item itself, and is enforced by the software. Attempting to link to the same wikipedia page from another data item will cause an error.
ad 2: note that these changes will be visible in watchlists, etc. If we group together changes by multiple users, this may become hard to read (we would have to link to multiple users as authors in a single row of the watchlist feed). That's why I'd prefer to only group edits by the same user, at least for now.
ad 3: if any aspect of an item changes, any page using that item is invalidated and re-rendered. Remember that the page may contain conditionals like {{#if:...}} that depend on stuff from the data item, so anything on the page may change when the item changes. So it needs to be fully re-rendered.
ad 4: you are right, we need a way to request an update right now, especially if the wikidata item was edited directly on the wikipedia page. How this will work depends on how exactly the caching is done. We are currently considering to ditch the local item cache and always fetch the item data from the primary copy in the repository - in that case, all that is needed for an update is a regular purge of the page.
-- Daniel Kinzler (WMDE) (talk) 16:07, 28 October 2012 (UTC)Reply[reply]
ad Daniel: I thought that for the purpose of infoboxes (not the interlanguage links) it was decided that Wikipedia pages must be able to link to several Wikidata pages. Example: A Wikipedia page has several infoboxes for a car model that exists in several variants or styles, with quite different properties. Each car variant or model-run (1991-1995 versus 1995-1998 etc.) would be a Wikidata page, and on some Wikipedias these are represented by a page, on others there is only a summary page. The Wikipedia page that combined multiple models would have a 1:n relationship with Wikidata item for the purpose of change propagation. Note that this is not a discussion about the Wikidata semantic linking to Wikipedia (which is ok, but I fear that the semantic association between Wikidata and Wikipedia is the wrong place to associate change propagation with. --G.Hagedorn (talk) 13:08, 2 November 2012 (UTC)Reply[reply]

"and that only one item on the repository can link to any given page on a given client wiki". Why is this assumed? Also why does each wiki need a local DB cache? That would certainly need the external storage url reuse (proposed here) if used it all. 02:20, 1 November 2012 (UTC)Reply[reply]

First of all, note that a new version of the draft is now online.
  • A one-to-one relationship between data items and Wikipedia pages of a given language is build in by design. It's even in the original proposal. This is of course pretty restrictive, but also provides several advantages. Anyway, it's too late now to change it, and this is not the place to discuss it.
  • Regarding the local URL cache: a bit of refactoring is needed to make stuff in the External Store accessible from another wiki. I have included this possibility in the new version of the draft. A local cache of the item data is not strictly necessary, but we do need a local copy of the item-to-page mapping (the sitelinks table) so we can join it against the page table. That provides an efficient way to determine which items are used where on which wiki.
Daniel Kinzler (WMDE) (talk) 21:29, 1 November 2012 (UTC)Reply[reply]

Consequences of limiting change propagation to use cases with a 1:1 relation between Wikipedia page and Wikidata item[edit]

Daniel writes "The propagation mechanism is based on the assumption that each data item on the Wikidata repository has at most one site link to each client wiki, and that only one item on the repository can link to any given page on a given client wiki. That is, any page on any client wiki can be associated with at most one data item on the repository." (2012-11-07).

While this condition is easily fullfilled in the case of interlanguage links, it allows only a subset of data to become Wikidata-derived. In the case where minor variants of a concept or thing exist, one Wikipedia will aggregate those on a single page while another will create separate pages. Only the latter Wikipedia can use Wikidata for such a page. For example, on automobile pages models are routinely grouped in the English Wikipedia (, etc.). The multiple infoboxes cannot either not derive their content from Wikidata items that are model specific (different models have different length, etc.) and which may or may not be identified as a Wikipedia page in some other language version, or at least changes remain invisible. Perhaps even more frequent, tabular data (e.g. are not supported. --G.Hagedorn (talk) 14:30, 7 November 2012 (UTC)Reply[reply]

recentchanges versus revision table[edit]

As of 2012-11-07 the change propagation plan primarily focussed on recentchanges. However the recentchanges table on mediawiki is secondary information that will be routinely rebuilt based on the history (revision table) after software upgrades or import operations (xml import from the command line requires to rebuild it). Injecting information only into recentchanges, but not into the page history will break teh assumption that recentchanges can be rebuilt. It should be a requirement that after a rebuiltrecentchanges the recent changes with respect to Wikibase injected changes are the same. I believe the simplest way is to actually create wikidata changes the page history (since the text is saved elsewhere, the storage impact of creating a revision record is minimal), and to modify the history display such that these changes can be hidden, are shown initially only summarized etc.

Furthermore, I consider transparency of changes a prime wiki feature that enables massive collaboration. The diffs of data changes should therefore be accessible for prolonged periods. Some pages may be scrutinized only after the information in recentchanges has expired. Only entries in the revision history will make this possible. These entries should have the following properties:

  • The Wikidata-derived revisions are a null edit in the sense that they create a dated and authored revision, but leave the text content unchanged. This adds very little overhead to the database operation
    • a consequence of being authored is users that do not have the same account on the client wiki, need to marked with an appropriate interwiki prefix.
    • The date/time should be the time of the last changes where multiple wikidata changes are aggregated.
  • The Wikidata-derived revisions should carry a flag or tag that can be evaluated by version analysis and rendering.
    • For example the mediawiki core version history display should add css classes to Wikidata-derived revisions, allowing coloring or filtering
    • An option to hide Wikidata-derived revisions from display is desirable.
    • When a diff of the page content is requested, and the diff spans Wikidata-derived revisions, minimally a note should be displayed that additional changes have been made.
  • The Wikidata-derived revisions provide a link to the Wikidata repository that displays the data diff generated by Wikidata
    • This minimally could be an http link inside the comment field.
    • more desirable might be a separate revision metadata field, which could serve both as a flag and as the link to the wikidata diff.
    • In both cases, when displaying a page content diff, mediawiki core can recognizing wikidata revisions, and either display the comments for easy access to the wikidata diffs, or use the explicit metadata field to create appropriate links. THe second option is slightly more flexible, at the cost of changing the database structure.

--G.Hagedorn (talk) 14:30, 7 November 2012 (UTC)Reply[reply]

Polling to Dispatching[edit]

Does the change from polling to dispatching mean that only WMF-based mediawiki installations will be able to use Wikidata? --G.Hagedorn (talk) 15:29, 10 December 2012 (UTC)Reply[reply]

I just talked to Daniel. The changes he made mean that we can now work with multiple client wikis as opposed to one. 3rd party wikis have not been included before and are not now. They're on the radar but honestly not on the roadmap at the moment because there is so much other things to do before that. --Lydia Pintscher (WMDE) (talk) 11:40, 11 December 2012 (UTC)Reply[reply]