Wikidata/Notes/URI scheme/de

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search

Template:Other languages/Wikidata/Notes/URI scheme Wikidata will offer data for numerous items. According to Semantic Web standards, every item of interest should be identified by an URI. That URI should be different from the URI (or URL) for different representations of documents describing the item.

Wikipedia heute[edit]

Schauen wir uns einen Moment lang das URL-Schema an, welches die Wikipedia nutzt:

  1. ist die URL des Artikels Erde in der Sprache Deutsch (de)
  2. Zwischen und besteht ein Sprachlink (Interwiki), der auf beiden Seiten explizit im Quelltext angegeben ist

Zu beachten ist, dass sich die Lemmata und damit auch die URLs von Wikipedia-Artikeln verändern können (siehe auch die Untersuchung von Siorpaes & Hepp). Wenn zum Beispiel ein Kevin Müller eines Tages Bundeskanzler wird, so verdrängt er wahrscheinlich den gleichnamigen Fußballspieler aus dem Hauptlemma. Auch Namensänderungen – zum Beispiel durch Heirat oder durch Ernennung zum Papst – führen zu Lemmaveränderungen. Meistens wird es dann entsprechende Weiterleitungen geben, die für Leser leicht zugänglich und verständlich sind – für Maschinen trifft dies aber nicht unbedingt zu.

Wie es die DBpedia macht[edit]

Zum Vergleich – so löst die DBpedia das Problem:

  1. ist die URI für das Objekt Germany
  2. is the HTML representation for the description about Germany
  3. offers the machine readable data about Germany, in a number of different formats, like RDF, JSON, Turtle, etc. Note that the format is specified by the suffix, i.e. .json, .rdf, .ttl, .ntriples, etc.

The name of the item is equivalent to the name of the article in the English Wikipedia. An effort to provide internationalized URIs in DBpedia is underway, by basically creating a multitude of language-specific URIs. There is a dedicated note with more details on the relationship between DBpedia and Wikidata.

There are plans for DBpedia to (additionally) use URIs that are derived not from the title of a page (which may change), but from its page id, because such URIs are more stable. This approach may also work for Wikidata - since items are stored in MediaWiki, the page ID could be used as the item ID. Unique, stable, easily dereferencable. Chrisahn (talk) 22:45, 10 April 2012 (UTC)

Zu beachtende Punkte[edit]

  1. URIs sollten eindeutig auf ein Objekt referenzieren
  2. URIs sollten beständig sein
  3. URIs sollten innerhalb von Wikimedia-Projekten kanonisch sein
  4. URIs in Wikimedia-Projekten sollten nicht auf einer einzelnen Sprache wie Englisch basieren
  5. URIs sollten einfach zu handhaben sein
  6. URIs should not break caching

Diese Liste ist nach Wichtigkeit sortiert.

A solution that is solely based on a label – and maybe an English label at that – is highly problematic. Why should the canonical URI for Rome be ? Why not ? And if it is the latter, what would the URI be for the Roma people? How to deal with disambiguation? But how do you disambiguate without using a language again? What about if a label changes meaning? etc. All these problems disappear once you use a unique, but inherently meaningless identifier. The disadvantages are in their usability: they are not easily written, they cannot easily be understood, they cannot easily be remembered. Tools could help with these problems, and indeed Wikidata plans to offer such tools.

Vorschlag für Wikidata[edit]

The following gives a proposal on the URI scheme for Wikidata:

  1. redirects to the English UI (HTML) of Wikidata about the item the English Wikipedia article of Germany is about. Note that English here and in 2 and 3 is just an example, it works for all other Wikipedia languages equivalently.
  2. is a non-persistent URI for the item the English Wikipedia article of Germany is about. This is a convenience URI that redirects to the actual URI.
  3. redirects to the machine-readable data about the item the English Wikipedia article of Germany is about, i.e. to, including a sameAs between URI 2 and URI 5.
  4. or or{sth}/q11867 is the wiki page about the item identified as q11867 (which, in this example, is Germany).
  5. is the persistent URI of the item identified as q11867.
  6. provides the data about the item (in several serialization formats, depending on the request header).

Fragen und Kommentare[edit]

The solution leads to a number of questions, issues, and need for clarification.

  1. What about entities that do not have a Wikipedia article (yet)? (wrt to URIs 1-3)
    • Well, they do not have convenient URIs (yet).
      • We have labels for each concept, in different languages. We could use that for a lookup. We could even list all pages that have that same label, along with their definitions, providing an automatic disambiguation page. -- Duesentrieb 13:40, 22. Dez. 2011 (UTC)
      • the relation between concept labels and interwiki links (titles on wikipedia) should be clarified. I suggest to use "Foo" as the en label if the page on en is called "Foo" or "Foo (xxx)". -- Duesentrieb 13:40, 22. Dez. 2011 (UTC)
      • if at some point we model wiktionary, we'd have another source of term/meaning mappings -- Duesentrieb 13:40, 22. Dez. 2011 (UTC)
  2. Should URIs 1 and 3 redirect or should they just return the content?
    • Real redirects seem better from an educational point of view
  3. What will be the content of the browser address line in case of entering URI 1?
    • It should be URI 4
  4. About URI 6: is this sufficient for the different representation formats, or should there be different URIs for RDF, JSON, etc.?
    • Encoding the content type in the URI is a bad idea because it makes linking much harder. Assumed there is an application which can handle RDF-N3-tuples but not JSON. Futher more this app finds a link for a "json-resource" e.g. somewhere in the web. It can't extract any information from that resource, even though the server is capable of generation N3-tuples. The content type should only encoded in the http header.
      • But many examples like DBpedia do exactly this. Also, one can use the URI 5 and content-negotiatio over that. --Denny Vrandečić (WMDE) (talk) 21:50, 3 April 2012 (UTC)
        • Yes you're right. Many application using different uris for different mime types. But that doesn't make a wrong thing right. One of the main principles of the architecture of the WWW is that the client can choose an appropriate format of the resource for his needs. Using different URIs makes this practically impossible.
          • There would be a content-negotiated identifier, #5, this can be used. #4 and #6 are merely direct links to these conneged resourced. --17:14, 10 April 2012 (UTC)
            • That looks like a solution for this problem, but indeed it isn't. Assumed that I found the resource with the uri #4 and want to link to it. To do this correctly I have to know that this uri doesn't support content negotiation (and of couse I have to know what content negotiation is) and therefore it's not a good idea to use this uri for linking. Futhermore I have to know that the uri which is suiteable for linking is #5. So I have to read the wikidata manual to create a proper link. And imagine I'm a machine. Someone has to change my algorithms to create proper links to wikidata. With other words: Links with uri #4 or #6 will spread like a virus and will make content negotiation mostly not working.
        • DBpedia uses content negotiation too. When accessed by a browser, redirects to , but for tools that send other Accept headers, the server responds with other redirects (I think). Chrisahn (talk) 22:50, 10 April 2012 (UTC)
    • RDF and JSON may turn out to be sufficiently different.
    • Note that we further have two levels of provenance data, too, that need to be published somewhere.
      • more generally, we need output options: just saying "RDF" (and perhaps XML or N3 or turtle) isn't enough. -- Duesentrieb 13:40, 22. Dez. 2011 (UTC)
        • e.g. we want to specify whether property values should be reified to allow the inclusion of value qualifiers in the output. -- Duesentrieb 13:40, 22. Dez. 2011 (UTC)
        • or perhaps only the label and description in one specific language is desired (e.g. for display via ajax). -- Duesentrieb 13:40, 22. Dez. 2011 (UTC)
          • The question here is: Should the information delivered in a coarse-grained or fine-grained manner. If a client is interested in the population of Berlin should it be possible to ask the server for that specific int or is it necessary to download all the facts about Berlin and then extract the population on its own. I prefer the second one because it makes a loose coupling between client and server much easier (an because this is the way it's done in the web).
        • If it is really necessary to define output options this can (and should) be done by defining a corresponding mime type with such options. So the content type requested by a client could look like this: application/wikidata+xml;arg="foo";otherarg="bar". But this approach has a big disadvantage: By defining our own mime type we make loose coupling hard, because the client has to know this mime type. By using standards like RDF or so we don't force to change client software to gather information from wikidata.
        • But standards like RDF do not answer the question about granularity! --Denny Vrandečić (WMDE) (talk) 21:50, 3 April 2012 (UTC)
          • So, how important is the question about granularity?
  5. TBD: referencing specific revisions of the data. -- Duesentrieb 13:51, 22. Dez. 2011 (UTC)
    • This shouldn't be a question of the URI, but of the data model or export formats.
      • I'm not sure if this is a good idea. Assumed that I'm interested in the version of a resource of last month. How to get this version? Get the current version and then remove all changes made since last month? That means that all off the changes has to be delivered?
    • The Wikidata implementation is based on MediaWiki, so this may be relatively simple: If the URL contains the revision ID, the server pulls the JSON for that revision from the database, parses it and returns the equivalent triples/statements. Very similar to what MediaWiki does with oldid URLs. Also allowing dates in the URL shouldn't be too hard either. For queries spanning many pages though, this simple approach will likely be too expensive. Chrisahn (talk) 23:02, 10 April 2012 (UTC)
  6. Is it really necessary that the html representation of an wiki entry and the "other" representations are identified by different uris? Why not to do a "GET" with mime type "text/html" to get the html representation for a browser and to do a GET an the same resource with the mime type "application/rdf+xml" or "application/json" to get a rdf or json representation?
    • This seems really wide-spread, although I do agree that it is not strictly necessary. But it makes 'using' Wikidata through the browser address bar much easier. --21:50, 3 April 2012 (UTC)
      • Well, I thinks that depends on the meaning of 'using wikidata'. If I want to link to a wiki resource in my application or website I have to decide which resource to choose. If other human users should use the link I have to link the resource which returns html. If I want to enable other application do deal with the linked information I have to link the resource which return text/xml. If I want that both can deal with the linked data, I have to add two (or even more) links. That makes linking very hard, and we have to tell 'the world' to add both links to there sites and applications. Furthermore I can't figure out why creating only a single resource makes using wikidata with a browser more difficult. (See also point 4.) The one which returns HTML, so that all my human users can read it
        • As said, you can just use URI #5 for that. A single URI for the item. URIs #4 and #6 are just convenience items without content-negotiation. --Denny Vrandečić (WMDE) (talk) 17:14, 10 April 2012 (UTC)
          • That doesn't work in practice. See point 4.
    • That's what DBpedia does (I think). Chrisahn (talk) 23:05, 10 April 2012 (UTC)
  7. The meaning of "non-persistent URIs" should be clarified. It's very error-prone to rename things especially if software has to deal with it. The word 'tree' has a special meaning. Even through the tree in my garden wouldn't change in any way if a decides to call it "car" from today on, it would be really difficult to talk with other people about my car. "Cool URIs don't change". The primary purpose of a URI is to identify a resource. It's very unimportant, especially in the context of machine to machine communication whether the uri looks "nice" or not. So the URI isn't better than Even through the first one seduces people to change the meaning of URIs from time to time.
    • The 'non-persistent URIs' would be merely convenience URIs that are based on Wikipedia's titles. Wikidata does not mandate Wikipedia's naming policies. Wikidata provides persistent URIs for everything (the numerical identifiers), but the idea with the convenience URIs is to, additionally, provide some human-readable and guessable URIs based on the titles in the language Wikipedias. As they are not persistent in Wikipedia, they cannot be in Wikidata. I hope this answers this question. --Denny Vrandečić (WMDE) (talk) 21:50, 3 April 2012 (UTC)
      • Ok, that seems comprehensible for me. But we have to tell 'everybody' not to use this non persistent URIs for linking or more generally to create references, because the resources this uris are pointing to could change at any time and if this happends a machine would certainly not recognize it. The question is: How to accomplish that?
  8. I think the idea of Chrisahn to use the page ID to identifiy an entry ("How DBpedia does it") is great!