Wikidata/Notes/Multilingual data

From Meta, a Wikimedia project coordination wiki

Wikidata will be a multilingual website. It will be possible to add, edit, and use data in any Wikipedia language. Data edited in Arabic will immediately be available in French. All Wikidata editors, no matter what language they choose, are working on the same, common data set. This is accomplished by representing all entities in the system with internal ids, which are then translated on display to their labels in the user's language. So Vienna is not represented internally by the string "Vienna", but by an internal ID, say, "q55866".

The UI will translate all these IDs to their appropriate labels on the fly. (But what to do if a label is not available? -> Fallback languages?)

For the exported data, the situation is different, though. The statement "Vienna is in the country Austria" would not be represented in RDF as e.g.

Vienna Country Austria .

or in JSON as

"Vienna" : { "Location" : "Austria" }

but rather as in RDF e.g.

q55866 p66099 q26964606 .

and in JSON as

"q55866" : { "p66099" : "q26964606" }

which some developers might find unfriendly.

In order to mitigate that, two proposals could be pursued: first, the obvious one, exporting labels; second, localizing the export on explicit request.

Exporting labels[edit]

Besides the data as given above, the labels would also be exported. I.e. in RDF this would be

q55866 p66099 q26964606 .
q55866 rdf:label "Vienna"^en .
q55866 rdf:label "Wien"^de .
p66099 rdf:label "Location"^en .
p66099 rdf:label "Ort"^de .
q26964606 rdf:label "Austria"^en .
q26964606 rdf:label "Österreich"^de .

and in JSON

"q55866" : { 
  "p66099" : "q26964606" ,
  "labels" : {
    "en" : { "label" : "Vienna" , "language" : "en" },
    "de" : { "label" : "Wien" , "language" : "de" }
  }
},
"p66099" : {
  "labels" : {
    "en" : { "label" : "Location" , "language" : "en" },
    "de" : { "label" : "Ort" , "language" : "de" }
  }
} ,
"q26964606" : {
  "labels" : {
    "en" : { "label" : "Austria" , "language" : "en" },
    "de" : { "label" : "Österreich" , "language" : "de" }
  }
}

Whereas this is perfectly correct, it leads to quite an increased bandwidth need, and sending data around that is often not required (how often will someone need the name for Austria in Latin?). Assuming a page holds about 50 statements and given 200 Wikipedia languages, the size of the export increases from about 5KB to 300KB - a factor of 60 for nothing!

In order to mitigate this problem, one option is that labels should only be returned for a language if that language is one of those provided by the user agent in the HTTP Accept-Language header list. This however has the drawback of requiring a Vary: Accept-Language response header, which will harm caching, and may be sent even when labels are not needed. Another option would be to use a query string parameter to specify requested label languages. This would provide each distinct representation with a unique URI and not inflict potential cache misses for identical responses.

Localizing the export[edit]

The labels allow the developer of tools consuming data from Wikidata to take the data and display human-readable labels. But they do not allow the developer to take a glance at the raw data, or to explore the data intuitively. Also, for educational purposes the data is only of limited use.

It is suggested that an export can be specifically requested to use localized data structures, in addition to the labels. I.e. instead of

q55866 p66099 q26964606 .

it actually does export, on explicit request,

Vienna Location Austria .

which means we have run a full circle. But ... (need to add the problems).