Multilingual Wikidata

From Meta, a Wikimedia project coordination wiki

Multilingual Wikidata is a set of standards and, eventually, functionality for supporting multilingual content within a Wikidata dataset.

Multilingual Datasets[edit]

A multilingual Wikidata dataset is any dataset with translatable content, with translatable here meaning language-specific and language-indifferent with respect to the overall entity or model. In particular, datasets which concern themselves with such things as an attribute's original language or the processes by which translation/transliteratoin occur should usually not use multilingual attributes; by definition a translatable attribute is one that is expressed in a particular language, but for which any particular linguistic expression is equivlant to all others.

Consider a simple model for anmials, for example:

    > DESC animal;
    
    COLUMN        TYPE           DESC
    ----------------------------------------------------
    species_name  VARCHAR2(50)   Species name in the
                  NOT NULL       Linnaean taxonomy
    commmon_name  VARCHAR2(50)   Animal common name
                  NOT NULL
                  TRANSLATABLE

The species name is a language-specific attribute, but not a multilingual or translatable one, since all names in the particular taxonomy must be in Latin. The common name, however is both language-specific and translatable- it does not matter in the model which common name is used when referring to a particular type of animal, nor is there a concept of the original language of a common name and how other common names might derive from it.

For other types of datasets, however, such concerns are important. For example, in cataloging there is the concept of a parrallel title, which is an equivalent to the original title in another language. The process by which a parallel title is assigned is a subject of concern for catalogers. In the case of movies a film has one original title and is then assigned new titles as it is released in different linguistic markets; these new titles are often quite different from what a direct translation would be like in order to optimally market the film.

Database-level Implementation[edit]

Every multilinugal table in a Wikidata dataset, defined as containing at least 1 translatable column, follows the above pattern during implementation. A base table is created containing the entity primary key and all non-translatable attributes. A second table, with the string _ML suffixed to the base table name, is created containing the primary key and all-translatable attributes. A _ML table always has the following columns:

    COLUMN        TYPE          DESC
    ----------------------------------------------------
    language_id   INT(15)       Language of the translated
                                content; foreign key to
                                Language
    primary_lang  TINYINT(1)    Whether content in this language
                  NOT NULL      should be considered "primary" or
                                somehow take precedence over other
                                translations.  For example, if the
                                content is originally in this language.
                                Primary language content should be given
                                preference when choosing an expression
                                to do additional translations from.
                                Only one language should be tagged
                                as primary.

The combination of base entity primary key and language id will always be unique in the _ML table.

Wikidata Enhancements[edit]

As part of future enhancements, Wikidata should support multilingualism in its data definition UIs through the TRANSLATABLE column/attribute modifier flag. If any entity attribute is flagged as translatable, Wikidata should automatically create a _ML table for the entity in the dataset's underlying SQL DDL.

See Also[edit]