Wikidata/Archive/Notes

From Meta, a Wikimedia project coordination wiki
Jump to: navigation, search

The table below shows the changes of the database schema that will happen with MediaWiki 1.5, a first Alpha version of which was released on May 3, 2005. To the left is the old database schema, to the right is the new one:

Database-restructure.png

(Note that the REVISION table has been recently altered to decouple the references to the article text from the revision numbers.)

These schema changes are the most essential requirement for Wikidata, as they allow it to associate an individual page with several data elements with only minimal changes to the schema. Primarily, the following changes are required:

  • A new table SCHEMA which defines the existing Wikidata schemas. This will only exist after prototype I, as the first prototype will use a hardcoded schema.
  • A new column rev_field in the REVISION table which associates an individual article revision with a field from a schema.
  • A new column rev_language which can be used for schemas which allow multiple revisions under the same title (as is the case in the Ultimate Wiktionary).

Beyond that, new tables can be created modeled after the TEXT table above for every desired type of data, such as strings, numbers, blobs, and dates. Each of these data elements is associated with a revision and, through that, associated with a schema.

Schemas and namespaces[edit]

In the Wikidata model, schemas directly build on the existing model of namespaces (prefixes like "User:", "Image:", "Wikipedia:"). A schema must be associated with a particular namespace. Thus, using the existing page_namespace field, we can immediately look up what schema a page is part of. This then allows us to search for the content of individual fields, if necessary, in different languages.

Presently, namespaces are very difficult to create and change. They are defined in essentially four places:

  • includes/Defines.php - lists the hardcoded system namespaces and their ID numbers
  • includes/Namespace.php - defines the "canonical" namespace names which will work in any language
  • language/Language*.php - local namespace names
  • LocalSettings.php - an arbitrary number of custom namespaces

The system namespaces are referred to extensively in the code in many situations. For example, a regular page is always linked to a discussion page, and this relationship is hardcoded. Additionally, certain parameters for each namespace can be set in LocalSettings.php.

For Wikidata, but also for existing Wikimedia projects like Wikibooks, it is necessary to add and change namespaces with relative ease. One of the first steps of the Wikidata implementation is therefore a namespace manager (SpecialNamespace.php). This requires changes throughout the codebase and also the addition of two database tables:

The NAMESPACE table:

+-------------------+-------------+------+-----+---------+----------------+
| Field             | Type        | Null | Key | Default | Extra          |
+-------------------+-------------+------+-----+---------+----------------+
| ns_id             | int(8)      |      | PRI | NULL    |                |
| ns_system         | varchar(80) |      |     | 0       |                |
| ns_target         | varchar(200)|      |     | 0       |                | 
| ns_subpages       | tinyint(1)  |      |     | 0       |                |
| ns_search_default | tinyint(1)  |      |     | 0       |                |
+-------------------+-------------+------+-----+---------+----------------+
  • ns_id: the number which is used to refer to the namespace in the code and in other tables.
  • ns_system: if this is a system namespace, a standard reference to it (e.g. "NS_IMAGE", "NS_USER").
  • ns_target: if a link in this namespace does not have a prefix, it should be assumed to be prefixed with this text. This can be used to create closed namespaces, where any non-prefixed link points inside the same namespace (could be useful for Wikibooks or for wiki farms), but also to create "InterWiki" namespaces, where any link points to another wiki by default.
  • ns_closed: Links within namespaces which are defined as ns_closed always point to pages in the same namespace, even if they are not prefixed.
  • ns_subpages: Are subpages allowed within this namespace?
  • ns_search_default: Should this namespace be searched by default?

The NAMESPACE_NAMES table:

+------------+--------------+------+-----+---------+-------+
| Field      | Type         | Null | Key | Default | Extra |
+------------+--------------+------+-----+---------+-------+
| ns_id      | int(8)       |      |     | 0       |       |
| ns_name    | varchar(200) |      |     |         |       |
| ns_default | tinyint(1)   |      |     | 0       |       |
+------------+--------------+------+-----+---------+-------+

  • ns_name: one possible prefix for this namespace
  • ns_default: Is this the "canonical" name, that is, should other names redirect to it?

This schema makes it possible for any namespace to have an arbitrary number of names, e.g. instead of just "Image:", there could also be "Video:" and "Sound:", and all of them would redirect to "File:". Another advantage is that namespaces can now be renamed where this would previously require editing of the language file (not trivially possible if multiple installations share the same codebase).

In the namespace manager, it must be possible to:

  • Create a new namespace or a new synonym. If there are existing pages using that prefix, the namespace manager should propose to automatically move them into the real namespace. This makes it possible to convert so-called "pseudo-namespaces" into real ones.
  • Rename a namespace. Here, the namespace manager should not perform the operation until all existing links point to the new name (it could be created as a synonym first).
  • Remove a namespace. This operation should only be permitted if there are no pages in the namespace in question.
  • Change the properties of a namespace.

All these rights should be atomically configurable using the new user/groups/rights permission scheme in MediaWiki 1.5.

Any new MediaWiki request will have to load these namespaces from the database or from the memory cache. Furthermore, the default namespace names must be read into the tables upon installation or upgrade from an older version.

Schemas[edit]

A schema is essentially an association of a namespace with multiple fields. It is important at this point to note that in the context of Wikidata, the contents of a namespace are not necessarily directly user-visible. This is because some tables define relationships between data, rather than the data itself. These relationships must be set and evaluated transparently to the user: No user wants to manually define, say, that "key 15 is associated with key 40".

Furthermore, in the Wikidata model, page titles have to be understood as "link keys". They must be unique, and their only purpose is to allow linking from one page to another using the standard means of the wiki. They are not necessarily the primary key of the table, nor does the table have to contain only distinct rows.

A specific example:

Schema:Persdata
TITLE=$Firstname $Lastname ??NR??

Field name (EN)   Field name (DE) Type             Options
NR                NR              {autoinc}        hidden
Last name         Nachname        STRING 20
First name        Vorname         STRING 20
Salary            Gehalt          NUMBER        
Notes             Notizen         LONGTEXT         local, default:en
Department        Abteil          {key}=>
                                  STORE:en_Department:NR
                                  SHOW:en_Department:Depname

This table already shows that field names are internationalizable. This is essential for the Ultimate Wiktionary, as the field names like "Word type" have to be shown in the user language. The option "hidden" indicates that the field is not editable to the user, this is useful e.g. in the case of auto-increments. The option "local" indicates that the data in this field is language-dependent, as is the case for dictionary definitions or, in this case, notes.

For language-dependent fields you can define that, if there is no content in a particular language, the content from an existing language should be used as a default. If this option is not set, the data would simply be missing in that language.

The "Department" column is probably the most interesting, as it introduces the concept of show/store keys. Show/store keys are simple references to other tables, where the user-visible content is different from the one stored in the database; in this case, a list of departments ("Accounting", "IT", "Research", etc.) is shown to the user, and the associated ID is stored in the field (and looked up again later).

For Wiktionary, we require another concept, the {multikey}. Let's say we have two namespaces, WORD and DEFINITION, and WORD contains the following column:

Field name (EN)   Type
Definition        {multikey:*}=>
                  STORE:en_Definition:Word
                  EDIT:en_Definition:Definition

Here we define that any word can be associated with multiple definitions. What user interface is provided to add these definitions is debatable, it could be a simple single textarea, or a dynamic JavaScript where the user can add new text fields to the form by clicking a button (if the latter, a simpler entry mechanism should be provided as a fallback).

Changing schemas[edit]

The column types that comprise a schema will not be easily changeable after a schema is defined, as a change of type could result in a loss of data. Nor will it be trivially possible to remove columns. It will, however, be possible to rename and internationalize the column names, to add columns, and to make non-lossy changes to column types (e.g. increase a field length).

Links[edit]

It should be possible to link atomically directly to any particular data element; the links tables will have to be altered to deal with this, and a new syntax is required. However, for the time being, just using the existing link table structure will be sufficient to allow basic linking from any Wikidata field to any page; that is, all fields that contain textual data can also contain full wiki syntax.

Revisions, diffs, recent changes[edit]

Due to the above storage model, we can apply all the mechanisms of the wiki to a page containing structured data. Altering an individual data element alters the page as a whole and results in a new revision. In the first prototype, a revision will always include a full copy of every single data element, but more efficient storage is of course desirable.

The "Diff" output that shows the difference between two revisions will have to loop through all the elements of a page and show diffs for each of them. The "Recent changes" table will not have to be changed. Multiple field changes in one edit should generate a single line in the log. It may be desirable to generate automatic summaries of simple changes. The same view will also be used for page histories; however, it should also be possible to get the history of changes to a single field.

Program-triggered changes to hidden tables (especially relationship tables) should not be shown in the Recent Changes log; the user-visible changes should be logged instead.

Implementation strategy[edit]

Further implementation strategy will be essentially as follows:

  1. Create namespace manager.
  2. Model two relational databases in a single Wikidata database and benchmark queries to determine feasibility of proposed specifications.
  3. Hardcode a schema for GEMET data and build basic proof of concept.
  4. Create schema editor, frontend and view code, and integrate functionality.
  5. Build Ultimate Wiktionary schema.