User:Nikola Smolenski/Wikidata

These are my notes and questions about Wikidata.

Internal[edit]

Language[edit]

Have we decided about a notation, both on-wiki and internal notation of our documents?

Berlin->population = 1234
Berlin::population = 1234
?

You mean for documentation? Internally the system would store the data in JSON for phase 1 and 2. --Denny Vrandečić (WMDE) (talk) 16:17, 20 March 2012 (UTC)

What wording are we using? Subject/predicate/value?

See Wikidata/Glossary. --Denny Vrandečić (WMDE) (talk) 16:17, 20 March 2012 (UTC)

Usage[edit]

Use cases[edit]

What are use cases of Wikidata?

Some use cases on Wikipedia:

Ability to refresh data at multiple pages/projects at once.

Easier referencing.

Ability to build lists (problems).

Ability to simulate bot-created articles.

Ability to draw charts, graphs and maps.

Ability to query the data.

Wiktionary.

Wikisource.

For phase 1 we are only looking at the language links for the Wikipedias, for phase 2 only at the infoboxes. Other use cases will arise during that work. --Denny Vrandečić (WMDE) (talk) 16:20, 20 March 2012 (UTC)

Should Wikipedias be equipped with their internal wikidata systems? Should wikidata work similar to Commons?

Like Commons. No internal Wikidata in every Wikipedia. --Denny Vrandečić (WMDE) (talk) 16:20, 20 March 2012 (UTC)

If this were as easy as it seems, why hasn't it been done already? Why aren't there wikis around having their own wikidata? Why has no one attempted doing it by templates?

They did. Check Commons. To edit it is pretty hard, though. --Denny Vrandečić (WMDE) (talk) 16:20, 20 March 2012 (UTC)

Comparison with other similar systems: SMW, Openstreetmaps...

Markup language[edit]

Is it decided what markup language should be used? Should multiple markup languages be allowed?

JSON[edit]

How will exactly JSON be used? What about objects or arrays which are members of the base object? What about multiply defined properties?

A possibility regarding objects of objects: { "key1": { "key2": "value" } } is the same as { "key1/key2" : "value" }

A possibility regarding arrays: { "key": [ "valuea", "valueb" ] } is the same as { "key1": "valuea", "key2": "valueb" }

Wikitext[edit]

Wikitext has the underused ability of creating a definition list like this:

word: definition

This could also be used to extract the data, display it neatly at the same time, and allow for use of wikimarkup around the data.

Other[edit]

CSV (related to Multiple data sources)?
Parser functions?
Plugins?

None[edit]

Is there need for a markup if it is not editable by hand anyway?

No markup language will be used for the data. --Denny Vrandečić (WMDE) (talk) 16:21, 20 March 2012 (UTC)

Procedurally generated predicates[edit]

In some cases, it will be possible to generate a predicate automatically from another predicate. For example, name:sr could be generated from name:ja. A completely different example, Darth_Vader->father_of could be generated from Luke_Skywalker->son_of.

Should there be procedural parameters at all? If not, users will likely generate them by bots.

If yes, should we go into depth, or just supply the most basic ones?

Should they be user-definable?

How should data overwrite procedurally generated predicates? (See also Data priority.)

Would it be possible to include external data sources (weather, stockmarket...)?

This is out of scope for the initial development of Wikidata. --Denny Vrandečić (WMDE) (talk) 16:21, 20 March 2012 (UTC)

Query language[edit]

Should there be a query language that will enable querying of the data?

"Semantic query language" (SQL)?

"Google query language" - enable only anding and oring of the parameters. Oring anded parameters would not be possible.

User wizards?

Ability to have multiple query languages as plugins?

Yes, such a query language is needed in phase 3. It is part of the project to find an answer to this question, but later. --Denny Vrandečić (WMDE) (talk) 16:22, 20 March 2012 (UTC)

Multiple predicate values[edit]

Multiple problems occur when a predicate could have multiple values. Potentially all the predicates could have multiple values, even those generally tought of as unique (f.e. date of birth - there are historical persons with uncertain date of birth where more dates of birth are possible).

There are multiple kinds of multiple data. Some are readily distinguishable: Berlin->population_in_2011, Berlin->population_in_2010. Some come from uncertainty about the data (birth date). A special problem is data which are nearly always unique, but still sometimes may be multiple.

There are also co-dependent multiple data. For example, a person might have been born in year X in city A or in year Y in city B, but certainly not in year X in city B.

Cyril and Methodius problem: the data are clear, but are always presented in a way that "multiplies" them.

Note that in some cases, multiplicity may disappear if data is observed "the other way".

Potential solutions:

Always have a default value and use that. Problem: in some cases, there is no reason to choose one value over another.

If the data has multiple values, have the default value be "uncertain" and then display that (translated).

Do what Wikipedia does: date_of_birth1, date_of_birth2... Problem: you have to do this always.

Have a number of functions that can present multiple data in different ways. For example a function that returns "1, 2 or 3", another that returns "1, 2, 3"...

Display an error.

Multiple values always must be somehow separated (for example, referenced differently).

Multiple values always as separate articles?

Berlin->population = Population_of_Berlin Population_of_Berlin->census Population_of_Berlin->estimate

Don't allow for unexpected multiple values. If a value is multiple, it will have to be handled on Wikipedia manually.

This is an open question, less with regards to saving and editing the data, but more with regards to querying and displaying it. This needs to be tackled for phase 2, which is rather soon, and some thought here would be useful. See Wikidata/Notes/Data model for this. --Denny Vrandečić (WMDE) (talk) 16:26, 20 March 2012 (UTC)

Multiple data sources[edit]

EXIF data

Article metadata (time of last update)

Data for one page defined on another page.

Data priority[edit]

If there are multiple data sources, what should be their priority?

First came - first served?
Last came - first served?
Always give priority to user defined data.
Always give priority to more locally-defined data.
Create multiple values? Only in some cases?
Display Error

Architecture[edit]

Caching[edit]

Several use cases, depending on data change frequency, article change frequency, article size/complexity, article visit number.

An article that is built quickly and accessed rarely need not be cached at all, just have its cache invalidated. When the first reader visits it, it will be rebuilt.

An article that is edited very often also. When the first editor edits it, it will be rebuilt.

If data is changed very quickly, articles perhaps don't need to be rebuilt (f.e. someone errs in editing the data, then reverts.

If data is changing very quickly (stockmarket ticker), perhaps the data shouldn't even be a part of the article text but displayed via ajax.

This is a very urgent problem, and we need to start working on this as soon as possible. The main reason why phase 1 is so long in the plan, is in order to get this question resolved. --Denny Vrandečić (WMDE) (talk) 16:27, 20 March 2012 (UTC)

See Wikidata/Notes/Caching investigation Chrisahn (talk) 22:18, 20 April 2012 (UTC)

Internal API[edit]

This may be a good chance to start implementing MediaWiki's internal API. This has been proposed previously. The goal is to have functions that will provide exactly the same functionality as API calls, removing code duplication.

Backend[edit]

I believe backend shold be very flexible, enabling people to use multiple storage options.

Completely modular architecture: three groups of plugins: for data entry, data search, data storage?

Key/value store textually in DB.
Key/value store by IDs in DB.
External key/value store.
Creating DB tables automatically by identifying clusters of data.

There is a work package in phase 3 to deal with this. Input is very much appreciated. --Denny Vrandečić (WMDE) (talk) 16:28, 20 March 2012 (UTC)