Talk:Wikidata/Technical proposal

From Meta, a Wikimedia project coordination wiki
This is an archived version of this page, as edited by Kozuch (talk | contribs) at 21:12, 20 January 2012 (→‎Status updates: Thanks too. Hope you will give (us) regular updates and will employ community volunteers enough accordingly so that this will not get doomed like others did (e.g. the Strategy wiki)...--~~~~). It may differ significantly from the current version.

Latest comment: 12 years ago by Kozuch in topic Status updates

Microformats

Changes to infoboxes need to be mindful of those that already emit microformats; I'm happy to advise further if and when needed. pigsonthewing 16:58, 27 October 2011 (UTC)Reply

Thank you. Will come back to you. This will be relevant as soon as phase 2 starts. Please ping me again if I do not come back to you! --denny 10:28, 25 November 2011 (UTC)Reply
You should probably use Microdata and not Microformat, as the former is the W3C standardized version. You should also consider RDFa. Check also Schema.org[1] and how they organize reuse. — Jeblad 16:48, 20 December 2011 (UTC)Reply

References database

Since WikiData will be creating an interwiki database, and also database to supply information to infoboxes... I was wondering whether it is also possible to use it to create a reference/citation database? That is, whenever someone adds a citation to a specific book (or journal...) in any Wiki, rather than having to add all the bibliographic details in each separate article, it calls upon the WikiData citation reference. That way there will be ONE canonical/centralised place for a reference that can be updated/corrected once and it will display correctly everywhere that that book is referenced. It helps to avoid linkrot, can be translated and greatly reduces the work in adding additional citations to articles.

An example of this kind of system is in use here: http://wiki.cancer.org.au/australia/Citation:O%27Gorman_T,_MacDonald_N,_Mould_T,_Cutner_A,_Hurley_R,_Olaitan_A_2009 This is the Cancer Council of Australia's Wiki with a new "citation" namespace. See also at the bottom there is a "cited by" section, this is the same effect as the "file usage on other wikis" that we see on Commons files. That way when a new article wishes to use this reference it just adds the reference from the "existing citations" dropdown list. Then it is displayed in the article correctly according to whatever styleguide (and language) that wiki has (here is that example citation displayed http://wiki.cancer.org.au/australia/Clinical_question:What_is_the_evidence_based_surgical_approach_for_hysterectomy_in_low_and_high_risk_apparent_early_stage_endometrial_cancer#References ).

Eventually we would create an amazing ISBN/ISSN database that would be the world's first universal, multi-lingual, up-to-date and freely-accessed bibliographic dataset. This would be a massively cool gift to the world. We could even pre-populate it with the data from the several national libraries that have already released their dataset under CC-0. It would encourage others to do the same by using the share-alike principle.

Is that possible? Wittylama 00:18, 29 October 2011 (UTC)Reply

There already exists a similar database on the wiki fr ("Référence" - example) Zorglub 01:07, 29 October 2011 (UTC)Reply
That's very interesting! I was not aware of that system. I particularly like how you can change the display style to show different formats for the same reference (e.g. BibTeX, wikisyntax, list). If I understand how that system works, it is particularly used as a way to clearly identify all the different editions of a book that has been republished multiple times, correct? What I'm particularly hoping New WikiData can be used for is the ability to place code that would look, for example, like {{reference|2-213-02191-0|1988|page 12}} and that would insert, automatically, the footnote that displays correctly according to the local wiki's language and manual of style. In the case of the French Wikipedia that would be (I think): "Nicolas Grimal, Histoire de l'Égypte ancienne, Fayard, Paris, 25 novembre 1988, broché (ISBN 2-213-02191-0) page 12." Instead, if you were on the English wikipedia and you wanted to refer to the later English edition of the same book you would write {{reference|0-631-19396-0|1994|page 12}}, it would give you "Nicolas Grimall, A History of Ancient Egypt, Blackwell, August 1994 (ISBN 0-631-19396-0) page 12." Do you see what I mean? Obviously I'm just inventing the code to demonstrate the idea, but the concept is to have a multilingual database of publication's metadata that is able to be kept up to date, so if someone adds more information later, that propagates to all the articles that have used that book as a reference. Wittylama 11:15, 31 October 2011 (UTC)Reply
Why people want to have always a specific database for different kind of data ? It is possible to think about a global concept for data storage like that:
data.MainName = Histoire de l'Égypte ancienne
data.MainClass = Book
data.MainID = 145
data.Parameter1.Name = Title
data.Parameter1.Class = Book
data.Parameter1.Value1 = Histoire de l'Égypte ancienne
data.Parameter2.Name = Author
data.Parameter2.Class = Book
data.Parameter2.Value1 = Nicolas
data.Parameter2.Value2 = Grimal
...
This can be used for scientific or persons data too:
data.MainName = Methan
data.MainClass = Chemical compound
data.MainID = 149
data.Parameter1.Name = Molecular mass
data.Parameter1.Class = Physical Property
data.Parameter1.Value1 = 16
data.Parameter1.Value2 = g/mol
data.Parameter1.Value3 = {{data|146}} link to another data unit
data.Parameter2.Name = Heat capacity
data.Parameter2.Class = Physical Property
data.Parameter2.Value1 = ...
or
data.MainName = Albert Einstein
data.MainClass = Scientist
data.MainID = 204
data.Parameter1.Name = Name
data.Parameter1.Class = Person data
data.Parameter1.Value1 = Albert
data.Parameter1.Value2 = Einstein
data.Parameter2.Name = Birthday
data.Parameter2.Class = Person data
data.Parameter2.Value1 = 14
data.Parameter2.Value2 = Mars
data.Parameter2.Value3 = 1879 ...
To call the data you have to insert something like {{addData|Albert Einstein|Birthday}} to get 14 mars 1879 in the appropriate wiki format or {{addData|Methan|Molecular mass}} to have 16 g/mol <ref>....</ref>. Snipre 14:31, 31 October 2011 (UTC)Reply

That is interesting. We will take a closer look at fr.wp and what they are doing, as soon as we start. Commons has also a very elaborate metadata capturing scheme. --denny 10:27, 25 November 2011 (UTC)Reply

Official discussion

Will there be a mailing list opened for this project? Where is the "official" discussion going to happen?--Kozuch 14:22, 1 November 2011 (UTC)Reply

Good question; It's a public project with MediaWiki ? --Karima Rafes 16:57, 14 November 2011 (UTC)Reply
There will be a public way to discuss the project. Right now the project did not start yet, but it will be a open and public project. --denny 10:25, 25 November 2011 (UTC)Reply
Great to hear that. Are there links to public discussions that lead to this project by the way? — Kennyluck 20:43, 1 January 2012 (UTC)Reply

Sister projects

The page mentions only Wikipedia. I assume it's obvious that this needs to work also with sister projects (at least interwikis)... Nemo 08:18, 4 November 2011 (UTC)Reply

Agreed. Helder 13:20, 17 November 2011 (UTC)
Agreed. --denny 10:26, 25 November 2011 (UTC)Reply

Data means storage

One big issue according to my experience with data is the integrity of the data and their storage. At least one project on de:Wiki is extracting data from infobox and create a database on the toolserver (I will call this process bottom-up). This is a problem for the reuse because wrong information present in the infobox will be included in the database and then propagated in other wikis.

The alternative is a top-down approach: a database is created outside of Wiki (I mean on a different project that those using the data) and each wiki refers to that database to build his lists or infoboxes.

Then the second problem is the connection to the database: the creation of model which connects to the database every time someone wants to open a article will generate a huge traffic for the servers (I'm not a specialist so this is only an assumption). And As data won't change every day the need for an constant update is not necessary.

My proposition is to avoid the use of a model to link wikis and database but to define bots which will be responsible to update the code in wikis article by using the database. This process already exists: it is possible to find on the toolserver some tools which generate wikicode from a formular. In our case instead of using a formular the tool will use data from the database and generates the wikicode which will be copied in the articles by the bots. Snipre 09:40, 2 December 2011 (UTC)Reply

However by doing that we centralize data storage and modifications of infobox in every wiki won't be included in the database. A specific procedure will have to be provided in order to add or modify data in the main database. Something to discuss because that is a modification of the Wikipedia spirit. Snipre 09:49, 2 December 2011 (UTC)Reply

Wikispecies

Considering that the Wikispecies adds this same type of information (which should be reusable) , there were plans to merger the Wikispecies with wikidata to facilitate the integration of taxonomic information on Wikipedia and related? Raylton P. Sousa 23:02, 15 December 2011 (UTC)Reply

The idea is to start with interlanguage links and infoboxes on wikipedia, and then see which other wikimedia projects may benefit from wikidata, and in what way. I expect that once we have some experience with the centralized data stores, it will become clearer what kind of interaction with wikidata makes sense for which project. -- Duesentrieb 13:15, 16 January 2012 (UTC)Reply

Schema.org as a baseline, and some additional notes

Its very tempting to propose a structure based upon Schema.org[2] where pages uses a multiple inheritance hierarchy, possibly with something similar to a category system, and some kind of item specifier inside the page. The page could be exported in an API that mimics the fields from Schema.org (possibly with RDF and also as JSON according to Schema.rdfs.org) and then allow easy reuse by any service that follows the common Schema.org definitions.

The pages could then be imported with similar mechanisms as those used in InstantCommons and with additional parser functions the DOM can be traversed and the information extracted. Such functions could use simple Xpath-selectors or more complex XSL transforms.

When the information is extracted on the destination page at some external wiki it is although not so simple to reuse the content. Several mechanisms are necessary to build additional descriptive text around the value, even if a localized template is available. It seems like this can be solved with guided rule-based machine translation, even if this is BAD according to the community at Wikipedia. There are additional notes about this at the page Some notes about small projects.

In some situations there might also be necessary to transform values from one unit system to another. Examples of this is currencies and conversion from metric system to inches.

On the server side one could propose a system of hierarchal categories whereby some of them reflects a Schema, and a page at data.wm then is built by collecting such supercategories during editing, and then those are used to build a set of legal fields for a specific subject. For each subject at data.wm tags could be used as markers for properties on the page as these should be static, while at the individual wp-pages the information could be imported by use of parser functions.

There are some additional notes at Wikipedia:Gjenbruk (Norwegian) (also subpages) and Help:Schema (Norwegian) (also subpages). — Jeblad 20:27, 20 December 2011 (UTC)Reply

P1.1. Focus at a XML-representation, and then especially one that supports RDF, RDF Schema, or OWL. In addition make sure to follow the definitions from Schema.org if possible. Note also that it is probably not necessary to do any reasoning about OWL, its only about encoding the information as described by this standard. (Phase 2: «We will not provide automatic mappings to external vocabularies and ontologies or use internal reasoning engines.»)

JSON should be available as a transform of the XML-representation, and not used directly.

P1.4. Translation of datasets into localized versions are a good bit more complex than just throwing together some translated strings as the values themselves sometimes needs translation.

O1.2. If some prerequisites exists then editing can be more or less automatic. That is there must be not conflicting entries in any Wikipedia article.

O1.3. A lot of reuse center around national statistics. Those datasets are not identified by unique identificators and the URL to the resource and a Xpath statement to navigate within the resource must be used as an identification.

P2.1. Not only values needs transformation but also units. This kind of transformations are missing. Recalculating the precision might relie on the units used for the value.

There might be diverging values for some properties, and it can be unclear how to handle them. Typical example is location data, where there might be some constraints, some inherit resolutions and some uncertainty for the given value.

A property might have a name, a value and an unit (typed values?). In addition there might also be a realm, which often creates problems in term bases. To further complicate this there might be a limit in time on the correctness of the value.

P2.7. This also influences automatic creation of stub articles. This is very important in small languages. Note that this is about automatic generation of text in other languages, its not about machine translation of preexisting text. Such guided machine generation of text is somewhat easier than a machine translation.

O2.6. Trust modeling of factoids, there are at least one very interesting trust model.

P3.1. (P3.3.) Generate complex structures and lists from wikidata imply some heavy language rewrites unless it should only be limited to html-structures.

O3.5. Some "flat" queries can be reformulated as hierarchal queries. Most important for Wikipedia is perhaps searches in time and space, where each of the four axis are continuous but can be reformulated as named segments. It is for example possible to split time in years and latitude/longitude in degrees.

O3.6. A visualization can have implications on the article in Wikipedia that goes beyond a simple injection of a widget into the article, it can be the whole article that needs some transformation.

Status updates

How is the project going? Will anyone provide regular status updates? I think a lot of people would be interested. Also, a lot of volunteers might want to help out, but currently there are no "official" instructions what is and will be going on... if help will be needed and wanted.--Kozuch 21:23, 21 December 2011 (UTC)Reply

I'm also interested in the status of this. --MZMcBride 21:36, 21 December 2011 (UTC)Reply
Me three Wittylama 22:25, 3 January 2012 (UTC)Reply
Me also. Tpt 17:06, 8 January 2012 (UTC)Reply
Me five. --Yair rand 20:36, 8 January 2012 (UTC)Reply

Hi folks. Preparations for the project are coming along nicely, we expect to kick off in early april. The team is almost complete, but we are still in the process of hiring - among others, someone in charge of community communications, who would provide status updates and also coordinate volunteer efforts.

By april, we should have a detailed road map. I suppose there will be some things in there that are suitable for volunteers to work on. However, in my experience, volunteers can best do "stuff that comes up along the way" - because they are not committed to a road map and can react immediately to community demand. I expect this to be especially true for UI-related things. But we'll see.

Cheers -- Duesentrieb 13:00, 16 January 2012 (UTC)Reply

Thank you for the status update. :-) --MZMcBride 16:14, 18 January 2012 (UTC)Reply
Thanks too. Hope you will give (us) regular updates and will employ community volunteers enough accordingly so that this project will not get doomed like others did (e.g. the Strategy wiki)...--Kozuch 21:12, 20 January 2012 (UTC)Reply

Hiring

Was the hiring successfull? Did you get the people you wanted?--Kozuch 21:25, 21 December 2011 (UTC)Reply

This page announces the team is planned to be completed by March 2012. --Spischot 13:16, 24 December 2011 (UTC)Reply
Currently it looks like part of the team will start work in march, and we'll have to full project kickoff in april. -- Duesentrieb 13:08, 16 January 2012 (UTC)Reply

Overlapping articles

At Phase 1 you write: "Every Wikipedia can only contain one such article." How can this system handle those situations where articles of different Wikipédias overlap, or do not exactly match, or one of them has a common article on two suptopics, of which the other one has separate articles? Bináris tell me 08:06, 27 December 2011 (UTC)Reply

There is at least one project that has analyzed alignment between articles at different projects. As I recall there are several problems with this, often because some of the articles focus on different topics (is an article about a church about the congregation or the building) or because a subdomain has the main article ("Plane (aircraft)" is located at "Plane"). — Jeblad 21:46, 27 December 2011 (UTC)Reply
It's an interesting (and legitimate) problem but I suspect we'll solve it the way we currently solve interlanguage links that don't precisely match - manually. If any other project tried to do this it would fail because it would need massively clever automation to fix all the errors, but we've got the community who have already done (and will continue to do) the hard work of disambiguating articles/concepts that match up between languages. In fact, by centralising the interlanguage links to one place it will make us faster and more efficient at this task that we already do. Wittylama 00:59, 11 January 2012 (UTC)Reply
As I recall from the projects paper the misalignments were considerable between the articles, and they did use the language links. — Jeblad 04:21, 11 January 2012 (UTC)Reply
Wikidata will porovide a centralized interface for managing interwiki links, but will not by itself solve the problems of granularity and misalignment. Initially, the quality of interlanguage links is expected to remain the same as with the current bot-based system. It will hopefully make the problem more managable though. E.g. no more removing a bad interlanguage link only to have it put back two minutes later by a bot. It will also provide a central place for discussiong problems with topic alignment.
It would be interesting to provide tools for splitting and merging concepts. I expect Wikidata will need to adress this at some point, but it may not part of the baseline implementation. -- Duesentrieb 13:07, 16 January 2012 (UTC)Reply