Talk:Wikidata/Technical proposal/Archives/2012/March

From Meta, a Wikimedia project coordination wiki

Official discussion

Will there be a mailing list opened for this project? Where is the "official" discussion going to happen?--Kozuch 14:22, 1 November 2011 (UTC)

Good question; It's a public project with MediaWiki ? --Karima Rafes 16:57, 14 November 2011 (UTC)
There will be a public way to discuss the project. Right now the project did not start yet, but it will be a open and public project. --denny 10:25, 25 November 2011 (UTC)
Great to hear that. Are there links to public discussions that lead to this project by the way? — Kennyluck 20:43, 1 January 2012 (UTC)
Just as an update and for the sake of completeness: There is now a mailing list at https://lists.wikimedia.org/mailman/listinfo/wikidata-l --Lydia Pintscher (WMDE) (talk) 16:10, 15 March 2012 (UTC)

Data means storage

One big issue according to my experience with data is the integrity of the data and their storage. At least one project on de:Wiki is extracting data from infobox and create a database on the toolserver (I will call this process bottom-up). This is a problem for the reuse because wrong information present in the infobox will be included in the database and then propagated in other wikis.

The alternative is a top-down approach: a database is created outside of Wiki (I mean on a different project that those using the data) and each wiki refers to that database to build his lists or infoboxes.

Then the second problem is the connection to the database: the creation of model which connects to the database every time someone wants to open a article will generate a huge traffic for the servers (I'm not a specialist so this is only an assumption). And As data won't change every day the need for an constant update is not necessary.

My proposition is to avoid the use of a model to link wikis and database but to define bots which will be responsible to update the code in wikis article by using the database. This process already exists: it is possible to find on the toolserver some tools which generate wikicode from a formular. In our case instead of using a formular the tool will use data from the database and generates the wikicode which will be copied in the articles by the bots. Snipre 09:40, 2 December 2011 (UTC)

However by doing that we centralize data storage and modifications of infobox in every wiki won't be included in the database. A specific procedure will have to be provided in order to add or modify data in the main database. Something to discuss because that is a modification of the Wikipedia spirit. Snipre 09:49, 2 December 2011 (UTC)
Yes this will be a technical challenge as you said. But we have extremely smart people in the team so I hope they can use all that smarts and be brilliant to make this work ;-) Regarding the workflow changes: Yes there will be some. How exactly that will be is a matter of discussion and trying. We'll start with that in April. Hope that clarifies it a bit. --Lydia Pintscher (WMDE) (talk) 16:15, 15 March 2012 (UTC)

Reconciling Wikidata with Wikipedia expectations :-)

... I seriously wonder however, how the social engineering of Wikidata is expected to work. Important factors for Wikipedia are trust, transparency and shallow hierarchies, and long-term publishing platform - how are these planned to work with Wikidata? Can I, with simple and transparent methods, see who changed which information in a given Wikipedia article (that is, including all the information that in a future comes from Wikidata)? Can I see this over years? I have very high esteem for Daniel and Denny, but fear that presently the design may be too technically focussed. I look forward to a general discussion venue (or just look, it may well exist already :-) ). G.Hagedorn 23:39, 11 February 2012 (UTC)

Hi Gregor! I fully understand your concerns. You can be assured that they are on our mind. First of all, there will be an edit history for every data record, exactly like they are for wiki pages. So the situation will not be worse than it is now.
Currently, changes in templates or images included on a page are not easily visible when looking at the page's history, and they also do not show up for people watching a page. Without firther measures, this problem would be the same with Wikidata. And since the data wopuld typically be essential to the article (more so than an image), this would be quite annoying.
Because of this, we are considering ways to integrate changes to the data records to the recentchanges of pages that use the data (so thesy show in the watchlist) - or even recording them in the page's history. How that should be done is one of the main engineering challanges, but you can be sure that we are thinking about it. -- Duesentrieb 10:18, 12 February 2012 (UTC)
Hi, Daniel, that certainly sounds good! You are correct, similar problems occur with templates (although in practice templates rarely provide content) and media files. In flagged revisions, this is more or less solved, I believe. The nice solution would perhaps be to have a true, overarching change record as history, regardless which dependency changed, but of course that sounds scary to me as well. - I assumed you would be able to track records in wikidata, but that alone does not provide transparency yet. If it is deeply hidden and fragmented, as in many "database transparency" attempts, it may require forensic effort. -- For Wikidata I wonder what arguments exists against a simple solution, in which at least initially data are not inserted dynamically into a page, but simple bot-like page rewrites (using the job queue) occur with appropriately updated template calls? I see advantages in that it breaks no social expectations, does not reduce transparency (it can enhance it by using a wikidata user plus a comment citing source and update), and works well with the current page caching mechanisms. What, other than "waste of storage" argues against this? -- G.Hagedorn 12:28, 13 February 2012 (UTC)
Oh, ease of use, clarity about how and where the data can and should be edited, delayed effect of changes, etc.
But: Wikidata does not in any way force the local community to transclude the data directly. We will just provide the mechanisms to do so. If a wiki however decides that they would rather use a bot to fetch the data from the API and insert it directly into the article, that's fine with us. -- Duesentrieb 12:57, 13 February 2012 (UTC)
(Does the discussion on this happens here, or is there a tradeoff-chart somewhere else?) I think your arguments are valid, but they need to be emeded with priorities. At least the "clarity about how and where the data can and should be edited" can be largely overcome by simply adding a comment "<!-- Do not update, automatically updated by XXX-->". This comment is clearly visible in edit mode, it is in line with current practices and works. And delays are non-existent, provided the job queue is resolved immediately (it won't always, but its a separate task to manage computing resources).
Of course it is more primitive than what we may dream of. If your task force could achieve a perfect solution until the end of 2013, it may be worth it, but we all doubt that, or? My argument is less that you never provide more complex forms, but that if you make it a priority, and consider "the community" something alien, it will take forever, or as long as the rich-text editor, whichever is shorter. Nothing usable has come up yet and people are migrating in masses to Drupal (which has all this, plus Drupal-variant of a "Wikidata" already in a very advanced stage). My recommendation: aim at the transparent updating as the first step, get that introduced, offer an advanced "live-inclusion" extension only in the second phase. G.Hagedorn 07:35, 15 February 2012 (UTC)
For templates, people can normally use "Related changes", and set the namespace to templates. With Wikidata, I assume that the pages will be on a separate wiki, thus removing that possibility. I think that if pages had something similar to the "Templates used on this page" list for Wikidata data points, it would help improve transparency. --Yair rand 12:46, 14 February 2012 (UTC)

Wikidata is a really interesting project. The way how the feedback to the projects (e.g. the Wikipedias) is done ist the most crucial point for me. A few thoughts/questions:

  • Is it planned that the Wikipedias will depend on Wikidata on the long run, like they depend on Commons today? I.e., will the information be copied from Wikidata to Wikipedia, or will it be included, such that if Wikdata went down for some reason, Wikipedias would suddenly be missing data?
  • What is the best way to transmit data back to the Wikipedias? An instant, silent update seems possible in the "inclusion" scenario of my first point. An instant edit to the Wikipedia history is an option, but that would mean that one edit in Wikidata could lead to hundreds of edits in the local projects, making the system quite vulnerable to vandalism/edit wars. Maybe it would make sense to install a verification system (similar to "flagged revisions") on wikidata or use periodic instead of instant updates.
  • There is hardly "one wikimedia community", most Wikipedia language editions are self-sufficient and rather isolated in practice. Different Wikipedias have evolved different standards, e.g. with respect to sourcing data, and it will not be easy to get all communities to agree on one standard. If an edit on Wikidata affects many different projects, this will create a completely new situation. This is maybe not so important for the interwiki links stage 1, but certainly in the later stages.

--Tinz (talk) 14:19, 21 March 2012 (UTC)

To your first point: The latter. But they will be running on the same infrastructure. It's probably unlikely that only either Wikipedia or Wikidata will go down.
About your second point: Making sure it's not easy to insert wrong information unseen is definitely on our list of things to do. How exactly this will look like still needs to be discussed.
To your last point: Yes that's definitely a challenge and one I will be working on with everyone. But in the end each Wikipedia is free to use or not use specific data from Wikidata. --Lydia Pintscher (WMDE) (talk) 18:05, 21 March 2012 (UTC)

Wikidata and Wikisource

With a lot of contributors of Wikisource I'm very interested by Wikidata project and I've some questions about the relations between Wikidate and Wikisource :

  1. The project is to store interwiki links on Wikidata. What about interproject links ? By example, many writers have their biography on Wikipedia, the list of their books on Wikisource, some quotations on Wikiquote and pictures on Wikimedia Commons. and there are inter-languages links in all these projects. An example, Dante Alighieri : Wikipedia Wikisource Wikiquote Commons.
  2. The author pages on many Wikisources has a template like an infobox on the top with the same information that in in Wikipedia infobox. Can Wikidata provide this information to Wikisource infobox like to Wikipedia ones ? Wikisource can use the extension developed for Wikipedia, I think without major changes.
  3. Wikisource has a beginning of semantic system with the Proofread Page extension and his Index namespace that provide metadata on books that are shown on main namespace. (an example : Index, A chapter in main) but these metadata are stored in a template insert into Index pages and can't be get with an api. So, can we imagine to use, in the future, the core of the semantic system created for Wikidata in order to have a great semantic system inside of Wikisource ? I will be very happy to work on it !

Tpt (talk) 20:25, 9 March 2012 (UTC)

Hi Tpt, thanks for your comments! Before I reply to your individual questions, let me clarify something: the Wikidata software development project run by Wikimedia Germany is limited in scope, and has a very tight time frame and well defined deliverables. The goal of this effort is it to provide a foundation for the Wikidata community project and cover some basic use cases, which can be expanded on later. The Wikidata development team will be focusing on the Wikipedia-related use cases described in the technical proposal, but we will keep other possible use cases in mind, so the system is flexible enough to cover them as well. So, while we are focussing on Wikipedia first, this does not mean other projects are to be left out.
Now to your questions:
  1. yes, we will aim to cover "sister-links" as well as "language-links"
  2. yes, that should be no problem at all. the same is true for "creator" pages on commons.
  3. yes, I believe so. I envision something similar to become possible for the meta-data about images on commons. However, the details are a bit unclear - one question would be if this data would be stored in the central Wikidata repository, or locally in wikisource, just using Wikidata software components.
One more thing: Wikidata does not aim to provide any kind of inference, which is why I try to avoid the label "semantic". Wikidata will provide rich structured data, and perhaps enough semantics can be imposed on that to make some level of inference or reasoning possible using 3rd party tools. But there are currently no plans for Wikidata itself to support this kind of thing.
HTH -- Daniel Kinzler (WMDE) (talk) 10:36, 12 March 2012 (UTC)
For the third question, the data are about a specific edition of a book reproduce in Wikisource. So, I think they can be stored in Wikidata if this project want provide a list of books with their editions like the project openlibrary.org does or in Wikisource if it doesn't. Tpt (talk) 15:48, 12 March 2012 (UTC)

A question

I have been trying to read up all the material and figuring out if a particular use case that is of interest to me will work or not. In case my question is one of those that require an RTFM response, please do indicate the exact link. The question is that if I look at the history of an article (say an en.wiki article) which makes use of linked-data - and click on a particular older version of the page - would I get the older version of the linked-data. That is would the data stored in the database system also have historical versioning? (A programmers analogy would be whether it would be like getting the SVN snapshot for a particular date) Wikimedia Commons does not do this (that is if I upload multiple versions of an image on Commons, the Wikipedia page history will link and show the current image) but it seems like it is often used as an analogy for explaining Wikidata to people. Please let me know if I am not being clear. Shyamal (talk) 11:49, 23 March 2012 (UTC)

That's a good question! There is a problem in MediaWiki itself with this: If you look up a previous version of a page involving a template, it does also not resolve to the previous version of the template but to the current version of the template. Since the templates will most likely be the ones requesting the data from Wikidata, the whole situation is very mudded. My thinking is, for now: as long as MediaWiki does not resolve to the previous versions of the templates (nor Commons, as you point out yourself), it would be inconsistent to do so for Wikidata. But since we retain a complete history of the data, it will be possible to change this later, together with the behavior of templates. --Denny Vrandečić (WMDE) (talk) 09:19, 24 March 2012 (UTC)
Do you know w:Wikipedia:WikiProject User scripts/Scripts/TimeTraveller? Never tried it, but should do it. --Atlasowa (talk) 14:45, 25 March 2012 (UTC)
The page says that they show the current templates and pictures. It only changes the links between pages. --Denny Vrandečić (WMDE) (talk) 10:10, 30 March 2012 (UTC)

Article, Talk, Facts

Just found a link to this project and it appeals to me a lot. There is a tremendous amount of data just waiting to be disclosed by Wikipedia. Just think of all App developers that are scanning FTP sites to find just that one table containing public toilets in New York, only to find out that they can't convert the DBF-format. Or the fact that you could actually ask Wikipedia "What's the latest stock price of Google" and it responds with "$ 641.24, updated less than 5 minutes ago". If Wikipedia could get this project of the ground it might mean the difference between Archie and Google for data.

Below is how I think I would implement it if I were the boss and had a lot of money to burn.

I like the way Snipre proposes how to retrieve the data: {{addData|Albert Einstein|Birthday}}. Why not add the data in the same veign?

Let's take the same example: en:Albert Einstein. Now there are five "tabs": Article, Talk, Edit this page, History, Watch. What if WP adds a sixth, "Facts". This would contain lines like:

Albert Einstein (Scientist).BirthDay=3/14/1879
Albert Einstein (Scientist).LastName=Einstein

Or to type less, since these facts are already linked to Einstein because they are on "his" Facts page:

~.Class=Scientist
~.BirthDay=3/14/1879
~.LastName=Einstein
Mass–energy equivalence (Scientific theory).PublishingDate=1905

(I've added Mass–energy equivalence to show that facts on article X do not have to be about X). The Facts page can be edited just like the Article page and Talk page can. Also, robots can edit them of course. So

Google (Company).CurrentStockPrice$=641.24 
Google (Company).Ticker=GOOG

on the Facts page could be easily maintained by a robot and incorporated in this Article with {{addData|~.CurrentStockPrice$}} or in en:Stock market using {{addData|Google (Company).CurrentStockPrice$}}

In AI many attempts have been made to turn a sentence like "He was born in 1879" into facts. May be many of the facts could be retrieved automatically from the article (especially articles that were originally made by a bot, like city populations etc).

I would like to see this as free as possible, the Wikipedia way. So if someone knows that a painting was made after 1805 he should be free to add the fact

~.Date=After 1805

Nonetheless I think .Date should have a distinct type to be defined somewhere so spelling errors or data errors can be found automatically. After that it's up to more robots to decide that a value "After 1805" means ">12/31/1804".

It would also be nice to have "child facts". For Einstein, it would be nice to have a list of publications like:

~.Publications
On a Heuristic Viewpoint Concerning the Production and Transformation of Light
On the Motion of Small Particles Suspended in a Stationary Liquid, as Required by the Molecular Kinetic Theory of Heat
On the Electrodynamics of Moving Bodies
Does the Inertia of a Body Depend Upon Its Energy Content?
/~.Publications
On a Heuristic Viewpoint Concerning the Production and Transformation of Light (Publication).ScientificArea=Photoelectric effect

Again, somewhere else is defined that Publications must be a list of objects of type Publication. Of course it's not needed that there actually is an article on WP called en:On a Heuristic Viewpoint Concerning the Production and Transformation of Light (Publication)

One more thing I'd like to see is that Wikipedia actually implements a search engine that responds to questions. So if someone asks "What's the latest stock price of Google" for the first time the question gets added to the list of "Questions not understood" and the person asking can leave his email for the time it is. Then, on some page somewhere (tab Questions? A whole different project?) someone adds this:

What's the latest stock price of {X}
The current price of {X,Company}.OfficialName at Nasdaq is ${X,Company}.CurrentStockPrice$

The software would look for a Company with name X and return the wanted facts.

To respond to "What's the latest stock price of GOOG":

What's the latest stock price of {X (Company).Ticker}
The current price of {X}.OfficialName at Nasdaq is ${X}.CurrentStockPrice$

This would look in the Ticker data to get to the right object. Also:

What weekday was Easter in {X (Easter date).Year}?
That year, Easter was on a {X.DayOfWeek}
  • Facts should be able to be tagged (like OR, POV, etc)
  • Talk pages would be about both the Article and the Facts page
  • Facts should be familiar pages in the interface, but have a standard database in the background
  • The database should have an API to make it very easy to add/delete/modify facts. It should be really easy to add "Latest tweet" to en:Justin Bieber.
  • Pages like en:Boiling point would have facts like H20 (Chemical formula).BoilingPointC=100. en:Water would have a fact ~.ChemicalFormula=H20. The system should be clever enough to deduct that Water (Chemical substance).BoilingPointC=100. So at the place where fields are defined you should also be able to tell that (Chemical substance).ChemicalFormula equals the class (Chemical formula).
  • Robots should be encouraged not to write the Article tab but only the Facts tab. For all these articles all we need is to have one template that gathers facts for one class. So the template for {{Chemical substance}} would show a nice article for en:Water even if the article hadn't been written by anyone. The template {{City}} would deduct from Houston (City).USState=TX that it's in Texas and would also tell some things about Texas.
  • More facts can be deduced with Rules. At the en:Melting point Facts one could add:
(Chemical substance).MeltingPointC>20 -> .StatesOfMatterAtRoomTemperature=Solid
  • Wikipedia would become even more like an Excel sheet where you change one thing and thousands of pages suddenly change. Robots and users would need more surveillance.
  • The format of some data is simply too specific. OpenMap data would be tedious to be implemented as "facts". How to handle that kind of data is at best the next step.
  • Interwiki links are also "facts", but now they are too free to consider them as such. "Article X mentions Y" is the best possible interference. "Libertarians usually do not agree with en:Keynes" does not imply libertarians have anything to do with Keynes.
  • New interfaces are possible. Now somewhere in the text it says "The Annus Mirabilis papers are four articles pertaining to..". You would have a DVD like interface where you can click the simple "list fact" "Publications" even if the Article itself is babbling about "Annus Mirabilis papers"
  • It would be very cumbersome to say "Water boils at {{addData|~.BoilingPointC}} so that won't be used.
  • Which is too bad, because most users would like to hover their mouse to see the Fahrenheit version.
  • It would however be nice to type in the Article "Water boils at 100 {!.BoilingPointC} C" to automatically add the fact that 100 is also the correct value for ~.BoilingPointC.
  • Casing has no value at all. No one sees a real difference between BoilingPointC or boilingPointC except for lazy computers. If someone types boilingpointC, WP should automatically correct it to BoilingPointC and not assume that someone meant some completely different boiling point.
  • It would be nice if WP "knows" that C can be converted to F and $ to Euro so WP can show regionalized versions of the same page.
  • Classes should have inheritance. A scientist is a person (so will have a birthdate) and may have publications.
  • Multiple inheritance might get things too complicated for classes, but for single articles (like en:Leonardo da Vinci) it would be nice to say
~.Class=Scientist,Writer,Painter

to enable all fields for those three classes. Many articles fall in different categories so inherintly we need multiple inheritance.

  • Conflicting facts should be marked automatically and not used anywhere. If the Facts page on Einstein says he was born "3/14/1879" and another page says he was born "After 1870", real people need to look into it and decide who's right.
  • A "List of new medical inventions between 1950 and 1960" could be generated automatically instead of laboriously made manually. If people want to have such a list, all articles about medical inventions that don't have a date could get a "Fact" that's labeled "unknown".

Joepnl (talk) 02:18, 31 March 2012 (UTC)