Talk:Requests for comment/How to deal with open datasets

From Meta, a Wikimedia project coordination wiki

Perspectives from Italy[edit]

Experiences[edit]

WMIT has been working a bit on open data matters, also in collaboration with OKFN; we hope to get better now that we're joining forces with the OSM-IT community by becoming the local OSM chapter.

--Nemo 09:40, 15 May 2014 (UTC)[reply]

Suggestions[edit]

So, this is what I suggest you to do, in order.

  1. If they "just" want to join the open data wave with Wikimedia, propose to instead have some joint or jointly-comunicated activity even without actual edits to Wikimedia projects.
  2. As a start, get the data out. Publish it as they have it, in CC-0. Let people know (see point 1.) and let them figure out what can be used.
  3. If this information is available somewhere already, cross-reference it with Wikipedia. BNCF proves this will increase their visibility and allow to spot singularities in their dataset.
  4. If they don't have a sensible way to publish, or curate, the data they have, and it's a lot of stuff, propose them to look into using Wikibase directly. It costs orders of magnitudes less than developing any internal solution.
  5. Only when they've done or looked into all of the above, it would be helpful for them to try and import the data Wikipedia needs to use; IMHO it's always a good thing when non-Wikimedia entities devote human resources to contributing to Wikidata via bots or otherwise, reducing the work of our volunteers. --Nemo 09:40, 15 May 2014 (UTC)[reply]

Fantastic[edit]

Just wanted to show support for this initiative. It's a shame there's no submission about this for Wikimania. EdSaperia (talk) 10:02, 15 May 2014 (UTC)[reply]

Ed, I think that the period for submissions passed already, but if you want we can organize an open discussion about it. The lead text is already written ;)--Micru (talk) 11:24, 15 May 2014 (UTC)[reply]

My personal opinion[edit]

I think this topic is going to become more and more important, and we should seek a general agreement instead of each community looking for its own solution/approach (as it is happening now). From my POV one of the trickiest issues is "who takes care of datasets?". If neither Commons, nor Wikidata want to take care of providing dataset support, then we should see if there is enough interest for starting a new (sub)project from either one. I would say they can be better managed by a platform like Commons, since they can be considered "files" with associated metadata, licenses, etc (even if they are not media) and visualized in the Media Viewer with fantastic tools like Vega (select "parallel cords" from the dropdown list to see what I mean).

There is a big potential in having a platform to import cleaned and uniformly formatted datasets, either to be used directly in Wikipedia (when convenient and CC-BY licenses) or to be imported into Wikidata (when convenient and CC0 licenses). There are already many bots importing datasets into the Wikidata structure, so if we had a "dataset platform" (like CKAN, but not necessarily it) operating with a "wikidata compatible formatted data" then it would be much easier to manage batch imports without programming a bot (just request which dataset to import, get approval, done), or update existing data because we would have properly referenced from which dataset the statements are coming. True that it can be done without it (it is being done already), but only people with a high enough technical knowledge venture to do it. By making it easier we would universalize data contribution.

I would like to put an emphasis on "wikidata compatible formatted data", because if you go any of the data sharing portals then you'll find such a jungle of formats, custom-made units and fields that it is hard to reuse. If we enforce using Wikidata concepts and formating when uploading a dataset, the dataset becomes immediately thousand times more useful and reusable.

About the technical part, no idea! Perhaps it is easier to adapt one of the existing solutions, but more important for now is to discuss what is the relevance to our mission. I say definitely important, but some discussion will be needed to shape a proper proposal.--Micru (talk) 12:13, 15 May 2014 (UTC)[reply]

If you are repeating yourself,. you're going wrong[edit]

Far from encouraging data importation from external source in wikidata, I'd like to consider intermediate-query tools to dynamically convert existent, external, free data into wikidata-compatible data structures. Any replication of data - when not fully automated - adds redundance and invariably causes incoerence.

Honestly I have no idea about how to build such tools :-) --Alex brollo (talk) 13:52, 15 May 2014 (UTC)[reply]

Well, not necessarily. Redundance is deprecated within the same database, but it is much appreciated for digital preservation. Metadata and data are harvested and copied all over the web, and the Wikimedia world re-write much of the information out there with its own language and policies and tools. We "reproduce" general knowledge within our own ecosystem. So, imho, it is OK to import here several datasets for our own purposes. It's the only way possible. --Aubrey (talk) 10:32, 4 June 2014 (UTC)[reply]

External dataset platforms[edit]

Just a quick comment on figshare; the platform is closed source, but all data and hosted content is CC0 or CC-BY. I'm sure they would be open to talking about any collaboration. - Lawsonstu (talk) 18:09, 15 May 2014 (UTC)[reply]

Mark from figshare checking in here to confirm the above mhahnel

Archiving External Datasets[edit]

There are lots of CKAN instances out there, but many of them only store metadata about the datatsets, and do not host the dataset itself. The problem is that the data eventually goes away. Highly volatile data like General Transit Feed Specification datasets have a tendency to go away when new data becomes available. For historic research this is problematic. There may be a spot for a Wikimedia related project to describe external datasets, and also to archive them. Having a second copy for Figshare (and other data hosting services) is a good idea, since lots of copies keeps stuff safe. I think Internet Archive have been used this way a little bit, but some healthy competition is a good thing. - Edsu (talk) 10:44, 18 May 2014 (UTC)[reply]

Edsu: I agree that this is one of the core aspects. I asked on Commons and there is no visible opposition against enabling support for data files. In fact it was already requested in bugzilla:43151, initially for CSV and ODS, later on changed to just ODS. In that bug report there were concerns raised that CSV lacks of a standard (other than rfc4180). A quick visit to datahub.io suggests that the two most popular formats are CSV (774) and RDF+XML (500). On other data portals CSV seems to be prevalent too. On the mailing list there was a thread discussing an extension to display datasets, which also requires the data to be stored somewhere.--Micru (talk) 08:43, 19 May 2014 (UTC)[reply]
I would oppose putting these files Commons is "a database of 21,323,188 freely usable media files to which anyone can contribute". These files are not in scope. Multichill (talk) 20:52, 27 May 2014 (UTC)[reply]
I agree with Multichill that Commons should not mock Figshare, datahub.io and the others: if they store data dumps, then we should not do the same. Competition is good but we must first figure out how we'll be different i.e. what we'd offer more, then start doing it. If we don't have a special idea or use for that stuff, we should just join forces with others (as we already did: WMF posts select data on datahub.io; we copy bigger datasets on archive.org).
In short: I still have no idea what's your aim. --Nemo 23:15, 27 May 2014 (UTC)[reply]
To name a few possibilities, ours could e.g. be editable/correctable, discussable on wiki (i.e. datasets will have talk pages), easily integrate into visualisation plugins for Wikipedia articles (see the planned Vega Visualisation Plugin), accept and archive live feeds, plus hooks into Wikidata's semantic information might make sets of datasets more useful/powerful. EdSaperia (talk) 12:59, 29 May 2014 (UTC)[reply]
Additional/different features don't create, by themselves, a specific role/space for a project. --Nemo 15:59, 29 May 2014 (UTC)[reply]
I was answering your statement "we must first figure out how we'll be different i.e. what we'd offer more". What role do you think e.g. wikidata has that makes it a worthy project other than its features? Do you mean "what will the mission statement of datacommons be?"? How about something like - to support data visualisation in Wikipedia? EdSaperia (talk) 14:02, 30 May 2014 (UTC)[reply]

Identifiers[edit]

Background: I am an archaeologist and I contribute to collaborative efforts such as Pleiades whose content is made available under CC-BY.

While data duplication seems problematic, I think storing identifiers in Wikidata for this kind of scholarly resources would greatly help, much in the same way as VIAF is done right now. In the case of Pleiades, importing data does is not an optimal path IMHO, but there when cross-references are in place things like importing geographic coordinates could be feasible.--Steko (talk) 11:02, 12 June 2014 (UTC)[reply]

I agree that cross-referencing with other databases and vocabularies is a good thing. I would even go so far to say that this may be the most important aspect of Wikidata in the long run, at least for 3rd party uses of the data. -- Daniel Kinzler (WMDE) (talk) 07:42, 16 June 2014 (UTC)[reply]

Editable Data Tables[edit]

I would like to point out another possibility besides a) modeling tabular data as data items and b) treating them as media files (both of which I find cumbersome and impractical): It would be fairly easy to add support for CSV as a page "content model" to MediaWiki. The table could be edited with the normal edit page, as CSV, or alternatively use a nicer, more graphical interface. It would be rendered as a (sortable?) wiki table. No wiki syntax needed at all.

A rudimentary version of this could be written in a few days. The tricky bits, like allowing categories on tables, or a nice editor, that could be added later.

Note that importing large data sets into a single wikibase data item will not work well. The Wikibase data structure is not designed for items with a very large number of statements, an item with several thousand statements would be very hard to handle, and might be impossible to even export without triggering a timeout. "Flat" CSV would scale better, but not indefinitely. I think a flat table format for wiki pages could be quite useful for medium-sized tables, from 10 to 10000 rows, maybe. For larger data sets, a different approach would be needed - but I doubt that it makes sense to import larger data sets into a wiki as a single "page" at all.

I personally think that it makes a lot of sense to import individual records from open datasets on demand, ideally using some tools that automatically uses the meta data on http://datahub.io/ and other CKAN instances to map the external record to Wikidata properties. This would be particularly handy for generating source references. However, Wikidata as a project, and Wikibase as a software, are not suitable for large scale imports of tabular data. -- Daniel Kinzler (WMDE) (talk) 07:39, 16 June 2014 (UTC)[reply]

Live Data?[edit]

It'd be fantastic to be able to link to feeds that update regularly, and have those updates be reflected. EdSaperia (talk) 22:53, 28 June 2014 (UTC)[reply]