Talk:DataNamespace

From Meta, a Wikimedia project coordination wiki

Thoughts[edit]

Initially, I think there are two main questions, both important:

  1. What community would be responsible for this data?
  2. How would it be implemented technically?

For both, there is a follow-up question, "Is there currently social/community (1) or technical (2) capacity to do this?" For #1, it seems it should be Wikidata. Some such data can already be implemented on Wikidata using the triplet-based system. I think this may include the example given in this article, (California, population property, numeric values for each ethnicity with a qualifier stating which ethnicity). But that may not be a good representation.

Other tabular data would be more difficult to represent in triplet form, but I don't know that it's impossible. We really need input from the existing Linked Data and Open Data communities.

I think it's important to understand that while Wikibase (the extension used by Wikidata) may not be able to do that, Wikidata is still probably the best place to put this if we do it. Either Wikibase can be expanded to handle this, our two extensions can run on the same wiki. I don't think it is productive to have two independent data communities on two different wikis. Superm401 | Talk 00:01, 22 August 2013 (UTC)[reply]

Superm401, great questions:
Is there currently social/community capacity to do this?
I think the existing Wikidata and Wikipedia communities would be the primary target of this proposal. I was not thinking of importing tabular open data that is not already available in some Wikipedia article.
Is there currently technical capacity to do this?
It appears Wikibase is not appropriate for the job, see comments below by denny and Daniel Kinzler (WMDE).
Can this be represented in the form of triplets?
My initial proposal was geared towards high-quality, curated datasets with traceable provenance as the primary objects, not on (1) the entity/property relations an existing dataset could be broken down to or (2) tabular data that could be reconstituted from triplets. I expect most of these tabular datasets will be hard to translate into structured data and some entities referred to by the dataset may simply never exist in Wikidata.
--DarTar (talk) 23:17, 27 August 2013 (UTC)[reply]

Wikidata[edit]

The proposal looks good to me. I understand that it will be confusing to talk to many about the subtle difference between tabular data and semantic data (I like your conceptualization). Just one question: what about sources and references? Would it be per table (easy)? Per cell (uhm)? Not at all? Also, as Superm401 said, it probably makes sense to assume it would be on the Wikidata site for general data, but with a different software extension. --denny (talk) 14:44, 22 August 2013 (UTC)[reply]

denny, I was assuming sources would be specified at least initiall only as a dataset-level property (which could be stored as a regular statement in Wikidata if the dataset is an identifiable entity). This assumes the provenance of the dataset as a whole could be traced to a single reliable source, as opposed to datasets generated by querying individual statements of various provenance. --DarTar (talk) 23:22, 27 August 2013 (UTC)[reply]
Yes, that would work and it would simplify things considerably. --denny (talk) 10:19, 28 August 2013 (UTC)[reply]

What and where[edit]

I think this is cool and useful.

So there would be a new namespace on a wiki, possibly wikidata, containing tabular datasets named wikidata:Data:Goiânia_accident_dosage.dsv or wikidata:Data:Goiânia_accident_dosage.json? It's unclear from the proposal if you will choose one between JSON and delimiter-separated values (and a particular kind, tab- or comma-separated), or whether the Data namespace will support many formats.

I don't know yet, ideally it should be format-agnostic and use some qualifier that specifies the appropriate handler to use, but I'm just speculating --DarTar (talk) 23:41, 27 August 2013 (UTC)[reply]

any page associated with a dataset can be used to store metadata, or (even better) the metadata can be stored on Wikidata... This needs fleshing out. Will the proposal mandate a particular location? The obvious page on which to store metadata about a dataset is its talk page (Data_talk:name_of_Data), and then all the richness of citation templates can be used (though importing them all to wikidata is a hassle, and citation templates impose a language) That's how schema talk is used for schema information, e.g. m:Schema_talk:ServerSideAccountCreation. I'm not clear about how metadata would go into Wikidata as statements about a page of tabular data.

Every dataset would have its own unique identifier stored in wikidata as an entity, with metadata attached as properties to the entity. This is the same mechanism currently used for mediawiki artifacts like categories, see wikidata:Q6513597.--DarTar (talk) 23:41, 27 August 2013 (UTC)[reply]

... if the data table exists as an entity in Wikidata. I think Wikidata can store statements about a page on a particular wiki, so using Wikidata for this purpose doesn't mandate that the Data: namespace also be on wikidata. Meta is primarily in English with localized versions, whereas my understanding is Wikidata aims for language-neutrality.

Agreed, I don't know what the best place would be for this namespace. Daniel below has a point about data about Wikimedia projects (that could live on Meta) vs data from Wikimedia projects (stored in Wikidata) but even this distinction is blurred by the fact that there are not just abstract entities in wikidata but also entities for artifacts.--DarTar (talk) 23:41, 27 August 2013 (UTC)[reply]

tabular data that can be easily embedded into an article will allow us to develop extensions or gadgets in MediaWiki to easily toggle between a tabular view and a chart view, replacing the need of static images or vector graphs. This is most excellent! The Score extension is similar in that it can render source material in multiple ways (MIDI file, sheet music, digital audio). Obviously need to consider how the non-JavaScript fallback works.

Yes, compatibility with mobile browsers and fallbacks for browsers with no JS should be considered --DarTar (talk) 23:41, 27 August 2013 (UTC)[reply]

-- S Page (WMF) (talk) 23:42, 22 August 2013 (UTC)[reply]

WikiBase, ContentHandler, MediaHandler[edit]

  • The WikiBase extension is indeed not well suited for dealing with tabular data.
  • A content handler for tabular data would be nice to have (in fact, I wrote a very basic one as a workshop demo at wikimania).
Daniel Kinzler (WMDE), link? --DarTar (talk) 23:51, 27 August 2013 (UTC)[reply]
  • A data namespace on meta for data about wikimedia projects sounds good.
  • The wikidata community might well decide that they want a namespace for tabular data, using an appropriate content handler. That would probably not be for data about wikimedia projects, though.

But there are limits to what the content handler can do:

  • ContentHandler support for tabular data would probably only work for tables with a few dozen columns and a few thousand rows. Larger data sets would become very slow to handle, may cause output of memory errors, etc.
  • Consider a MediaHandler instead of a ContentHandler - MediaHandlers deal with uploaded blobs (images, etc) and are designed to handle large files.
  • It's still possible to implement in-wiki display with a MediaHandler. There already is paging support, which could be extended to support things like "show column A, F and D for rows 200 to 350".
  • The MediaHandler could support different native formats without conversion, so people could just upload CSV, WDDX, ODT, or whatever.
  • With a MediaHandler, there would be no on-wiki editing... but that's not feasible for large data sets anyway, and probably not even desirable in most cases.

(notes by User:Daniel Kinzler (WMDE), copied on Meta with permission)

Daniel Kinzler (WMDE), excellent feedback. The original proposal was mostly targeted at the small, tabular datasets that are currently embedded in Wikipedia articles, so there would be no need to support large files at the cost of dropping on-wiki editing. I really see these as two separate use cases: small, editable datasets vs large datasets that can only be uploaded and previewed or summarized but not edited on wiki. I also think that if we were to advertise this as a repository for large tabular datasets we would immediately hit MediaWiki's upload bottleneck (people often think of hundreds of Mbs or a couple of Gbs when referring to large datasets)--DarTar (talk) 23:51, 27 August 2013 (UTC)[reply]

Scope of the namespace[edit]

Heya,

I like the proposal but have some minor questions:

1) What in the proposal will prevent that we become a general purpose dataset hosting site? There is an assumption that the Wikidata community will take ownership of this namespace but are they are up for that? Even if they do, there is still the question of what are appropriate datasets that we should host and which ones not. Do we only move tables from Wikipedia articles to this new namespace or can, for example, Chapter folks upload their self-assessment data as a table. I think the scope should be a bit more clearly defined upfront.

Drdee, this is a question that came up in all the comments above, I guess I should better specify the scope (only move tables from Wikipedia articles) --DarTar (talk) 23:56, 27 August 2013 (UTC)[reply]

2) Do you want to enforce a link between a dataset and it's metadata if the metadata lives on a separate page or should this be available on a single page? Having both data and metadata on a single page will prevent certain problems (like metadata page is deleted / renamed / not updated) and when consuming the dataset you only have to make one call instead of two calls to get all information.

I hadn't thought of that, interesting. I imagine this could be handled in the same way as Commons handles the deletion of media files and the associated metadata page in the File namespace, but that still doesn't answer the problem of how to jointly delete a dataset and the corresponding entity page on wikidata. I don't see this as an untractable problem, the same issue applies to entities linked to Wikipedia articles that get deleted or renamed.--DarTar (talk) 23:56, 27 August 2013 (UTC)[reply]
I wouldn't worry too much about this. This is the same problem we have to solve for Wikimedia Commons and the structured data about the multimedia content there anyway, and this solution should work analogously for the data namespace, no matter where it is located (but if it is not Wikidata or Commons, it would need to be turned into a Wikibase repository). --denny (talk) 10:31, 28 August 2013 (UTC)[reply]

3) We have a prototype to embed Limn charts in a Wiki.

Demo! --DarTar (talk) 23:56, 27 August 2013 (UTC)[reply]

Drdee (talk) 13:47, 27 August 2013 (UTC)[reply]

Partnership with external organizations[edit]

I think it is a very interesting (and needed) proposal. Personally I would recommend to contact external organizations that are already dealing with this very same kind of problem and partner with them. A potential candidate could be Datahub (OKFN/CKAN). Maybe Wikidata can take care of describing semantically the fields of each raw data file or providing the mapping to Wikidata properties. Not everything must be contained in Wikidata, but how to interpret the data it is also valuable knowledge and it might be worthwhile to develop that kind of bridge.--Micru (talk) 02:07, 28 August 2013 (UTC)[reply]

I agree with Micru. This project would allow the Wikimedia world to host other interesting datasets from the open data world (many of them can be well suited in Wikipedia articles). Moreover, digital preservation and revision control of datasets are a huge problem (see for example this post). Wikidata could be part of the solution. --Aubrey (talk) 10:06, 28 August 2013 (UTC)[reply]
Micru, Aubrey: in fact we've been working with OKFN/CKAN for a while: we created and have been using the DataHub as a data registry for open data about Wikimedia projects and we're continuing to work with them to determine how to best support our community (the DataHub recently upgraded to CKAN v.2 and there are some glitches we're trying to sort out). Integrating CKAN with MediaWiki is a totally different story and I believe that a dedicated data namespace for reusing tabular data across Wikimedia projects is a much more realistic goal than integrating in production two totally unrelated platforms. I'd be curious to hear of any experiment that has been made in that direction though. --DarTar (talk) 00:41, 24 September 2013 (UTC)[reply]

What can be hosted on Wikidata[edit]

A lot of "table" information can be stored on Wikidata just as wikidata is now.

If a table shows or compares the characteristics of a bunch of wikidata items (compare the characteristics of different games consoles; List the recurring characters in a TV show with the episodes they were in) this can be done by having a separate wikidata item for each line of the table and a property for each column.

If you want to show a progression over time - population for example - then have a 'population' property with multiple values for the population (integer datatype will be available in October) at different points in time, each value with a qualifier property for the point in time corresponding to that value. Note that this data can be expressed as a table or as a 2d graph - each population-value/time-qualifier is a point on the graph. A graph with multiple lines representing different sources of population figures (estimate, CIA factbook, national census office) needs an additional qualifier for each value. A graph with multiple lines comparing the population in different places needs to import data from different items, each representing a different place.

The existing Wikidata data structure with qualifiers could even be used to represent arbitrary 3 dimensional points with each point represented by a value (representing a label for the data point, for instance) with qualifiers for x-coord, y-coord, and z-coord. Add a time coord and it can vary over time as well. 3d coordinates can also be represented by a distance value with a qualifier using 'coordinate' datatype to give polar cooordinates for the point. Of course it's starting to get complicated now. This is however effectively what the data in the astronomical infoboxes includes (taken from star catalogs), all of which is due to be integrated into wikidata.

A separate datanamespace may well have a usecase but, as far as I can see the tables we have on Wikipedia can pretty much all be represented in the current wikidata with no changes. The development needed is all related to developing tools to extract and display the data. These tools should also, of course, automatically update the tables and graphs when the data is changed or when an additional data point is added.

Or have I missed something? Filceolaire (talk) 00:46, 29 August 2013 (UTC)[reply]

See the Global Economic Map project here and on Wikidata. This is aiming to store tables of economic statistics on wikidata. Filceolaire (talk) 00:58, 29 August 2013 (UTC)[reply]
Filceolaire, sorry for taking so long to come back to you. I think you're right that in principle every type of tabular data could be transformed into a set of statements in wikidata using qualifiers, but this has a number of huge implications on usability and on the underlying data curation model:
  • each header in a tabular dataset must already exist as an entity in Wikidata (it's not an accident that the first three items in the Global Economic Map task force proposal are precisely about this issue: Wikidata already has a page on each country.). It's not clear to me if the community will ever accept populating Wikidata with arbitrary entities solely for the purpose of storing tabular data, even if they don't represent anything notable or worth being created.
  • Moreover, what happens to the original dataset once individual entities it refers to are deleted? How is provenance mantained for the original dataset once you lose control of entities? How do you update a dataset whose entities have been renamed or deleted in the meantime?
  • there are two possible curation models I can think of:
  1. tabular data to Wikidata statement: this is the model in which people donate/import a tabular dataset from an existing source (say US census data), the dataset is then represented as an object with a unique identifier in the data namespace; it is then broken down and converted into a series of statements for the corresponding entities in Wikidata when appropriate.
  2. Wikidata statements to tabular data: the second model (the one you're advocating) has people donate or import data only if it can be natively represented in the form of Wikidata statements; this data, once adequately modeled and imported, is then reaggregated in the form of a tabular dataset.
  • Model #2 imposes a huge overhead on data donors/curators, for the reasons that I mentioned above. Most data publishers/curators out there (researchers, governmental agencies, scientific institutions) typically publish, document and maintain datasets as individual objects. Asking them to convert this data into Wikidata structures as a condition for this data to be published will not work unless (a) there is dedicated interface support for this workflow, (b) all the usability barriers are removed, (c) all the relevant entities and qualifiers exist, (d) the problems I mentioned above with entities being refactored is properly addressed . Model #1 is much more likely to meet the expectations of a data donor and will allow the Wikidata community to selectively curate and import this data into structured data only when appropriate. Model #1 is also more flexible, it will still support all the use cases you refer to without creating an unnecessary burden on the data donor's end. I also don't agree with your characterization of curating and documenting a tabular dataset as the equivalent of annotating a chart. If you look at what most popular open data repositories do (Dryad, CKAN, FigShare etc), that's exactly how they work. --DarTar (talk) 19:48, 24 September 2013 (UTC)[reply]

How to incorporate that bar chart into Wikidata[edit]

Looking at that bar chart on the project page. How could the data for that be encoded in wikidata.

One of many static barcharts used across Wikipedia

Option 1[edit]

We could have a special wikidata page just for that bar chart using special bar-chart properties.

Brazil cesium human chart (Item)
Type of chart : Bar chart (Property with Item)
Chart title : "Goiana incident radioactive contamination" (Property with multilingual text)
Bar : "0.5 to 1" (Property with multilingual text)
color : Pale Green (qualifier)
Height : "6"(qualifier)
Bar : "0.5 to 1" (Property)
color : blue (qualifier)
Height : "2" (qualifier)

This is effectively describing the chart rather than data. You could just as well have an Open Office file and store it on Commons.

Option 2[edit]

We could add the data to the item for the Goiana incident:

Goiana incident (Item)
Chemical: Cesium (Property -> Item)
Dosage : < 0.5 Gy (qualifier -> number with units)
Number of persons : "18" (qualifier -> number)
Outcome : Outpatient (qualifier ->Item)
Chemical : Cesium (Property -> Item)
Dosage : 0.75 +- 0.25 Gy (qualifier -> number with range and units)
Number of persons : "6" (qualifier -> number)
Outcome : Outpatient (qualifier ->Item)
Chemical : Cesium (Property -> Item)
Dosage : 0.75 +- 0.25 Gy (qualifier -> number with range and units)
Number of persons : "2" (qualifier -> number)
Outcome : In-patient (qualifier ->Item)

Here we are modelling the data, not the chart. The set of properties used here can probably be reused for describing medical trials.

Wikisource need of a Data namespace[edit]

On it.wikisource the need of a comfortable place for data is felt from year. While Wikidata seems a perfect storage place for "general interest" metadata there's a subtle need to much more granular, book-specific data:

  1. relationship between nsPage and ns0 to manage internal links and anchors
  2. relationship between nsPage number and book page number
  3. dictionaries of words
  4. scannos search and highlighting
  5. ....

Lua is a powerful tool to manage pretty large sets of data (previously managed by "mega-switches into templates) but Lua-specific data structure is needed, and such data are efficiently used in view mode, but aren't accessible while editing.

The best would be, that Lua and javascript could read the same sets of data (ie that Lua could read JSON structures and plain text), Lua using mw.loadData() peculiar features. I can't imagine how this could be done. --Alex brollo (talk) 21:43, 20 December 2013 (UTC)[reply]

Similar? PiRSquared17 (talk) 02:03, 25 December 2013 (UTC)[reply]

Datasets on Wikidata[edit]

See also: bugzilla:62555. --Micru (talk) 00:30, 12 March 2014 (UTC)[reply]

Time to revisit this! Now that we have a Data: namespace implemented on Commons :)[edit]

How to get this introduced into MediaWiki Core? –SJ talk  19:50, 21 October 2022 (UTC)[reply]