Requests for comment/How to deal with open datasets
Thanks to initiatives like the one lead by the Open Knowledge Foundation, more and more governments are opening their data in platforms like CKAN. Also, scientists part of the Open Access movement disclose their data, which might be partially relevant to the Wikimedia movement mission. How should we adapt to this new situation? How does it affect our mission?
- 1 Which kind of data are we talking about
- 2 Relationship of datasets with Wikidata
- 3 Examples of external dataset platforms
- 4 Potential use cases
- 5 Open Questions to the Community
Which kind of data are we talking about
By datasets we refer to:
- static data tables that are used in Wikipedias (see DataNamespace for some examples)
- data that is used to generate visualizations (like those enabled by Vega)
- data that is better handled in batches (hence the name "dataset")
- data that doesn't fit into Wikidata (see below)
Relationship of datasets with Wikidata
Wikidata was created as a collaboratively edited knowledge base, which means that its community has to be selective when integrating data into its structure for several reasons.
- Information overflow: on Wikidata contributors and data coexist to create semantic structures that can be queried in several ways. Adding too much data, might make data curation impossible (think of copy-pasting text dumps into Wikipedia)
- Structure of the project: Wikidata is conceived in a way that each relevant piece of information is stored in its representing item (a city item stores its population), which is great for adding contextualized data, but difficult when this data is just relevant for a specific graph.
- Software limitations: on Wikidata each field is supposed to be editable. That requires formatters and widgets, which would render the pages unusable when the number of elements is in the range of hundreds or thousands. Datasets do not need to be edited, just versioned as a whole, like files in Commons.
Even with Wikidata not being the right place the data structure of Wikibase (the software powering Wikidata) should be kept into account.
Examples of external dataset platforms
There are other projects that are dealing with datasets in different ways.
CKAN is an open source data management system used by governments to disclose their data. There are (as of May 2014) 83 instances running. Dataset example: "2011 Annual Report for the Vancouver Landfill"
ZENODO is developed by CERN under the EU FP7 project OpenAIREplus. Their goal is to make as much European funded research output as possible available to all. Example: "Conversion of fluoride and chloride catalized by SAM-dependent fluorinase in Nocardia brasiliensis"
Figshare is another platform for sharing data, but close sourced. e.g. "Life history traits and maximum intrinsic population growth rate of 107 chondrichthyan species".
Potential use cases
Cooperation with governments
Together with the Open Knowledge Foundation Austria and the Cooperation Open Government Data Austria, Wikimedia Austria is currently establishing an open data portal that will host non-governmental open data in Austria. Open data is an important part of free knowledge and often serves as a valuable resource for Wikimedia projects. Some of this data will be incorporated into Wikidata, but in some other cases it won’t be possible (reasons described above). However it it might be still interesting to use some datasets for visualization purposes. Wikimania '14 submission
See also projects like CitySDK
Partnerships with citizen organizations
Recently Amical Wikimedia was approached by OCMunicipal, a citizen organization whose aim is to monitor town hall spending. In order to achieve this they engage citizens and journalists to process the data. Once it is processed and standardized it becomes more useful and aligned with our knowledge divulgation mission. It would also help in their mission if potential partners and contributors would know that their work can be reused in Wikipedia.
The Analytics team from the Wikimedia Foundation generates a great deal of valuable information that has to be formatted in cumbersome wikitext in order to be published. Their work would be much easier if a tool to manage datasets was available.
Better management of existing tables
As outlined in a previous proposal (see DataNamespace), there is a great deal of information in Wikipedia pages that cannot be shared in different language editions. It is also hard to update when there is new data available, and, as explained above, it doesn't always belong to Wikidata scope.
With an extension using a visualisation language like Vega, we could support rendered visualisations, even live visualisations, in Wikipedia.
Open Questions to the Community
Here some open questions to spark a constructive discussion:
- What should be the relationship of Wikimedia projects with external datasets? Should we import them whenever possible or just link them?
- In case that datasets were in the scope of the Wikimedia movement vision, since both Commons and Wikidata have different scopes, where to store them and how to maintain them?
- Should a technical solution be sought to complement the perceived gap?
- There have been many technical changes recently, is the community ready for yet-another-tool? If not, has to be limited to people working in outreach activities?
- Would it make your job as editor more complex or easier to have an additional tool to deal with datasets?