Wikimedia Conference 2010/Developers' Workshop/Notes/Structured Data

From Meta, a Wikimedia project coordination wiki

This pad contains live notes from the Structured Data working Group at Wikimedia Conference 2010/Developers' Workshop.

Outline[edit]

The first session was used to identify main topics to discuss today and tomorrow. The resulting topics are:

  • Using MW templates for managing structured data (do we need better declarations of template parameters? datatypes? data extraction from templates, ...)
  • Data import/re-use from external sources (live import vs. data integration, caching, push vs. pull, versioning issues, trust and provenance)

Participants[edit]

(based on introductory round in the morning)

  • Tatiana de la O (acracia): API, RDF storage
  • Gregor Hagedorn: image metadata (?), large datasets, (Identification in biology, matrix keys for querying data)
  • Simon GESIS: wiki for social science databases (import/export)
  • Leszek Krupinski (leafnode): importing metadata properly, database access to wikis
  • John Erling Blad (jblad): using external data in Wikipedia, re-using public data
  • Daniel Kinzler (Duesentrieb): multi-lingual metadata, commons, metadata extraction
  • APPER (Chris): PersonData, tool server
  • Lars Aronson (LA2): practical use of metadata, personal metadata
  • Jonathan Gray (OKF): Open Data, Open Content, bibliographic metadata, browsing
  • Inez: structured data extraction, article recommendation, (WYSIWYG background)
  • Sebastien: Automatic Wikification
  • Anja Jentzsch (anjeve): DBpedia
  • Robert Isele: DBpedia
  • Kolossos: Maps, Template Tiger
  • Markus Krötzsch: Semantic MediaWiki
  • Jeroen De Dauw: Semantic Maps

Use Cases[edit]

  • Image metadata on Commons
  • PersonenDaten on German WP
  • Geodata on WP
  • Bibliographical records (e.g. using FRBR ontology)

Using MW templates for managing structured data[edit]

  • Should we use templates, or develop some other way of presenting/recording meta-data?

How to manage metadata about templates?[edit]

  • Proposal by Daniel:

Declare template paremeters, including documentation, expressed relation (e.g. RDF property), optional-flag, etc.

  • Proposal by Markus:

SMW already has a feature like the one proposed by Daniel. In addition, SMW declares properties on separate pages to have a local name and datatype (instead of just using a technical URI of some external ontology directly)

  • Properties should be first-class objects (like in RDF), existing globally and possibly being used in more than one template
  • Some property values can have multiple languages
    • we want to support this only for plain text values
    • we treat it as in RDF internally
  • Problem: not all properties have reasonable one-to-one mappings to template fields, e.g. sometimes multiple fields have to be consolidated into one property value
  • Possible solution 1: Have a parser function for declaring template fields to have some "meaning", do this with an extension and incrementally introduce it to a WP project; actual values are obtained by hooking into template transclusion and checking if the template has declared meanings for its parameters to process them (advantages: no changes in core, incremental adoption/extension possible)
  • Possible solution 2: Have a parser function for processing instantiated template parameters; the parser function is inserted into the template code (wrapped around the value) and processed on the pages the template is used on (advantages: no addition database lookup when using templates)

Which information to declare?[edit]

Core information:

  • parameter name
  • datatype
  • unique identifier for the property (possibly from a standard vocabulary, or from the wiki)
  • human-readable documentation

Auxiliary information:

  • field required or not (used for editing)
  • information for sanitizing inputs
  • list-related attributes (e.g. separators for lists)
  • ...

Basic datatypes:

  • text, multilingual text, dates, numbers, wiki page names, geo coordinates, URLs
  • lists of <anything>
  • Units of measurement?

Other issues (later)[edit]

  • Data model?
  • Data types?
  • Mapping to external ontologies?