Semantic MediaWiki/Original implementation proposals
In this article, we collect concrete proposals on how to implement the targeted functionality. This is actually the main discussion field, since it is not as clear as the other areas. Initially, it is filled with some proposals extracted from earlier projects and ideas, but it should evolve to something that we really want to see implemented.
Some statements on this page refer to Wikidata. I do not know the current stage of implementation of this project: maybe some additional features of this project can be incorporated unexpectedly. You may want to comment if you have more information.
According to the ideas and experience from other projects, the following types of annotation should be supported:
- annotation with atomic data (like a downstripped version of Wikidata), which is associated to certain data-fields (like "population" or "date of birth") and types ("integer" or "date"),
- categorization of articles (possibly improved), and
- typing of links (to structure the relationships between articles).
All of the annotated data should be integrated into a single data export, possibly in a standartized format. To help users, categories, link-types, and data-fields should have dedicated articles with descriptions (like categories do now). These articles might also include automatically generated information to enhace usability, like for category articles do now. By creating articles, users can also introduce new categories, types, and data-values; so the developers are not the ones to dictate what types are available, the community keeps all the power.
Categories and link-types can be classified in a hierarchy. Data-fields can be assigned to a data type (needed for parsing etc.).
In contrast to Wikidata, dedicated "data classes" that work similar to templates are not planed in the first phase of the project. However, it is expected that template-like editing masks can be introduced later, building on a basic mechanism for associating data-fields with articles. Other features of Wikidata, e.g. localization tags, might be more difficult obtain, while the hoped-for Wikidata applications like searching and sorting can be achieved more easily with the help of a standard-conformant approach that allows to reuse existing software.
It is suggested to strictly respect "typing" in annotations: only articles can be annotated, categorized, or semantically linked. This is important to be able to use standard data formats and existing software. (see below for details)
Should the treatment of images or multimedia be different from articles? Or do we just put these into additional dedicated categories automatically (so that programs can restrict to image data by restricting to this category)?
Editing syntax and management of categories/types/…
How can users enter the annotation data in a nice way? Which things might not be so easy to enter?
For categorization, this is well established:
- An article is categorized by entering the hidden field [[Category:name of the category]]. The keyword "category:" is localized to each special language wiki.
- To maintain and browse categories, a special article with the same prefix can be generated. In addition to custom text, it contains a list of all articles that are "directly" in this category and its subcategories.
- Categories can be structured in a hierarchy: the super-categories of a category are specified by putting the category's article into the respective categories as if it were a normal article.
For typed links, we have the following proposal:
- Types are managed like categories: they have dedicated articles and can be put into a hierarchy.
- There needs to be a new namespace for this: what about "link:" or "relation:" for the English Wikipedia?
- A naming convention for such types would be useful. Typical identifiers could be "has capital" or "is located in".
- Would this possibly create confusion with the authors (about the direction of the typed link; compare with proposed syntax below)? Maybe having simple identifiers like for data-fileds ("capital") would be better.
- Links are typed by giving a type-name within the brackets, as a third parameter in addition to the link target and the (optional) alternative label:
[[article|alternative label in print|link type]]. If there is no alternative label, one can write [[article||link type]]. On the other side, leaving out link types is also possible: [[article|alternative label in print]] or just [[article]].
- There could be extra information or a simple search mask on the articles for link types. Any ideas?
- It could be helpful to have the possibility to annotate a link with more than one type, e.g. by providing a comma separated list instead of a single type name
Annotation with simple data could work as follows:
- The annotations are palced in a dedicated environment in the wiki-source, which is not rendered as text. Example implementations used the keywords <meta> … </meta> for this, but there might be better expressions (since "meta"-data is data about the article, like the authors or the language, but the annotation above talks about data within the article).
- The downside is that this approach creates separated ontology-sections in the wiki-source. An alternative is to give a separate input field for such basic data. See the mock-up-screenshot at the home of Wikidata.
- Values are assigned to data fields by writing "identifier=value" inside the appropriate environment.
- Identifiers of data fields should have dedicated articles in a special namespace ("data:" in English?). There, one can place a description of how to use the data-field, and a choice box for the data-fields type. The type is needed for parsing the data entered by the user (to obtain a time or date instead of a string etc.).
Export format and interoperability
Making (annotation) data machine readalbe requires a machine-readable format. Which one should we use? What are the details (e.g. predefined namespaces)?
- An XML-based format seems to be generally the best solution, since it can capture highly structured data in a form that programs can use very easily.
- A candidate would be "OWL/RDF" (do not confuse this with RDF or RDFS): it can express much more than the above annotations, so it is powerful enough for possible future extensions. It also comes with a simple semantics (see below) that is not as un-intuitive as RDF(S) is. It supports a range of data-types. It is read by many software tools. It is an official W3C recommendation -- an open data standard.
Is the meaning of the above export format interpreted in a standard way? Is the meaning obvious to humans? How exactly are the suggested types of data defined?
The semantics of OWL/RDF specifications is based on the Web Ontology Language OWL. It admits complicated expressions that are not for the average user, but its basic idea is very clear: classes (here: categories) are sets, instances (here:articles) are elements of these sets, relations (here: typed links) are labeled arrows between instances. Annotations with simple data effectively are just typed links: instead of between two articles, they are between an article and a single data-value (However, OWL allows specific expressions about article-article-links, so one should not generally view articles as a special data-type for simple-data links but keep the two concepts separate).
- Its semantics is defined formally, so there is a standard interpretation to guide software developers.
- It is downwards compatible with XML and RDF(S): all tools for these formats give some support for OWL/RDF-syntax as well. Conversely, XML or RDF-Documents are usually not OWL.
- Overall, there is a strong tool support, and many tools are produced by universities and are open source. Commercial tools for industrial strength applications exist as well.
- It is a widely used standard, that is considered as the basis for future developments in the field. Building on it will give us many options for future extensions or adjustments.
- OWL supports many data-types, but we still might need customized extensions (e.g. by creating composed types). Integers, floats, strings, dates and times should work, but things like GPS-coordinates might not be there. Will this be a problem?
Required MediaWiki and Database changes
OK, so what do we have to do now? Here go the software-side implications of the great plans above.
A fundamental question: if we have export in a standard format like OWL/RDF, do we need extra copies of the data in our main database? Or is the string-based XML-format enough for further use? (Especially the data-values of different types could require a few new database tables).