Semantic MediaWiki/Blueprint

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search

On this page, we presented the details of our plan for implementation in 2005. It served as a blueprint for the actual programming and also describes in some detail how we envisoned it all working in practice (also see Semantic MediaWiki/Original implementation proposals).

Now the system is coded and working, this blueprint is historical, although there is lots of good feedback in its Talk:Semantic MediaWiki/Implementation page.

Overview of features[edit]

The intended extension of MediaWiki will support various forms of semantic annotation of Wiki-articles. These are:

  • Annotation of article-article-links by means of configurable relations (link-types).
  • Annotation of articles with simple data-values that are assigned to configurable attributes.
  • Support for using physical units of measurement in all numerical attributes, without need for prior configuration.

The new annotation data will be combined with annotations from the current category system to generate standard-compliant OWL/RDF output, which can be fully processed with tools that support OWL DL or OWL Lite, but which can also be treated in a meaningful way by software that supports RDF, RDFS or XML.

Furthermore, some new functions for providing user feedback will be introduced. In particular, the given annotations will be displayed at the bottom of each article, and annotated data-values can have special HTML-code for displaying helpful information (unit conversions) as tooltips or on mouse contact. This is important to assure that annotations are visible (and thus checkable) immediately for all users.

The proposed annotation methods were developed to be as simple and unintrusive as possible. We stay true to the important Wiki-principles of having all information right inside the Wiki-source and of providing visible user feedback for (most) inputs, and we suggest a syntax that is easy to understand without any technical background knowledge on semantic annotation.

Outline[edit]

The above features can concretely be realized with only a small number of modifications (which, in fact, is also a feature). We now first describe the envisaged functionalities and discuss the single steps for implementation in the next section below.

Relations and attributes[edit]

When a user assigns a "type" to a certain link, this link specifies a certain relation between articles. Besides this, articles can have attributes that have (one or more) data-values of a certain data-type. Stated differently, attributes and relations are used inside articles to classify the relationship of an article to a data-value or to another article, respectively. For an example, take the relation "has Capital" between a country and a city and the attribute "Population" for a country (or a city).

Relations and attributes are really used to "categorize" such connections between articles and between articles and data. Clearly, we need to have a way to define and describe relations and attributes very much like in the case of categories today. Therefore, two new namespaces "Relation:" and "Attribute:" will be introduced, and articles in these namespaces will be considered to describe the respective concepts. These articles are used to collect human-readable information on how to use the annotations, but also machine-readable entries that define special features of a relation or attribute (For example, any attribute needs a data type, like integer for "Population").

Annotations in the Wiki-Source[edit]

Existing relations and attributes can be used for annotation in articles (in fact, we will even allow use of relations and attributes that do not have a defining article yet; this is a Wiki-principle that is already in use for categorization). Annotation of links with link types will basically use the following scheme:

[[Relation name::article name|alternative link label]]

Both the annotating "Relation name" and the alternative label can be omitted. The deviation from our earlier proposal to introduce the link type as a third field at the end of the link has deep reasons of usability that will be explained elsewhere. The following sentence, that could be given in the article on Berlin, is an example usage of the link annotation:

Berlin is the capital of [[is capital of::Federal Republic of Germany|Germany]].

Although this introduces some extension in the Wiki-source, it is readily understandable for humans. A more detailed discussion on usability is given below.

Annotation of articles with data-values can be done in a very similar syntax, using the following scheme:

[[Attribute name:=data value|alternative label]]

The alternative label might not be necessary in many cases, but it is a fallback if the article-text is not supposed to contain the exact data value. The following sentence, for the article about Berlin, is an example usage of data annotation:

Berlin has about [[Population:=3.390.444|3.4 Mio]] inhabitants.

In order to use this annotation, the parser needs to know the datatype of "Population" (it could also be a string, instead of an integer), so it should only consider evaluating this for cases where the attribute has been declared already. Of course, string could always be used as a default datatype. In cases where the given value is not understood by the machine (e.g. when it is not a syntactically correct number), the corresponding place in the article could be colored or even linked to a page that explains possible errors. We will also allow units of measurement following data values; details are given below.

Units of measurement[edit]

Physical units are very important in Wikipedia; annotations with plain numbers will usually not be enough. One solution would be to require that any single attribute gets only values in some fixed unit. But this would lead to users wildly defining new attributes just to use their own favorite unit for some (existing) annotation property. This would be very unpleasant both for the user (who would need to create new attributes all the time) and for those who process our annotations (who have to deal with many, only informally connected attributes that describe the same features).

On the other hand, OWL does not foresee any way of giving units with the data values. Even worse, there is little hope of ever getting descriptions for converting between different units into OWL. Hence, OWL (and RDF, RDFS, XML) cannot effectively deal with values that are given in different units, even if we provide some information on these units.

We propose the following solution for this problem: users are allowed to provide units for all numerical data types. Even if the unit is unknown to the system, it will be easy to extract it from the annotation, because numbers cannot contain spaces (so everything behind the first space must be the unit). Units are then processed by first creating a variant of the name of the given attribute by appending the extracted unit in parentheses. This variant is the used to store or export the given data value. This allows users to use any unit, while still keeping different units apart (we do not export "miles" under the same label as "kilometers"). Furthermore, whatever unit is used, the annotations are still explicitly connected to the same attribute and special software (that has knowledge about conversion between units) could make use of this information.

We can further extend the internal support for units by building knowledge on conversion between units into MediaWiki. Attributes can have special data types (see below), with built-in information about a given unit system, so that MediaWiki can auto-convert data values to a fixed target unit before exporting them. For example, if an attribute has the type "length", then MediaWiki could automatically transform data values with unit "inch", "miles", or "kilometers" into "meters" before storing/exporting them. Users can freely use units as before, but for the case of the known units, MediaWiki simulates that the user would have converted the value into some standard unit (such as "meter"). In the exported annotations, one cannot distinguish between the converted and the user-specified units, so the external handling of annotations is not affected.

Converting values into standard units preserves the users' freedom, but still simplifies the task of external programs a lot. If only attribute values in one unit are considered, then one can completely rely on standard tools and no customized implementations (with built-in unit support) are needed. Furthermore, unit support can be added successively into MediaWiki without any limitation on usage.

Datatypes[edit]

Attributes need to have a specified data type, so that meaningful export and internal processing of a value can be achieved. Datatypes can be given inside articles, using the namespace "Attribute:" to assign them to some attribute. Datatypes themselves are built-into MediaWiki and cannot be created online by the users. The datatype of an attribute specifies the following information:

  • The type of the possible values for this attribute: it could be a whole number, a decimal number, a string, a date, etc. This also controls the data type used in export functions, since XML Schema provides types for many different basic data formats.
  • If applicable, the standard unit of the attribute. If users give no unit with an annotation, then this unit should be assumed (can be empty).
  • If applicable, a collection of supported units and the way to convert them into the standard unit. This allows MediaWiki to convert some units automatically to the preferred standard unit.

The parts of the above that are concerned with units only apply in some cases and the suggested system can be started without any such unit-aware types. However, units are helpful, because they enable MediaWiki to provide automatic conversions into other units as well (see below for possible user feedback, based on this idea).

User feedback and export[edit]

Annotations must be accessible to the user without editing the Wiki-source. A first feature to do this is to display all annotations as a list below the article. Each item of the list is of the form "Relation name: link to article" or "Attribute name: data value". For simple attributes, the extracted value is given with the unit from the article, but for unit-aware attributes, the value can be given in various common units (e.g. in "kilometer" and "miles"). It should be rather easy to create such a quick fact sheet automatically at the very bottom of each article.

In addition, relations and attributes can be used to create meaningful tooltips for the annotated area in the article. So a typed link would no longer just contain the target, but also the associated relation. For attributes with unit-aware types, the annotated data values would probably not be links, but could still display helpful tooltips: a short text with conversions of the unit into various other common units would be very helpful for users.

Finally, the user will be able to request the annotation of an article in one or more standard file formats that were conceived for such a purpose. Initially, we want to support exporting of complete OWL/RDF (including annotations, headers, and definitions of categories, relations, and attributes) and of OWL/RDF snippets (containing only the annotations given within the article).

More elaborate forms of user feedback are still under consideration, but are not in the scope of the first extension. These might include complex internal query mechanisms or special listings or statistics on attribute/relation pages.

Relational hierarchy[edit]

As in the case of categories, relations can also be usefully organized in hierarchies. The idea is that all annotations that apply to a subrelation do also apply to its superrelations. For example, "is capital of" is a subrelation of "is located in": whenever a city is the capital of some country, the city is located within this country. Specifying a relational hierarchy allows us to get more information out of just a few annotations. The superrelations of some relation are described within the article of the relations (in the namespace "Relation:"). We propose a syntax that is similar to a typed link between relational articles, but using a special link type like "is subrelation of":

[[is subrelation of::Name of Superrelation]]

Another possible name for "is subrelation of" might be "implies relation" since this is really what it does in practice. Of course, "is subrelation of" is not a real relation in our sense: it is predefined and cannot be modified by the user. In OWL export, it corresponds to a language operator for relations and is not a relation in itself. However, using the typed link syntax seems convenient from a usability viewpoint and one could still create an article "Relation:is subrelation of" to explain the specific nature and usage of this "relation".

Further properties for relations in this spirit may follow in the future. Most importantly, we will need transitivity to be available as soon as we have real OWL DL tools that work on Wikipedia's annotation.

Semantic search[edit]

In addition to the simple article-based user feedback proposed above, it would be helpful to offer at least basic built-in search capabilities, based on the annotations. The evaluation of OWL/RDF ontologies with things like relational hierarchies and transitive relations can be quite complex, but much simpler searches can already be helpful. A first candidate for implementation would be a simple "triple search" where the user can search for basic subject-relation-object triples (here: either "article-relation-article" or "article-attribute-data value"). For example, a query can ask for all articles that are in the relation "belongs to" with "European Union", thus returning all EU countries. This example is a query for possible subjects of a triple. The opposite type of query asks for possible attributes, e.g. for everything that the article "France" "belongs to".

Because simple relations are always specified in the article of their subject, the latter type of search is already answered in the "fact sheets" that were proposed above. So the user just navigates to the article on France to see (at the bottom of the page) what it "belongs to". This does not cover relations that have to be inferred in a more complex way (as in the case of transitivity), but it is still helpful for the user. So we do not need a dedicated interface to enter queries for objects of relations. To query for subjects, one needs to provide the user with an interface to specify the desired relation and object. Although this could be done with some text boxes, one can also provide links to hard-coded searches (assuming that searches are triggered by calling certain URLs with parameters) in the fact sheet. For example, if one looks at the article on "Germany", then the fact sheet, among other things, should say "belongs to: European Union". It would be easy to create a link "(What else belongs to European Union?)" after this entry. This would be a convenient interface for the user to pose queries, since it suggests interesting questions based on the article the user is already looking at. Furthermore, it is very easy to implement this.

It must be noted that queries for subjects are much harder to answer than those for objects, since the former require us to search through virtually all articles in Wikipedia. The implementation plan below suggests a way to do this in a highly efficient way, based on the fact that triple search requires only RDF-based evaluation of the annotations; we do not need to consider more complicated OWL constructs. Also, the above is mainly tailored towards queries for relations that are given by typed links. Queries for datatypes are slightly more complicated, since one is usually not interested in the list of all cities that have exactly the population of the city one is looking at. However, many datatypes can be ordered and it makes some sense to ask for a list of all articles, ordered by their population values. But such questions must be adjusted to the data type and more complicated queries will only be addressed in future implementation phases.

Simple triple queries are, of course, quite limited: we can ask for all things that "are located in" "Spain", but we cannot combine such queries in order to find all articles that belong to the category "Cities" and "are located in" "Spain". This latter query poses two problems: (1) the user needs an interface to enter such complex queries at all and (2) relations like "belongs to the category" are not adequately treated in plain RDF. Especially, belonging to a category is a transitive relation: if capitals are a subcategory of cities, and Madrid is categorized as a capital, then we want Madrid to appear among the cities of Spain as well. So the query answering software has to infer that Madrid is a city. Luckily, such basic category relations are already supported by many tools, though more complex OWL constucts still receive less support. So, the main problem for more complex queries remains a simple user interface. This is not in the scope of the first extension phase, but it will be an important target for future work. Technical support for combining this with complex data-attribute queries is also available, but again it requires a suitable user interface to enter the according questions at all.

Steps for implementation[edit]

Environment: namespaces, data base layout, etc.[edit]

The extension will introduce two new namespaces "Relation:" and "Attribute:". These will be taken into account during parsing.

Data base extensions are not required, but can be added for performance reasons to cache some of the new data. For most such data, we prefer a solution based on an independent "triple store" -- a database that stores and processes (searches, etc.) annotations efficiently (see below). A possible exception is the association of data attributes to data types. Since this information is required during parsing, it should be stored in an addtional table to allow for efficient access.

Finally, special features such as RDF-output and semantic searching will require some independent scripts to run on dedicated URLs. This should not be a problem.

Parsing of Wikisource and generation of articles[edit]

The parsing has to take into account the syntactical extensions proposed above. This affects only the parsing of links (areas in [[ ]] in the Wikisource). For these areas, one has to

  1. Check the link target for ":=" and "::" and extract real link target (after "::"/":=") and relation/attribute (before "::"/":=").
    1. If the link has no semantic annotation at all, just process it as usual.
    2. If the link was tagged with a relation ("::"), process the rest of the link as before, but store the tuple (relation,link target) in some temporary data structure. Special handling for relations of the type "is Subrelation of" can be implemented here, as well.
    3. If the link was tagged with an attribute (":="), extract annotation value (string) and look for alternative text (if not present, then use annotation value string). Use the alternative text as plain text within the article. Now find out the data-type of the attribute (a remark on this follows below) and hand over the link value to a dedicated data-parsing submethod (hard-coded for the given data type, but much of the code is reused for many data types). If the data type is numerical, the method first extracts the unit by finding the non-numerical part at the end of the value. If there is no unit, a standard unit can be assumed, depending on the data type, or the unit string is just empty. Next, the method extracts the real data value and generates (1) an XML Schema conformant string for this value, (2) a unit string (the unit might be changed if a conversion for this unit is supported), and (3) a string for displaying the data value at the page bottom or in tooltips (possibly with values in other units, and possibly with the given unit and value at the top position). Store the attribute-name together with these strings in some temporary data structure.
  2. After parsing the Wikisource, a "fact sheet" is created at the bottom of the page (after the category links). It is a box where all annotations (ordered alphabetically) are given, together with their specified values. The annotations and values are taken from the aforementioned temporary data structures. Duplicate annotations are reasonable (e.g. if data unit conversion tooltips are needed in many places), but duplicates must be eliminated in the temporary annotation storage.

Remarks:

  • To find out the data-type of an attribute, one would have to look up the Wikisource of the attribute. To make this process more efficient, we suggest introducing one additional database table for this.
  • The fact sheet gives the "true" values for the annotations, not the alternative text provided by the user. This is reasonable, because it is supposed to show the annotation values, and not the text, and because alternative texts are often adjusted to the textual context (such as irregular plural forms) which is not adequate in the fact sheet.
  • The fact sheet gives the annotations in alphabetical order, because this allows quick comparison between different articles with similar annotation values (e.g. two cities); otherwise, one would have trouble finding the corresponding attributes. It should still be easy to find the places of annotation within the article (backlinks for editing them would be feasible to simplify this, but this is not targeted here).
  • It is still open how to syntactically denote the data type of an attribute.

Storage of annotation data[edit]

We recommend storing the annotation data in some separate data base that is optimized for holding RDF-based data. Such "triple stores" are freely available. We think that Redland is a major candidate: it is a mature free RDF-triple-store with support for advanced querying. It is written in C, but has language bindings for PHP and many other languages. Since triple stores work on a standard data format, it is not too hard to switch to another implementation later on, if desired.

The suggested process is as follows: After parsing an article, the annotations have been collected in some data structure. Now this data is transformed into RDF-triples (using the XML-Schema version that was computed above for storing attribute values). The triples for an article are straightforward translations of the annotations in the article. For data values with units, the attribute name is modified by adding the unit in parentheses. To update the triples in the triple store, one has to (1) remove all triples with the article as its subject and (2) add all the newly computed triples. This can be done after storing the Wikisource. Race conditions can occur between multiple edits, but we do not have to care much about them, since any further edit can fix mixed up information (in the unlikely case that there really is some editing conflict during update of the triple store). So usually the user can be provided with the finished complete article, even if the triples are still in the process of being stored. Some further optimizations might help to make updating more efficient -- the authors of the triple stores could give some hints here.

Though it generates some overhead in storing articles, using a triple store has major advantages: all semantic services of Wikipedia can work on the triple store, independently from the rest of the database. The triple store comes with efficient functions for searching triples that can immediately be used by Wikipedia. Finally, the store already is a database of all annotations; so, if regular full dumps of annotation data are to be done, one can export the triple store right away.

Ouput: OWL/RDF output and semantic search[edit]

OWL/RDF and OWL/RDF-Snippets can be requested for each article via dedicated URLs. Similarly, search queries are posed basically by requesting URLs that trigger an appropriate script (though links to this URL might be provided by a simpler interface as explicated above). By virtue of using an RDF triple store, such services can be provided independently from the main Wiki database: only the triple store is needed. In particular, the services mentioned here can be implemented in a modular fashion without changing the MediaWiki source at all. Scripts of this kind can reside in a particular subdirectory and can be added or removed as desired. Using a well-developed triple store such as Redland, one can write such extensions in many programming languages, by just using the language bindings of the triple store API.

The particular services mentioned above are fairly easy to implement. Full OWL/RDF output for one particular article first requires fetching all the triples where this article is the subject. In addition, one has to provide basic declarations of all the included annotation elements, i.e. for categories, relations, and attributes. Thus, it is necessary to iterate over the retrieved triples to find out which of these elements are involved. For categories and relations, it is enough to provide a basic declaration that is not further related to their actual definition in Wikipedia. More information on attributes of this relation or on the categorical hierarchy can be retrieved by users through additional requests (and OWL/RDF tools allow easy merging of such specifications). For attributes, the declaration has to involve the datatype. However, one only needs the XML-Schema datatype (such as integer, float, string, etc.) and not the Wikipedia datatype that is possibly associated with unit information (such as length, weight, temperature, astronomical length, etc.). Thus, this information can be retrieved from the triple store, as well (provided that it has been appropriately stored there, when parsing the attribute article) and no access to the Wiki database is needed. After collecting these data, one can produce OWL/RDF in a straightforward way, by just casting triples into proper syntactic form. There are some choices here since RDF allows different forms of expression, and we should provide an example somewhere.

The creation of OWL/RDF snippets works similarly, but they contain only the triples that are associated with the article. The definitions for the involved categories must be retrieved independently, if needed (however, many applications may have fixed assumptions on the definition of these elements anyway, e.g. a scientific desktop application will expect the atomic mass of some element to be given as a floating point number. So, the overhead of transferring these definitions is removed. Likewise, no headers (other than maybe a basic XML-header, for correctness) are included. Each of these services requires only an article name as its sole input.

For simple triple search, one needs a script that accepts three parameters: subject, predicate, and object. Either subject or object should be left out; the query will then return all articles that fit into the specified relation. Specifying both subject and object can be used as a simple check on whether the given triple is in the database, but this might mainly be useful for simple debugging. Specifying neither subject nor object is also a meaningful query for all triples with that relation, but it might not be a good idea to allow this search (unless we have very efficient processing for large numbers of results). Likewise, queries that do not provide a relation can be evaluated, but again, this might not be a useful behavior. The queries themselves can be resolved by the triple store, which accepts them in a standard query language (e.g. SPARQL, which resembles SQL). Results can be rendered (as links) on an HTML page, though a useful clustering for huge numbers of results would be needed at some stage. At the beginning, one might be content with giving all results on one page or blocking queries that yield too many results. RDF-output for query results could be done, but it might not be a good idea: by providing OWL/RDF of the annotations already, external applications can do searches offline. We should not encourage them to use our server CPUs for answering their queries.

These basic services can be augmented by others, e.g. more elaborated search scripts, without modifying MediaWiki. This allows for easy introduction and testing of improvements in the future. More resource intensive functions could also be disabled for Wikipedia, while being installed on smaller MediaWiki projects. Finally, it allows users to write their own plugins for their own MediaWiki-based semantic wiki at home.

In addition to OWL/RDF output on a single article basis, we should also provide dumps of all annotations on a regular basis. Then, people can experiment at home without making online requests to Wikipedia. Using a triple store simplifies this task, because all annotations are already stored in a central database that can be exported easily. Since the triple store is free and available on most platforms, we could even dump its internal database without any modifications (it might be more useful to users than a single 100Mb RDF-file).

Project Semantic MediaWiki.
This article is associated with the project Semantic MediaWiki.