Semantic MediaWiki/Background: Ontologies and the Semantic Web

From Meta, a Wikimedia project coordination wiki
Jump to: navigation, search

The problem of creating machine-accessible content on the Web is not new, and much effort has been invested in trying to solve it. The investigations in this field were strongly enforced by the articulation of the vision of the Semantic Web, which was envisaged as an improved World Wide Web that allows users to search for actual content, instead of text. Based on machine-readable descriptions of web-content, "intelligent" software was supposed to gather and organize information, relate data from distributed sources, and answer questions. The basic ingredients for these features are ontologies – formal specifications of various kinds that describe important features in a domain of interest.

Today, the intended revolution of Web-usage has turned into a gradual evolution, and it is clear that the full implementation of the original objectives will still take years to come. Nonetheless, the great amounts of research and development in these fields have established versatile technologies with many applications. The extension of MediaWiki should take advantage of these achievements: the commitment to technologies that have already become standard should allow us to reuse existing software and to stay in the mainstream of future developments.

Wikipedia vs. the Semantic Web[edit]

Obviously, Wikipedia is not the Web and it must be understood that the "semantification" of the two will differ markedly. A closer look shows that this is rather fortunate for Wikipedia: some unsolved problems of the Semantic Web simply do not occur in our single-site context. But let us start with some points where a "Semantic Wikipedia" is quite similar to a Semantic Web. Among others, the Semantic Web confronts us with the following issues:

  • web pages that were created for humans must be annotated for use by programs,
  • to be readable and editable by the public, annotations must be provided in standardized data formats,
  • to be understandable by the public, annotations must have a formalized meaning,
  • translating informal information into formal annotations can be difficult; we must develop methods to guide the users,
  • programs must be able to integrate information from many sites,
  • many different people will create their annotations in a distributed way; we must expect contradictions and errors in the gathered data.

In this respect, Wikipedia is not that different from the Web. In particular, the distributed, decentralized way of providing content is an important similarity, suggesting the use of Semantic Web methods for our setting. For most of the above issues, we already have some concrete answers available today: Annotations can be built upon non-proprietary data formats that have a standardized syntax and semantics (meaning). There is quite a bit of methodology and experience in designing ontologies, so an understanding of what types of annotations are more difficult for users has developed.

A multitude of programs exist that can work on these standard data formats. Many of these are still under heavy development, both in companies and in universities. Most software is free (partly free-as-in-speech), but there are also industrial strength applications that are developed commercially. Ontology languages and software are generally designed to work on distributed specifications, which may be incomplete or partially erroneous.

On the other hand, the Semantic Web faces far more difficult issues. Even if everybody would use a common standard language for annotations (there are more than one), different names might still be used for the same concepts. There is no "world community" to negotiate on the usage of annotations and ontologies, so these can become incompatible. Furthermore, there is no easy way of creating annotations: instead of a convenient MediaWiki interface, people would have to write their ontologies directly in a technical syntax. Finally, the motivation for creating annotations currently is rather weak, since most people do not want to provide their data in a machine readable way to the world, but rather want humans to visit their sites to click on advertisements. These might be some of the reasons why we are much closer today to a "semantic Wikipedia" than to a semantic WWW in general.

The Web Ontology Language and others[edit]

As mentioned above, there are various languages for writing annotation data in a way that is understood globally. Here, we want to discuss RDF/RDFS and OWL, both of which have a machine-accessible XML syntax. Both are W3C recommendations, like HTML and XML, but OWL is a more recent development which is arguably more evolved. However, OWL is downwards compatible; OWL ontologies can be processed with tools that were conceived for RDF or even for XML as well. The converse is generally not true.

Resource Description Framework (RDF)[edit]

RDF is a very simple format for describing relations between all kinds of resources (though the various syntactic formats are confusing for most humans). What an RDF specification describes is basically just a directed graph where both the nodes (i.e. resources) and the edges (i.e. relationships, properties) have labels. That is all, but one can express rather complex relationships.

In the context of Wikipedia, this could be implemented by means of typed links (as described in the section on related work): articles are resources that can easily be described by URIs, and typed links are the labeled edges between them. The resulting structure could then be queried to obtain information. Such queries are just questions about the graph, e.g. "Find all nodes that have a link of type birthplace to France".

Moreover, RDF-relationships can also be declared between a resource and a so-called literal. In effect, literals are just simple data values, with an associated data type (the available types are defined in the standard and are closely related to the data types in XML Schema). Thus, one can annotate resources with data properties of certain values. The result can still be depicted as a directed graph, where we now have resource relevant data and its type information (literals) as two distinguished node types. RDF has other features, such as the description of resource collections (sets, lists, etc.), but we will not go into these details. And with RDFS, queries can be fashioned, that leverage such relevant meta-data specifically, or in combination with the particular data that it is characterstic of.

RDF Schema (RDFS)[edit]

However, RDF is not sufficient for more elaborate purposes, because it cannot describe anything beyond simple directed graphs. In particular, there is no internal mechanism to implement classes (e.g. for categorization of articles). Sure, one can define relationships with a label "hasClass" between an article and its class, but a typical RDF tool will not recognize this as a special relationship. In fact, the class (category) will just be treated like any other resource (article).

This creates problems with subclass relationships: if A is a subclass of B, and B is a subclass of C, then A is a subclass of C. But this will not be derived by RDF tools, because we cannot express the notion that the relation "subclass of" is transitive. Indeed, "subclass of" is just a label – a string that has no internal meaning whatsoever.

To overcome this problem, RDF was extended by a simple ontology language called RDF Schema (RDFS). In this language, special relationships such as subclassOf are predefined and are treated in a standardized way. This enables programs to handle various "structural" descriptions in an adequate way, instead of treating them like plain meaningless labels. This facilitates classification, as RDFS has a predefined Class (object) and a property type which states that a resource belongs to a class (is-of type). Any resource of type Class is treated as a class, and can thus be used as the type of other resources. In addition, classes can be organized into a hierarchy by relating them via the subclassOf property.

The meaning of these expressions is built into the language. For example, let A be a subclass of the class B and assume that the specification contains a resource r of type A. Now, when a user enters a query for all objects of type B, then r will also be returned – the relationship is inferred by the language that implements the RDFS specification, as a fulfillment requirement of the tenets involving inheritance; as a characteristic of bona-fide object oriented systems. These features are very helpful, because they simplify annotations considerably. Without the built-in semantics (meaning), one would have to state explicitly that r is a subclass of B (and possibly of many other classes). On the other hand, software is required to become more "intelligent" than when working with simple RDF alone.

Aside from these extensions (and some more of similar kind), RDFS is very closely related to RDF. RDFS is valid RDF, which is syntactically correct, valid XML. The only difference, being that, capable programs can make use of the additional knowledge of the built-in semantics (meaning). However, one can still use RDF-tools to work with the data.

Too much and too little: a critical look on the expressive power of RDFS[edit]

Like all software systems of the Semantic Web, whose primary means of data complexity management, expression, and possible persistence, involves serialization and representation in the form of text based documents, via XML, the combination of RDF and RDFS has some consequent disadvantages. There are two major sources of trouble: (1) RDF(S) treats all properties and classes as resources and (2) statements that can be made about some resource are usually legal for any resource. For example, one can easily state that a class has itself a type (i.e. it is an instance of itself). This creates some problems. When we speak of classes (or categories), we usually imagine them as "collections of things". If something is of a certain type, then it just belongs to that collection. For instance, the class Person represents an abstraction, which shares a common structure and behavior characterized as, a collection of all persons. Unfortunatelly, this interpretation of classes is no longer applicable when we allow classes to be their own type: in common set theory, no set can contain itself.

In effect, the correct formal interpretation of RDFS is much more complicated and is not easily communicated to the average user. This, of course, is quite problematic in the context of Wikipedia, because we cannot provide prior training for editors working on annotations. But the complicated semantics of RDFS gives us even more expressive power; sometimes more than we would reasonably like to have. By definition, even predefined resources like Class and subclassOf are just resources. Thus, we can legally state that "subclassOf is-of type Class" – a statement that is rather nonsensical. This further adds to the confusion that users may encounter when working with RDFS. Although users can still work, following the idea that classes are collections of resources, standard compliant software has to obey the official semantics to process arbitrary RDF(S) input. Thus, the behavior of such tools might not be what the user expected.

On the other hand, RDFS is a very weak language for making more elaborate descriptions. For example, like RDF, it has no means of stating that a property is transitive. So, if we state that Frankfurt lies in Germany and that Germany lies in Europe, we cannot derive the information that Frankfurt is in Europe. Yet, the average user would take it for granted that this knowledge is given in the specification, and would like to obtain Frankfurt in a search for European cities.

Another limitation is that RDFS cannot construct complex class expressions: if the user wants to have all resources that belong to the class "City" and are located in Germany, then RDFS cannot be used as a query language. Likewise, we cannot say that the class Human consists exactly of the classes Woman and Man, along with many other more elaborate statements that we might want to make (this concerns the possibility of extending our annotation framework later on; for the moment, we have no need for such complicated expressions).

Finally, RDFS has a feature called reification that allows us to use statements as resources. So, we can express "the fact that Frankfurt lies in Germany is-a type of geographical relation". Though sounding complicated, this actually has quite some practical applications. It allows us to annotate our annotations, for example, with a source for a statement or a time for which it is true.

However, reification turns out to be extremely powerful; so powerful that, in combination with simple (very useful) extensions like those mentioned above, it rules out the possibility of implementing a program that can fully evaluate these specifications (the language becomes undecidable). That is the reason why one usually choses to sacrifice reification for some other practical features and decidable (implementable) formal semantics.

The Web Ontology Language (OWL)[edit]

OWL has much simpler semantics that disallow some freedoms of RDFS, in exchange for more powerful descriptions in other areas. The added power also poses some problems in intuitive usability, so we should restrict the system to simple OWL annotations. To be added …

Software applications for the Semantic Web[edit]

Here we will introduce some tools, preferably non-commercial ones, that can be used to work on ontologies in standard file formats.

cultureset.com

http://www.cultureset.com

Cultureset.com is based on the idea that the web is improved when common data formats are used. In this case, the data is about "when did things happen in my life?", "when did things happen in my friends lives?", and "when did they happen concurrently?"

In addition, cultureset.com looks at the problem of how to get more end-users involved, by asking questions these users might ask. In this case, users are either "publishers" (people who have already asssembled web content about when things happened) and "users" (general end-users of the forward-facing app functionality). Their questions might be:

(publishers) - how do I get more traffic back to my site? how do I quickly convert my data into a common format that cultureset.com can process?

(users) - why should I use this application? is it fun? can I learn something from it? does it help me make connections with other people? Does it help me enhance the connections I already have?

The application tries to answer all of these questions at once. At the same time, the authors of cultureset are considering how to join the growing community of people and projects who are pursuing the goals of the semantic web.

Cultureset has introduced it's own XML format for representing it's data, but is also able to read HEML (http://heml.mta.ca) and iCalendar data. It is likely that more data formats will be accepted (JSON, etc), and that open-source tools will be integrated where possible (SIMILE timelines for data representation, etc). Perhaps there is the possibility for Semantic MediaWiki and cultureset.com to interchange data easily. At some point, our XML format may be completely replaced by something more widely used; for the moment, we have not found anything that perfectly suits our data needs.

The focus on increasing end-usership will remain highly important. Although some might cringe at the the coarse sound of "getting more traffic", traffic is in fact the lifeblood of the web.

Cultureset is at the moment a non-commercial site, and contains no advertising. This may change in the future.


Project Semantic MediaWiki.
This article is associated with the project Semantic MediaWiki.