Talk:Semantic MediaWiki/Blueprint

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search

Please consider using our mailinglists on Sourceforge for further discussions. The talkpages in our various project and demo wikis start to become untrackable and the lists could easily handle some more traffic. --Markus Krötzsch 21:48, 17 October 2005 (UTC)

Attributes vs. Relations[edit]

In his weblog, Graham Glass raises the following issue:

I was a bit surprised to see that two additional kinds of tags were proposed: attributes and relations. Apparently, attributes are features of the the thing described in the page, whereas relations exist between pages. For example, on the page for Berlin, instead of the current MediaWiki markup:
Berlin is the capital of [[Germany]]. Berlin has [[3.390.444]]] inhabitants.
you would write the enhanced markup:
Berlin is the capital of [[is capital of::Germany]]. Berlin has [[population:=3.390.444]] inhabitants.
where "is capital of" is a relation, "population" is an attribute, and "Germany" is described on a separate Wiki page. Note that attributes and relations are specified using a different syntax.
I don't see the need for both relations and attributes. To me, "capital city" and "population" are both just properties of Berlin. In other words, I think you could write a simpler form of markup that uses a single unified syntax to indicate a property:
Berlin is the capital of [[is capital of::Germany]]. Berlin has [[population::3.390.444]] inhabitants.
It's tempting to say that "attributes are different because they have literal values and don't ever link to a separate page" but this is not true. For example, the page associated with a Rose might include the following text:
The color of a rose is often [[color::Red]].
where "color" is clearly an attribute of a Rose but "Red" has its own page.

In reply to this important remark, I will try to provide some not too short explanations. Most importantly, from the perspective of our readers, we really want attribute and relation annotations to be treated differently. In short, relations are just annotated links and they otherwise behave like normal links: they are rendered as hyperlinks in the article, and if the target does not exist yet, there will be a red link that allows you to create it. This usage corresponds to the link to Germany in the above examples.

In contrast, attributes are used to annotate data values that do not correspond to links, but that would just be represented as plain text in todays articles. An example is the number 3390444 above: there is no article of that name, and if there was one, then it would rather not have much to do with the population of Berlin. Thus, the annotated value should not become a link in the final article. However, as explicated on the implementation page, we plan to introduce support for units of measurement in the case of attribute values. This allows us to have features like automated unit conversion: tooltips that give the transformed value in several related units can be shown to the readers when hovering over some data value. This is obviously a whole different behavior as compared to annotated links.

In addition to the user view, there are also some technical reasons. Links (marked by relations) are plain text, that refer to certain URLs in the Wiki. In contrast, data values (marked by attributes) belong to a specific data type, e.g. they can denote a number, a date, or a simple string. We have to parse data values for several resons: (1) we want to do calculations with them (such as conversions to other units), (2) we want to compare them (looking for the ten greatest cities is different from looking for the ten cities with the alphabetically greatest population number), (3) we want to generate RDF-export and such export requires values in a standard format (so we need to obtain the real value, and transform it back into some standard conformant string; we cannot expect our editors to write in the international standard way, especially in non-English Wikipedias). Now for parsing, we need to know the data type. Thus attributes require a whole different technical handling than relations: if a nonexisting relation is used for annotation, we can still fully process the data. If a nonexisting data type is used, then we are unable to do meaningful parsing and cannot export the annotation at all.

The previous paragraph also hints at the deep formal differences that specification languages see between attributes and relations. You may want to have a look at the specifications of RDF(S) or OWL to get an impression of the underlying conceptual discrepancies between both things.

So, summing up, one can say that:

  • Relations describe relations between articles, which are displayed as links and thus connect to other relevant resources. Searching for links is simple: e.g. one can search for all articles with a "is capital of" link to "Germany".
  • Attributes describe data-properties of articles, which are displayed as plain text, support unit conversions, can be ordered, and must be translated into standard compliant expressions for export. Searching for attributes requires advanced options, e.g. one would seldom like to search for all articles with "population" attribute being exactly 3390444, but rather for all that are at least or at most that size, and one wants to order the results increasingly or decreasingly. When searching for physical quantities, one wants to work with units, e.g. when searching all rivers longer than 500 miles.

But what about the example with the red rose above? Well, as a data value, it could only be a string (there is no special type for colors) and there would indeed not be much additional features (no units, ordering only alphabetical, etc.). So we would naturally conceive the property "color" as a relation that links to the article of red. Compare this to a scientific context, where colors are encoded as wavelengths: there we need unit conversion and comparison of wavelengths, but we won't have articles for every wavelength. So this is an application for attributes. In general, the decision for either modelling must be made carefully (of course it can be changed later on). A good example where string data types are probably useful is for the first names of prominent people: we do not want the first name to be a link, but we might want to search for all people that have a certain first name (there was even a discussion on how to allow this based on the current technology). Using a data type also yields such features like displaying all people ordered alphabetically by their first (or, more difficult with current articles, last) name.

Consequently, we think that the chosen distinction is helpful and necessary. Note that we really made the syntax for attributes and relations very similar, so users can easily get used to both at the same time. Completely unifying both syntaxes seems not to be desirable, since the effect of both annotations on the article layout is different. We think that this should be clear from the source code. I understand that most of what I said above is not explained in much detail on the project pages. Having a mock-up example of an annotated article-source and its desirable output would probably be helpful. I will add it when I find the time … --Markus Krötzsch 12:58, 7 September 2005 (UTC)

Natural language processing[edit]

Would it be useful to use some basic natural language processing to encourage editors to annotate existing links, or attributes on a page? This could be as simple as a list of search expressions, e.g. 'capital of' in the same sentence as a link, but I'm sure much more sophisticated checks would be possible without too much complication. When a phrase is recognised, the new syntax could be suggested to the editor and/or the article listed on a "Please Annotate" page. Of course, this would probably be a phase 2 feature - better to get something working before trying to automate it. Rattle 23:53, 21 August 2005 (UTC)

Yes, such features could really be build on top of a basic annotation mechanism. It should then be rather orthogonal to the annotation itself and can be provided with the help of external tools or "annotation bots" as well. Templates would also be a good starting point to bootstrap annotation. Since we could probably only provide very basic automatic suggestions, they will probably only be useful in the beginning of the project. Afterwards, people will have supplied most "obvious" annotations anyway. So one might consider to just annotate automatically with a bot and let the users (when they begin to use the features) check whether the annotations were correct. But again, this is all quite independend from the annotation capability in itself. --Markus Krötzsch 18:19, 25 August 2005 (UTC)

Syntax of annotations[edit]

After reading your paper presented at Wikimania 2005 about Wikipedia and the Semantic Web and this page, I would like to make some comments about the syntax proposed for the metadata. IMHO embedding semantic relations inside wiki links it's not a good idea for some reasons:

  • The metadata will be scattered through all the page
  • Users would be confused with those strange links

My main concern is the last one, because users will need to learn the purpose of this new syntax, obscuring the contents for those that do not understand the pupose of those semantic relations, maybe preventing them to contribute to Wikipedia.

I've worked (a little bit) some time ago on ontologies and RDF, and I also imagined that it would be extremely useful to add semantic data to Wikipedia. My syntax also used the new domain Relation: (but not Attribute:, great idea!). However, it was closer to the one used for categories, because in my opinion this metadata should be placed at the end of the page, and should be "hidden" (not showed as text of the article, but in a special frame at the bottom of the page). For example, at the end of the article about On Her Majesty's Secret Service, the semantic data will be agrouped between its categories and the interwikis (actually, as categories, it can be put everywhere, however there should be normally written at the end of the page and inside templates, e.g. infoboxes):

[[Category:1969 films]]
[[Category:James Bond movies]]

[[Relation:Main actor|George Lazenby]]
[[Relation:Written by|Ian Fleming]]

[[Relation:Release date|1969]]

[[Attribute:Runtime|140 min]]

[[de:Im Geheimdienst Ihrer Majestät]]
[[fr:Au service secret de Sa Majesté]]
[[sv:I hennes majestäts hemliga tjänst]]

The metadata can be displayed by the MediaWiki software in the same way as categories.

Later, in the Relation: namespace, the page defining Main actor would look like:



Other advantage of this syntax is that it looks like the existing wikicode used in MediaWiki, with special links and its parameters. Thus, users won't have to learn a new syntax, and of course the software will need less changes.

I hope this helps. --surueña 16:11, August 25, 2005 (UTC)

I agree that this is a valid alternative to our current proposal and we considered this option more than once during our planning. Still it has advantages as well as disadvantages, and so far we considered the disadvantages to be more relevant. The main problem is that moving the annotation to the end of the page leads to a noticeable split of annotation and "normal" Wikisource, such that we might end up with a "two-class society" inside Wikipedia, where annotations are only edited by a minority of users. We think that this is an important problem, since we are not just talking about some "metadata" here: the annotations are extremely close to the content of the article, and they can lead to disputes and discussions like any other part of the content. This is even emphasized since many external applications will only get the annotations, without any explaining text. If we want a high quality of annotation, we therefore need all our contributors to consider and to judge them, and we have concerns that the splitting of annotation from text would constrict this.
From this viewpoint, your first comment (that annotations and text would be mixed up) appears to be a feature rather than a bug: we want that the article text is fully consistent with the annotation, and thus annotations should be stated at the same place where the article makes the according statement. Of course, you are right that this distributes annotations around the whole article, but it also binds the annotations to the appropriate place in the article so that they can be found easily. At the same time, we plan to provide a summary of all annotations at the bottom of each page, and it would even be possible to include "edit"-links with each such annotation to bring the user to the relevant possition inside the text. So it should be easy enough to find the annotation.
While these are the arguments in favour of inline-annotation, it is also true that it makes the Wikisource more complicated. The question is how severe this effect really is. We tried to create a simple human-readable syntax which is similar for link types and for data values. Would this really repel contributors (that are not repelled by all the other syntactic elements, like normal links, images, templates, tables, ...)? One should also consider the fact that only a fraction of all links will get any annotation at all. In many cases, one will annotate facts right inside a template and not touch the article text. In cases where there are no templates on a page, there are usually not so many common properties for the article (otherwise one could create a template ;-). So we hope that we do not decrease the number of Wikipedia editors.
Furthermore, the data-annotation feature in conjunction with units of measurement has another application: one can have tooltips (e.g. via CSS -- no Javascript ;-) that give a value in different units. So you move your mouse to the place where the text says "1.200 miles" and the tooltip gives the length in "km" as well. This feature is easy to implement based on basic unit support, but it will only make sense for inline-annotations. Having a unit conversion at the end of the page is also hepful (and will be implemented) but it is obviously not as simple for reading ... Thus, even though new syntax requires some familiarisation for the editors, it makes Wikipedia simpler for the readers, and the editors are really awarded for their efforts. Some people might also be enthused even more by seeing such cool things being achieved through their contribution. Together with the additional promotion that Wikipedia might get from external (desktop) applications that use the data, the number of editors might even be increased after the introduction of annotations (inline or not), so I am not concerned to loose the strong support from the community.
These are the main reasons for our desicion, though we are aware that your objection on usability is a major issue. We tried to alleviate it by developing a simple syntax, but there is really no perfect solution for this problem. In emergency (if editor numbers decrease after introducing the feature), we could still transform inline-annotations into a syntax that conforms to your approach automatically (gradually by a bot). The converse is not that easy to do, especially for data values. --Markus Krötzsch 19:31, 25 August 2005 (UTC)
P.S.: I still think that your proposed syntax is really nice as well. Maybe it can be used in some or the other place where "invisible" annotations are desired (especially in the "Relation:"-articles). It can certainly coexist to some extent with inline annotation (though, for the above reasons, I would not want to have any possibly inconsistent/disputed content annotation "hidden" at the bottom of the page). --Markus Krötzsch 19:36, 25 August 2005 (UTC)
I can see both points here. In terms of syntax, I think allowing any link to have semantics added to it is hugely more powerful than having to define all semantic relationships separately. On the other hand, from a systematic point of view, scattering semantic information all over a document (particularly in an environment like this that encourages the editing of sections in isolation) sounds incredibly fragile. But maybe this is just the difference between function and practice. The function is more powerful if it can work on any link; the human process will be more manageable if the largest amount of semantic annotation is done within template-driven infoboxes. --glenn mcdonald 20:35, 10 October 2005 (UTC)

The online prototype is awesome! [Remark by Markus Krötzsch: the online demo now has moved to] And very promising. And I must say that it changed my opinion about the location of annotations. I'm agree with you that it should be possible to put them anywhere because they must be close to the related contents. However, I'm still believe that the majority of semantic annotations will be inside infoboxes, but who knows.

With respect to the syntax of annotations, I do not think that having two types of annotations is a good idea because editors would need to learn both, and the wikicode syntax would grow more than neccesary. If annotations should be sometimes visible and sometimes hidden, it's better to desing a syntax that allows both behaviors.

Although I like the current syntax for relations and attributes because is intuitive and easy to use —i.e. relations: is capital of::Germany and attributes: Population:=3,390,444 (I will name it operator syntax from now on)—, IMHO it have some disadvantages:

  1. Is a completely new wikicode syntax, with no resemblance to current wikicode constructs. This is a big change in the wikicode syntax, and implies more changes to the MediaWiki as well as for the current bots (as it has been stated below in this page, probably they would not work anymore).
  2. No backward compatible, i.e. nowadays valid page titles will be forbidden with this syntax (the :: and := would be illegal).
  3. Template issues [1].

In my opinion the worst problem is to break the backward compatibility. Although forbiding := and :: from page titles may be seen as a minor problem, the Wikipedia has enough technical limitations, so the tendency should be reducing the number of illegal titles instead of increasing it (e.g. I'm sure std::out and X:=Y are very valuable for a wikibook about programming languages).

Thus, it's better to reuse any of the currently available wikicode constructs:

  • Links with a special namespace, e.g. [[Image:X]], [[Media:X]], [[Category:X]], …
  • HTML tags, like <maths>, <gallery>, <timeline>, …

In my opinion the best option is to follow the namespace notation because a semantic annotation is a type of metadata, and currently the closest type of metadata are categories, so using this syntax for the annotations is a natural choice, doesn't it? Therefore I still propose the namespace notation:

 [[Relation:is capital of | Germany ]]
 [[Attribute:Population   | 3 390 444 ]]

As I said, they are completely analogous to category annotations, i.e. the construct is invisible in the page body, it just shows a summary at the bottom of the page (and of course, simply adding a colon before the namespace creates a link to the page of the special namespace and not an annotation). And, there's an easy solution for those cases where the semantic annotation must be shown at the body of the article: the templates {{relation|is capital of|[[Germany]]}} and {{attribute|Population|3390444}}.

Those new templates do not exist yet (these names are currently unused in the Wikipedia, as well as abbreviatures like rel and atr), and they would be very simple:




The usage of these templates is simple and safe (if the value of the annotation has to be written more than once, then semantic annotations would be error prone), but the biggest advantage is that it is very intuitive, even for current editors.

The namespace notation can be used in infobox templates without any disruption (see discussion about semantic templates) because editors can choose to pass to the template a value of any kind —internal links (piped too), external links and even no links at all but only a word, although maybe the last two shouldn't be allowed— and also multivalued annotations —very common in infoboxes, e.g. some countries have more than one official language—, more difficult to achieve with the operator syntax.

 [[Relation:has capital|[[Madrid (city)|Madrid]]]]
 [[Relation:has official language|[[German language|German]], [[French language|French]], [[Italian language|Italian]], [[Romansh language|Romansh]] ]]

However, the usage of templates has to be further investigated, and probably a scape sequence should be implemented because notes are very common at infoboxes (for example, comments to the value can be between parenthesis without being considered part of the semantic annotation).

Efficiency is another advantage: special links are always detected by the namespace, and semantic annotations should not be processed differently by the MediaWiki by searching an operator at the middle. It's important because we all know that performance is currently a real problem for the Wikipedia, and the reduced the processing of the syntax the better. But of course maybe this cost is negligible —and we must consider the usage of templates for visible annotations—, so this must be further investigated too.

But a big advantage of the namespace notation is, as a standarized internal link, that bots will probably work without changes. Maybe there should be a standard for making extensions to the wikicode syntax to avoid bots to stop working in the future (e.g. bots will not follow a link to an unknown namespace), and thus a paper about the design of the semantic wiki should be very useful to the next Wikimania. What do you think?

I want to say that after thinking on a lot of different solutions, I really like this one. Of course there are many other solutions, for example using one more parameter when the editor don't want to hide the annotation (e.g. [[Relation:is capital of | Germany | Germany]] or [[Attribute:Population|3 390 444|3.9 millions]]), and the big number of possibilities introduced by a tag notation.

This notation also has some disadvantages, for example it requires more typing than the operator syntax, but metadata is a small amount of text w.r.t. the whole article, and all in all it has many advantages. But the main reason for not introducing the operator notation is the precedent for future expansions: If the semantic annotation introduced a completely new operator why my cool extension shouldn't do the same? And thus, in that case, an overly complex wikicode means more CPU time, bots will stop working and more titles will be invalid.

I hope this helps. --surueña 13:20, 16 October 2005 (UTC)

Well, just a quick comment: I think the problem with our notation is not that big after all: we have no problem with articles that are called "std::out" and the like. The only issue is that std will be stripped as a semantic relation before a link to such an article is created. This could be prevented in various ways, e.g. by suggesting the use of an HTML HEX-Notation for the ":" inside the article name, or by introducing <nosemantics> tags. Of course, the chosen syntax affects the required implementation, but since the implementation uses regular expressions for finding annotations, it would not be very complicated to adjust the syntax at a later stage. At the moment, we focus more on the technical work behind scenes: adding a triplestore, an efficient and powerful search, and unit handling will be important, no matter what syntax we might finally choose. T usage of templates for including semantic information into articles seems to be a very good idea: we could also use it to include search results "in place", e.g. for having an automatically created filmography inside the article of an actor. --Markus Krötzsch 21:41, 17 October 2005 (UTC)

Editors and Usability[edit]

I can see good work is being done on this very worthy task, but I'd like to see more discussion on is the usability of any planned implementation. In particular the practical matter of your average Wikipedia editor taking an interest and successfully adding and editing semantic information. The success of this project depends on that being possible (and probable) for people who have no background in the science.

Consider the motivation for your average editor. Where is the payoff for adding semantic annotations of an article? An application that makes a compelling use of the data is not going to be compelling until there is a fair amount of data in there.

Now I am not trying to be defeatist about this, rather bring home the point that we have to keep the investment required to do these annotations as low as possible, with as little as possible prior knowledge or education required. I accept there will be quite a number of people who can understand what is being proposed and will be able to use it, but we really want a large number of people participating to spread the annotations throughout the wiki.

As enthusiastic as the people here are (and other who will join in) we cannot do it all ourselves. We need to provide something that is intuitive, useful by itself within the scope of the wiki and that people will embrace. This will encompass both implementation and promotion, with a simple design and simple message that will grab users.

I am in the process of working on something simple that might go some way toward this goal, but Id like to get an idea of what people's thoughts are on this. Darkov 04:53, 30 August 2005 (UTC)

I perfectly agree with this. We definitely have to take users (i.e. editors) with us when planning such extensions. It is a major goal of this project. Discussions at Wikimania showed that a surprisingly large number of people immediately see the benefits of semantic annotation and are willing to use it in practice. This encouraged us to go for implementation directly.
Yet you are right that the voice of the community must be heard at some stage. And that is why we established this project portal instead of building extensions in secret. But the project is still in the planning phase and we believe that the technical decisions we have to make now are not well-suited for a general community-wide discussion: those with background knowledge can always find arguments why the things are good in the way that they were planned. Rather, we would like to implement a prototype and let the public look at it in practice. They can then give practical comments that we may be able to resolve on the technical level. We are currently cooperating with developers that are working on the prototype and we are looking forward to a public demo within the next weeks (with restricted features, but nonetheless quite functional).
I am happy to see that you insist on providing good documentation and guidance to the users. I think that this is often neglected when people with technical backgrounds propose "improvements". Our approach includes dedicated articles for all annotation elements (relations and attriubtes, categories have them already), so it is easy to give detailed descriptions on how an annotation is to be used. Furthermore the community can create annotation elements or change their descriptions easily. So the system should indeed be shaped by the community. The formal semantics is still necessary for enabling software developers to support our data correctly, and in fact it comes for free with any standardized export format. But it is only a structural semantics: the intuitive semantics of all annotations is defined by the users.
For the problem of convincing users to create annotations we have a twofold strategy: firstly, we will try to provide as much features as possible with reasonable implementation effort. Especially, this will include simple search functions that should already prove to be very helpful to many users. The prototype will already offer such capabilities. Secondly, we intend to cooperate with Wikiprojects that are currently engaged in editing articles on certain topics. This is helpful, since such projects could also develop their own annotation elements in a coordinated way instead of adding them freely whereever they need them. Wikispecies is a good candidate for this. Furthermore it turns out that annotation in some areas (say geographical relations) can be done very efficiently even by non-experts and can be achieved in short time. For example, a small group German Wikipedians annotated all people-related articles in German Wikipedia within one weekend (annotation elements in this case where predefined in the Personendaten project and they had some tool-support). So it would be possible to bootstrap some annotations to show the community their utility and proper usage.
Anyway, good documentation will be vital and I would be happy if you would contribute in explaining the ideas of semantic annotation when the implementation is set up (and of course comment on the usability of the current approach). --Markus Krötzsch 18:52, 31 August 2005 (UTC)
You write someone is working on a prototype. Who is doing that? If you have something to show by October, please consider demoing at WikiSym 2005, see the SpeedDemo section on the conference wiki or talk to Sunir directly. Since Max Voelkel will be there, maybe he can demo for you? --Dirk Riehle
While we happily do the conceptual work, we leave the implementation to more experienced PHP-developers ;-) The current prototype is "sponsored" by the company Doccheck, who use MediaWiki for a public medical knowledge base (GFDLed). Klaus Lassleben is responsible for the implementation and might add some updated information on implementation progress if he finds the time. If the demo is ready soon enough, Max is certainly going to present it in San Diego. --Markus Krötzsch 17:02, 5 September 2005 (UTC)

Hello, this is my first contribution to this project pages. At the moment the separate storage of the semantic part from links has been finished. This means that a triple from type "subject-relation-object" will be splitted from the common wiki link syntax and stored in a separate table. For displaying or editing the articles this information will be retrieved from the tables and "injected" into the article on the fly. So the common mediawiki mechanism of storing articles will not be changed at all, which will allow you to switch on and off the semantic functionality as you like it to.
I apologize for the pessimistic timelines, but this project has to be done in parallel to our every-day-work. Anyway I am confident that there will be something to present at the WikiSym. Next steps will be:
    • Building a public test environment based on the MediaWiki 1.5rc2, possibly something like "". Timeline: will be done till 8.Sep.2005
    • First steps for implementing a search, capable of searching everything what has a specified relation to the specified subject. Maybe there will be to fields for relation and subject, both with autocompletion based on ajax. Timeline: 3 weeks from now
    • Implementation of administration pages for semantik relations, based on an own namespace. Timeline: 5 weeks from now
--Klaus Lassleben 19:30, 5 September 2005 (UTC)

A Possible Solution for minimizing the "Readability Issues" of Semantic Annotations[edit]

First of all, I must say I am very much in favour of having the Semantic Annotation wherever the data itself is located. This would ensure that we dont create two inconsistant versions of the same info. But the [[Wiki]] tag is probably the most glamourous tag ;-) of all the Wiki Syntax!!! Squeezing too much into it can make it slightly annoying from the readability point of View. The issue is not of technical complexity as much as it is of Visual complexity. This could affect even long time veterans. Having said all that, I strongly like to stick with the :: and := Syntax.

Rather than scaring off newbies here is an alternative.

What I propose is an Autohide Semantic Annotations capability based on neighbourhood matching algorithm. I have created a JavaScript based "Proof of Concept" to illustrate what I mean. All those Javascript haters out there can easily recode it in PHP.

  • When edit page is loaded, the TextArea contains the unaltered WikiText from the database including the annotations.
  • If the browser supports JavaScript, All annotations from the WikiText are removed one by one and pushed into an array along with 2 strings that are the Left Neighbour and Right Neighbour of the tag. At the end of this process a Normal looking Wikitext is reintroduced into the TextArea.
  • At the time of Form Submission, the left and right neighbours are used to search the exact spot where the annotation was supposed to be located.
    • If a unique spot is found, the annotation is reintroduced.
    • If no spot is found or there are more spots than one, then such Wikitag info is added to an error string.
  • If the Error string is empty, the submission takes place and the server loads the wikitext into the DB as well as the triple store as already dicussed by the others.
  • If the Error String is non empty, the validation function adds the Error string which also contains the Annotations at the bottom of the text in a seperate Section. If the user knows about :: and := he can easily fix it. Else he can leave the message as is and some one else can fix it later. And the annotation is never lost.

Reveal Annotations Button At any time during the edititng process, this button can be clicked and the hidden annotations are applied to the text at the appropriate locations.

How does the algorithm work?[edit]


  • The algorithm starts with the left neighbour and right neighbour strings and searches for a unique link bracketed by the left and right neighbours.
  • If a unique match is found the annotation is replaced and the wikitext is returned.
  • This search process repeated by shrinking the neighbours one char at a time until the searchstring becomes shorter than the preferences permit or the searchstring is found in more than one place.


  • Coming out of the loop without returning implies that either the string was never found or more than one slot was found. In that case the error message is updated.

Here is a LIVE DEMO to show what I mean


  1. Always Show Annotations: The Semantic Enthusiasts and Power Users of Wikipedia can check this option to disable all Hiding revealing functionality for themselves.
  2. Preferred Neighbour Length: The Greater this length, the lower the liklihood of reaching situations where there are more than one candidate slots for reintroducing annotations. The larger the number, the greater the amount of work JavaScript needs to do. But on contemporary PCs even a length equal to the WikiString should not produce visible slowness.
  3. Minimum Neighbour Match: The higher this value, the more sensitive it is to changes farther from the specific WikiTag. Very large numbers are not recommended. Very small numbers 4 or less can possibly reintroduce the tag at another place if the original tag is deleted.

How does this look from the point of View of the user?[edit]

The algorithm might have sounded complex/trivial or terrific/stupid or something in between depending on who u are ;-). But that is immaterial here. What matters here is how does a user see it. For Eg when the user searches Google he hardly cares about the algoritms used there or their complexity.

  • Most of the users hardly touch the portions that are Semanticaly Tagged. So Wikipedia looks like it always looked. No change in behaviour at all. Inserting or manipulating other places will be absolutely unnoticeable..
  • If at all a warning is Generated, most wikiusers can fix it with their common sense.
  • Even if they are scared and dont know what to do, they can just save the page and the annotation is not lost and can be fixed by the Semantic Enthusiasts later.
  • AlphaGeeks who love seeing the semantic annotation can turn the feature off and immerse themselves in the :: filled world.

Suggest Your Own Approaces[edit]

There would be millions of ways of hiding and reintroducing the annotations with minimum interference to the users and maximum liklihood of exact restoration. Maybe we can have a competition here... And the best algorithm wins. My suggestion was just a small step in that direction. Guys who have coded various forms of diff algorithms could probably suggest solutions.

Whatever be the approach, my primary hope is that

  • Niewbies are not scared off the system.
  • Semantification of Wikipedia meets with minimum opposition.

SudarshanP 16:49, 15 September 2005 (UTC)

These are some very interesting ideas. The problem clearly will not be solved easily, and the proposed handling of cases with non-empty error string might not be very appealing to some people. But it is obvious that one cannot find a general solution, since the wiki text can change completely, while still containing the same links (which then should have the same annotations, which cannot be done be reintroducing them). Of course, it could also be admissible to keep the relationships to all available targets constant: we just store the annotations (like "Berlin is capital of Germany") before editing, and reinsert them as annotations into some/all of the links that link to the relevant target (here "Germany"). There are some pros and cons here:
  • If we insert tags into all available links (i.e. tag all links to Germany with "is capital of"), then editing this relation becomes cumbersome -- too many occurences. Of course, "remove relation" could be an automatic editing function for each semantic annotation that was found in a text. Anyway the annotation becomes detached from the place (sentence) in the article that really gives this relation.
  • If we insert tags only into the first link to some topic, then we collect annotations at the beginning of the article, again more or less detached from context. On the other hand, both of these approaches are quite easy to implement.
  • If we count the number of the link where the annotation was obtained from in the article (e.g. the fourth link to "Germany" was tagged with "is capital of", then we might be able to insert it at the same position (even if heavy context changes occured, but not if whole sections were swapped or if similar links were inserted). If the required number of links does not exist after editing, then we might fall back to some of the other methods.
There are certainly many problems here. The challenge is to preserve the annotations mostly at the right position, since otherwise the editors who want to handle annotations become frustrated and stop annotating. Maybe a heuristic approach that combines context information with position of occurence and some other data (e.g. overall number of links to this target) would be satisfactory, but this would be quite some work. --Markus Krötzsch 22:50, 28 September 2005 (UTC)

Have you thought about the changes needed to the bots that Wikimedia Uses[edit]

Bots are routinely used to keep Wikipedia and other Wikimedia projects in good shape. One can say it is the responsiblity of the bot writers to fix their own bots ;-) But Wikipedia cannot live without these bots. So the impact of these changes on the bots would need to be analysed. I am not saying that the bots will be impacted... But who knows. The bots might remove or damage the annotations...

SudarshanP 09:10, 16 September 2005 (UTC)

True, we need to take this into account. Bots that don't investigate the link target will not be affected, since they just think that "is captial of::Germany" is some strangely named article. But you are right that some bots might want to follow article links for whatever reason, and these then have to be adapted to extract the rigth link as well. The introduction of data values in link-syntax might also create confusion with bots that scan for links. I think we can postpone this discussion for now, and come back when we know the details of our own implementation ;-) --Markus Krötzsch 22:12, 28 September 2005 (UTC)

Benefits of Annotation to Interlanguage cross pollination of Info[edit]

One of the important Aspects of a semantic Annotation is that it is true irrespective of language even if we use different words to assert the same thing. Let me give you an example. Consider the sentense: "Berlin is the captital of Germany".

The Article about Germany in English already says [[de:Deutschland]] and the the article about Berlin in already English says [[de:Berlin]]. The vice versa assertions are made in the de wikipedia as well. Even if we had the info in only one of the Wikipedias we could get the annotation trivially translated to the other language.

Let us say somebody said somewhere(maybe in some relation translation repository or wherever) that is

[[en:X is captial of Y]]=[[en:X capital of Y]]=[[de:X ist das Kapital von Y]]=[[de:X Hauptstadt Y]]=...=[[fr:...]]=[[it:...]]=[[cyc:???]]

By the way, I have no clue of German ;-). So pardon me if I am wrong :-)) Any country capital information annotation is now available to any other language.

Let me give another Example to illustrate: The German aricle says that the Mains Voltage is 230V and the Mains Frequency is 50 Hertz. The English article does not mention this at all. If some one just annotated this in German and provided the English translation of the relation similar to the one described above, the semantic annotation can be automatically appended to the fact sheet of the English article under a sub heading: "The German article about Germany says that:

  • Germany has Mains Volatge of 230V.
  • Germany has Mains Frequency of 50 Hertz.

Once a single relation or attribute is translated, The same attribute for another country say France or Italy can be translated into English from a German annotation without human intervention. If the translation Template is available, then even other language assertions can be used.

To minimize redundancy, any equivalents of the assertions already there in the main article can be ignored while appending foreign assertions. Providing the CycEquivalent allows this to even be translated into CycL which may be very useful for inference in conjunction with Cyc.

Apart from providing cross language annotation, this might also provide a skeleton information base for stub articles. Say you have an article about Cerebellum in English. Right now if you go to the Portugese Article, you will see only a stub with negligible info. If Annotation translation was available, the stub would have a lot more information.

I know it will be quite some time before my dreams are fulfilled :). I dont think it is very hard to implement either... So hey lemme at least dream!!!

SudarshanP 14:47, 16 September 2005 (UTC)

I must say that I am skeptical towards wide-reaching automatic generation of article content (there will be huge amounts of annotations, and transfering them might generate a lot of noise in many less obvious cases -- we don't want to overload articles with autmoatically generated data). But there might be special cases where helpful applications can come into reach once we have the extensions. I guess Denny shares some of your visions in this respect ... (but they are not yet within our scope) --Markus Krötzsch 22:27, 28 September 2005 (UTC)
I agree auto-generation of content deserved skepticism. This fine plan says "A first feature to do this is to display all annotations as a list below the article." So as I think SudarshanP proposes, when displaying an article, the fact sheet should show all annotations from all translations while attempting to weed out redundancy.
The key difficulty of Interlanguage cross pollination of Info is that in this plan editors make annotations in-line with written text. That preserves the richness and context of language (e.g. "Berlin has about 3.4M inhabitants", or "Plato suggests 50000 lived in Atlantis"), at the cost of requiring translation. If instead article source had a "fact block" containing only annotations, then this would be completely independent of language and available to all translations; before reading this plan that's what I envisioned for a Semantic MediaWiki, but this plan is a better incremental change and doesn't preclude it.
--skierpage 22:48, 2 October 2005 (UTC)

Unconnected Ramblings[edit]

Hey, by the way I have been wondering... Why hasen't any one written a bot that automatically inserts all pictures in another language wikipedia say English German or French into a less dense wikipedia like Portugese. The language equivalence is specified in the documents anyway. The stub articles will now be much richer in the target language. Anyway that is unconnected to annotation and can probably be done right now. Has someone already done this? If so why has it not been reflected all over wikipedia.

Of course the images will need to have a flag stating whether they have English or another lang within the pic. Copying images if the flag is false into languages where the equivalent article is a stub should be trivial to implement and very enriching to Wikipedia. Just my ramblings... SudarshanP 14:47, 16 September 2005 (UTC)

The flag-thingy actually touches annotations. I think the main problem is that you can hardly create more than a basic stub with such fully automatic movements of data. Firtly, the placement of the images in a complex article is unclear (especially since the image might illustrate a part of the article that is not immediate from the article title, but belongs to some subsection). Secondly, there are many cases in which two articles in different languages have completely different pictures, but where each of the articles has a sufficient amount of pictures already.
So I think that the main application would be to have a bot that transfers pictures from full-featured articles of one language to the according stubs of another language. But there are many problems (non one-to-one interlanguage links, data blowup by copying unused images around, name and description of images, ...). Anyway, there surely is some more promissing place for discussing these matters ;-) --Markus Krötzsch 22:20, 28 September 2005 (UTC)

Semantic annotations in/from infobox templates[edit]

Templated infoboxes effectively already represent at least semi-well-constructed semantic tagging. So it seems to me like one huge potential jumpstart to this project would be to add a post-edit hook that updates the database of semantic relationships with any parameters from instances of infobox templates.

I'm not sure I think this is really the right blanket rule, of course, but it's at least a reasonable approximation, and it'd give us a lot more data, quickly, to search against. Arguably the implementation of the current typing on individual links could be postponed, and conceivably even eliminated (although I'm not necessarily suggesting that).

And if you don't want template parameters to be synonymous with semantic annotations, actually, then this raises some very interesting questions. Clearly you'd want to define the semantic relationships in the templates, not the instances, but the link extensions you're currently proposing don't lend themselves to that, since the links are defined in the instances, not the templates.

Apologies if this has already been discussed; I'm just jumping in in the hopes of being helpful.

--glenn mcdonald 21:42, 10 October 2005 (UTC)

I agree that we really have to use the templates for such a jump start introduction of semantic links. This will certainly be attempted, probably by using a not overly smart bot. We are not generally in favour of using templates as a replacement of individual annotations -- they just seem to be too unflexible to catch enough information.
However, most of our current implementation efforts are concerned with features that cover processing of data (unit conversion, data storage in triples, search engine, ...). The input syntax is an important part, but changing it will be possible even at much later stages of implementation. One could even think of different input models. So all ideas and comments are very welcome (but please use the list -- see the top of this page). --Markus Krötzsch 22:07, 17 October 2005 (UTC)


Suggested datatypes: array(n dimensional), vector and matrix as synonyms for 2 and 3 dimensonal array, complex number, boolean, char, string, integer, float, time, date, quaternions... Fuelbottle 20:39, 15 November 2005 (UTC) also mentions coordinates., though like date it isn't yet implemented. MediaWiki already has support for a universal date format [[2006-02-01]] that it displays in your preferred format, e.g. 2006-02-01. Wikipedia has the {{coor}} coordinate template, e.g.{{coor dms|32|42|54|N|117|09|45|W|type:city(1,305,736)}} (not implemented on meta.wikimedia). Clearly the datatype unit parsing should handle these two entry formats. skierpage 11:05, 1 February 2006 (UTC)

Units of Measure[edit]

If you convert all lengths to metric, then in some places you get a bunch of values like 2.54 centimeters. Many of us will recognize that as an inch, but some would be put off by it. I have no solution beyond having the recipient indicate a preference for some particular units which might be taking customization too far.

I think that's pretty reasonable customization. Again, MediaWiki supports a preference for date and time display format (you set it in, so it's not going too far for it to support preferences for other units and formats. SMW_Datatype.php already creates the USER representation of the datatype that appears in the infobox (e.g. it shows areas as "963.6 km² (372.048 miles²)" regardless of what units you enter, so it would "only" have to consult user preferences for which unit to show first and whether to show other units. skierpage 11:35, 1 February 2006 (UTC)