Semantic MediaWiki/Problem statement
In spite of the great success of Wikipedia and related projects, the restriction to pure text- or multimedia-based content creates narrow boundaries for using the gathered data. The creation of lists is limited by the fact that much information to go into the list tends to be spread over quite a large number of articles. For example, many users would be interested in data such as
- a list of all philosophical notions that were introduced by Rawls,
- the collection of all Bulgarian tennis players that are younger than 20 years,
- a diagram that displays the average lifespan of persons in Wikipedia, for all decades in the previous thousand years,
- a gallery of all images of New York city that can be found in any Wikipedia, ordered by the time at which they were taken,
The information that is required to provide such output is definitely present in Wikipedia, but it is not readily retrieved. The reason is that the information is hidden in the text of Wikipedias articles, such that automatic retrieval of the required data is just impossible (especially, because we would often like to gather data from all language Wikipedias!).
Thus, the only solution to problems as the above was to provide the required data sets manually, a method that is prone to errors and creates huge problems of maintenance and scalability. More recently, first attempts were made to overcome these restrictions. The basic idea is to enable computer-aided processing of (some parts of) Wiki-content by making certain information explicit that was otherwise hidden in natural language texts. Thus, one introduces ideas of semantic annotation into MediaWiki.
The purpose of this Wikiproject is to discuss and develop the possible ways to semantically enrich Wikipedia in a way that respects the specific requirements one finds in this context, in order to find solutions that meet the requirements of the single areas of use and that still allow strong cooperation between various semantically-enabled MediaWiki projects.
One practical solution to this problem would be to add tags to all the bits and pieces of information on Wikipedia. This would allow a very simple search software implementation by doing running each piece through a simple expression:
- philosophical notions introduced by Rawls
- would return items tagged with "philosophical notions" and "introduced by: Rawls"
- Bulgarian tennis players younger than 20 years
- would return items tagged with "country: Bulgaria" and "tennis player" and "age: 20 years" or less
For something like the third query you'd probably be best off teaching the person how to form a MySQL database query like:
- SELECT lifespan, birth_date FROM people WHERE birth_date > '1000 AD' AND deceased = 'TRUE';
- then write a script to average all the lifespans in each decade
- then past the resulting list into a spreadsheet program and create a graph...
this kind of information is already present for a lot of things in the info boxes that display their age and country
the parsing process could simply look up strings of words in a database of tags, if not found then looking it up as a synonym, if still not found then looking it up as a script. it would start by looking up the longest strings of words first, working from left to right. For example:
- "Bulgarian tennis players younger than 20 years" wouldn't find any tags, synonyms, or scripts
Then it would search again but without the last word
- "Bulgarian tennis players younger than 20" and still wouldn't find anything
Then search again and again, removing the last word every failure, until it gets to:
- "Bulgarian" :D which would finally result in a hit for a synonym: "country: Bulgaria" which is a tag. The parsing would then continue with:
- "tennis players younger than 20 years" which would then start the cycle of not finding anything and removing the last word each time, until...
- "tennis players" which would return a result in the synonyms database: "tennis player".
The search would then continue in this manner until all the text was converted into a list of tags, except the next hit would be a synonym:
- "younger than" = "age less than" the parser would then look up "age" to find it is a script:
- "age" is a numerical property which expects a time span, and as such would be written to parse the words immediately following it, looking for a number followed by a unit of time. A complimentary script would be written to take the age tags and store them in a normalized manner when building the database, rather than indexing them literally. The actual search process would also need to be able to match ranges like "less than 20 years"
- the age script would find the synonym "less than" and replace it with "<"
- then find "20" then look for a unit
- then it would find years, and normalize "20 years" to something like a number of seconds, the script would exit, and the parser would continue, however there is no more text
the resulting query would then have been normalized to an array of tags:
- "country: Bulgaria"
- "tennis player"
- "age < 20 years"
a search engine could easily be written to find objects with all of these tags