Generic article retrieval

From Meta, a Wikimedia project coordination wiki

For external applications embedding Wikipedia into their interface e.g. KStars (sky program) and Amarok (music player) it is hard to fetch the right Wikipedia article in a predictable way in every localised language without maintaining and constantly updating large article lists which is in practice not possible. So there is need for a predictable way accessing an article in every given language. So currently these programs only use the english language Wikipedia and because of the lack of accessibility for automated tools they use Wikipedia only in a sub-optimal way. As there is an effort to integrate Wikipedia into KDE a working solution is finally needed.

Proposal[edit]

This proposal will not go into technical details (especially for the Web API) but is a proposal both for the Wikipedia community and the software developers what they should take into account and how they should design their solutions in principle and what can be done in practice right now.

Current situation[edit]

Currently Wikipedia article lemmas are mainly constructed after one basic principle:

Name (disambiguation)

The "disambiguation" word is in most cases placed after the "Name" with a space and in round brackets.

  1. The problem with the "Name" is that it is in general translated in the local language of the wiki. Thus an external program would need to use dictionaries which is not failure safe.
  2. The problem with the "disambiguation word" is that it does exist in most cases only if there is a clash of the same name for different topics and of course it is also not always the same and translated in local languages as well.

standardized names via REDIRECT[edit]

So there is need for an easy (technically and also for editors obvious) way to access articles. So a good solution would be using the existing natural lemma scheme of Wikipedia and improve it. The design goals are:

  1. No article needs to be moved because of the solution; if an article has a different name (and if the name is reasonable according to other principles of the local project) than you would need it for an automated access, create the needed lemma as a REDIRECT.
  2. No cryptic name. It needs to be as obvious as possible and should be as natural as possible for every person in every language and thus should be as close as possible to the existing lemmas of the articles.
  3. The "name" part of the lemma is an international (not translated) name that is generally accepted (e.g. scientific names) or a machine extractable part of it (with a regular expression).
  4. The disambiguation word is standardized within one local Wikipedia language for different classes of topics and can be a translated word.
  5. If there is an accepted disambiguation for one class of articles in one local project every lemma of that group in that project has to use that disambiguation.
  6. In Meta-wiki there will be lists that contain for every object group for every local language one entry that contains the regular expression that constructs the "Name" out of the common international name followed by an entry for the local "disambiguation" for that group. External groups can grab these lists and use them for their software in order to access Wikipedia in every supported language.

Let's look at an example how this could work:

Asteroids
"\(([:digit:])*\)\ ([:alpha:])*" (name as given by IAU)
project name disambiguation
en.wp "([:digit:])*\ ([:alpha:])*"
de.wp "([:alpha:])*" "\ \(Asteroid\)"

The asteroid Ceres e.g. has the official name "(1) Ceres". The english language Wikipedia uses 1 Ceres, while the german language Wikipedia uses Ceres (Asteroid).

So you could imagine other examples with chemical compounds (taking the scientific abbreviation), with animals and plants (taking the latin name) and so forth.

This approach works in these cases where a database of common accepted names for that class exists and in those cases a program that wants to embedd the articles can make use of such databases. This would work for e.g. KStars and Kalzium (periodic table program) that have already build in lists of common accepted names.

failure tolerant retrieval[edit]

In case of Amarok this approach does not work that way as even if you could suggest a common artist database it could not make use of it as Amarok reads the artist information out of the file information (e.g. ID3-tags). As the file information is only in rare circumstances really accurate you need a failure tolerant solution as well.

But also in this case a standardisation on lemma names helps. E.g. think of the names of music groups. They often clash with ofter names, e.g. in case of the music project "Enigma": Enigma is a disambiguation page with lots of different entries. The article we are looking for is at Enigma (musical project). So currently for Amarok there is no way directly finding the correct article at this example. But which general disambiguation could be taken in order to fit all musicians? It has to be natural for Karl Marx (Komponist) and Ludwig van Beethoven as well. A neutral disambiguation lemma for all of them would be "Name (music)". Of course this lemma does only make sense as REDIRECT. The groups of articles could also overlap so that you have for one article several such redirects that get accessed by different kinds of software. So a music program could take the name and adds the standardized disambiguation for that language and would avoid most (but not all) of the disambiguation pages.

But the problem of e.g. Amarok is that the name taken out of the file information is not correct in many cases (wrong written names, missing special characters...) and of course these redirects do not always exist (also for the other above groups where common names do exist). So there is need for a sophisticated search engine inside MediaWiki. The search enginge needs to provide the following possibilities:

  1. A mode where it returns only one result so that it can be directly accessed.
  2. The result has to be highly accurate via sophisticated heuristics (these heuristics need to get extracted out of Wikipedia).

So in case of music the "(music)" disambiguation can help the MediaWiki search engine weighting results within the "namespace" that gets created by "(music)" much more than others. So let's take e.g. the german musician Herbert Grönemeyer. In many cases the name in the file information will be written "Herbert Groenemeyer" or just "Groenemeyer". So a music program would e.g. take the "Groenemeyer" and add "(music)" and ask the Wikipedia with the string "Groenemeyer (music)". The search enginge will notice that "Groenemeyer (music)" is just a substring of the existing REDIRECT "Herbert Grönemeyer (music)" with only two different letters (and you could also take heuristics with umlauts and other special characters into account) and would return the result. Other maybe existing "Grönemeyers" would be weighted that way much lower.

Web API to search engine[edit]

All these steps would help in many cases and increase the accessibility a lot. But these possibilities do not always help in case you have much more structured information than you could directly put into a lemma.

Let's e.g. take again our "Karl Marx" example. There were two famous persons that had this name: Karl Marx (the philospher) and Karl Marx (Komponist) (the musician). Imagine we would use a program that returns biographies to given data. So we would need to differentiate between these two persons in an automated fashion. In these use case you often have structured information about birth and death for every person.

  • In the german language Wikipedia there is the so called "Personendaten" template (person data) which gets used in all biographical articles and is needed for the german Wikipedia DVD and other things (you need to look at the page code as the template it is hidden by default).
  • In orther languages persons are often in categories like "born 1234" and "died 4321".

Furthermore there are different article types like REDIRECT and disambiguation pages which are relevant information for a client application.

So one could imagine accessing this data via a Web API of the MediaWiki search engine:

  • The client can make request with these structured information. E.g. a request with name="Karl Marx" born="1818". That way the search engine would return the philosopher and not the musician.
  • The search enginge could return these informations in a structured way and the client application uses it for a further request or whatever.
  • Of course also the structured lemmas with "name (disambiguation)" could be accessed that way even better via name="foobar name" and disambiguation="foobar disambiguation".

But there is the question: How to define such request fields for the Web API for all projects in a generic as possible and easy maintainable manner and how to extract them out of Wikipedia without making the articles more complex?

  1. The list of existing request fields will be discussed and collected at Meta-wiki and enabled after consensus in all projects.
  2. The fields are by default empty and thus disabled and will be ignored if used in a request. Every local Wikipedia project has to define their content by themselves. Field content definitions that are needed for external groups (like the disambiguation field) will be stored (or be accessible) in a central place (maybe for the time beeing manually in Meta-wiki until an automated solution exists, see Generic article retrieval#standardized names via REDIRECT).
  3. Field definitions can be edited by the local projects via a "Search:" namespace (or a similar name).

How could these field defintions look like?

  1. The definition could consist out of a token with "token:" followed by an regular expression
  2. The token defines on which part the regular expression will be applied.
  3. The field defintion can contain more than one line. The list will be worked up form top to bottom and stopped at the first hit.
  4. The field definition can return values that are placed after a "=" in the matching line.
  5. Definition of the request variable is beeing done with the special token "definition:". If the entered variable does not fit the regular expression of the definition it will return an error.

So let's take some examples:

  • The "search:name" article could contain:
definition: $1 = ([:graph:])*
lemma: ^$1
  • The "search:name-disambiguation" could contain:
definition: $1 = ([:alnum:])*
lemma: \ \($1\)
  • In order to access an article in let's say the german language Wikipedia with the english lemma you can do the following (It will look for the interwiki link after it did not find the lemma):
definition: $1 = ([:graph:])*
lemma: ^$1
text: \[\[([:alpha:]){2,3}:$1\]\]
  • The "search:page-type" could contain:
text: \{\{Disambiguation\}\} = disambiguation
text: ^\#REDIRECT = redirect
text: .* = text
  • The "search:person-born" could contain in case of the category solution of person data:
definition: $1 = ([:digit:])*
text: ^\[\[Category:Born\ $1\]\]
  • And of course a "search:namespace" could define the namespace field so that you could restrict your search for a namespace...

The MediaWiki search engine will take these regular expressions with the client string and will compare them with the datastring of the client and will return the article that matches best. By the way this scheme with tokens would also work without problems with the proposed Wikidata, as you could define each data field of Wikidata as token (in fact the current scheme could be viewed as a Wikidata with the two data fields lemma and text).

That way the articles won't become more complex and the solution is flexible, extensible, distributable and local maintainable which is a great advantage for a wiki solution.

Conclusion[edit]

  • All these three parts of the suggested prosal (standardized names via REDIRECT, failure tolerant retrieval and Web API to search engine) can be realised completely independent from each other but do work together without gap.
  • The three parts are maintainable by a large community and do not add artificial complexity to the articles which is a great plus.
  • We can start right now with existing technology via creating the standardized names and thus enable Wikipedia for programs like Kalzium and KStars right now. This is a great advantage as we can collect experience step by step and in parallel the two other components can be coded and can be improved with feedback out of the realisation of the first component.

So let's start!