Jump to content

A new look at the interwiki link (2nd phase)

From Meta, a Wikimedia project coordination wiki
This page is deprecated. Please see the current version at A newer look at the interwiki link.

Got accidentally across the article on interwiki links, an idea initially triggered by Ellywa and further developed by GerardM and HenkvD. I agree fully that the present practice will become unmanageable, as the amount of links to be maintained will grow exponentially with the growth of the Wikipedia's in various languages.

The main reason is indeed that the interwiki links are currently defined as part of the descriptive part of any lemma, resulting in a lot of redundancy in the overall database. This article is an attempt to contribute to the idea, possibly coping with some of the objections as well.

Golden Rules in Database Design

[edit]

In relational database design a golden rule is to avoid any redundancy in the data (i.e. tables). So, if you have an application that is for example storing personal info (names and addresses) and score data (e.g. for a game), make sure a name and an address is stored in one table (and only one!) and store the score data in another table. Why? Well, the same person may have more than one score and you generally don't want to repeat the name/address information in the score table. Because if the address changes, you have to change that in many places.

But having the information in different tables, you want to have a linking mechanism to know which score belongs to which person. In other words: the two tables in the example have to be interrelated. Therefore the name Relational Database.

There are basically two mechanisms to define such a relationship, i.e.:

  • Still include the fields for names and addresses in the score table, but instruct the database system to maintain automatically the integrity across tables that share the same fields. So if an address is changed, the database system takes care about corresponding changes in all other tables where that address exist. This method, although perfectly right, is often disliked by developers, because it may impose restrictions and reduces the ease with which a database can be developed.
  • The second mechanism makes use of unique keys where possible. Unique keys are automatically assigned by the database system. Other tables (e.g. the score table) refers to this key (e.g. golf player Smith has unique key 48964) and can find the name and address of the guy with that particular score in the persons table.

So far the general basic explanation of relational databases. Back to the problem addressed by the users mentioned above. My basic approach would be the following: If you are able to define the principles of tables and their relationships, you are able to program the necessary functionality in an unambiguous way.

Lay-out of Interwiki Tables

[edit]

An interwiki table could look like this:

en fr de nl es
231377 8597 4721 18458 5616
348315 184713 4677 34090 5584
295225 184202 32658 16600 6018
382591 121788 120179 17945 4574


The numbers refer to the cur_id fields in the cur table (that is the main table where everything is stored). The cur_id field is an unique key for every article in any language version of Wikipedia. So, nl - 18458 is an article about "hond" in the Dutch Wikipedia. The English lemma on "dog", however will have another unique id, because it is in another table.

The above table is not very well readable by humans that are editing a page, but for the database system it is easy to replace the cur-id fields that are stored in the interwiki table by the contents of the cur_title field, which gives you the titles of the articles as follows.

en fr de nl es
dog chien Hund hond perro
cat chat Katze kat gata
pet animal domestique Haustier huisdier animal doméstico
doghouse   Hundehütte    


When viewing a Dutch page with title "hond" and cur_id=18458, the interwiki links are now automatically generated by the system in the following manner [http://en.Wikipedia.org/search/cur_id=231377 | English: dog], etc. The http link here is something artificial, i.e. some way to define in an easy way that the search is not on the title, but on the id. Btw: Searching on a numeric key rather than on text will generally contribute in a positive manner to the performance of the system.

Title Changes

[edit]

This mechanism already solves the problem of changing a title. The unique id remains the same, so the database will automatically resolve the new title for an article.

Expansion

[edit]

The above tables represents a certain (more or less mature) state of the system. Only the article on the doghouse is not yet available in all languages of this example.

Let's now consider how this may have evolved over time. Suppose the Spanish Wikipedia does not have yet an article on dog, cat or pet. So the table (in human readable form) looks like this:

en fr de nl es
dog chien Hund hond  
cat chat Katze kat  
pet animal domestique Haustier huisdier  
doghouse   Hundehütte    


Now a Spanish user starts to write an article on "animal doméstico". Once finished, he (read for the rest of this article he/she or him/her, etc) calls up the interwiki subsystem and selects French and English as foreign reference languages (Dutch and German are like Chinese to him). He searches for "animal domestique", knowing that this is the right translation for his Spanish equivalent and will see the following table:

en fr es
pet animal domestique  


He confirms that this is the right fit and the table will now look as follows:

en fr de nl es
dog chien Hund hond  
cat chat Katze kat  
pet animal domestique Haustier huisdier animal doméstico
doghouse   Hundehütte    


Next he realises that an article about perro and gata does not exist yet in the Spanish Wikipedia, but decides to link the "animal doméstico" article to other foreign articles about these subjects as well. So after searching for dog (he knows the English (semi-)equivalent) and chat for the French (semi-)equivalent and confirmation, the table looks as follows:

en fr de nl es
dog chien Hund hond animal doméstico
cat chat Katze kat animal doméstico
pet animal domestique Haustier huisdier animal doméstico
doghouse   Hundehütte    


If an user, viewing the German article on "Hund" clicks to the Spanish language he will get the article on "animal doméstico" and the same will happen if he was initially looking to the German article on "Katze".

But in the reversed case, i.e. if an user is viewing the Spanish article on "animal doméstico" and clicks to the English version, the system should in that case show a multi-choice page, requesting to choose between "dog", "cat" or "pet" respectively. The link on the Spanish page could have "multiple articles" (but then in Spanish of course) as text. The actual url should consist of a search term/query containing all relevant cur-id's of the English Wikipedia, i.e. [http://en.Wikipedia.org/search/cur_id=231377&cur_id=348315&cur_id=295225 | English: multiple articles].

Once the Spanish Wikipedia is extended with articles on "perro" and / or "gata" only the author of these articles has to do something. When searching on "hond" (because this user happens to be a retired Dutch, living in Spain), he will see that "animal doméstico" is already existing and he will change that to "perro". The same for "gata". The end result will be the first table in this article.

Automatic Maintenace Tasks

[edit]

The system should automatically take care of a few maintenance tasks.

New article (e.g. on "animal doméstico") and no interwiki defined

[edit]

The table will automatically be extended with a new entry where only the field for es is filled. (this could be called an orphan link), like this:

en fr de nl es
dog chien Hund hond  
cat chat Katze kat  
pet animal domestique Haustier huisdier  
doghouse   Hundehütte    
        animal doméstico


The Spanish article will show no interwiki links, nor will the other languages show a link to the Spanish article. Both an user working on "huisdier" and amazed that there is no link to a Spanish article could correct this (provided he understands Spanish). But also the other way around, a Spanish user with knowledge of a foreign language could do this. If the correction is made complete and will result in an empty line (i.e. the original orphan with only the "animal doméstico" entry, which is now empty for all languages), the system can then automatically remove this line. If the user making the correction forgets to remove the orphan entry, the system can also automatically remove this. The logic is then: if duplicates exist for a language, remove the orphan entry(ies).

Deleting Articles

[edit]

Deleting an article is no problem. The system can easily remove references to articles in any language that do not exist anymore.

A more complex situation

[edit]

The above described tasks are obviously batch processes. So it may happen that an Italian entry is added to the Spanish orphan before this orphan entry is moved to "pet" etc. The table would then look as follows:

en fr de nl es it
dog chien Hund hond    
cat chat Katze kat    
pet animal domestique Haustier huisdier    
doghouse   Hundehütte      
        animal doméstico animale domestico


This is a more complicated case, which need some further re-thinking. It probably would require an auxiliary table that registers which entry in the fifth record has been the parent and which are child(ren). If a parent is moved (in this example the Spanish entry), the children are moved as well, but if only a child is moved (in this example the Italian entry), only the child is moved and the Spanish entry automatically becomes an orphan again.

Please note that where in the above explanation for the sake of readability the verbose text, such as dog, cat, pet etc. is used, the system only stores the unique id's (the field cur_id) for any article.
This complex situation is in fact quite easy. The spanish and italian entries are considered to be a complete new entry, something like a horse. If the spanish entry is fixed to pet the italian leaves behind, as the italian still could mean horse. There is no way of knowing that. The italian should be fixed seperately. HenkvD 19:56, 30 Jun 2005 (UTC)
But then you harm the italian contribution, because he initially only mastering spanish as foreign language, found the animal domestico as hook to connect his article. And that is right because that spanish article is about pet. So if the spanish article is corrected, why not correct the italian as well? The only condition is that the spanish article was the parent to which the italian one was linked later. Taka's suggestion for the implementation can easily be extended to have the system take care of this parent-child relationship. RonaldB 22:17, 30 Jun 2005 (UTC)
The parent-child relationship itself might be too complex. Just a few questions: How is it shown to the users? How does the spanish user know that his change has effect on the italian entry? I think it is essential on wiki's that relations are 100% clear, and that the action you take should have predicable results. HenkvD 07:52, 1 Jul 2005 (UTC)
The user doesn't need to know beforehand about the dependency structure. He just sees that es and it are apparently linked to each other. If he moves the Spanish entry the system can figure out that there are one or more children and prompts the user about the dependency with the question "Do you want to move it:animale domestico as well?" with the option to confirm or deny. If denied only the Spanish entry is moved and the Italian one becomes an orphan.
If someone only moves the Italian entry, having no children, there is no need for such a prompt.
The ultimate parent might be given another color, but that may raise more questions than it clarifies. Another option is to display the creation date just below the entry. For an experienced user it might be convenient to know what is the parent. If for instance a group of 5 children have to be moved, he can pick the parent immediately.
RonaldB 15:46, 1 Jul 2005 (UTC)
A prompt "Do you want to move article .... as well?" is a fine option on the parent-child issue. On the other hand it will work fine as well without a parent-child complexity. Parent-child might be an addittional feature which is nice-to-have, but it is not a must-have. HenkvD 18:02, 1 Jul 2005 (UTC)
If you don't understand other language (even you cannot guess..), you cannot answer, and the scheme does not work. Maintaining the hierarchies (of interwiki links) is a nightmare. Even in current 1:1 mapping, it does not work. What's the relation between en:Middle school, en:Gymnasium (school), en:High school, and en:Secondary education? The animals are easiest case. They can be calssified by the taxonomy. -- ChongDae 9 July 2005 14:45 (UTC)

Comments / Discussion

[edit]

A relational table with this layout:

en fr de nl es
231377 8597 4721 18458 5616
348315 184713 4677 34090 5584
295225 184202 32658 16600 6018
382591 121788 120179 17945 4574

seems to me a bit unprofessional. A much more flexible table looks like this:

topic_id article_id language_id creation_date comment (this is not a field)
2569 231377 5 1-7-05 22:33:55  
2569 8597 8 3-8-05 03:04:28 this is now a child of cur_id=231377 in language_id=5
but a parent of cur_id=4721 in language_id=12
2569 4721 12 12-9-05 12:44:12  
2569 18458 20 and so on  
2569 5616 17  
139 348315 5
139 184713 8
139 4677 12
139 34090 20
139 5584 17
3751 295225 5
3751 184202 8
3751 32658 12
3751 16600 20
3751 6018 17

To explain it: you don't want to have a table that you need to change everytime a new language is added. In this (latter) model, adding a new language is a matter of just adding a new record to the language table. No structural changes of the database model are needed.

The problem (and maybe the solution) is the "topic_id", which in this model refers to a "topic" or "article title" which is independent of any language. It is a solution, because it is language independent. The problem however that a number holds no information about the topic, and there should be some kind of "name" to make clear what each given topic is. And then you need a language again. But i am sure that with some thinking some good solution for this can be found.

Taka 07:27, 30 Jun 2005 (UTC)

You are absolutely right, this is the way to implement the table. I used the spreadsheet like representation because it is more suitable to explain. The topic id can be generated automatically by the system if a new orphaned article is added. There is no reason to give the topic_id any meaning other than it is a number connecting entries in that table. So it is really language independent.
By adding a field with the current date/time stamp at the moment a new record is created, the system can easily keep track of grandparent-parent-child-grandchild etc. relationships. See the few added fields in de date column. RonaldB 22:57, 30 Jun 2005 (UTC)

A few practical aspects

[edit]

On paper everything looks nice. However, I am wondering:

  • what to do with articles that are being turned into disambiguation pages and subsequently split? This may happen at any given moment in any language.
  • how much load will this "nice feature" have on the Wikiservers?

Quistnix 18:06, 30 August 2005 (UTC)[reply]

* The serverload will hardly be effected. Only one extra table needs to be accessed.
* Articles that are linked from Disambiguation pages are treated the same as current interwikilinks. The disambiguations pages itself are less linked because the disabiguation mostly differ per language. The disambiguation pages that have interwikilinks can have these in the new situation as well, just as other articles.
* Splitting up any article (disambiguation or not) is just treated as new articles, by adding and removing links to other languages.
HenkvD 10:21, 2 September 2005 (UTC)[reply]