Using OmegaWiki for Commons

From Meta, a Wikimedia project coordination wiki
Koniks in the Oostvaarderplassen

Wikimedia Commons is the resource for digital content for the Wikimedia projects. There are thousands of pictures, movies, soundtracks for you. If you can find them that is. There are great pictures but who will think Image:Konik-etalage2.JPG is about horses?

One way in which we can enhance Commons is by adding keywords, categories to the pictures. This would help a lot, searching for articles names like Konik1 does not help. When these are categorised or when keywords have been added it makes it much more useful. It means that there will be something that will help when looking for pictures.

With OmegaWiki we can take a next step. The step that is necessary to help those people who do not speak English. Imagine, all keywords and categories are entries in the OmegaWiki, for all these words and phrases there are translations. A user enters a word in his language, the engine does a lookup for the word in OW and instead of "Pferd" it looks for "horse". The Konik picture had a category of horse so it was found.. Joepie !!

Technically on the OW side, we would consider the Commons words and phrases part of the "Commons" glossary. They can be existing words. They can be new words. In OW we have a need for indicating what content is missing. One example would be "glossary X item without a translation in MY language". When these translations are added, we have two benefits for the price of one; OmegaWiki has new translations and Commons has a better search-functionality.

One extra challenge are derived words e.g. plurals, diminutives etc. A derived word is in essence the same as the headword. So a "paardje" should work in a similar way as horse ??

Mock up[edit]

Erik made a mockup of what this functionality might look like. In a mail about this he wrote the following:

1) A new tag for images of dogs is created. (In this demo, I call categories "tags", because I hope this will be what they are eventually called.)

2) The user can choose from the languages they speak to clarify which language this tag name is written in.

3) Based on the tag name and language, a lookup on OW is performed, which fetches all the associated meanings for his word.

4) The user selects one of these meanings.

5) Automagically, another lookup is performed to determine the available translations, if any. After saving the tag, it is then instantly available under these names in the other languages.

In the demo, the first two meanings have translations available, while the other two do not.

Why is this so powerful? Because, if OW itself is successful and contains many words, it almost instantly makes the entire media repository on Commons available to speakers of all languages. (Now, hopefully, you can see why we've been excited about getting millions of translations for free from the Logos project.) No need to create many different tags - just select the right meaning. Furthermore, it builds bridges from other projects to OW. The language work we are constantly doing will no longer be redundant, but focused on one place.

A 14-year-old Italian kid can then use the tag "cane" to look for photos of dogs, while a Maori girl from New Zealand can use "kurii". Moreover, the same category hierarchy can be used to browse in different languages (based on user perferences, a fallback hierarchy would be queried to determine the language that should be used should no translation be available).

We could also automatically make use of synonyms, plurals and inflections (though this requires further changes to the category code beyond internationalization). Given that we are mapping one of multiple meanings to a single tag, there will be tag collisions -- those will have to be dealt with through disambiguation. But this is not important: Try to see the tag name merely as a key to a meaning. What this key is called is secondary.

The key principle of selecting a meaning and then performing automatic translations can be used in many different contexts. For example, in Wikidata, one could use the same principle to internationalize field names such as "Country", "Flag" and "Population".

This application also shows that OW must contain everything from words to names to phrases. There is no limit to the scope of it. This makes it a potentially massively useful tool for both human and machine translation.

The category internationalization functionality will not be part of the first release of OmegaWiki, but we believe we can get funding to work on this later. I believe that OW, in combination with better tagging features in general, could make our tagging system the most advanced one available. Flickr, for example, has no localization, is unlikely to ever get semi-automatic localization, and apparently supports no synonyms either.