Talk:OmegaWiki data design

From Meta, a Wikimedia project coordination wiki

Some random ideas[edit]


most commonly refered to as word by the public. A token is a punctuation mark (such as comma or a parenthesis) or a letter sequence not containing spaces.
This sentence will be (probably) separated into tokens thus (a "vertical bar" is inserted between adjacent tokens):
a word form that represents a group of related word forms. The selection of the lemma from the declined word forms is arbitrary, but as a rule of thumb, the nouns have their nominative singular as their lemma. The lemma of the verb is taken to be the first person singular in Latin and ancient Greek, the infinitive in the Slavic languages and German.
The lemma is the form you will most often look up in a printed dictionary.
E.g. in English, the lemma give stands for the forms give, gives, giving, gave, given.


Each token belongs to a lemma. (Unless the sentence is meant to be ambiguous. They saw the girl. Here the saw can be present tense of saw or past tense of see. Unless the original author intended the sentence to carry both meanings, the token is assigned a single lemma.)

Each lemma has certain morphological properties. These are e.g. the formation of declined forms. (That the word means is the same in the singular and plural, is a morphological property), the absence of certain forms is also an example of morphological properties.

Each lemma has certain surface-syntactical properties, e.g. the gender, whether the noun is countable, etc.

Each lemma has certain deep-syntactic properties, e.g. that the lemma give has three complements: the actor (represented by the subject in the active), the patient (represented by the direct object in the active), and the adressee (represented by the indirect object in the active).

Each form has its lemma and a set of categories. E.g. gives is the third person singular active of give.


These innate properties shall be taken into account while devising the representation.

--TMA 11:25, 16 July 2005 (UTC)[reply]

simplified vs traditional[edit]

From the page:

I have as yet no proper way of distinguishing different charactersets for one language eg simplified and traditional Chinese

Those are *in* ISO 15924 -- simplified is Hans, traditional is Hant. In language codes the ordinary way, apparently, is [ISO 639 language]-[ISO 15924 script]-[ISO 3166 country]-[variant], with anything after the language optional, thus zh ("Chinese"), zh-Hant ("Traditional Chinese"), zh-Hant-SG ("Traditional Chinese in Singapore"), etc. See for example [1]. —Muke Tever 02:47, 24 July 2005 (UTC)[reply]

need for examples[edit]

I think that some examples with real use case could be very useful to understand and comment the ERD. Is it possible that designers give some ? luna 09:45, 28 July 2005 (UTC)[reply]

Font Issues[edit]

Since Ultimate Wiktionary is a database of "raw" content, fonts should play no part. They only come into play when the text is being displayed ot printed, and then choice of font will be up to the application/user. Some sort of markup may be appropriate to indicate the source language of texts, as, for example, Japanese, Chinese (simplified) and Chinese (traditional) may have the same "word", but there are national font preferences when it comes to display or print.--JimBreen 04:27, 31 July 2005 (UTC)[reply]

I trust that the majority of people cannot display the Cherokee or the Amharic language. I would love to have the WMF host fonts for all the scripts that we use in our Ultimate WIktionary. Finding Free fonts is a lot of work and often you find shareware fonts that are not payed for. I want to inform about the possibilities not because we must but because we can. GerardM 05:25, 1 August 2005 (UTC)[reply]

Alternate spellings[edit]

I'm not sure: is there a place for alternate spellings in the data ? (The link from 1 expression to many spellings in the mispellings table seems to do that). Then, are all the spellings at the same level, or can we specify a prefered spelling ? I think we need both. For example, in french, the word "key" can be spelled "clef" or "clé", and is pronounced the same (/kle/), and both orthographies are commonly used. Other words have two spellings because the French Academy changed them recently, so the newer spelling should be the prefered one. This is the case for the word "master" which was spelled "maître", and can (and should) now be spelled "maitre" since 1990. Most people are not aware of that reform (I only discovered it some months ago), so the first spelling is still correct and often used (and the second one is considered as a mistake by the same people). --Kipmaster 08:47, 29 August 2005 (UTC)[reply]

A Misspelling is just that: plain wrong under all circumstances. Words that are certified by an authority like the Academy Francaise can be marked as such. According to the Academy words that they did not certify are wrong. In UW we can have many spellings and potentially we can include or exclude non-certified words. As the spelling for the same word can coexist, it is needed that a relation between these words is defined. GerardM 12:06, 31 August 2005 (UTC)[reply]
I'm glad to read "(i)n UW we can have many spellings". This presumably includes all valid orthographical variants (which is a better term than "spelling" for such things). Just as long as the Japanese part of UW can have both 合気道 and 合氣道 in the same entry, because they are two written versions of the same word (aikido). --JimBreen 04:48, 5 September 2005 (UTC)[reply]
合気道 and 合氣道 are two different Expressions and as such they are two different Words they share the same meaning. These words are related as orthographical variants; I do not know if these variations each belong to a specific named orthography ?? GerardM 20:54, 6 September 2005 (UTC)[reply]

multiple mediafiles?[edit]

Looking at your data design and reading your comments I'm not quite sure if your design covers the following. I think the representation tables are supposed to cover this; but i'm not totally sure. And if they are there is a need additional items.

One thing wiktionary would be great for will be to help people learn languages. Being able to attach mediafiles to words would allow you to hear how a word is spoken by a native speaker. Even better would be the ability to hear different pronouciations of the word from different people.

If you have enough mediafiles for words you will be able to automatically generate sentences from text. Obviously it wouldn't be as good as someone reading out a sentence; but it would probably be adequate provided the words in the sentence are generated from the same speaker. Consequently each mediafile needs additional elements attached to it:

  • name of the person
  • dialect (might be covered by representation type)
  • gender?
  • Emotional tone? (In an ideal world we would have different versions of certain words according to someones emotional state).

Identifier about the person would help people decide what version of the word they preferred to hear if there was multiple. :ChrisG 03:27, 20 September 2005 (UTC)[reply]


Hi there to anyone reading this. I find the data design very interesting, but after examining it I came up with a doubt. I'm an (extemely) active user at the Spanish wiktionary, and the way we deal with verbal conjugations there is by means of Templates. I've also used Templates to create inflections in french, german, and modern greek and they really are a time saver. The drawback that I see here is that one would need to type in the whole conjugation table by hand for a spanish verb for example, making it a very time consuming task. My main question is: Is there a chance to include some shorthand for regular inflections in several languages? ppfk (@) December 06 2005.

The functionality of Ultimate Wiktionary is one whereby we initially provide functionality. The current implementation of inflections is that we allow for boxes specific to languages. Inflections exist for verbs nouns adjectives. There are standard inflections and the rules for these can be implemented in software.. It will need confirmation of an editor to include these inflections..
However, it will need someone to write the rule in the software. It is something that we cannot provide per default as we do not know these rules and because we have plenty of functionality on our plate :) GerardM 15:24, 7 December 2005 (UTC)[reply]


Please have a look at

in particular you may find inflection paradigms there (only a list of inflects like without the regular expression needed to do the work). It also deals with - canonical forms of multi word entries - syntactic, morphologic and semantics

and so on.


"Dog Food"[edit]

I've had a look at the Ultimate Wiktionary/WiktionaryZ datamodel Erik recently released and it appears the model has a Language table that will be used both as content by the WiktionaryZ dataset and operationally by the Multilingual_MediaWiki to control such things as the user interface, language of pages, etc. Also, the main page indicates that UW/WikitionaryZ user interface labels will be derived from data in the dataset.

I think there are two drawbacks to this approach, though. Logically there should be a difference between operational data used by MediaWiki and content data that is part of an individual Wikidata dataset. In the future the sorts of attributes that each will require for its model of the Language entity will tend to diverge. For example, for certain applications a Language must be modeled as it changes over time, either in the case of languages which go extinct, or for languages that signficantly evolve over time (e.g old vs. modern English). Such detail is unnecessary for Mediawiki to control the languages used on a particular installation, however.

More importantly, though, having data used by the site software be world-editable as a Wikidata dataset is a very large security hole. At the very least it make acts of vandalism more devestating, to say nothing of denial of service attacks through turning languages off or changing their character sets. Use of WikitionaryZ for label data also has the potential for very bad Cross_site_scripting attacks if HTML-injection filters are not added or have any flaws in them.

Jleybov 23:58, 23 March 2006 (UTC)[reply]

The language table will not be world editable. It would also not be world editable if it were only a WiktionaryZ table. As it is the explicit intention to create a user interface for the languages that are actually used (the majority) there is a direct link with a new language and the need for the localisation of MediaWiki for yet another language.
When a language changes over time, it is an issue with the words of that language. It has nothing to do with the existance of that language. By the way old and middle English are considered languages in their own right in ISO-639-3. Then again, the table is first and foremost about the existence of a language.. that is all. GerardM 16:44, 14 June 2006 (UTC)[reply]
In the context of WiktionaryZ I can understand how there could be a close connection between Wikidata languages and Mediawiki languages- if a language is supported by WiktionaryZ there will be content in that language and so naturally there should be a Mediawiki localization for it.
I'm assuming, though, that integration across different Wikidata datasets is desireable enough that the Language and Script entities in WiktionaryZ will be shared with other datasets. In that case the assumption that the languages modelled by Wiktionary and the languages used by Mediawiki should move together in lock-step breaks down. For example, my cataloging project will have need for many ancient and dead languages- basically every language for which we have written specimens of text. Just because we wish to account for items written in Etruscan or Pahlavi, though, does not mean we should support MediaWiki localizations in that language.
The fundamental issue here is that it is a conceptual mistake to mix system set-up data with content data, and as Wiktionary and other Wikidata-flavored projects evolve the mismatch between modelling requirements in these two very different contexts increases. For example, the Language table already has an is_enabled attribute which has nothing to do with modelling a language as a real-world, ontic entity.
Jleybov 01:24, 28 June 2006 (UTC)[reply]

Universal Pronouncing Dictionary[edit]

I have given some thought to the data design of the so-called "Omega Wiktionary" The data design ought to be based on a universal pronouncing Wiktionary. Every term of every language ought to be transliterated into an agreed upon phonetic alphabet. While many may prefer the IPA, my preference is for X-SAMPA because the entries can be displayed in recognizable ASCII character order. The next entry should be some code for language source, such as ISO 639-2. The next entry could be a display of the word in the unicode character set most in use for that language. The next entries could be: grammatical classifications, definition in the original language, and a translation into a least one of the six UN languages. Thus, the design of the entries could be as follows:

    1. Pronounciation in X-SAMPA
    2. Language code from ISO 639-2
    3. Unicode display in native characters
    4. Grammatical classifications
    5. Definition in the native language
    6. Translation into a least one of the 6 UN languages

-Walter Ziobro