Talk:Tables for Wiktionary

Note: content that was here was moved to Talk:Vortaro tables because I believe it was regarding the Vortaro tables. There's an outside chance it was actually re: the original tables listed on this page.

(page=spelling <<->> expression(word) <<->> meaning <<-> concept) Aliter 19:03, 20 Oct 2004 (UTC)

I elaborated a bit on the line above, but it would be nice to know what we see as the requirements. Aliter 00:56, 16 Mar 2005 (UTC)

?? A Wiktionary page has one or more Words ??[edit]

Can this be correct ?

Currently, in English Wiktionary, a Wiktionary Page has one word. That word can have one or more meanings (or none); none or many alternate spellings; one or more occurences for different parts of speech; one or more occurences for different langauges. But, by definition, ONE WORD is ONE PAGE.--Richardb 13:57, 5 Jun 2005 (UTC)

?? A Word has a Language?? Again, a word can have several languages.

If you/we are meaning something other than what is currently commonly understood by the term "word", then you/we need to use a different term. aWORD ENTITY.

When you are talking about tables it is inherent that you are talking tables. GerardM 09:52, 6 Jun 2005 (UTC)

A WORD ENTITY has

a language (By definition, each WORD ENTITY is for one language)
a spelling - by definition (but possibly several alternate spellings)
many WORD TYPES (parts of speech)

Alternate spellings are different words; they are a special type of synonym. A word HAS one word type otherwise it is a different word. GerardM 09:52, 6 Jun 2005 (UTC)

Any WORD ENTITY/WORDTYPE combination can have multiple MEANINGS.
--Richardb 13:57, 5 Jun 2005 (UTC)

?? Can one MEANING have several related WORD ENTITIES ?? ie:can many synonymous words all "link" to the one MEANING ? Seems dubious.--Richardb 13:57, 5 Jun 2005 (UTC)

Yes they can when they are exact synonyms. GerardM 09:52, 6 Jun 2005 (UTC)

Senses vs. Translations[edit]

It's important to keep in mind that a word in one language may not have a simple translation into another language. Suppose you have a Mandarin word M with senses M1, M2 and M3. Obviously, the Ultimate Wictionary will contain explanations of those three senses written in Mandarin. Now, suppose there's an English word E with senses E1 and E2, and E1 matches exactly the sense of M1 (that's going to be a rare occurrence), while M2 and M3 don't have exact equivalents in English but need to be circumscribed with several words. So for M1 we have what could be called a "translation", while for M2 and M3 we don't. Therefore the senses M2 and M3 need to be explained in English as well.

The upshot then is the following: the Ultimate Wictionary needs to contain descriptions of all senses of all words of all languages, with each description available in every language. en:user:AxelBoldt

Indeed. I'm a translator, and have dealt with a couple different terminology systems. One that is often used in translation circles is Trados MultiTerm (MT). MT uses the structure of Concept -> Language(s) -> Term(s) -> Meaning(s), Example(s), etc., with the focus on Concepts as the root category.

While this works much better when dealing with multilingual glosses, it makes a devil of trying to set this up Wiki-style. If we follow the suggestion listed in the first FAQ, users should be able to just plug in a new word -- but then how do they set up a concept, especially a concept that might not match any single word in that user's language? Or how do they relate their new word to an existing concept where the only terms entered so far are in a language that user doesn't understand? Eiríkr Útlendi 19:26, 11 January 2006 (UTC)[reply]

Wolfgang's proposal[edit]

Dear all,

I came across many XML formats for coding linguistic information and was disappointed that they typically cover only one aspect in a satisfactory manner, whereas others are difficult or impossible to code.

Currently we are using OLIF (http://www.olif.net) to store data for machine translation and we have access to terminological databases like IATE (European Commission) in DXLT format (see e.g. http://forum.europa.eu.int/irc/DownLoad/kSepAiJ_mSGRygAeJGCUa_HlQtNHZOpHlbT2_MZqh3RDTFQXJV5p4Tj3etVU/IATE-XML_and_mapping.doc).

Furthermore we have a huge number of texts, many of them with translations (EN,DE,FR,ES). We aligned some texts at the sentence level (TMX format see http://www.lisa.org/tmx) and built parallel corpora in Lucene databases (see http://lucene.apache.org/).

The variety of data that we have led us to the conclusion, that it would be unwise to have a single root element, because different applications require different roots, and data duplication should be avoided.

I then thought about structuring linguistic data in a relational database.

We think about using the following tables (there are some more, just the most prominent ones):

Lexem (containing a lemma and links to LexInfo and to an inflection paradigm).

LexInfo (containing a list of all possible combinations of lexem attributes such as language-partOfSpeech-gender-naturalGender-transitivity-preposition...).

Variant (containing 2 Lexem IDs which are orthographic variants or have more than one LexInfo or inflection paradigm).

SurfForm (containing inflected forms, links to lexem ID and to inflForm ID).

InflForm (contains a list of all inflected forms like DE-verb-1Person-Sing-Indic-Präs).

RegEx (input InflForm and inflection paradigms, output regular expression to be applied to generate surface form).

Dict (containing lemmas and multiword expressions in canonical Form).

Meaning (containing definitions, examples and semantic Type ID).

SemRel (containing semantic relationships between Meanings).

Trans (unidirectional link between two Dict entries, containing condition when it is applicable).

and linking tables between SurfForms and Lexems, between Dict entries and SurfForms, and Meaning and Dict.

Furthermore we have the corpus databases: Text (contains the IDs, classification code, text sort code, and storage location of the texts) TextIndex (Index to retrieve Texts from a search expression) Sentence (contains text of a sentence) TUV (link between two sentences which are mutual translations) TUVIndex (Index to retrieve TUVs from a search expression

Note1: every tupel in every table has an ID.

Note2: there is no link between corpus databases and the other ones now (we could tag the texts to do so).

Note3: the relational database system is somewhat extensible (e.g. you can use Wordnet in parallel with MindNet if you want, you can add syntactic information to the sentence database (Treebank), you can add morphological information like word decomposition and derivational morphology and slot grammar info and so on.)

Note4: you should add to all entries in all tables some admin. information (updated by ..., criticised by ..., owner ?)

To show some strong points of this approach, please check if you can code this into your approaches:

Bank Plural Bänke => Meaning Sitzgelegenheit

Bank Plural Banken => Meaning Geldinstitut

Konto => Plural Konten, Kontos, Konti

Spritzgussverfahren Variant of Spritzgußverfahren Variant of Spritzgießverfahren, LexInfo: same as Verfahren

Some advantages:

- possibility to inherit properties Leertaste => look for Taste, same LexInfo applies

- look up surface forms and analyse possibilies ("der" can be determiner Nom. Sing masc, Gen. Sing. fem, ... relative pronoun, ... demonstrative pronoun)

- clear distinction lemma/surface form/dict entry/meaning

- avoids duplications and reduces required disc space

Of course there are some shortcomings/limitations:

1) Fragmentation of information into many tables

2) it is more a computer linguistic approach than a human translator's approach

3) you have to be explicit what you exactly want to link

4) it finally can become a tough job, just to code an ambiguous word like e.g. key (noun, adj, verb, different meanings, sometimes same translation for different meaning, ...)

5) adding entries requires searching for existing entries which is not so trivial because there is no root element

Some points to think about:

a) Nom. Sing. => ein Abgeordneter/der Abgeordnete (i.e. Nom. Sing. is not a perfect InflForm)

b) das Mädchen => gender neuter, natural gender fem.

c) grüner Tee => not a series of lemmata "grün Tee" but a series of inflected forms.

d) hyphenation

Users could gain the highest benefit if they could download the complete database (or get it at marginal costs on CD). In such a case, they could add their own tables or modify tables.

Your comments are welcome !

Wolfgang