Vortaro tables

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search

In a series of comments on Tables for Wiktionary, Polyglot announced a potential structure for a relational database to hold Wiktionary data.

Polyglot's comment[edit]


For a few years now I have been chewing on this collection of tables:

OpenOffice.org Drawing (desc) PDF-version (desc)

It represents a table structure to be used in PostgreSQL (since that supports inheritance). I also have a design where all the tables are relational, so it can be used in MySQL. I prefer PostgreSQL because it allows the use of stored procedures, but that's not relevant in this discussion.

I hope I can save people some time when creating a database schema for a multilingual dictionary.

I'm sure it can be improved upon, but quite a lot of thought has already gone into it. Polyglot 20:28, 6 Sep 2004 (UTC)

Here comes an explanation of how data would be stored into these tables:

Words go into the Wordlist table. Each spelling is only stored once. The WordDescriptions table is used to describe what kind of word it is in a given language and what its properties are. A word like are would be described several times for English (we are, they are, you are (twice once singular and once plural) and it's also a measurement in Dutch (acre), maybe in other languages as well.

There is also a field where the Validity is stored. This is a score for how many times this entry was verified.

Now we have to start from the other side to the concepts table. I see a concept as an idea. It can be illustrated by means of graphics from the images table or sounds (sound of a bird or a musical instrument). An example would be the combination father/papa. They stand for the same idea, but the meaning is different and it wouldn't be right to substitute/translate padre (es,it) as daddy. Another example would be words that stand for the same concept, but that are different in style. spoken/written; street language/common language/uptown language. Or layman's language and specialist's language. In the meanings table they are divided and it is possible to indicate the usage, the context and the actual number and gender they stand for. Consider Mädchen (girl), grammatically this is a neutral noun, but in actuality it stands for a female person.

Then we get to the dictionary table. Sometimes a meaning becomes several words in another language. I like to use the example of bathroom which becomes salle de bain in French. salle, de and bain are all words that are described separately. They are put together in the ExpressionParts table to be used as one of the translations of bathroom.

I was talking through Skype yesterday with Gerard and his opinion was that things could be simplified. I have been normalising to the extreme. It would be possible to, instead of having a WordList table, to have a table containing Expressions or even entire phrases, then describe those, if applicable. On the other side, maybe the difference between Concepts and meanings doesn't necessarily need to be made. All that needs to be discussed.

On the bottom left is a collection of lookup tables for languages, parts of speech, cases and verb tenses. I see all of these as concepts and that's how it becomes possible to retrieve their names in a given language (this can even be used for the interface to this dictionary).

On the top right are tables for the phonetic transcriptions. A transcription can be reused and it is possible to enter more than one transcription for a given word or expression.

In the relatedwords table it is possible to indicate which words are a conjugation or a declination of each other. Synonyms, antonyms and translations can be entered in the Dictionary table, or they can be deduced from the relations between Meanings and Dictionary.

In the context table it is possible to indicate which words often appear together in the same sentence. And also which concepts are related. This is to help choose the right WordDescription in case of ambiguity. The idea is to be able to use this dictionary for machine translation or as a translation memory. It is very well possible to enter entire sentences and their translations.

Gerard remarked I don't have any way of entering etymologies yet. This could be implemented as a free text field of the WordDescriptions table, where links to other words/terms can be made the Wiki way. The database doesn't know about those relations, but I don't think that's a big problem.

Another thing that is lacking is a way to keep track of the changes that people make. This is a though one. I tried to make up for it with those Validity fields. It would be possible to encode all the changes in an xml-format and keep track of them in a separate table. More or less the way the wiki works. With an intelligent xml-system it should be possible to selectively undo inappropriate changes to the database.

Polyglot 08:35, 7 Sep 2004 (UTC)

About etymologies: it would be nice to be able to find easily words with the same etymon in different languages. It can be interesting to learn that English /cheese/, Spanish /queso/ and German /Käse/ all derive from Latin /caseus/ which is also the source of the international scientific word /caseine/.