User:Millosh/Dictionaries

From Meta, a Wikimedia project coordination wiki

In this moment I am working on one Serbian dictionary of synonyms. During that work I got some ideas about the work on Wiktionaries:

Let's say what one word with synonyms/translations is enough for one word in Wiktionary. (Maybe I should read some Wiktionary documentation, but I suppose that this is the minimum.)

In short, this may be done for a dozens of languages on a dozens of Wiktionaries.

Stage 1, one language dictionary[edit]

  • Take some dictionary between English (or whatever language) and your language. Of course, take it in machine readable format (not encrypted).
  • Take the first word in (let's say) English.
  • Take the first translation in your language. Connect this word in your language with other translations of the word in English.
  • Find which words in English have the same translation. Connect the word with other translations in those words.
  • You will get the list of connected words. There will be a lot of mass, but you will be able to make some simple methods for cleaning the most of the mass. The rest of the mass will be cleaned by humans because this is a wiki :)
  • Of course, you may do that with a lot of different dictionaries...

Imagine that we analyzed two words from language A in the dictionary "language B -> language A" and that we got the next results (of course, this is simplified table):

A58 - B65 - A58, A43, A21, A63
    - B69 - A58, A28, A21, A38
    - B71 - A58, A43, A21, A88
    - B89 - A58, A43, A21, A63

A21 - B31 - A21, A43, A76, A20
    - B44 - A21, A43, A39, A22
    - B65 - A58, A43, A21, A63
    - B69 - A58, A28, A21, A38
    - B71 - A58, A43, A21, A88
    - B89 - A58, A43, A21, A63

We may say that if one word from the language A has the same meaning as the word A58 in the language B, this connection will get one point. So, we will have the next situation according to the words A58 and A21:

A58(A21) = 4
A58(A43) = 3
A58(A63) = 2
A58(A28) = 1
A58(A38) = 1
A58(A88) = 1

A21(A43) = 5
A21(A58) = 4
A21(A63) = 2
A21(A28) = 1
A21(A38) = 1
A21(A88) = 1
A21(A76) = 1
A21(A39) = 1
A21(A20) = 1
A21(A22) = 1

For the beginning, this may mean:

  • The closest synonyms to the word A58 is the word A21.
  • The closest synonyms to the word A21 is the word A43.
  • Words A21, A58, A43 and A63 are synonyms (which we may call "G(As)1").
  • It seems that words A28, A38, A88, A76, A39, A20 and A22 are not related with the group G(As)1. However, we will put the connections in the memory, but we will not write it into the dictionary. Imagine that the word blood literary means in some language "red bird". Of course, there are some red birds in the area where that language is spoken. So, in this sense, blood will be connected with the word "bird" and, almost for sure, with some specie of birds. However, this will be the only connection to the birds. Other connections will be inside of the descriptions for erythrocyte, lymphocyte, heart and so on. Of course, mistakes are possible, but we may analyze results :)
  • This may be very useful for smaller languages which have some two language dictionaries (where the language B is English). We may be able to generate one language Wiktionaries for all of such languages.

Stage 2, two languages dictionary[edit]

(To be continued.)

Stage 3, cross language dictionaries[edit]

(To be continued.)