Research:Ideas/How much repetitive text is there in Wikipedias in different languages? Can it be useful for translation memory?

From Meta, a Wikimedia project coordination wiki


This page documents a proposed research project.
Information may be incomplete and may change before the project starts.

Intuitively, it seems that Wikipedia has quite a lot of identical or very similar sentences: "He was born in <year>"; "<City name> is a city in <country>"; etc. Of course, there are plenty of sentences that are unique to each article, but the similar sentences could be useful for Translation memory, and used for extensions like ContentTranslation, Translate, and be exported to other friendly projects, like OmegaWiki, Moses, etc. A real number that isn't based on intuition will help make decisions about the development of these products.

Support needed[edit]

Needed: Data mining of whole texts and finding similar sentences. This can be repeated in all languages with pretty much the same algorithm. In some languages, such as Chinese or Thai, the rules for sentence and word segmentation may be different, but other than that the algorithm can be reused for all our languages.

Ready to create a project page?


References[edit]