Jump to content

Problems with server-side Chinese to Chinese translation in MediaWiki

From Meta, a Wikimedia project coordination wiki

Update: I am currently working on a list of two-character words for polygraphic characters

Problems with Chinese to Chinese conversion. (This is a relatively old essay. While the examples given are now rendered correctly, there are still similar issues, but thanks to a helpful technical staff they are easily resolved when they are caught.)



Example 1 -- 宋雅王后


Original Simplified version: 宋雅王后(Queen Sonja of Norway)名宋雅·哈拉尔森(Sonja Haraldsen)1937年7月4日生于 奥斯陆(Oslo),挪威国王哈拉尔五世的王后。她父母是卡尔·奥古斯特·哈拉尔森(Karl August Haraldsen)和达格妮·哈拉尔森(Dagny F. Ulrichsen)夫妇,1968年8月29日嫁与哈拉尔王储,成为挪威王太子妃。1991年1月17日哈拉尔继承王位后,成为王后陛下。

Conversion by MW: 宋雅王後(Queen Sonja of Norway)名宋雅·哈拉爾森(Sonja Haraldsen)1937年7月4日生於 奧斯陸(Oslo),挪威國王哈拉爾五世的王後。她父母是卡爾·奧古斯特·哈拉爾森(Karl August Haraldsen)和達格妮·哈拉爾森(Dagny F. Ulrichsen)夫婦,1968年8月29日嫁與哈拉爾王儲,成為挪威王太子妃。1991年1月17日哈拉爾繼承王位後,成為王後陛下。

This is incorrect: 王后 (queen) is repeatedly converted to 王後 (after the king), which is humourously incorrect in 宋雅王後 (after King Sonja) and 成為王後陛下 (acceded the throne after the king, rather than acceded the throne to become queen).

MS Office's converter's rendering: 宋雅王后(Queen Sonja of Norway)名宋雅•哈拉爾森(Sonja Haraldsen)1937年7月4日生於 奧斯陸(Oslo),挪威國王哈拉爾五世的王后。她父母是卡爾•奧古斯特•哈拉爾森(Karl August Haraldsen)和達格妮•哈拉爾森(Dagny F. Ulrichsen)夫婦,1968年8月29日嫁與哈拉爾王儲,成為挪威王太子妃。1991年1月17日哈拉爾繼承王位後,成為王后陛下。

In this version, 后 is properly converted in all circumstances, including 位後 (after sth.).

Example 2 -- 沙漠


Another example, in the article "Desert", 沙漠气候干燥 (the desert climate is dry) is erroneously converted as 沙漠氣候干燥 (the desert climate interfere dry).

Interface conversion issues


There are also parts of the interface that are erroneously converted, or not converted at all.



编辑本页 (edit page) remains unconverted, and it looks ugly to Traditional users. 繁简转换 (Traditional-simplified conversion software) is unconverted 联系我们 (contact us) is unconverted 永久链接 (permalink) is unconverted

上载文件 (upload file) is converted as 上載文件. Although this is technically correct (文件 can be converted to 文件 or 檔案, depending on context), is wrong in this particular case because it has the different meaning of "upload document" (proper conversion would be 上載檔案).

Issues with errors


These are only a few examples out of the many errors to be found, not a few of which are potentially confusing or in many cases make the article wrong.



Some statistics may be in order, I will give two quotes from Jack Halpern:

  1. "A number of surveys, such as [Xiandai 1986], have demonstrated that the 2000 most frequent SC characters account for approximately 97% of all characters occurring in contemporary SC corpora. Of these, 238 simplified forms, or almost 12%, are polygraphic; that is, they map to two or more traditional forms."
  2. "Some preliminary calculations based on our comprehensive Chinese lexical database, which currently contains approximiately three million items, show that more than 20,000 of the approximately 97,000 most common SC word-units contain at least one polygraphic character, which leads to one-to-many SC-to-TC mappings. This represents an astounding 21%. A similar calculation for TC-to-SC mappings resulted in 3025, or about 3.5%, out of the approximately 87,000 most common TC word-units."



This means that approximately 20000 individual entries are needed for a good conversion system. At present, to the best of my knowledge (perhaps Zhengzhu could correct me), there are less than 500 entries in the current database, which includes many proper names (which are not included in the figure of 20000).

From this, it is clear that if we are serious about C2C conversion, we have a long way to go -- at least 19500 more entries are needed.

And it will be a very long time before ambiguous phrases are properly convertable (eg 阴干, which can convert to 陰乾 "to dry pickles" or 陰干 "even number"; or 编制 which can convert to 編制 "establish" or 編製 "make by knitting"), most likely with AI software. However, these are not nearly as big of a problem as the other improper conversions.