The Chinese writing system is based on logograms, where each character has its own unique meaning. There are well over 50,000 Chinese characters, but only 3,000 to 5,000 are commonly used in daily written communications. Chinese is the most popular language in the world when measured by the number of speakers, with mainland China (PRC) contributing a population of over 1.3 billions. An overview of the Chinese language can be found at .
Written Chinese had evolved slowly during its long history. Because complex meanings can usually be expressed by a combination of characters, new characters were only introduced when absolutely necessary. However, a major fork of the written language happened during the 1950s after the end of the civil war in China. A whole new set of Simplified characters were introduced in the mainland China (PRC), in the hope to make learning the written language easier and thereby improving the literacy rate of the population. Whether it served this purpose or not is debatable. To date, Simplified writing is the official written language in PRC and Singapore, in Malaysia, it is the official writing system used when Chinese language is being taught. The original writing system, commonly referred to as Traditional writing, is mainly used in Taiwan (ROC), Malaysia, as well as Hong Kong (PRC) and Macau (PRC). Both systems are widely used by other smaller Chinese communities through out the world.
In addition to the difference at the character level, there are also differences at the word and phrase level among the different regions. For example, foreign words are commonly translated into Chinese using combination of Chinese characters that when pronounced, mimics the pronounciation of those words in the original language. However, there are arbitrary many different ways of choosing Chinese characters that provide similar pronounciations. Therefore for the same non-Chinese word, there may be different Chinese translations in different Chinese-speaking regions. Further, there are also regional variations reflecting different cultures and customs.
Generally, people who were educated in one writing system can usually understand writings in another system, but not without spending significant effort. For example, reading articles written in the other system can be much slower. Further, misunderstanding do happen from time to time due to regional variation in words and phrases. Therefore an automatic system that can convert between the two writing systems will greatly improve the quality of communication among the Chinese community.
The Simplified character sets contains about 2,500 characters (?) that are written differently from their Traditional counterparts. In most cases, there is a one-to-one relation between a Simplified character and a Traditional character. There are, however, about 200 commonly used Simplified characters that have multiple corresponding Traditional forms depending on their meanings. This is one difficulty in developing toward an automatic conversion system. Current natural language processing techniques either can not guarantee a high accuracy rate, or are computationally too expensive to be deployed to large scale projects such as the Wikipedia. Another difficulty is in accounting for the regional variation, which evolves quickly and unpredictably. As with any living language, monitoring new words and phrases is a difficult task.
In this paper, we describe a semi-automatic system that employs the power of a wiki environment to accomplish the conversion between Traditional and Simplified Chinese. The automatic part of the system performs conversion using a mapping table. The wiki elements include the ability to change the mapping table by end users, and to manually specify correct conversions within the wiki text using a newly introduced markup. Combining these two features, the Chinese Wikipedia is able to deliver contents that are tailored to a user's specific language preference.
The conversion to a user-specified Chinese variant happens when a page is rendered. It doesn't detect what language variant the page is written in. In fact, a page can be written in mixed variants. (Therefore it requires that an editor can read and understand the mixed source text. For educated Chinese, this is generally no more difficult than reading and understanding the markups for a table, say.) The conversion is performed using a mapping table, and is implemented using the PHP function strtr(). A large portion of the mapping table contains one-to-one character mappings of the two writing systems. If a character can map to multiple characters in the other writing system, common phrases containing such character are added to the mapping table.
Initial conversion tables
The initial character-to-character mapping is extracted from the Unihan database, version 4.1, distributed by the Unicode Consortium. Phrases containing one-to-many mapping characters are extracted from several open source Chinese input systems, including SCIM and libtabe. Scripts for extracting these initial mappings are included in the MediaWiki package since version 1.4.0.
Customizable conversion tables
The initial table constructed above contains some errors and is far from complete. It contains little mappings regarding regional variations, for example. Unfortunately, there are no generally available standard conversion tables that are either accurate or complete. Therefore, a decision was made by the zh community to provide a system that allows users to modify the conversion tables, and such system was later implemented. The customizable conversion tables are stored as a regular wiki page in the MediaWiki: namespace, in some simple format that allows the definition of conversion rules be defined.
Currently in the zh community there is a policy implemented so that any regular user can request changes to the conversion table, and sysops can edit the corresponding wiki pages to implement the changes.
Because of the inherent dynamic and ambiguous nature of the language, the automatic conversion system described above can not provide perfect conversion. First, it is nearly impossible to construct a "perfect" conversion table because of the dynamic nature of the language. New words and parses are introduced constantly and there will always be a delay before such instances are added to the conversion table. Second, the Chinese writing does not specify word boundaries and there can be multiple ways of parsing a sentence into words, and different parsing may result in different conversion. Therefore even with a "perfect" mapping table, perfect conversion requires understanding about the semantic of a sentence. This is a tremendous challenge facing the natural language processing community.
For these reasons, a manual markup was implemented to allow specifying the correct conversion within the wiki text. This allows any and all editors of the Wikipedia to correct the errors made by the automatic system.
Following the upgrade to version 1.4 of the MediaWiki software, the conversion system was deployed at all Chinese Wikipedia sites, including Wikipedia, Wikibooks, Wiktionary and Wikiquote. Since main activity of the Chinese community is currently focused on Wikipedia, it is also the main source of feedback. So far the feedback has been generally positive from both Simplified and Traditional Chinese users.
Because the manual conversion involves modifying the source text of an article, its wide spread usage was discouraged at the beginning to allow more thorough testing of the relating code. Therefore the focus of the community has been on improving the conversion tables. As more and more users request changes to the conversion table, processing these requests by the sysops has been an increasingly tiresome task. Therefore it is likely that the Chinese community will adopt a policy toward encouraging the use of the manual markup in the near future, especially when the relating code has been tested fairly extensively. There is a plan to develop additional software to collect such manual markups and use them to semi-automatically update the conversion tables.
Conversion system for other languages
There are other languages that have similar conversion needs as the Chinese language. For example, the Serbian language can be written in two scripts, Cyrillic and Latin. A similar approach can be implemented to support conversion for such languages. Work is underway to modulize the Chinese conversion system and restructuring the code to provide an easy interface to support conversion for other languages.
Towards a more intelligent conversion system
A unifying Chinese writing system is not likely to happen any time soon, therefore conversion systems will always be needed to facilitate communication between the different Chinese speaking regions. The system described in this paper exploits the wiki nature of the Wikimedia projects, and uses the the power of the human community to provide language specific contents. For non-wiki environments, more intelligent automatic systems are needed.
The Chinese Wikipedia, with the use of user customizable conversion tables and manual conversion markups, can provide a rich dataset for research toward such intelligent conversion systems. In particular, the manual markups for special cases that can not be handled by the automatic conversion will be a most valuable. One future work is to provide a clean interface so that such data can be easily extracted from the Chinese Wikipedia.
We have presented an outline of the Chinese conversion system currently running at various Wikimedia projects. The system performs simple automatic conversion with the help from users in a wiki fashion. The system is in its early stage and lots of work are needed to provide better service. Nevertheless, the system has been proven useful in actual deployment.