Automatic conversion between simplified and traditional Chinese

From Meta, a Wikimedia project coordination wiki

This page describes the writing system conversion system currently implemented at the Chinese Wikipedia (zh). It has been running since December 23, 2004[1] and has been generally received well by the community there. There has been some interest in implementing similar conversion systems for other languages,[citation needed] and it is hoped that this page can cast some light as to how to approach using the infrastructure that's been developed as a by-product of the Chinese conversion system.

For an essay expounding on some of the problems with the current conversion system, see Problems with server-side Chinese to Chinese translation in MediaWiki.

In addition to Chinese Wikipedia, Chinese Wiktionary, Wikiquote, and Wikibooks also have the conversion systems. When Multilingual Wikisource hosted texts in all languages without the conversion system there, there were problems with regard to how to handle simplified and traditional Chinese. However, since Chinese Wikisource was opened in 2005 with the trend of opening language subdomains, the problem has been solved.

Multilingual sites like Meta, Wikimedia Commons, and Multilingual Wikisource have no automatic conversion between simplified and traditional Chinese. Multilingual Wikisource no longer accepts Chinese articles, so they have to be posted on Chinese Wikisource, unless still copyright-restricted in Greater China.

Background[edit]

Current Chinese language speakers use two different writing systems – simplified and traditional. Traditional Chinese characters are used primarily in the Republic of China (Taiwan), Hong Kong, Macau, and the Chinese diaspora in North America and Indonesia. Simplified Chinese is used in Mainland China, Singapore, and the Chinese diaspora in Malaysia.

Sample differences of Chinese writing systems
Traditional Simplified

Automatic conversion between them is critical for the future of Chinese Wikipedia. Current conversion tools such as libiconv convert characters between two different encodings (such as GB and BIG5), but the situation for Chinese is the mapping between characters inside the Unicode character set, which includes both simplified and traditional Chinese characters. For most simplified and traditional Chinese characters, there exists a one-to-one mapping, but there are still about 100 pairs of simplified and traditional Chinese characters which are not one-to-one (such as, in traditional Chinese, there are two different characters, '鬱' and '郁' , with different meanings but the same pronunciation 'yù', while the two characters were merged into one character '郁' in simplified Chinese). Also, because there is no space between two adjacent characters in a Chinese sentence, it makes for more complex parsing.

A completely automatic conversion would need some AI features, and would be difficult. Instead, we can introduce some kind of new markup to solve the problem.

Implementing similar conversion system for other languages[edit]

In MediaWiki 1.4[edit]

Most conversion related functions are implemented in the language object.[citation needed] For Chinese, this is in LanguageZh.php. At a minimum, the language object should override the following methods[citation needed] to support multiple variants:

  • getVariants
function getVariants() {
    return array('zh', 'zh-cn', 'zh-tw', 'zh-sg', 'zh-hk');
}

This function returns an array of language variants that are supported. There should exist language files for the supported variants.

  • getPreferredVariant()
function getPreferredVariant()

This function should return the user's preferred variant.

  • convert()
function convert( $text, $isTitle=false )

This function implements the actual conversion of the text. the flag isTitle signals whether the input text is the article title. In Chinese, sometime titles are converted a bit differently. This function is called near the end of the parser (Parser.php).[citation needed] Anything that's parsed by the parser will be converted. Note that the parsing of the manual markup -{}- is also done in this function. This should probably be singled out so that other languages can reuse it easily.[citation needed]

  • getExtraHashOptions()
function getExtraHashOptions() {
    $variant = $this->getPreferredVariant();
    return '!' . $variant ;	   
}

This function returns a string of the user's preferred language variant. It is used for the purpose of caching.

Note that some of these functions are not Chinese specific, for example getExtraHashOptions() and getPreferredVariant(), and the parsing code for the manual markup in convert(). Some code refactoring can be done if and when there is a need to implement conversion systems for other languages.[citation needed]

The code for supporting customizable conversion tables is a bit messy currently.[citation needed] Code cleanup is underway and its usage will be updated here shortly.[citation needed]

In MediaWiki 1.5+[edit]

Since April 2005, there is a LanguageConverter object that encapsulates most conversion related functionality. The conversion system for the Serbian language is actively worked on, and a preliminary test site is up and running at (All links are broken now):

There is also a test site with conversion support for English, just for fun at the moment:

See also[edit]

Reference links - Chinese[edit]