Automatic conversion between simplified and traditional Chinese

From Meta, a Wikimedia project coordination wiki

Jump to: navigation, search

This page describes the writing system conversion system currently implemented at the Chinese Wikipedia (zh). It has been running since December 23, 2004[1] and has been generally received well by the community there. There has been some interests in implementing similar conversion systems for other languages, and it is hoped that this page can cast some light as to how to approach using the infrastructure that's been developed as a by-product of the Chinese conversion system.

For an essay expounding on some of the problems with the current conversion system, see Problems with server-side Chinese to Chinese translation in MediaWiki.

In addition to Chinese Wikipedia, Chinese Wiktionary, Wikiquote, and Wikibooks also have the conversion systems. When Multilingual Wikisource hosted texts in all languages without the conversion system there, there were problems with regard to how to handle simplified and traditional Chinese. However, once Chinese Wikisource was opened in 2005 with the trend of opening language subdomains, the problem has been solved.

Multilingual sites like Meta, Wikimedia Commons, and Multilingual Wikisource have no automatic conversion between simplified and traditional Chinese. Multilingual Wikisource no longer accepts Chinese articles, so they have to be posted on Chinese Wikisource.

Contents

[edit] Background

Current Chinese language speakers use two different writing systems ― simplified and traditional. Traditional Chinese characters are used primary in the Republic of China (Taiwan), Hong Kong, Macau, and the Chinese diaspora in North America. Simplified Chinese is used in Mainland China and Singapore. Chinese written in Malaysia, though no official status, is commonly simplified as well.

Sample differences of Chinese writing systems
Traditional Simplified

Automatic conversion between them is critical for the future of Chinese Wikipedia. Current conversion tools such as libiconv convert characters between two different encodings (such as GB and BIG5), but the situation for Chinese is the mapping between characters inside the Unicode character set, which includes both simplified and traditional Chinese characters. For most simplified and traditional Chinese characters, there exist a one-to-one mapping, but there are still about 100 pairs of simplified and traditional Chinese characters which are not one-to-one. Also, because there is no space between two adjacent words in a Chinese sentence, it makes for more complex parsing.

A completely automatic conversion would need some AI features, and would be difficult. Instead, we can introduce some kind of new markup to solve the problem.

[edit] Technical description

[edit] Conversion

The conversion implemented at Chinese Wikipedia is rather simple, without employing fancy artificial intelligence (AI) features. Instead, it exploits the wiki nature of Wikipedia and other related projects to build community knowledge, as discussed later.

The conversion happens when a page is rendered. It doesn't detect what language variant the page is written in. In fact, a page can be written in mixed variants.

It currently supports seven language variants:

  • zh (no conversion)
  • zh-hans (generic simplified)
  • zh-hant (generic traditional)
  • zh-cn (Mainland China; simplified)
  • zh-tw (Taiwan; traditional)
  • zh-hk (Hong Kong and Macau; traditional)
  • zh-sg (Singapore and Malaysia; simplified)

Accordingly, the system maintains four conversion tables, one for each variant. The conversion table consists of simple mapping rules. For example, in the zh-hans table, there may be a rule:


This rule says that for displaying in zh-hans, the character (meaning country), which is the traditional way of writing, should be converted to , the simplified writing. To accommodate different usages of the same concept in different Chinese variants, the table also allows rules for converting phrases, for example:

电脑计算机

This rule says that for displaying in zh-cn, any occurrence of the phrase 电脑, the common translation for "computer", or literally "electronic brain" in Taiwan and Hong Kong, should be converted to 计算机, or "recording and counting device," which is the common translation used in Mainland China.

The initial conversion tables were created semi-automatically. The Unihan database is used to create mappings between characters in Simplified and Traditional Chinese. Three phrase dictionaries bundled with SCIM and libtabe were used to create the initial phrase mapping tables.

Using this approach, some phrases that should never be converted can be specified using the customizable table as well, by mapping the words to itself. For example:

电脑电脑

This will prevent the phrase 电脑 from being converted.

These rules are stored in an associative array. The conversion is done with the php function strtr(). This does lead to problems relating to ambiguous word segmentation in the Chinese language, which is currently not handled by the system.

The code of implementation of these functions are available at trunk/phase3/includes/ZhConversion.php and trunk/phase3/languages/ under MediaWiki's SVN.

[edit] Conversion tables

The above approach obviously relies on the accuracy and completeness of the conversion tables. Unfortunately, there are no generally available standard conversion tables that are either accurate or complete. Therefore, a decision was made in the community of Chinese Wikipedia to provide a system that allows users to modify the conversion tables, and such a system was later been implemented. The customisable conversion tables are stored as a regular wiki page in the MediaWiki: namespace, in some simple format that allows the definition of conversion rules be defined.

Currently in the community there is a policy implemented so that any regular user can propose changes to the conversion table, and sysops can edit the corresponding wiki pages to implement the changes.

Conversion table links:

[edit] Wiki linking

A common problem in the Chinese Wikipedia is that the article title can be rendered differently under different variants, leading to confusion when linking to articles by different variant users. The current implementation would attempt to find links in different variant form. For example, some word "foo" in zh-cn is equivalent to some "bar" in zh-tw. If someone writes an article titled "Foo", zh-tw users will see the title "Bar" when the article is rendered. Further, when a zh-tw user writes a link [[bar]], it will be linked to the article "Foo" automatically.

[edit] Manual markup

Sometimes a conversion rule is only used in rare situations. Putting such rules in the customizable tables incurrs unnecessary cost when processing other pages. Therefore a manual markup was implemented to allow manually specifying conversion rules in the wiki text. The general syntax of the markup is as follow:

-{language_code1: text1; language_code2: text2; ...}-

For example:

-{zh-cn: 计算机; zh-tw: 电脑}-

This markup can also be used to specify that some text should not be converted at all, by not supplying any language code:

-{text not to be converted}-

A conversion rule can be declare to work through out the article, using the following syntax:

-{A|zh-cn:foo; zh-tw: bar}-

Any "foo" or "bar" in the remain of the text will be converted according to this rule (This is in CVS HEAD but not in 1.4 yet).

Article titles can be manually converted, by specifying the conversion rule in the article body using this tag:

-{T|zh-cn: foo; zh-tw: bar}-

When displaying the article, zh-cn user will see "foo" as article title, and zh-tw user will see "bar".

Also, this markup can be converted into a specified title, without regarding any languages, by using the syntax below:

-{T|foo}-

This is useful while the title have one language, and the first letter of the article is not caps.

Two magic words have also been introduced to prevent automatic conversion of the title or the whole article:

  • __NOTC__: prevents automatic conversion of the article title.
  • __NOCC__: prevents automatic conversion of the whole article.

[edit] Implementing similar conversion system for other languages

[edit] In MediaWiki 1.4

Most conversion related functions are implemented in the language object. For Chinese, this is in LanguageZh.php. At a minimal, the language object should override the following methods to support multiple variants:

  • getVariants
function getVariants() {
        return array('zh', 'zh-cn', 'zh-tw', 'zh-sg', 'zh-hk');
}

This function returns an array of language variants that's supported. There should exist language files for the supported variants.

  • getPreferredVariant()
function getPreferredVariant()

This function should return the user's preferred variant.

  • convert()
function convert( $text, $isTitle=false )

This function implements the actual conversion of the text. the flag isTitle signals whether the input text is the article title. In the Chinese, sometime titles are converted a bit differently. This function is called near the end of the parser (Parser.php). Anything that's parsed by the parser will be converted. Note that the parsing of the manual markup -{}- is also done in this function. This should probably be singled out so that other languages can reuse it easily.

  • getExtraHashOptions()
function  getExtraHashOptions() {
        $variant = $this->getPreferredVariant();
        return '!' . $variant ;	   
}

This function returns a string of the user's preferred language variant. It is used for caching purpose.

Note that some of these functions are not Chinese specific, for example getExtraHashOptions() and getPreferredVariant(), and the parsing code for the manual markup in convert(). Some code refactoring can be done if and when there is a need to implement conversion systems for other languages.

The code for supporting customizable conversion table is a bit messy currently. Code cleanup is underway and it's usage will be updated here shortly.

[edit] In MediaWiki 1.5

The upcoming MediaWiki 1.5 will include a LanguageConverter object that encapsulates most conversion related functionalities. Conversion system for the Serbian language is being actively worked on, and a preliminary test site is up and running at

There is also a test site with conversion support for English, just for fun atm:

[edit] See also

[edit] Reference links - Chinese