User:DChan (WMF)/Forms of writing used in Chinese editions of Wikipedia

From Meta, a Wikimedia project coordination wiki

Forms of writing used in Chinese editions of Wikipedia[edit]

Engineers often ask me to explain the relationship between (A) the different forms of Chinese and (B) the editions of Wikipedia written in them. I'm not aware of a single place that has pointers to all this information, so this is my attempt to create one.

Summary table[edit]

There are eight editions (language versions) of Wikipedia in different forms of Chinese. This table summarises the forms of writing used in them. Note Chinese text appears in non-Chinese Wikipedias too (e.g. giving someone's personal name in Chinese), and likewise other languages appear in Chinese editions of Wikipedia. See below for background information.

Variety BCP 47 tag Wikipedia URL Active users (2020-05) Content
Standard Written Chinese zh 9075 Written in mixed Hant/Hans script (each page is a mix, because each edit is written in the editor's preferred variant).
  • LanguageConverter (part of MediaWiki) performs script conversion and vocabulary conversion.
  • Vocabulary conversion is powered by hundreds of subject-specific lexical lists (e.g. Physics and Showbiz).
Cantonese yue 304 Written in Hant script.
  • Uses Cantonese grammar and vocabulary throughout
  • Including certain characters that don't exist in Standard Written Chinese
  • There is client-side code to perform script conversion (Hans→Hant on save, Hant→Hans on view).
Min Nan nan 72 Articles mainly use Latin script (Pe̍h-ōe-jī romanization).
  • Discussion pages mostly use Hant script.
  • In many cases, the discussion page contains a roughly parallel copy of the article written in Hant script (using Min Nan grammar and vocabulary), e.g. Hiong-káng and Talk:Hiong-káng.
  • In other cases, a Hant version of the article exists in the Help namespace, e.g. Tâi-oân and Help:Tâi-oân.
    • The Latin article links to the Hant article by using the template {{TwinHAN}}
    • The Hant article links to the Latin article by using the template {{TwinPOJ|Tâi-oân}}
    • The content of the articles may not be exactly parallel.
Wu wuu 56 Written in Hans script (using Wu grammar and vocabulary).
Hakka hak 32 Articles mainly use Latin script (Pha̍k-fa-sṳ romanization).
  • But some use the Hant script instead (using Hakka grammar and vocabulary)
  • For some topics, both exist, e.g. Sîn-kâ-pô and 新加坡.
    • The Latin article links to the Hant article by using {{Hakka-TW|0|新加坡}}
    • The Hant article links to the Latin article by using {{Hakka-TW|1|Sîn-kâ-po}}Generally, linked articles may not be exact equivalents.
Gan gan 21 As on Chinese Wikipedia, articles can be written in mixed Hant/Hans script (each page can be mixed).
  • But here Hant predominates
  • LanguageConverter performs script conversion, but unlike on Chinese Wikipedia, vocabulary conversion is not used.
Min Dong cdo 20 Articles and discussion pages mainly use Latin script (Foochow romanization), though Hant/Hans script is allowed too.
Classical Chinese lzh 75 Articles and discussion pages use Hant script.

The following background information explains terms used in the table.

Wikipedia in Standard Written Chinese[edit]

There are numerous varieties of spoken Chinese: Mandarin, Cantonese, Hakka, Wu etc. They vary in pronunciation (greatly), vocabulary (considerably) and grammar (somewhat), and are not mutually intelligible.

  • Standard Written Chinese is a written form that mostly uses the vocabulary and grammar of spoken Mandarin. (Pronunciation is not strongly represented in Chinese characters.)
  • Yet Mandarin speakers and non-Mandarin speakers alike use it as a written standard.
  • Therefore it is often just called "Chinese" — for instance "Chinese Wikipedia" means the Standard Written Chinese edition of Wikipedia.
  • Note that for non-Mandarin speakers, writing in this standard means using different vocabulary and grammar than they do in speech (diglossia).

There is, however, some variation in Standard Written Chinese, as explained below.

Script variation: Hant and Hans scripts[edit]

Chinese characters (also called "Han characters") can be written according to two standards, called "Hans" and "Hant".

  • Hant ("Han traditional") is standard in Taiwan and Hong Kong, and was standard in Mainland China before the 1960s.
  • Hans ("Han simplified") is standard in Mainland China today. Compared to Hant, it modifies certain characters to have fewer penstrokes.
  • It is laborious trying to read Hans if you are only familiar with Hant, and vice versa.
  • It is infeasible to write correct Hans if you are only familiar with Hant, and vice versa.
  • Wikipedia editors are fairly evenly split between Hans and Hant.
  • Script variation bears no relation to the differences in varieties of spoken Chinese. For instance you can write Mandarin in either Hans or Hant.

Vocabulary variation[edit]

There are vocabulary differences between Standard Written Chinese as used in Mainland China versus Taiwan, Hong Kong etc — similar to, but more extensive than, the English vocabulary differences between the USA versus the UK, Canada etc. This applies to terms in science, technology, engineering etc, but also very extensively to non-Chinese names. E.g. John Lennon's surname is written as 列侬 in Mainland China, 連儂 in Hong Kong and 藍儂 in Taiwan.


Chinese Wikipedia uses MediaWiki functionality called LanguageConverter to display articles automatically in the reader's preferred variety of Standard Written Chinese. LanguageConverter handles script conversion using a fixed lookup table (like many other software libraries). However, and far more unusually, it can also handle vocabulary conversion, powered by lexical lists within Chinese Wikipedia itself, arranged by subject, e.g. Physics or Showbiz. Each item gives different versions of a term, e.g.:

  { type = 'item', original = 'Electric field', rule = 'zh-tw:電位;zh-cn:电势;zh-hk:電勢' }
  { type = 'item', original = 'Lennon, John', rule = 'zh-cn:约翰·列侬;zh-tw:約翰·藍儂;zh-hk:約翰·連儂' }

Some other Wikipedia editions use LanguageConverter to convert scripts; however only Chinese uses it for vocabulary conversion. To the best of my knowledge, no other project attempts anything like this sort of vocabulary conversion for Chinese.

Wikipedia in other modern varieties of written Chinese[edit]

As stated earlier, Standard Written Chinese mostly uses the grammar and vocabulary of Mandarin, which means non-Mandarin speakers use different grammar and vocabulary in their speech than in standard writing. Alternatively, they can write down the exact words they would speak.

  • This gives rise to Written Cantonese, Written Hakka, Written Wu etc
  • Each of which is very different from Standard Written Chinese (and not readily intelligible to a Mandarin speaker).
  • These written varieties are mostly used in informal contexts.
  • Wikipedia is highly unusual as a collection of formal writing because it has editions in six of these written varieties besides Standard Written Chinese:
    • Cantonese Wikipedia
    • Min Nan Wikipedia
    • Wu Wikipedia
    • Hakka Wikipedia
    • Gan Wikipedia
    • Min Dong Wikipedia

Script variation: Hant, Hans, Romanization[edit]

Aside from Chinese characters, Latin letters can also be used to transcribe varieties of Chinese. This is called romanization. Min Nan Wikipedia, Hakka Wikipedia and Min Dong Wikipedia all use romanization instead of Chinese characters for article text.

Cantonese Wikipedia and Gan Wikipedia are primarily written in Hant script. Cantonese has client-side code to allow viewing or entering text in Hans script. Gan uses the same LanguageConverter functionality as Chinese Wikipedia (but only for script conversion, not vocabulary conversion).

Wuu Wikipedia is primarily written in Hans script.

Classical Chinese[edit]

Classical Chinese was the standard written form of Chinese for many centuries until around 1920. In some ways Classical Chinese Wikipedia serves a similar role to Latin Wikipedia. By convention, Classical Chinese is written in Hant script, so there is no need for script conversion.