Pinyin

From Meta, a Wikimedia project coordination wiki

Pinyin w:Pinyin is a Roman alphabet-based phonetic rendering of Chinese word pronunciations. It requires the use of four special macrons over the vowels, representing four tones (in addition to the nominal "5th" tone) which distinguish word meaning.

ā "singing tone" á "surprise tone" ǎ "question tone", à "short tone."

A good idea for functionality to add to Mediawiki: PinyintoUnicode Source (GNU GPL) -- it takes a word like 'Feng1shui3' and converts it to 'Fēngshǔi'. This has to be done within context marks, like <pinyin>Feng1shui3</pinyin> to isolate the function.

The correction of improper character sets used for the purpose of displaying pinyin should not be an issue, since pinyin is not so much a character set as it is a very limited array of display marks over vowels, within Unicode its a standard feature and is well incorporated into the standard sets. Still if at some point pinyin to IPA conversion might be useful, then that conversion process might require some correction of misused characters. Most problematic is the third tone mark like "ě"-- which may be substituted with a similar rounder-shaped (not sharp) diacritic.

  • See Pinyin to Unicode converter This page converts text written in pinyin, with syllable-final tone numbers, into unicode. Simply enter or paste in the pinyin and convert.

Latin-1 Supplement - Unicode U+0080 - U+00FF - (128-255) á = á = á = á à = à = à = à é = é = é = é è = è = è = è í = í = í = í ì = ì = ì = ì ó = ó = ó = ó ò = ò = ò = ò ú = ú = ú = ó ù = ù = ù = ù ü = ü = ü = ü subtract 32 for upper case

Latin Extended-A - Unicode U+0100 - U+017F - (256-383) ā = ā = ā ē = ē = ē ě = ě = ě ī = ī = ī ō = ō = ō ū = ū = ū subtract 1 for upper case

Latin Extended-B U+0180 - U+024F (384-591) ǎ = ǎ = ǎ ǐ = ǐ = ǐ ǒ = ǒ = ǒ ǔ = ǔ = ǔ

ǖ = ǖ = ǖ ǘ = ǘ = ǘ ǚ = ǚ = ǚ ǜ = ǜ = ǜ subtract 1 for upper case


From Helmer Aslaksen's page on Reading and Writing Pinyin in Unicode

Warning: Some older browser have trouble with hexadecimal numeric character references, so it may be safest to use decimal.