Wikipedias in multiple writing systems
The article describes each Wikipedia that uses multiple writing systems. If you are a native speaker of one of the languages listed below which require automatic script conversion between writing systems, then you are welcome to help us write a comparative table with letters and transliteration rules. Then we can help you create such a converter.
If you wish to help us, contact us.
(original message was written by Kprwiki)
- 1 With Automatic Conversion System
- 2 With Partial Automatic Conversion System
- 3 Existing Automatic Conversion System ready to be implemented
- 4 Without Automatic Conversion System
- 4.1 Cyrillic - Latin converter
- 4.2 Cyrillic - Latin - Arabic converter
- 4.3 Latin - Arabic converter
- 4.4 Indic-Arabic converter request
- 4.5 Chinese Character - Latin Convertor
- 4.6 Latin–Latin conversion
- 4.7 Other converter
- 5 Considering to Introduce Multi-writing System in the near future
- 6 Notes
With Automatic Conversion System
Some Wikipedia are using Automatic Conversion System.
The en:Chinese Language has two major writing systems. These are simplified and traditional Chinese, and some variants have unique vocabulary. Therefore, Chinese Wikipedia supports four variants: zh-cn (Mainland China), zh-tw (Taiwan), zh-hk (Hong Kong and Macao) and zh-sg (Singapore). Read Automatic conversion between simplified and traditional Chinese for detail.
The en:Serbian language has two writing systems, and two dialects. So there are four variants in the language.
- Latin alphabet Ekavian
- Cyrillic alphabet Ekavian
- Latin alphabet Ijekavian
- Cyrillic alphabet Ijekavian
Serbian Wikipedia supports auto-converting system for two writing system. And not support by dialect, because there are few difference between those.
The Kazakh language has three writing systems: Cyrillic, Latin, and Perso-Arabic (Central Asian branch) alphabets. So, Kazakh Wikipedia supports automatic converting system (Cyrillic-Latin vice versa and Cyrillic/Latin to Arabic read only) for those writing system.
Current version of the converter used in Kazakh Wikipedia contains some serious mistakes. Please up-date the LanguageKk.php file by this file .
The Kurdish language use three writing system by region,
- the Latin alphabet is used in Turkey and Syria,
- the Arabic alphabet in Iraq and Iran, and
- the Cyrillic alphabet in exUSSR.
The Kurdish Wikipedia supports auto-converting system for two writing systems Latin/Arabic.
Inuktitut as spoken in Canada has two writing systems: the Inuktitut syllabics are used in parts of the autonomous territory of Nuvavut, while other regions use the Latin alphabet. An automatic conversion system between the two has been created (see iu:Wikipedia:Conversion script). However, because syllabics do not have uppercase letters, conversion from syllabics to Latin display only lowercase Latin letters. Vice versa, it is not possible to make a distinction in syllabics between lowercase and uppercase Latin letters.
It will probably be enabled when the MediaWiki version is updated, somewhere in July 2011.
With Partial Automatic Conversion System
The Tajik language use three writing systems by region,
- Cyrillic alphabet in Tajikistan.
- Arabic alphabet in Afghanistan.
- Latin alphabet.
Tajik Wikipedia currently has Auto-converting system for two of the writing systems (Cyrillic - Latin) but not into Perso-Arabic.
See references for Cyrillic - Perso-Arabic converting system developement at http://tajpers.narod.ru/
The Uzbek language has three writing systems: Latin, Cyrillic and Arabic alphabet.
Uzbek Wikipedia currently has Auto-converting system for two of the writing systems (Latin - Cyrillic) but not into Perso-Arabic.
An automatic conversion between the three writing systems are desirable since the Perso-Arabic script is used in Afghanistan. Converter into Arabic could be developed.
The Gan language has three major writing systems. These are simplified and traditional Gan Chinese, and Romanized Gan.
Gan Wikipedia currently has an auto-converting system for two writing systems (simplified and traditional Gan Chinese), but not into Romanized Gan. An automatic conversion into Romanized Gan would be desirable in order for non-Gan speakers to learn and comprehend the Gan language easier.
Existing Automatic Conversion System ready to be implemented
The Kyrgyz language has major three major writing systems. These are Cyrillic Kyrgyz, Latinized Kyrgyz, and Perso-Arabic Kyrgyz (used in Xinjiang, China).
An automatic conversion between the three writing systems are desirable since the Kyrgyz in China do not use Cyrillic.
Arabic to Cyrillic converter is under developement so that Chinese Kyrgyz can also contribute to Wikipedia even without knowledge of Cyrillic.
The Uyghur language has two writing systems: Arabic and Cyrillic alphabet. Latin has sometimes been used to type by people whose computers don't have support for those alphabets, however, with more and more support that is becoming less and less.
The Latin alphabet is used by Uyghurs in Turkey, Western countries and parts of Xinjiang, the Cyrillic alphabet is used in CIS countries whereas the Perso-Arabic script is used predominantly in Xinjiang.
An automatic conversion between the three writing systems are desirable to prevent conflicts between users with different preferences.
The Chechen language has 2 writing systems: Cyrillic and Latin alphabet.
An automatic conversion from Cyrillic into Latin writing systems are desirable since the official script is Latin and many Chechens living outside of Russian Federation can not read Cyrillic.
Without Automatic Conversion System
Sorted according the simmilarity of the required converstion system.
Cyrillic - Latin converter
The Tatar language has three major writing systems. These are Cyrillic Tatar, Latinized Tatar, and Perso-Arabic Tatar.
An automatic conversion between the three writing systems was very desirable in order to avoid Tatar script conflicts.
Serbo-Croatian are using two writing system: Latin and Cyrillic alphabet.
Cyrillic - Latin - Arabic converter
Azerbaijani language has three writing system: Latin, Cyrillic and Perso-Arabic alphabet.
An automatic conversion between the three writing systems are desirable to prevent the creation of the South Azerbaijani Wikipedia written in the Perso-Arabic script.
The Turkmen language has three writing systems: Latin (used in Turkmenistan), Perso-Arabic alphabet (used in Iran and Afghanistan) and Cyrillic (historicaly used in Turkmenistan).
An automatic conversion between the three writing systems is desirable because although officially, Turkmen is rendered in the Latin alphabet, the old Cyrillic alphabet is still in wide use and many political parties in opposition to the authoritarian rule of President Niyazov continued to use the Cyrillic alphabet on websites and publications, most likely to distance themselves from the alphabet that Niyazov created.
Latin - Arabic converter
Malay language is normally written using Latin alphabet called Rumi, although a modified Arabic script called Jawi script also exists. Rumi and Jawi are co-official in Brunei. Efforts are currently being undertaken to preserve Jawi script and to revive its use amongst Malays in Malaysia, and students taking Malay language examination in Malaysia have the option of answering questions using the Jawi script. The Latin alphabet, however, is still the most commonly used script in Malaysia, both for official and informal purposes.
An automatic conversion from Latin to Jawi script should be set up.
- Omniglot article about written Malay
- Jawi writing for PC
- rumi-jawi transliteration software written using C#/WPF
Indic-Arabic converter request
There are several different scripts used for writing the Punjabi language. In the Punjab province of Pakistan, the script used is Shahmukhi and is essentially the same as the Urdu script. In the Indian state of Punjab, Sikhs and others use the Gurmukhī script. Hindus, and those living in neighbouring Indian states such as Haryana and Himachal Pradesh sometimes use the Devanāgarī script. Shahmukhi and Gurmukhī scripts are the most commonly ones used for writing Punjabi and are considered the official scripts of the language.
What about the set automatic Gurmukhī - Shahmukhi transliteration based on this source like in e.g. Kazakh wikipedia.
The Kashmiri language has major three major writing systems. These are Devanagari Kashmiri, Perso-Arabic Kashmiri and Romanized Kashmiri.
An automatic conversion between the three writing systems are very desirable in order to avoid Kashmiri script conflicts. However, an accurate conversion script is very difficult to develop (see also http://desceco.org/O-COCOSDA2010/proceedings/paper_38.pdf )
The Sindhi language can be written using modified Persian alphabet and Devanagari script. Most Sindhi people youth in India do not know the Persian alphabet, and use Devanagari, leaving the current Wikipedia available soley for those in Pakistan.
Chinese Character - Latin Convertor
Automatic Han to Latin conversion may be difficult but perhaps possible with reasonable accuracy. Completely automatic Latin to Han conversion is either impossible or extremely difficult and will almost certainly be inaccurate without knowledgeable human intervention (indeed, this is a similar problem to an input method for Han characters). Without the latter, only contribution in Han is possible. This would then disadvantage contributors who only know the Latin orthography.
The Hakka language has two major writing systems. These are Traditional Chinese characters, and Romanized Hakka(see existing chinese character --> Hakka dictionary).
Hakka Wikipedia currently does not have an auto-converting system for the two writing systems. An automatic conversion from Traditional Chinese characters into Romanized Hakka would be desirable to avoid conflicts between users with different preferences and enable users to comprehend the meaning of every word more easily.
- Recent Updates:
- The 4000 most used Han characters have been translated to Romanized Hakka.
- An application for the enabling of the translations has been made here and the task is awaiting to be assigned. --Hakka (talk) 13:33, 2 November 2013 (UTC)
An automatic conversion between the two writing systems from Romanized Minnan --> Traditional Minnan Chinese characters and from Traditional Minnan Chinese characters --> Romanized Minnan is both desirable in order to avoid existing conflicts between users with different script preferences.
The Cantonese Language has two major writing systems. These are Cantonese written in traditional Chinese characters, and Romanized Cantonese (namely the Cantonese Penkyamp Romanization and Cantonese Yale Romanization systems).
An automatic conversion between the two writing systems is desirable in order for non-Cantonese speakers to learn and comprehend the Cantonese language easier.
- See Discussion Page
Minnan Wikipedia (Latin–Latin)
Someone commented on a user page to raise the possibility of automatic conversion between the two leading Latin orthographies here. They are Pe̍h-ōe-jī and Tâi-uân Bân-lâm-gí Lô-má-jī Phing-im Hong-àn (Tâi-lô). Each is strictly a function (in the mathematical sense) of the other. The conversion table is available. Something very simple on the level of the script conversion tool at ang: might just work. Incidentally, it might even serve as a rudimentary spellchecker if implemented properly. See also this thesis and this blog post.
An automatic conversion between the three writing systems are desirable to prevent the creation of a Mongolian Wikipedia written in the Classical Mongolian script and the Latinized Mongolian script.
An automatic conversion between the two writing systems are necessary because the 'Kalmyks' (known as Oirats in China) use the Todo script only.
Gothic language can be written in Gothic (Wulfila's alphabet), Latin, or Runic. Most articles at the Gothic Wikipedia are written in Wulfilan or Latin. Only the main page is available also in a Runic version.
The Ladino language has major two major writing systems. These are Latinized Ladino, and Rashi script (variant of the Hebrew script).
An automatic conversion between the two writing systems are desirable to prevent the duplication of articles.
The Vlax Romani has major two major writing systems. These are Latinized Romani, and Cyrillic Romani.
An automatic conversion into Baybayin would be desirable to enable more Tagalog users to learn their traditional script.
At first should be created unicode version of Baybayin fonts.
Considering to Introduce Multi-writing System in the near future
- Requests for new languages/Wikipedia Hanja, incubator:Test-wp/ko-hanja
- User:Yes0song/Automatic conversion in Korean language
- User:Masoris/Edit Page Suggestion for Automatic Conversion System
- ko:사용자:Yes0song/다지모 (Korean)
- ko:사용자:Yes0song/다지모/한자 혼용판 도입 (Korean)
(1) There are two writing system: traditional w:Hanacaraka (also called Carakan, an Abugida script) and Latin. Latin is more prevalent to the extent of almost all publication in Javanese (albeit only in small number) are all in Latin. A one-to-one conversion is possible from Latin to Hanacaraka. Hanacaraka only recently (2009) got it's own Unicode, and there exist a Hanacaraka Unicode font and several non-Unicode fonts. Since the Unicode hasn't been supported by TrueType, it's using SIL's Graphite.
Currently Javanese Wikipedia already request WebFont to be implemented. In the future it is desirable to see autotranslation like the Chinese or Cyrilic projects.
(2) Another thing to be considered: Javanese language has (at least) two registers (sets of vocabulary) based on social standing: polite/palace Javanese (krama) and brash/market Javanese (ngoko). Both are used in Central Java, the former is more commonly used in publication, while the latter are more commonly used in conversation. In some places the usage of the latter is also found in publication, mainly in Suriname (for example the ngoko language is used in Suriname-Javanese Bible, which to the eyes and ears of the Javanese people would be vulgar), where the former is no longer in use, due to historical and geographical reasons. The same also true for East Javanese people, who opposed vehemently the use of the former due to its association with aristocracy, and for people from other ethnicity all around Indonesia. Therefore there are four combinations/variants in Javanese language:
- Hanacaraka krama
- Latin krama
- Hanacaraka ngoko
- Latin ngoko
Converting from krama to ngoko sometimes only requires one-to-one mapping of vocabulary, but in other instances requires one-to-many or many-to-one, or even a change in the grammar.
(3) Historically, there's also third (and even fourth) script that was used to write Javanese, that is Arabic script (called Pegon alphabet and Arab gundul alphabet), and long before that, Sanskrit/Pallava (Old Javanese/Kawi script). http://www.omniglot.com/writing/javanese.htm
The use of these old scripts would in Wikimedia projects is still non-existent, but probably in the future would be beneficial for Wikisource and Javanese Wiktionary
The Konkani language has four main writing systems: Devanāgarī scripts, Latin script, Kannada script and Malayalam script. This is because Konkani does not have a unique script of its own and hence scripts of the other languages native to the regions are used.
This will help resolve script issues from hindering the growth of the language.
The Laz language has two writing systems: Georgian script and Latin script. An automatic conversion into Georgian would be desirable to enable more Laz users from Georgia.
The alphabet is here http://en.wikipedia.org/wiki/Laz_grammar#Alphabet Georgian and Latin
- Yes, Laz live in Georgia, do not know the Latin alphabet as well. And in Turkey residing Laz do not know Georgian alphabet. Therefore, it is important to write Laz Wikipedia with two alphabets. so is made, for example, Serbian Wikipedia. I mean, this is not a big problem. Thanks! Dato deutschland 07:13, 11 March 2010 (UTC)
- Laz Georgian - Latin - Georgian converter - Both direction converter, easy to implement into Laz Wikipedia
The Brahui language has two main writing systems: Arabic script and the Latin script. This is because:
- The current online Arabic keyboard does not contain the required number of vowels for Brahui.
- Sometimes vowels are used as consonants depending upon their position in a word. This is quite confusing for people who are getting literacy instruction in the Brahui language.
A system that can convert between the two scripts would help resolve script issues from hindering the growth of the language.
1. The Manchu language is near extinction in terms of native speakers, however alot of enthusiats and academics are learning it as a second language. When they learn it, in China I believe they mainly use Manchu script and in the west they learn the language in both the latin and Manchu scripts.
2. A little snag we might run into is the fact that Manchu script is normally written vertically, from up to down. However, if need be, that rule can be bent and we can do it horizontally and people can manually rotate their screens if they wish to read it in Manchu script.
3. The Jurchen script is used for writing an earlier stage of Manchu, the Jurchen language. If it ever works out properly in unicode, we might create a separate Jurchen wikipedia like how we have separate modern and old English wikipedias.
The Talysh language has three writing systems: Latin, Cyrillic and Perso-Arabic.
An automatic conversion between the three writing systems can be developed. However, unlike Kazakh Wikipedia (maybe like Uzbek Wikipedia), the Talysh community decided to use Latin script first.
- This article is written base on a Korean article ko:사용자:Yes0song/다지모/다양한 표기법 현황