Jump to content

Wikipédia dans plusieurs systèmes d'écriture

From Meta, a Wikimedia project coordination wiki
This page is a translated version of the page Wikipedias in multiple writing systems and the translation is 22% complete.
Outdated translations are marked like this.

L'article décrit chaque Wikipédia qui utilise des systèmes d'écriture multiples.[1] Si vous êtes un locuteur natif d'une des langues listées ci-dessous qui nécessite une conversion automatique entre les systèmes d'écriture, vous pouvez nous aider à écrire un tableau comparatif avec les lettres et les règles de translittération. Nous pourrons ensuite vous aider à créer un tel convertisseur.

Il existe des sites tiers qui aident à intégrer les efforts de translittération et qui sont les bienvenus. (le message original a été écrit par Kprwiki)

Vous pouvez également consulter les outils de translittération existants disponibles en ligne.

Langues avec systèmes de conversion automatique

Les wikis dans ces langues sont dotés de systèmes de conversion linguistique, soit au sein du logiciel MediaWiki (voir la documentation de MediaWiki.org pour plus d'informations techniques), soit par l'intermédiaire de scripts ou de gadgets locaux.

Prise en charge complète

Anglo-Saxon

L'Anglo-Saxon a deux systèmes d'écriture : Le latin et les Runes. Un système de translittération automatique a déjà été activé sur chaque page de la Wikipédia anglo-saxonne.

Balinais

La Langue balinaise possède deux systèmes d'écriture : L'écriture latine et l'écriture balinaise.

Un système de translittération automatique est développé sur les projets balinais pour convertir les caractères latins en caractères balinais, mais il n'est pas clair si le système de conversion inverse est pris en charge ou non.

À expliquer : Quelles sont ces variantes ?

  1. ban-x-dharma
  2. ban-x-palmleaf
  3. ban-x-pku
  1. DHARMA transliteration (ban-x-dharma)
    Transliteration rules following DHARMA project "strict transliteration".
    Mostly follows ISO 15919, with modifications for precision and broader coverage.
  2. Palmleaf.org transliteration (ban-x-palmleaf)
    Transliteration rules developed for Palmleaf.org.
  3. Puri Kauhan Ubud transliteration (ban-x-pku)
    Transliteration rules developed at Puri Kauhan Ubud and widely used in Bali.
    Also the default Balinese to Latin transliteration variant.

Chinois

La langue Chinoise vernaculaire (alias chinois standard (cmn), utiliser le code de macrolangue chinoise zh sur les sites Wikimedia) possède deux systèmes d'écriture principaux : Le chinois simplifié (zh-Hans) et le chinois traditionnel (zh-Hant), et possède différents vocabulaires et syntaxes localisés dans les différentes régions sinophones.

La Wikipédia chinoise (zhwiki), ainsi que certains autres projets zh.wiki* [note 1], prennent en charge six variantes :

  1. Chinois simplifié (Chine continentale) (zh-Hans-CN)
  2. Chinois traditionnel (Hong Kong) (zh-Hant-HK)
  3. chinois traditionnel (Macao) (zh-Hant-MO)
  4. Chinois simplifié (Malaisie) (zh-Hans-MY)
  5. Chinois simplifié (Singapour) (zh-Hans-SG)
  6. Chinois traditionnel (Taiwan) (zh-Hant-TW)

Dans les URL, les étiquettes Wikidata et probablement les schémas de base de données, les variantes sont simplifiées en zh-cn, zh-hk, zh-mo, zh-my, zh-sg et zh-tw. Cependant, pour les Wikidata, il n'est pas clair actuellement (avec zh-hans, zh-hant et le "zh" original) quelles variantes doivent être utilisées et lesquelles ne doivent pas l'être.

Les variantes sont également prises en charge par les pages de traduction /zh alimentées par Special:Translate.

Les rapports de bugs et les demandes de fonctionnalités peuvent être déposés sur le Wikipedia zh (utilisez la page de discussion pour les utilisateurs ayant des barrières linguistiques).

Gotique

Le Gotique possède deux systèmes d'écriture : Le latin et le Gotique. Un système de translittération automatique a déjà été activé sur chaque page de la Wikipédia gothique.

Update 2024: This automatic transliteration system doesn't work in the 2022 Version of Vector, consider switching to other skins for the conversion purpose.

Inuktitut

L'Inuktitut tel qu'il est parlé au Canada possède deux systèmes d'écriture : le Syllabaire inuktitut est utilisé dans certaines parties du territoire du Nunavut, tandis que d'autres régions utilisent l'alphabet latin. Un système de conversion automatique (page en anglais) entre les deux a été créé. Cependant, comme l'alphabet syllabique n'a pas de majuscules, la conversion de l'alphabet syllabique vers l'alphabet latin n'affiche que les lettres minuscules de l'alphabet latin.

La conversion automatique est activée sur la Wikipédia Inuktitut, notez que les codes de variantes n'utilisent pas le "iu" de Wikipédia (macrolangue ISO 639-1), ils utilisent cependant ike-Cans pour le syllabique et ike-Latn pour le latin.

Kurde

Tracked in Phabricator:
Task T199895

La langue kurde utilise trois systèmes d'écriture selon les régions,

  • l'alphabet latin est utilisé en Turquie et en Syrie,
  • l'alphabet arabe en Irak et en Iran, et
  • l'alphabet cyrillique en ex-URSS, mais comme il n'est plus utilisé, ce système d'écriture n'est pas importé ici.

La Wikipédia kurde propose un système d'auto-conversion pour les deux systèmes d'écriture latin/arabe.

Convertisseur kurde latin-arabe.

Tachelhit (Chleuh)

La Langue tachelhit possède deux systèmes d'écriture : Le tifinagh et le latin. Certains documents mentionnent également que des écritures arabes ont été utilisées pour décrire, mais elles sont trop anciennes pour être utiles dans ce domaine.

Un système de translittération automatique du tifinagh vers le latin a été mis en place sur le wiki de test, la conversion inverse a été récemment déployée sur le logiciel MediaWiki.

Tracked in Phabricator:
Task T59138

The Wu Chinese has two major writing systems, Simplified Chinese and Traditional Chinese.

An automatic conversion between the two writing systems is both desirable as like how zhwiki is. It's recently supported since MediaWiki 1.41.

Prise en charge partielle

Cantonnais

Cantonese Traditional to Cantonese Simplified han characters

La langue cantonaise peut être écrite en caractères traditionnels ou simplifiés.

La Wikipédia cantonaise dispose d'un système de conversion à sens unique des caractères traditionnels vers les caractères simplifiés, sous la forme d'un gadget JavaScript. (Cette tâche phabricator détaille le processus pour le changer en convertisseur fourni par le système). Tous les articles sont écrits et édités en caractères traditionnels, car la conversion du traditionnel au simplifié est plus fiable que la conversion du simplifié au traditionnel : les caractères simplifiés effacent certaines distinctions qui sont préservées dans les caractères traditionnels.

Cantonese Traditional Characters to Cantonese Romanizations

Cantonese can also be written in the Romanized alphabet. There are three main existing Cantonese Romanization variants: Penkyamp Romanization of Cantonese, Jyutping, and Yale Romanization of Cantonese. Cantonese Wikipedia can aim to eventually incoporate all three romanized versions as a transliteration function to enable cantonese and non-native cantonese readers to read the Cantonese articles in all three orthographies.

The Penkyamp transliteration tool can be found here. The full Chinese to Penkyamp list can be found here.

Tatar de Crimée

Tracked in Phabricator:
Task T23582 resolved
Tracked in Phabricator:
Task T326864 invalid

La Langue tatare de Crimée possède trois systèmes d'écriture principaux. Il s'agit du latin, du cyrillique et de l'arabe.

La Wikipédia en tatare de Crimée utilise principalement l'écriture latine, mais l'écriture cyrillique est utilisée comme écriture officielle de facto en Crimée depuis l'annexion de la Crimée par la Fédération de Russie.

Avec des travaux majeurs sur le code de base de MediaWiki, la conversion entre le latin et le cyrillique est développée dans les projets crhwiki et Wiktionnaire test crh.

Nous attendons toujours des volontaires sur les avis concernant les écritures arabes tatares de Crimée. Devrait-il y avoir aussi un avis sur la conversion crh-arab ? Faites-nous part de votre opinion sur la page de discussion.

Since January 2023, there are also discussions to add supports for Dobrujan Tatar (crh-RO), a dialect of Crimean Tatar language in Romania.

Gan chinois

La langue gan possède trois systèmes d'écriture principaux. Il s'agit du chinois gan simplifié, du chinois gan traditionnel et du gan romanisé.

La Wikipédia Gan dispose actuellement d'un système de conversion automatique pour deux systèmes d'écriture (le chinois Gan simplifié et le chinois Gan traditionnel), mais pas pour le Gan romanisé. Une conversion automatique en gan romanisé serait souhaitable pour permettre aux personnes ne parlant pas le gan d'apprendre et de comprendre plus facilement la langue gan.

Serbe

La langue serbe possède deux systèmes d'écriture, le cyrillique (sr-Cyrl) et le latin (sr-Latn), avec deux dialectes principaux. Il existe donc en théorie quatre variantes de la langue :

  1. alphabet cyrillique Ekavian (sr-Cyrl-ekavsk)
  2. alphabet latin Ekavian (sr-Latn-ekavsk)
  3. alphabet cyrillique Ijekavian (sr-Cyrl-ijekavsk)
  4. alphabet latin Ijekavian (sr-Latn-ijekavsk)

La Wikipedia serbe propose un système de conversion automatique pour les deux systèmes d'écriture, mais pas pour les dialectes, car il y a peu de différences entre eux.

Actuellement, les codes des variantes sont erronés, "sr-ec" et "sr-el" ; ils attendent des correctifs pour être corrigés.

Serbo-Croate

Tracked in Phabricator:
Task T268033 resolved

Le Serbo-croate est une langue pluricentrique avec quatre variétés standardisées (bosniaque, croate, monténégrin et serbe), deux prononciations principales (ijekavien et ekavien), et deux systèmes d'écriture : le latin (sh-Latn) et le cyrillique (sh-Cyrl). Sur la base d'un consensus entre ses éditeurs, un translittérateur unidirectionnel du latin vers le cyrillique a été mis en place sur les projets serbo-croates le 1er décembre 2022 (à l'exception du projet test Wikivoyage Serbo-Croate, qui utilise hbs et pour lequel hbs n'a pas de support de conversion).

Une proposition d'implémentation du convertisseur pour les deux scripts a été proposée ici.

Tadjik

La langue tadjike utilise trois systèmes d'écriture par région,

  • L'alphabet cyrillique au Tadjikistan.
  • L'alphabet arabe en Afghanistan.
  • Alphabet latin.

La Wikipédia tadjike dispose actuellement d'un système d'auto-conversion pour deux des systèmes d'écriture (cyrillique - latin), mais pas pour le perso-arabe.

Voir les références pour le développement du système de conversion Cyrillique - Perso-Arabe à tajpers.narod.ru.

Talysh

Tracked in Phabricator:
Task T258975 resolved

La Langue Talysh possède trois systèmes d'écriture : le latin, le cyrillique et le perso-arabe.

Un système de translittération automatique à sens unique du latin vers le cyrillique a été développé ; il n'existe pas encore de support pour l'inverse, ni pour l'écriture arabe Talysh.

Ouzbek

La langue ouzbèke possède trois systèmes d'écriture :

  • le latin,
  • l'alphabet cyrillique et
  • l'alphabet arabe.

La Wikipédia ouzbèke dispose actuellement d'un système d'auto-conversion pour deux des systèmes d'écriture (latin - cyrillique), mais pas pour le perso-arabe.

Une conversion automatique entre les trois systèmes d'écriture est souhaitable puisque l'écriture perso-arabe est utilisée en Afghanistan. Un convertisseur vers l'arabe pourrait être développé, et s'il était un jour déployé, le Wikipédia Test Ouzbek Sud ne serait pas nécessaire.

Langues avec systèmes de conversion automatique existants à mettre en œuvre

Les wikis dans ces langues ne prennent pas automatiquement en charge les conversions linguistiques, mais il existe des outils externes utiles pour aider les lecteurs à lire les wikis dans différents scripts. Il est à espérer que, dans un avenir proche, ces outils pourront être introduits dans les wikis, voire dans le logiciel MediaWiki.

En ce qui concerne les scripts linguistiques utilisés sur ces wikis :

  1. Soit vous vous contentez de prendre le script le plus utilisé ;
  2. Soit ont des pages dans au moins deux scripts, qui peuvent ou non avoir des modèles pour la navigation.

Azerbaidjanais

Tracked in Phabricator:
Task T31218 declined

Azerbaijani language has three writing system: Latin, Cyrillic and Perso-Arabic alphabet.

The Azerbaijani Wikipedia is written in the Latin script.

However due to the incompatibility of the Latin and Perso-Arabic scripts a South Azerbaijani Wikipedia was created in July 2015.

An automatic conversion between the Latin and Cyrillic scripts is desirable to make the wiki readable for Azerbaijanis living in Dagestan.

The Batak languages can be written using the Latin script and the Batak script (Surat Batak). There is already Latin - Surat Batak converter [2].

Belarusian (Classical and Official orthographies)

The Belarusian language has two writing systems, Cyrillic and Latin.

In addition, this language is written in two spelling varieties, Classical Belarusian (used until 1933) and in the Russifying Official Belarusian introduced in 1933. This situation necessitated the creation of two separate en:Belarusian Wikipedias. Both are written in Cyrillic.

Hence, the introduction of a Latin converter is a pressing need for both, especially for the en:Belarusian diaspora and the Belarusian democratic opposition.

There is also a versatile convertor that converts between Cyrillic and Latin, and between Classical and Official Belarusian:

Furthermore, this converter also offers conversion into Archaic, that is, Old, Belarusian, which is none other but the en:Ruthenian language, written either in Cyrillic or Latin letters.

NB1: The following converter should be avoided:

because it does not convert from the Belarusian Cyrillic to the Belarusian Latin alphabet, but transliterates the Belarusian Cyrillic on the model of the Russian romanization in line with the official document en:Instruction on transliteration of Belarusian geographical names with letters of Latin script, which denies any official role to the Belarusian Latin alphabet.

Last but not least, until the mid-20th century Belarusian was written by Muslims in a third national alphabet, namely, in Arabic letters, known as the Belarusian Arabic alphabet. No Cyrillic/Latin - Arabic converter has been developed yet, but some shcolars are working to this end. See also Revised Proposal to encode Arabic characters used for Bashkir, Belarusian, Crimean Tatar, and Tatar languages.

NB2: In late 2021 a project of the Latin alphabet-based Belarusian Wikipedia, that is, the Biełaruskaja Wikipedyja łacinkaj, commenced.

The Buginese language can be written using the Latin script and the Lontara script. There is already a Latin - Lontara converter, which need only small edits to be ideal. There is also Latin - Aksara Lontara online converter [3].

The Chechen language has 2 writing systems: Cyrillic and Latin alphabet.

An automatic conversion from Cyrillic into Latin writing systems is desirable since many Chechens living outside of the Russian Federation cannot read Cyrillic.

Article principal: [[]]

The Konkani language has five writing systems: Devanagari script, Latin script, Kannada script, Arabic script and Malayalam script. The Goan Konkani Wikipedia has articles in the Devanagari, Latin and Kannada scripts. Although there exists a project for a script converter, it hasn't been developed yet.

In the absence of an on-Wiki system, an external tool, Konkanverter is being used to manually transliterate text.

It needs to be investigated whether MediaWiki's LanguageConverter system can be used to implement the script conversion.

Girgit, a tool for transliteration between the three scripts has been released under the GPL. It is worth investigating whether it can be integrated to the Konkani Wikipedia.[2][3]

The Karakalpak language has two writing systems, Latin and Cyrillic.

Currently kaawiki is using the Latin script, and doesn't have a conversion system

There has a Karakalpak converter on Transliteration.kpr.eu, it supports conversion from Cyrillic to Latin, but the reverse conversion isn't working for now.

The Kyrgyz language has three major writing systems. These are Cyrillic Kyrgyz, Latinized Kyrgyz, and Perso-Arabic Kyrgyz (used in Xinjiang, China).

An automatic conversion between the three writing systems is desirable since the Kyrgyz in China do not use Cyrillic.

Arabic to Cyrillic converter is under developement (tentative source codes) so that Chinese Kyrgyz can also contribute to Wikipedia even without knowledge of Cyrillic.

Laz

The Laz language has two writing systems: Georgian script and Latin script. An automatic conversion into Georgian would be desirable to enable more Laz users from Georgia.

The alphabet is on Wikipedia, in Georgian and Latin.

The Polish language is typically written in Latin letters. Yet, in western Belarus Catholics mostly identify as Poles and speak the local Slavic vernacular, defined as Polish. However, they have no knowledge of the Latin alphabet. Hence, (mostly devotional) Polish-language books are published for them in Cyrillic.[4]

Supplying the Polish Wikipedia with a converter to such Polish Cyrillic would enable this Polish minority population of 300,000 to enjoy access to the Polish Wikipedia, which is one of the world's largest wikipedias.

There are some readily available converters of this kind, namely

The Sindhi language can be written using modified Persian alphabet and Devanagari script. Most Sindhi people youth in India do not know the Persian alphabet, and use Devanagari, leaving the current Wikipedia available solely for those in Pakistan.

A Sindhi Arabic to Devanagari Conversion tool can be created (based on this table and this table), tested and then installed on Sindhi Wikipdia in order for Sindhi articles to be read in the Devanagari script at the click of a tab. That also eliminates the need to have a separate wiki written in Sindhi Devanagari.

The Sundanese language can be written using the Latin script and the Sundanese script (Aksara Sunda). There is already Latin – Aksara Sunda converter [4].

The Tatar language has three major writing systems. These are Cyrillic Tatar, Latinized Tatar, and Perso-Arabic Tatar.

An automatic conversion between the three writing systems was very desirable in order to avoid Tatar script conflicts.

As of September 2021, there's a Tatar Cyrillic to Latin conversion tool available at baltoslav.eu, but no reverse conversion supports yet.

The Turkmen language has three writing systems: Latin (used in Turkmenistan), Perso-Arabic alphabet (used in Iran and Afghanistan) and Cyrillic (historically used in Turkmenistan).

An automatic conversion between the three writing systems is desirable because although officially, Turkmen is rendered in the Latin alphabet, the old Cyrillic alphabet is still in wide use and many political parties in opposition to the authoritarian rule of President Niyazov continued to use the Cyrillic alphabet on websites and publications, most likely to distance themselves from the alphabet that Niyazov created.

The Uyghur language has three writing systems, Arabic, Latin and Cyrillic.

The Latin alphabet is used by Uyghurs in Turkey, Western countries and parts of Xinjiang, the Cyrillic alphabet is used in CIS countries whereas the Perso-Arabic script is used officially in Xinjiang.

An automatic conversion between the three writing systems is desirable to prevent conflicts between users with different preferences. Actually that's existing: Yulghun.

Languages previously with automatic conversion systems, now removed

Kazakh

Tracked in Phabricator:
Task T268143 resolved
Tracked in Phabricator:
Task T350684 resolved

Le Kazakh possède trois systèmes d'écriture : le cyrillique (kk-Cyrl), le latin (kk-Latn) et le perso-arabe (kk-Arab). La Wikipédia kazakhe (kkwiki), ainsi que d'autres projets kk.wiki*, prennent donc en charge le système de conversion automatique pour ces systèmes d'écriture.

In late 2023, the MediaWiki language conversion for Kazakh was removed.

Languages without automatic conversion system

Unfortunately, those languages are having no supports on language conversion, either within wikis or externally. The problems regarding scripts used by their contents are same as above section. Sorted according the similarity of the required conversion system.

Hopefully, in the near future, the language conversion tools can be developed and deployed for them.

Arabic, Cyrillic and Latin

The Shughni language has three writing systems: Latin, Cyrillic and Perso-Arabic alphabet.

The Shughni Wikipedia test is written in the Cyrillic, Latin and Arabic scripts.

An automatic conversion at Wikimedia Incubator between the Latin and Cyrillic scripts is desirable to make the wiki readable for the 40,000 Shughni people in Tajikistan and 20,000 Shughni in Afghanistan. Transliteration to the Shughni arabic script can be made at a later date.

Cyrillic and Latin

Bosnian language uses two writing systems: Latin and Cyrillic alphabet. Currently Bosnian Wikipedia uses Latin scripts, but no Cyrillic support. Some materials mentioned that Bosnian language was using Arabic scripts before 1900s, but not useful for modern develops.

A Cyrillic-Latin converter for Bosnian would be perfect.

It's possible that Lojban can be written in both Latin and Cyrillic, see Lojban grammar Wikipedia article.

The Nogai language can be written in both Cyrillic and Latin scripts, the Nogai test Wikipedia on Incubator is written mostly in Cyrillic, but the community has asked a possible to also show contents in Latin as well.

Tracked in Phabricator:
Task T169453 declined

The Romanian language can be written using either Latin script or Cyrillic script. Currently Romanian Wikipedia only use Latin script, as some users think Cyrillic Romanian should be marked as "Moldovan".

An automatic conversion between the two writing systems was considered as per Proposals for closing projects/Deletion of Moldovan Wikipedia 2. However, due to a number of large scale community conflicts of interests, the consideration is nowadays fall into a no-go zone, and unlikely to be touched again.

Explained by a former Incubator administrator, a Cyrillic Romanian (or Moldovan, if you like) project is available on Fandom.

The Vlax Romani has major two major writing systems. These are Latinized Romani, and Cyrillic Romani.

Arabic and Latin

The Brahui language has two main writing systems: Arabic script and the Latin script. This is because:

  1. The current online Arabic keyboard does not contain the required number of vowels for Brahui.
  2. Sometimes vowels are used as consonants depending upon their position in a word. This is quite confusing for people who are getting literacy instruction in the Brahui language.

A system that can convert between the two scripts would help resolve script issues from hindering the growth of the language.

Komering language has three major writing systems: Latin (officially used), Arabic (used by local Muslims), and Komering (but currently doesn't registered at Unicode, where they treat this as Rejang scripts). An idea to consider developing a conversion system is discussed at incubator:Talk:Wp/kge/Halaman Utamo.

Malay language is normally written using Latin alphabet called Rumi, although a modified Arabic script called Jawi script also exists. Rumi and Jawi are co-official in Brunei. Efforts are currently being undertaken to preserve Jawi script and to revive its use amongst Malays in Malaysia, and students taking Malay language examination in Malaysia have the option of answering questions using the Jawi script. The Latin alphabet, however, is still the most commonly used script in Malaysia, both for official and informal purposes.

An automatic conversion from Latin to Jawi script should be set up.

References:

Arabic and Brahmic scripts

The Haryavni language has two writing systems, they are Devanagari used in India, and Shahmukhi (a modified Arabic script) used in Pakistan.

Currently the Haryavni Wikipedia test on Incubator has much more articles written in Shahmukhi (being populated since later 2023), and some finger-counted articles written in Devanagari created at least five years ago.

Tracked in Phabricator:
Task T12034 declined

The Kashmiri language has three writing systems. These are Devanagari Kashmiri, Perso-Arabic Kashmiri and Romanized Kashmiri.

An automatic conversion between the three writing systems is very desirable in order to avoid Kashmiri script conflicts. However, an accurate conversion script is very difficult to develop (see also [5])

Punjabi

There are several different scripts used for writing the Punjabi language. In the Punjab province of Pakistan, the script used is Shahmukhi and is essentially the same as the Urdu script. In the Indian state of Punjab, Sikhs and others use the Gurmukhī script. Hindus, and those living in neighbouring Indian states such as Haryana and Himachal Pradesh sometimes use the Devanāgarī script. Shahmukhi and Gurmukhī scripts are the most commonly ones used for writing Punjabi and are considered the official scripts of the language.

What about the set automatic Gurmukhī - Shahmukhi transliteration based on this source [dead link] like in e.g. Kazakh wikipedia.

So every one can read these both wikis in Gurmukhī or Shahmukhi scripts.

The Tamil language can also be written in Arwi (Tamil Arabic script). A Tamil to Arwi Conversion tool can be created, tested and then installed on Tamil Wikipdia in order for Tamil articles to be read in the Arabic script at the click of a tab. That also eliminates the need to have a separate wiki written in Arwi.

Brahmic scripts and Latin

The Meitei language can be written using Meitei (or Meetei Mayek), Bengali and Latin scripts, and has several dialects. An automatic conversion system was proposed on Incubator, see incubator:User talk:Artoria2e5#A query.

The Pali language can be written using Devanagari, Brahmi and Latin scripts. An automatic conversion system was proposed here.

The Sylheti language can be written using Sylheti Nagri and Bengali scripts. The Sylheti test projects on Incubator are exclusively using Sylheti Nagri, and only use Bengali scripts in some talk pages.

A proposal to create conversion system is discussed at langcom mailing list, but a survey at Incubator shown that some contributors said something against implementation of such a conversion system.

CJKV and Latin

Automatic Han to Latin conversion may be difficult but perhaps possible with reasonable accuracy. Completely automatic Latin to Han conversion is either impossible or extremely difficult and will almost certainly be inaccurate without knowledgeable human intervention (indeed, this is a similar problem to an input method for Han characters). Without the latter, only contribution in Han is possible. This would then disadvantage contributors who only know the Latin orthography.

The Mindong language has two major writing systems. These are Traditional Chinese characters, and Romanized Foochowese (the writing system is known as "Bàng-uâ-cê").

Mindong Wikipedia currently does not have an auto-converting system for the two writing systems. An automatic conversion from Traditional Chinese characters into Romanized Foochowese would be desirable to avoid conflicts between users with different preferences and enable users to comprehend the meaning of every word more easily.

The 6,104 most used Han characters to Romanized Foochowese list can be found here.

A Eastern Min transliteraton tool can be found here.

The Hakka language has two major writing systems. These are Simplified and Traditional Chinese characters, and Romanized Hakka(see existing chinese character --> Hakka dictionary).

Hakka Wikipedia currently does not have an auto-converting system for the two writing systems. An automatic conversion from Traditional Chinese characters into Romanized Hakka would be desirable to avoid conflicts between users with different preferences and enable users to comprehend the meaning of every word more easily.

The 4000 most used Han characters to Romanized Hakka list can be found here.

The Minnan Language has two major writing systems. These are Romanized Minnan and Minnan written in traditional Chinese characters.

An automatic conversion between the two writing systems from Romanized Minnan --> Traditional Minnan Chinese characters and from Traditional Minnan Chinese characters --> Romanized Minnan is both desirable in order to avoid existing conflicts between users with different script preferences.

There were suggestions to combine former Incubator Wikipedia test project in Chữ Nôm and Vietnamese Wikipedia that uses Chữ Quốc, that would be extremely difficult.

Wp/vi-nom no longer exists on Incubator. A substantially equivalent project exists at [6].

Different Latin scripts/orthographies

Norwegian (Bokmål and Nynorsk)

The Norwegian language, while is in nowadays only using Latin scripts, has several major orthographies, too hard to count the detail numbers.

Currently the well known orthographies are:

  1. Bokmål, the Norwegian Wikipedia currently uses, the supreme-court-defined official orthography, and probably the one that Google Translate supports (as that only supports one "Norwegian"), or may be other machine translation tools;
  2. Riksmål, probably also used by Norwegian Wikipedia, though the evidences are not yet provided, no IETF language tag as of September 2021;
  3. Nynorsk, the Nynorsk Norwegian Wikipedia currently uses;
  4. Høgnorsk, IETF language tag hognorsk, also used on nnwiki, but only on some pages that can be counted by fingers (see nn:Special:Prefixindex/Nn/)

There were some historic recordings on nowiki that their wiki was just one Norwegian Wikipedia, but later the Nynorsk Norwegian speakers passed a consensus to split their articles, to found a nnwiki, and nowiki is de facto Bokmål Norwegian Wikipedia. There are, however, other users don't agree with histories, and want to merge both back to one nowiki, using scripts to convert them.

Southern Min (Minnan)

Someone commented on a user page to raise the possibility of automatic conversion between the two leading Latin orthographies here. They are Pe̍h-ōe-jī and Tâi-uân Bân-lâm-gí Lô-má-jī Phing-im Hong-àn (Tâi-lô). Each is strictly a function (in the mathematical sense) of the other. The conversion table is available. Something very simple on the level of the script conversion tool at ang: might just work. Incidentally, it might even serve as a rudimentary spellchecker if implemented properly. See also this thesis and this blog post.

The Nigeria Yoruba and the Benin Yoruba orthographies are different. The Yoruba Wikipedia uses the Nigeria Yoruba spelling.

The Nigeria Yoruba orthography is based on Samuel Crowther’s 1852 orthography, which was influenced by the Church Missionary Society writing system. The Nigeria Yoruba orthography rules were standardized during 1875 Yoruba Orthography Conference. In 1966, the Western Nigeria Ministry of Education set up a committee to review the orthograpic rules and the Report of the Yoruba Orthography Committee was published in 1969 and following reactions, a larger committee published the Report of the Enlarged Committee on Yoruba Orthography in 1972.

In 1971, the Joint Working Party was set-up to achieve practical reforms in multiple Nigerian languages, and the Yoruba Working Party accepted most of the recommendations of the Orthography Committees. In 1974, the Joint Consultative Committee on Education, set-up by the Federal Ministry of Education, approved that the recommendations of the Joint Working Party be used by all Ministries of Education in Nigeria and the West African Examinations Council.

The Benin Yoruba orthography is based on the Benin National Alphabet created by the National Linguistic Commission in 1975 and adopted in law the same year. The Benin National Alphabet defines several Benin language orthographies, including a Yoruba one. The national alphabet was updated a few times, including in 1990 and in 2006.

The main difference between the Nigeria Yoruba and the Benin Yoruba orthographies are as follow: ẹ ọ p ṣ in Nigeria are spelled ɛ ɔ kp sh in Benin.

Cyrillic, Latin and Mongolic

The Kalmyk language can be written using the Cyrillic script and the Todo script.

An automatic conversion between the two writing systems are necessary because the 'Kalmyks' (known as Oirats in China) use the Todo script only.

The Manchu language has three writing systems: Manchu script, Jurchen script, and the Latin script.

  1. The Manchu language is near extinction in terms of native speakers, however a lot of enthusiasts and academics are learning it as a second language. When they learn it, in China I believe they mainly use Manchu script and in the west they learn the language in both the latin and Manchu scripts.
  2. A little snag we might run into is the fact that Manchu script is normally written vertically, from up to down. However, if need be, that rule can be bent and we can do it horizontally and people can manually rotate their screens if they wish to read it in Manchu script.
    The vertical script is now supported.
  3. The Jurchen script is used for writing an earlier stage of Manchu, the Jurchen language. If it ever works out properly in unicode, we might create a separate Jurchen wikipedia like how we have separate modern and old English wikipedias.
  4. All in one, it was required too many times by langcom that conversion system for Manchu should be deployed as soon as possible.

The Mongolian language can be written using the Cyrillic script, the Classical Mongolian script and the ’Phagspa script see unicode(Mainly for art).[7].

An automatic conversion between the three writing systems are desirable to prevent the creation of a Mongolian Wikipedia written in the Classical Mongolian script and the Latinized Mongolian script.

The Xibe language can be written using either Latin script or Xibe scripts. Currently the Xibe test Wikipedia has many contents in Xibe scripts, previously many of them were using Latin, they were manually converted to Xibe in later 2023.

An automatic conversion between both writing systems is desirable for readers.

Other converter

Peul/Fulfulde has two major writing systems. Latin script, en:Adlam script. Arabic Ajamiya is also used in Cameroon and neighbouring countries.

There are already some pages that have been converted manually, for example: Gine/adlam

en:Javanese language is the language primarily spoken in the island of Java, and also by the Javanese diaspora in Indonesia and Suriname.

(1) There are two writing system: traditional Hanacaraka (also called Carakan, an Abugida script) and Latin. Latin is more prevalent to the extent of almost all publication in Javanese (albeit only in small number) are all in Latin. A one-to-one conversion is possible from Latin to Hanacaraka. Hanacaraka only recently (2009) got it's own Unicode, and there exist a Hanacaraka Unicode font and several non-Unicode fonts. Since the Unicode hasn't been supported by TrueType, it's using SIL's Graphite.

Currently Javanese Wikipedia already request WebFont to be implemented. In the future it is desirable to see automatic conversion like the Chinese or Cyrillic projects.

(2) Another thing to be considered: Javanese language has (at least) two registers (sets of vocabulary) based on social standing: polite/palace Javanese (krama) and brash/market Javanese (ngoko). Both are used in Central Java, the former is more commonly used in publication, while the latter are more commonly used in conversation. In some places the usage of the latter is also found in publication, mainly in Suriname (for example the ngoko language is used in Suriname-Javanese Bible, which to the eyes and ears of the Javanese people would be vulgar), where the former is no longer in use, due to historical and geographical reasons.

The same also true for East Javanese people, who opposed vehemently the use of the former due to its association with aristocracy, and for people from other ethnicity all around Indonesia. Therefore there are four combinations/variants in Javanese language:

  • Hanacaraka krama
  • Latin krama
  • Hanacaraka ngoko
  • Latin ngoko

Converting from krama to ngoko sometimes only requires one-to-one mapping of vocabulary, but in other instances requires one-to-many or many-to-one, or even a change in the grammar.

(3) Historically, there's also third (and even fourth) script that was used to write Javanese, that is Arabic script (called Pegon alphabet and Arab gundul alphabet), and long before that, Sanskrit/Pallava (Old Javanese/Kawi script). http://www.omniglot.com/writing/javanese.htm

The use of these old scripts would in Wikimedia projects is still non-existent, but probably in the future would be beneficial for Wikisource and Javanese Wiktionary

(4) Javanese Hanacaraka is still related to Sundanese and Balinese language, and Wikimedia projects currently has Sundanese Wikipedia and its sister projects, and Balinese Wikipedia.

There are discussions in Korean Wikipedia Dajimo about introducing hanja system to use automatic conversion.

There are also somewhat discussions regarding differents of Korean grammars between South Korea and North Korea, though the need for script converting is still under analysis.

The Ladino language has major two major writing systems. These are Latinized Ladino, and Rashi script (variant of the Hebrew script).

An automatic conversion between the two writing systems are desirable to prevent the duplication of articles. However, this can meet a very hard-to-resolve technical challenge, see talk page for details.

Tagalog language can be written in Latin or Baybayin scripts. But as Baybayin scripts are shelted by local governments, it seems that there are lack of supports on a potential conversion system.

Notes

  1. Depuis septembre 2021, le Wiktionnaire chinois et Wikisource ne permettent que le système de conversion simplifié-traditionnel ; tandis que sur les Wikibooks, Wikinews et Wikiquote chinois, le zh-Hant-MO est fusionné en zh-Hant-HK, de même que le zh-Hans-MY fusionné en zh-Hans-SG

See also

More lists of Wikipedias by various criteria :  [ modifier ]