Automatic conversion in Konkani language

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search

This page explains the automatic script conversion for Konkani Language. Konkani language(kok) is written in multiple scripts. It is spoken in India, particularly in Konkan area, includes the state of Goa, Karnataka, Maharashtra and some parts of Kerala. Konkani is also spoken in Wichita, KS, Kenya, Uganda, Pakistan, Persian Gulf, Lisbon in Portugal. There are 3.6 million native speakers. The Konkani language is written in the following scripts

  1. Devanagari (official),
  2. Roman
  3. Kannada
  4. Malayalam
  5. Arabic

As of now there are no wiki projects for this language. An automatic transliteration between these scripts is in development and this page tries to help volunteer developers to get more details.

Automatic conversion system[edit]

Following are the step by step procedure to add multiple script conversion for Konkani. Feel free to add more information. Since April 2005, there is a LanguageConverter object that encapsulates most conversion related functionalities.

Prerequisites[edit]

If you are a volunteer trying to follow these steps, you should have a running instance of Mediawiki in your local computer. You should be familiar with some Indic scripts. You should also have the proper input methods and fonts for the scripts addressed here.

Add Konkani support to Mediawiki[edit]

At the time of writing this, Konkani is not a supported language in Mediawiki. We need to add the Konkani language variants as identified languages of Mediawiki. Open wiki/languages/Names.php and add the entries as follows

 'kok' => 'कोंकणी', 
 'kok-ml' => 'കൊങ്കണി', 
 'kok-kn' => 'ಕೊಂಕಣಿ',

kok is default variant of Konkani written in Devanagari script. kok-ml is Konkani written in Malayalam script, kok-kn is Konkani written in Kannada script.

Write the Language Classes for Konkani[edit]

For the new langauge added, we need a Language definition class, Create a php file name LanguageKok.php in wiki/languages/classes folder with the following content.

<?php
require_once( dirname( __FILE__ ) . '/../LanguageConverter.php' );
require_once( dirname( __FILE__ ) . '/LanguageKok_ml.php' );
require_once( dirname( __FILE__ ) . '/LanguageKok_kn.php' );
/**
 * Konkani -Devanagari(कोंकणी)
 *
 * @ingroup Language
 */
class LanguageKok extends Language {
	function __construct() {
		global $wgHooks;
	        parent::__construct();
		$variants = array( 'kok', 'kok-ml', 'kok-kn' );
		$variantfallbacks = array(
			'kok'    => 'kok',
			'kok-ml' => 'kok',
			'kok-kn' => 'kok',
		);
	}
}

This class defines the language variants and the fallbacks for each variant.

As you noticed, it referes to 3 other PHP files. LanguageConverter.php is an existing class (we will explain it later). LanguageKok_kn.php and LanguageKok_ml.php are two new classes we are going to create in wiki/languages/classes folder. These classes are definition files for Language variants. Create them and add the content as given below. Content of LanguageKok_kn.php is

<?php
/**
 * Konkani -Kannada 
 *
 * @ingroup Language
 */
class LanguageKok_kn extends Language {
}

Content of LanguageKok_ml.php is

<?php

/**
 * Konkani Malayalam
 *
 * @ingroup Language
 */
class LanguageKok_ml extends Language {
}

As you noticed there is no content in these classes. Konkani Malayalam or Kannada specific logic will come here, but for the sake of simplicity we are not putting any code there.

Write a language converter[edit]

Next step is to write a Language Converter itself for the Konkani. It should use default script - ie Devanagari as the base and should have the logic to convert it to any any other script. Writing such a language converter is not easy, but thanks to the existing transliteration systems in Mediawiki, we have a reusable LanguageConverter class. You can see it in wiki/languages/LanguageConverter.php. We are going to extend that class and add our custom code. Open the LanguageKok.php again and add the following code.

<?php

require_once( dirname( __FILE__ ) . '/../LanguageConverter.php' );
require_once( dirname( __FILE__ ) . '/LanguageKok_ml.php' );
require_once( dirname( __FILE__ ) . '/LanguageKok_kn.php' );

/**
 *
 * @ingroup Language
 */
class KokConverter extends LanguageConverter {
	var $mToMalayalam = array(
		'क' => 'ക', 'ल' => 'ആ',  'र' => 'ഇ', 'व' => 'ഈ',  'प' => 'ഉ',
	);

	var $mToKannada = array(
		'क' => 'ಕ', 'ल' => 'ಕ್',  'र' => 'ಇ', 'व' => 'ಜ್',  'प' => 'ಮ್',
	);

	function loadDefaultTables() {
		$this->mTables = array(
			'kok-ml' => new ReplacementArray( $this->mToMalayalam ),
			'kok-kn' => new ReplacementArray( $this->mToKannada ),
			'kok'    => new ReplacementArray()
		);
	}

}

/**
 * Konkani -Devanagari(कोंकणी)
 *
 * @ingroup Language
 */
class LanguageKok extends Language {
	function __construct() {
		global $wgHooks;

		parent::__construct();

		$variants = array( 'kok', 'kok-ml', 'kok-kn' );
		$variantfallbacks = array(
			'kok'    => 'kok',
			'kok-ml' => 'kok',
			'kok-kn' => 'kok',
		);

		$flags = array();
		$this->mConverter = new KokConverter( $this, 'kok', $variants, $variantfallbacks, $flags );
		$wgHooks['ArticleSaveComplete'][] = $this->mConverter;
	}

}

You can see that we added a KokConverter class. And that class has a replacement or transliteration table mapped to each language variant. In this example, the transliteration rules does not make sense, and it is just for illustration. That table need to be expanded to accommodate all language transliteration rules.

Testing[edit]

Now it is time to see whether this code will work or not. We need to set the content language of the wiki to kok to make this code execute and do its work. For this, in your wiki/LocalSettings.php locale $wgLanguageCode and change its value as follows

$wgLanguageCode = "kok";

Now open your wiki in browser and see the dropdown appearing along Page, Discussion tabs. You can see the scripts listed there.

Automated Testing[edit]

We can also test the transliteration using the PHPUnit base tests. Create a test class named LanguageKokTest.php in wiki/tests/phpunit/languages folder with the following content.

<?php
/**
 * @author Santhosh Thottingal
 * @copyright Copyright © 2011, Santhosh Thottingal
 * @file
 */

/** Tests for MediaWiki languages/LanguageKok.php */
class LanguageKokTest extends MediaWikiTestCase {
	private $lang;

	function setUp() {
		$this->lang = Language::factory( 'Kok' );
	}
	function tearDown() {
		unset( $this->lang );
	}

	function testTranslate() {
		$this->assertEquals($this->lang->mConverter->convertTo( 'क', 'kok-ml' ) ,'ക');
		$this->assertEquals($this->lang->mConverter->convertTo( 'क', 'kok-kn' ) ,'ಕ');
	}
}

Now, in your terminal, goto wiki/tests/phpunit directory and execute the test case

php phpunit.php languages/LanguageKokTest.php

You will see that all tests passed successfully.

Transliteration Rules[edit]

There is an Intra indic script transliteration tool already implemented at SILPA It is an opensource project and it contains many language specific rules for Indic languages. It is a python project, but the logic can be reused without much difficulty.

Common Rules[edit]

For the Indic scripts, the unicode code points in an offset of 128 for all languages/scripts. That means, if you add 128 to the ka of Kannada, you get Malayalam Ka.

Schwa deletion of Devanagari[edit]

Read more from Schwa deletion in Indo-Aryan languages

Chillus of Malayalam[edit]

Issues with Arabic transliteration[edit]

Notes to Developers[edit]

  1. Important: If you are going to work on this stub code, please inform sthottingal @ wikimedia.org to avoid any duplication of efforts and for better coordination.
  2. The code samples explained here are simplified a lot for explanation. But it will become complex once we cover all language specific rules. You may refer LanguageSr.php or LanguageZh.php in wiki/languages/classes folder to read and understand how Serbian and Chinese wikipedia projects achieved their automatic transliteration system.
  3. Do not start with all scripts, take only one script that you are familiar with.

Girgit[edit]

Girgit, a tool for transliteration between the three scripts has been released under the GPL. It is worth investigating whether it can be integrated to the Konkani Wikipedia.[1][2]

See also[edit]

  • http://www.literatureindia.com/2009/10/16/the-source-code-of-girgit-is-now-freely-available/
  • https://code.google.com/archive/p/girgit/source#svn%2Ftrunk%2FGirgit%2Fsrc%2Fin%2Fchitthajagat%2Fgirgit%2Fserver