Diversifying Wikipedia: Technological and socio-political aspects

Submission no.

Title of the submission: Diversifying Wikipedia: Technological and socio-political aspects

Type of submission (discussion, hot seat, panel, presentation, tutorial, workshop): presentation

Author of the submission: Ritesh Kumar

E-mail address: riteshkrjnugmailcom

Username: riteshkrjnu

Country of origin: India

Affiliation, if any (organisation, company etc.): Dr. Bhim Rao Ambedkar University, Agra

Personal homepage or blog

Abstract (at least 300 words to describe your proposal)

Despite making huge strides towards great diversity in the coverage of articles, Wikipedia still lags behind in the diversity of languages in which knowledge is disseminated. According to the data from Census 2001, there are 1635 mother tongues in India. Out of these, just 21 languages are represented in Wikipedia and the number of articles created in most of these languages are negligible in comparison to the number of articles in a language like English. Besides English, the largest Indian language represented on Wikipedia is Hindi, which has a global rank of 49 (based on the number of articles in the language) with over 116 thousand articles. Table 1 below gives a summary of the number of pages in each of the Indian languages on Wikipedia and their global rank.

Sl. No.	Global Rank	Language	Articles
1	1	English	4,678,838
2	49	Hindi	116,629
3	59	Newar	71,163
4	61	Tamil	65,589
5	63	Urdu	63,568
6	66	Telugu	60,070
7	76	Marathi	40,994
8	78	Malayalam	37,576
9	81	Bengali	33,519
10	83	Western Punjabi	33,240
11	95	Nepali	26,926
12	97	Gujarati	25,636
13	98	Bishnupriya Manipuri	25,126
14	107	Kannada	17,306
15	109	Punjabi	16,237
16	127	Sanskrit	10,274
17	132	Oriya	8,560
18	148	Bihari (Bhojpuri)	5,569
19	184	Assamese	2,998
20	243	Sindhi	577
21	248	Maithili	525

Table 1: Wikipedia articles in Indian languages (extracted from [1])

This biased representation of languages on Wikipedia could be understood in terms of two major factors:

a. Technological factors: Till very recent times, non-availability of technologies for input as well as rendering of the scripts used for writing Indian languages was a major hurdle in widespread use of these scripts over the web in general. While these languages were used sporadically (and mostly using the Roman script), use of Indian languages on the web was not very widespread. However, with the inclusion of most of the characters from almost all the major Indian scripts and development of phonetic keyboards, both the issue of rendering of Indian scripts have been solved to a large extent, except for a few grey areas. However, even though the input methods of Indic scripts have been simplified and standardised to a great extent, a large population is still unaware of these developments and so continuously refrain from using these tools for their purpose. This issue could be addressed through awareness programmes, workshops and training in Indic scripts input in different parts of the country. It could be achieved through widespread collaboration activities carried out across different institutions and Universities across the country. In addition to this, another way of improving the participation of the community and increasing their contribution so as to enrich the content in a particular language is the use technology, especially natural language processing tools, to make the task of the contributors relatively easy. One of the ways it could be achieved is by using the machine translation systems to translate the articles from a major language like English into other languages and then getting those articles edited and adapted by the experts into their language. This would not only provide parallel articles in a large number of languages but also considerably ease the job of the editors and contributors. As more articles are translated, the parallel copora, thus created, could also be used to further improve the machine translation systems, thereby, feeding back into the Wikipedia content generation.

A bigger issue is the non-representation of a large number of languages on Wikipedia. While some of these languages have a script of their own, majority of the world's languages are spoken languages without a script of their own. The question is how do we create content in those languages which do not have a script of their own. There could be two solutions – using a standardised, universal script like International Phonetic Alphabet (IPA) or using one of the larger known scripts that is used for nearby major language. Both the solutions have certain problems associated with them – IPA is not a commonly-known script and its use may defeat the purpose of accessibility and using script that is commonly used for other languages may not be acceptable to the community members. However, using larger-known scripts could provide the solution to the accessibility issue.

b. Socio-political factors: Along with the technological factors, the socio-political status of languages and the politics of language, in general, in India have played a very crucial role in creating an imbalance in the representation of languages in Wikipedia. Officially, languages in India are divided into 22 scheduled (the languages that are included in the 8th Schedule of the Constitution) and 100 non-scheduled languages. Besides these there are hundreds of languages which are not counted in the official figures because those have less than 10,000 speakers (even though most of these lesser-known languages have a very robust and stable population and they are not endangered). Furthermore there are several languages which are classified as the dialects/varieties of some major language without any convincing reason to do so. As a result of all these, it becomes really difficult to distinguish between distinct languages and their varieties.

On the other hand, the global policy of the Wikimedia Foundation on opening a new language edition gives a contradictory picture for Indian situations – on the one hand it states that it does not consider “political differences” such that it could give “unbiased access” of the “sum of all human knowledge” to every single person; and on the other hand it categorically states that “regional dialects”, which are inherently political entities (in the sense that the distinction between languages and dialects are always political), are excluded from opening a new language edition. This policy, along with socio-political status assigned to Indian languages, has created an environment where a lot of languages are actively excluded from Wikipedia, thereby, undermining the basic goal of Wikipedia. While the solution to the larger linguistic issues of India is beyond the scope of this paper, in order to improve the linguistic diversity on Wikipedia so as to make it accessible to all, a revision in eligibility criteria and policy of the Wikimedia foundation with regards to the language, informed by the research on languages, dialects and varieties in sociolinguistics, is imminent.

Thus in order to increase the linguistic diversity of Wikipedia, thereby, increasing its accessibility among a large population, two-fold effort is necessary – use of latest technologies including the improved input methods and the advanced NLP techniques for quick and huge development of articles in several different languages and at the same time understanding the linguistic scenario of India and adopting a more well-informed stance towards languages so as to encourage Wikipedia articles in a large number of languages.

Track: WikiCulture & Community

Language of Track: English

Length of session (if other than 30 minutes, specify how long): 30 minutes

Will you attend Conference at Kolkata with own cost if your submission is not accepted?: Yes

Slides or further information (optional)

Special requests

Interested attendees[edit]

If you are interested in attending this session, please sign with your username below. This will help reviewers to decide which sessions are of high interest. Sign with a hash and four tildes. (# ~~~~).

riteshkrjnu.