Wikimedia Blog/Drafts/How Odia Wikimedia community is building up to acquire content from newspapers and state portals

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search

Title ideas[edit]

  • How the Odia Wikimedia community is working to acquire content from newspapers and state portals
  • How the Odia Wikimedia community is working to enrich Wikipedia using content from newspapers and state portals
  • How a Unicode converter could be key to improving the Odia Wikipedia
  • How the Odia Wikimedia community is making Odia sources easier to use


A brief, one-paragraph summary of the post's content, about 20-80 words. On the blog, this will be shown in the chronological list of posts or in the featured post carousel on top, next to a "Read more" link.

  • The Odia-language Wikimedia community has successfully built character encoding converters to convert digital content from newspapers, magazines and portals, made in proprietary non-standard encoding systems, to Unicode to enrich the Odia Wikipedia—in time for its thirteenth birthday.


Odia language input in Odia Wikipedia.png
Accessing sources in the Odia language is set to become easier than ever before. Photo by Subhashish Panigrahi, freely licensed under CC-BY-SA 4.0.

Around this time last year, I wrote about the development of Odia language character encoding converters that the Wikipedia community was working on. These converted text with proprietary encoding into Unicode, a universal standard. These converters have now, in time for today's thirteenth anniversary of the Odia Wikipedia, been made available for use.

Character encoding is used to represent a collection of characters by some kind of encoding system, and is used in computation, data storage, and transmission of textual data. Fonts in different scripts used to have several different encoding systems before the onset of Unicode.

However, most media outlets—as well as the state government—are still using old encoding systems for Odia. These require the installation of a particular font using the same encoding system to read documents. Unicode makes this much easier, as most modern computers come with Unicode fonts preinstalled.

A character encoding converter is generally used to convert from one encoding system to another. Massive amounts of content, not archived on a regular basis, could now be converted to Unicode and, in turn, provide Wikipedia editors with easily accessible sources to create new articles and enhance existing ones.

The Odia language is spoken by over 40 million people in eastern India, accross various Indian cities, and by expatriates abroad. It is one of the oldest languages in South Asia, and is recognized as a "classical language" by the Indian government. The Odia Wikipedia celebrates its thirteenth anniversary today, June 3.

Wikipedian Jnanaranjan Sahu receiving an award at the Odisha Dibasa 2014 celebration. Photo by Biswadeep Mishra, freely licensed under CC-by-SA 3.0.

However, the "classical language" status has not yet boosted knowledge production or use of the language on the Internet. Almost all online newspapers and state publications, such as Odia-language journals, public announcements, and information portals, host their content in various legacy character encodings that do not allow users to easily access and share information. This has, unsurprisingly, proven a major hurdle for the small Odia Wikimedia community, who hope to enrich their project with Odia-language citations.

To help solve these problems, the community sought the help of two encoding converters. These were previously developed by friends from Srujanika, a non-profit based in Bhubaneswar, Odisha, that works on promoting science education in school curricula in the Odia language, as well as the digitization of early Odia literature. These converters became the building blocks on top of which Wikimedian Manoj Sahukar built converters after massively rewriting their code. I was also part of the re-building process, from the initial development of the converters to the design of their interface, and I helped to design handouts teaching new users how to use them.

The community played a major role in promoting the converters on social media. An op-ed in the Odia newspaper Samaja helped to reach out to more people unaware of the uses for Unicode. Many Internet users did not realize that they had been sharing knowledge on their blogs or social media using various legacy encodings, which neither appear in search engines nor allow anyone to share them in an accessible way.

By converting news and articles from newspapers and magazines as test cases, the converters were improved over time. Citing Odia-language sources wasn't so easy before; making use of any content from a newspaper could take Odia Wikipedia editors hours.

From September 2014 to March 2015, a small project ran to convert text from several newspapers and magazines so that they could be used as citations in articles in the Odia Wikipedia—important because, when these sources are not available in Unicode, search engines and Wikipedia users can have difficulty finding them.

A conversation with developer and Wikimedian, Manoj Sahukar, about designing an encoding converter for Odia. Audio recording by Subhashish Panigrahi, freely licensed under CC BY-SA 3.0.

Because the converters were hosted separately on Google Drive, it was difficult to have them all in one central place. Odia Wikipedians wanted their Wikipedia to host a single converter, where a user could select the appropriate input encoding. Wikimedian and developer Jnanaranjan Sahu came up with a responsive, wiki-based converter that went live on May 12 and is now available for use. The converter now enables the choice of source encoding from a drop-down menu, and converts the input into Unicode. Issues with this conversion process can be reported via Google Spreadsheet.

Combining five different converters to one, Jnanaranjan says, was a necessary next step in development:

When I found that there are different URLs for different converters, and that the URLs lead to a bunch of different sites, it seemed quite messed up. It would have been difficult for users to locate each of the converters. I thought it would be easier for users if they could find all the encoding converters for Odia on one page on their home wiki. So, I tried to tweak the source code and design this converter.

He also explains that several newspapers, whose news is encoded in older systems, are now rich information sources. "Converting them and using the information to add more citations to Wikipedia could help to achieve the dream of every single person being able to contribute more information to the Odia Wikipedia," he says, "so all human knowledge may be available in our language."

Subhashish Panigrahi, Wikimedian and Programme Officer, Centre for Internet and Society, Bengaluru, India.


Ideas for social media messages promoting the published post:

Twitter (@wikimedia/@wikipedia):

(Tweet text goes here - max 117 characters)
More citations for Odia Wikipedia (@odiawiki) as the community designs character encoding converters to make more content available in Unicode.
The Odia Wikimedia community (@odiawiki) has built new converters, making Odia sources more accessible than ever.
The Odia Wikimedia community (@odiawiki) has built a tool to make Odia sources easier to access and share.
A 13th birthday present from the Odia Wikimedia community (@odiawiki) - a tool to make Odia sources easier to use.


  • The Odia-language Wikimedia community has successfully built character encoding converters to convert digital content from newspapers, magazines and portals created in proprietary non-standard encoding systems into Unicode. Because of mainstream media does not make use of the Unicode standard, a massive amount of content does not appear in search engines, and it is difficult to locate relevant content. The converters will help editors to search useful content online and, ultimately, enrich the Odia Wikipedia with new sources, just in time for its thirteenth birthday.