Indic Tech Internship Programme 2023/OCR for complex pages in Wikisource

From Meta, a Wikimedia project coordination wiki

Goal:[edit]

The goal of this project is to develop an Optical Character Recognition (OCR) system for processing uploaded images on Tamil Wikisource. The primary objective is to extract textual content from complex images of Tamil pdf, making the content more accessible and searchable on the platform.

Problem Statement:[edit]

Tamil Wikisource, a platform hosting Tamil-language documents, lacks an efficient OCR system for processing complex PDF images. Users frequently upload images containing Tamil text, and the current system faces challenges in accurately extracting and converting this textual content. The existing OCR solution requires improvement to enhance accuracy, thereby providing a more reliable method for converting Tamil PDF images into machine-readable text.

Proposed Solution[edit]

Build a web based OCR engine which process the entire PDF pages and create text content in Wikisource.

Tech Stack: Tesseract and Python Flask framework, Wikisource API. Source code

Future Enhancement:[edit]

  • Future work on this project involves enhancing the accuracy of the OCR system. This may include exploring advanced OCR algorithms, training models specifically for Tamil language recognition
  • Incorporating feedback mechanisms to iteratively improve accuracy over time.
  • Graphical interface for area selection on Image to do OCR processing