Community Tech/OCR Improvements

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search

This project aims to improve optical character recognition (OCR) tools on Wikisource. Currently, Wikisource editors use a range of OCR tools in the proofreading process. These tools are very important, but they have many issues. Some of these issues include:

  • The tools can be difficult to discover for new users.
  • Some tools are broken, inefficient, or unreliable.
  • The user experience is unintuitive and uninviting.
  • It can be difficult to determine which tool is appropriate for a specific text.

For all of these reasons, users may be discouraged from editing Wikisource. We hope to improve the OCR tools, so that editors can work with greater ease and support. This project was the #2 request from the 2020 Community Wishlist Survey. In the course of this project, we’ll investigate and identify the key issues, collaborate with various communities, and implement solutions that help volunteers to contribute with greater ease and support. We look forward to reading your feedback on the talk page!

Why use OCR tools[edit]

For Wikisource, OCR tools are a crucial component of the editor experience. OCR stands for “optical text recognition.” An OCR tool converts an image file with text into machine-encoded text. When the process is complete, the user has a digitized version of the text, which can be edited, searched, and stored electronically. OCR tools are commonly used by many online communities and platforms, including Wikisource.

When editors add books to Wikisource, they typically do the following:

  1. Upload a file to Wikimedia Commons. The book is usually a PDF or DjVu file, containing images of scanned pages.
  2. Create an index page (powered by the Proofread Page extension) for the book on Wikisource.
  3. Proofread the book, page by page:
    1. [This is where OCR tools come in] Convert the image into editable, machine-encoded text with an OCR tool.
    2. Once completed, the user has a newly digitized version of the text.

How to use OCR tools[edit]

In Wikisource, OCR tools can be accessed when the user clicks the “Edit” tab on the page.

Edit tab to open proofreading view and OCR tools in Wikisource

Once they have clicked “Edit,” they will see the original image file of the text (at the right). Sometimes, the file has already been OCR-ed (on the left), such as when it is brought from the Internet Archive, which automatically OCR’s some texts, especially for languages with Latin scripts. However,  these texts often go through the OCR process again with tools on Wikisource, which may improve the existing text layer. To do this, the user will use the OCR tools (described below) to render the image file into a text file (as displayed on the left).  

Sometimes,  texts may have not gone through the OCR process at all. In these cases, the user will see the image file (on the right) and a blank section (on the left). Users will use the OCR tools (described below) to render the image file into a text file (as displayed on the left).  Once complete, the text is ready for proofreading.

Note that the right and left designations are the opposite for RTL (right-to-left)  languages.

It is important to understand that OCR tools do not work for all texts. For example, hand-written manuscripts are usually not supported by OCR tools. This is because the characters are not as standardized in computer-generated fonts. In these cases, users typically need to manually type the text as displayed in the image file.

Example of OCR tools available to Wikisource user in the Proofread view

OCR tools available on Wikisource[edit]

OCR Gadget[edit]

The OCR Gadget, also known as “basic” OCR, is a widely used OCR tool for Wikisource editors. It uses Tesseract, an open source OCR system sponsored by Google and hosted on Toolforge, to generate new text. OCR Gadget is considered better than the Google OCR (which we’ll describe below) at recognizing text columns. However, it has more character errors. Additionally, it has limited language support. While OCR Gadget generally supports languages with Latin scripts, it doesn’t generally support Indic languages. For example, it does not support  Hindi or Punjabi.

To enable OCR Gadget, go to Special:Preferences. In the Gadgets tab, you go to “Editing tools for the Page: namespace,” and you click OCR: Enable OCR button () in Page: namespace.” Once enabled, OCR Gadget can be accessed in the toolbar (see screenshot example of the grey-colored “OCR” icon). Upon clicking on the icon, an OCR-ed version of the text should appear on the left side. The user can then proceed to proofread that version of the text.

Example of OCR gadget in the proofread view

Google OCR[edit]

In 2016, the Community Tech team developed Google OCR, which was wish #25 in the 2015 Community Wishlist Survey. The Google OCR tool was meant to address the lack of Indic language support in Tesseract-based OCR systems, such as OCR Gadget. This new OCR tool used the Cloud Vision API provided by Google.

With the development of Google OCR, Wikisource editors could receive OCR support for the following languages: Multilingual Wikisource, Arabic, Assamese, Bulgarian, Bengali, English, Spanish, Hindi, Kannada, Marathi, Malayalam, Neapolitan, Odia, Russian, Sanskrit, Tamil, Telugu, and Gujrati. However, some Indic languages were not included that had active Wikisource communities. You can read the full list of languages supported by the Google Vision API.

Generally, Google OCR is considered to be a rather accurate OCR tool. However, there are sometimes problems with properly recognizing text in columns, so the lines are interleaved.

To enable the Google OCR Gadget, go to Special:Preferences. In the Gadgets tab, you go to “Editing tools for the Page: namespace,” and you click “Google OCR: Enable the Google OCR button to submit the page image to Google's OCR service. ”Once enabled, Google Gadget can be accessed in the toolbar (see screenshot example below) by clicking on the tri-color “OCR” icon. Upon clicking on the icon, an OCR-ed version of the text should appear on the left side. Alternatively, you can go to the website directly and add in the image for single-image usage (but this is primarily used for non-Wikisource purposes).

Example of Google OCR available in the proofread view
hOCR[edit]

hoCR is a Phetools hocr script. It's part of the OCR Gadget, and it uses the same Tesseract back-end. For more information, you can check out the hOCR Wikipedia page.

Indic OCR[edit]

In 2018, IndicOCR was developed by Jay Prakash, a volunteer developer. IndicOCR uses Google Drive, which uses a different OCR back-end than Cloud Vision. The tool was meant to address the limitations of GoogleOCR by providing support for a wider range of Indic languages, including Bengali, Bhojpuri, Gujrati, Hindi, Kannada, Maithili, Malayalam, Marathi, Nepali, Oriya, Punjabi, Sanskrit, Tamil, Telugu, and Urdu. However, it is important to note that some of these languages do not yet have Wikisource communities (such as Urdu), but this OCR tool could provide support for such communities in the future.

To enable IndicOCR, you can add the following code to your local wiki common.js page.

mw.loader.load('//meta.wikimedia.org/w/index.php?title=User:Indic-TechCom/Script/IndicOCR.js&action=raw&ctype=text/javascript');

If you want to add extra button in Visual Editor then add the following code also to your local wiki common.js page.

mw.loader.load('//meta.wikimedia.org/w/index.php?title=User:Indic-TechCom/Script/OCR4VE.js&action=raw&ctype=text/javascript');

Once enabled, it is identified by the text analysis icon (which looks like magnifying glass over text) in the toolbar (see screenshot example below). Upon clicking on the icon, an OCR-ed version of the text should appear on the left side.  Alternatively, you can go to the website directly and add in the image for single-image usage. For more information, check out the documentation.

Example of Indic OCR tool in the proofread view

OCR4Wikisource[edit]

OCR4Wikisource, developed by T. Shrinivasan, is a Python script that is set up to run on Linux operating systems. It requires you to share your password in plain text (on your personal device). The tool will download the book from Wikimedia Commons, split the file into individual pages, upload the pages to Google Drive one-by-one for doing OCR, download the OCRed text, and upload them to respective Wikisource pages. This entire process can be done on a personal device rather than individually clicking on OCR icons for each page. The end results uploads OCR-ed versions of the pages directly onto Wikisource.

This is the only bulk OCR provided to users, so some users prefer it. The quality of the OCR is also considered to be rather high. Before IndicOCR was developed, many Indic language Wikisourcers used OCR4Wikisource. You can read more documentation.

To enable OCR4Wikisource, you will need to download the zip file from a link (provided in the documentation), and then you’ll need to follow steps within the Terminal to complete the process.

The primary issues with OCR tools[edit]

Discoverability[edit]

If you are a new Wikisource editor, it can be confusing to first use OCR tools. You may not know that you should use OCR tools. If you do know that you should use OCR tools, you may not know which tools are available or how to access them. The documentation on these processes vary by wiki, and some wikis have more extensive documentation than others. As a result, new editors usually need to directly interact with experienced Wikisource editors to receive this information.

Once users do learn about the OCR tools, there is no simple “quick install.” Rather, different tools require different installation processes. Some can be enabled by checking a box in Preferences. Some are enabled by copying and pasting some code into the common.js page. Others are scripts that need to be run. In total, the discovery and installation is disjointed and often confusing.

Diversity of choices[edit]

There are simply too many OCR tools to choose from. Sometimes, a diversity of tools can be a good thing. However, in the case of Wikisource OCR, the range is confusing. This is because all of the tools are meant to accomplish the same thing: a textual rendering of the image file. Consequently, editors shouldn’t need to pick between various tools that look the same, are named similarly, have similar icons, and are designed to, in theory, do the same thing. Instead, editors should have a more streamlined experience, where they can either pick just one tool or at least be guided to the most appropriate tool for their workflow, without needing to conduct research themselves.

Reliability[edit]

Many of the OCR tools don’t work very well. For example, OCR Gadget has been out of service for significant periods of time in the past, and it has suffered from a lack of sufficient gadget maintainers. The hOCR tool does not work for non-Latin scripts.  Meanwhile, many of the OCR tools have a slew of reported issues, including slow response times and rendering texts of a low quality. The tools also struggle to deal with handling certain formatting issues, such as text divided into columns (e.g., in magazine pages). They also have problems with non-English characters and diacritics.

Open questions[edit]

  • Have we covered all of the main OCR tools used by Wikisource editors?
  • Have we covered the major problems experienced when using OCR tools?
  • Which OCR tools do you use the most, and why?
  • What are the most common and frustrating issues you encounter when using OCR tools?
  • Which problems, overall, do you find the most critical to fix, and why?
  • Anything else you would like to add?

Please share your feedback on the talk page!