Community Tech/OCR Improvements

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search
Translate this page; This page contains changes which are not marked for translation.
Other languages:
Bahasa Indonesia • ‎English • ‎français • ‎polski • ‎עברית

This project aims to improve optical character recognition (OCR) tools on Wikisource. Currently, Wikisource editors use a range of OCR tools in the proofreading process. These tools are very important, but they have many issues. Some of these issues include:

  • The tools can be difficult to discover for new users.
  • Some tools are broken, inefficient, or unreliable.
  • The user experience is unintuitive and uninviting.
  • It can be difficult to determine which tool is appropriate for a specific text.

For all of these reasons, users may be discouraged from editing Wikisource. We hope to improve the OCR tools, so that editors can work with greater ease and support. This project was the #2 request from the 2020 Community Wishlist Survey. In the course of this project, we’ll investigate and identify the key issues, collaborate with various communities, and implement solutions that help volunteers to contribute with greater ease and support. We look forward to reading your feedback on the talk page!

Why use OCR tools[edit]

For Wikisource, OCR tools are a crucial component of the editor experience. OCR stands for “optical text recognition.” An OCR tool converts an image file with text into machine-encoded text. When the process is complete, the user has a digitized version of the text, which can be edited, searched, and stored electronically. OCR tools are commonly used by many online communities and platforms, including Wikisource.

When editors add books to Wikisource, they typically do the following:

  1. Upload a file to Wikimedia Commons. The book is usually a PDF or DjVu file, containing images of scanned pages.
  2. Create an index page (powered by the Proofread Page extension) for the book on Wikisource.
  3. Proofread the book, page by page:
    1. [This is where OCR tools come in] Convert the image into editable, machine-encoded text with an OCR tool.
    2. Once completed, the user has a newly digitized version of the text.

How to use OCR tools[edit]

In Wikisource, OCR tools can be accessed when the user clicks the “Edit” tab on the page.

Edit tab to open proofreading view and OCR tools in Wikisource

Once they have clicked “Edit,” they will see the original image file of the text (at the right). Sometimes, the file has already been OCR-ed (on the left), such as when it is brought from the Internet Archive, which automatically OCR’s some texts, especially for languages with Latin scripts. However,  these texts often go through the OCR process again with tools on Wikisource, which may improve the existing text layer. To do this, the user will use the OCR tools (described below) to render the image file into a text file (as displayed on the left).  

Sometimes, texts may have not gone through the OCR process at all. In these cases, the user will see the image file (on the right) and a blank section (on the left). Users will use the OCR tools (described below) to render the image file into a text file (as displayed on the left).  Once complete, the text is ready for proofreading.

Note that the right and left designations are the opposite for RTL (right-to-left) languages.

It is important to understand that OCR tools do not work for all texts. For example, hand-written manuscripts are usually not supported by OCR tools. This is because the characters are not as standardized in computer-generated fonts. In these cases, users typically need to manually type the text as displayed in the image file.

Example of OCR tools available to Wikisource user in the Proofread view

OCR tools available on Wikisource[edit]

OCR Gadget[edit]

The OCR Gadget, also known as “basic” OCR, is a widely used OCR tool for Wikisource, originally developed by Phe. It uses Tesseract, an open source OCR system sponsored by Google and hosted on Toolforge, to generate new text. It is part of a wider suite of tools for Wikisource known as phetools, and uses a sophisticated system of speculative pre-processing and caching to deliver great interactive performance.

The backend uses the hOCR structured standard OCR format to communicate with the Gadget. OCR Gadget is considered better than the Google OCR (which we’ll describe below) at recognizing text columns. However, it has more character errors. Additionally, it has limited language support. While OCR Gadget generally supports languages with Latin scripts, it doesn’t generally support Indic languages. For example, it does not support Hindi or Punjabi. The tool also lacks an active maintainer which has led to long stretches of partial or complete outages in the past.

To enable OCR Gadget, go to Special:Preferences. In the Gadgets tab, you go to “Editing tools for the Page: namespace,” and you click OCR: Enable OCR button () in Page: namespace.” Once enabled, OCR Gadget can be accessed in the toolbar (see screenshot example of the grey-colored “OCR” icon). Upon clicking on the icon, an OCR-ed version of the text should appear on the left side. The user can then proceed to proofread that version of the text.

Example of OCR gadget in the proofread view

Google OCR[edit]

In 2016, the Community Tech team developed Google OCR, which was wish #25 in the 2015 Community Wishlist Survey. The Google OCR tool was meant to address the lack of Indic language support in Tesseract-based OCR systems, such as OCR Gadget. This new OCR tool used the Cloud Vision API provided by Google.

With the development of Google OCR, Wikisource editors could receive OCR support for the following languages: Multilingual Wikisource, Arabic, Assamese, Bulgarian, Bengali, English, Spanish, Hindi, Kannada, Marathi, Malayalam, Neapolitan, Odia, Russian, Sanskrit, Tamil, Telugu, and Gujrati. However, some Indic languages were not included that had active Wikisource communities. You can read the full list of languages supported by the Google Vision API.

Generally, Google OCR is considered to be a rather accurate OCR tool. However, there are sometimes problems with properly recognizing text in columns, so the lines are interleaved.

To enable the Google OCR Gadget, go to Special:Preferences. In the Gadgets tab, you go to “Editing tools for the Page: namespace,” and you click “Google OCR: Enable the Google OCR button to submit the page image to Google's OCR service. ”Once enabled, Google Gadget can be accessed in the toolbar (see screenshot example below) by clicking on the tri-color “OCR” icon. Upon clicking on the icon, an OCR-ed version of the text should appear on the left side. Alternatively, you can go to the website directly and add in the image for single-image usage (but this is primarily used for non-Wikisource purposes).

Example of Google OCR available in the proofread view

Indic OCR[edit]

In 2018, IndicOCR was developed by Jay Prakash, a volunteer developer. IndicOCR uses Google Drive, which uses a different OCR back-end than Cloud Vision. The tool was meant to address the limitations of GoogleOCR by providing support for a wider range of Indic languages, including Bengali, Bhojpuri, Gujrati, Hindi, Kannada, Maithili, Malayalam, Marathi, Nepali, Oriya, Punjabi, Sanskrit, Tamil, Telugu, and Urdu. However, it is important to note that some of these languages do not yet have Wikisource communities (such as Urdu), but this OCR tool could provide support for such communities in the future.

To enable IndicOCR, you can add the following code to your local wiki common.js page.

mw.loader.load('//meta.wikimedia.org/w/index.php?title=User:Indic-TechCom/Script/IndicOCR.js&action=raw&ctype=text/javascript');

If you want to add extra button in Visual Editor then add the following code also to your local wiki common.js page.

mw.loader.load('//meta.wikimedia.org/w/index.php?title=User:Indic-TechCom/Script/OCR4VE.js&action=raw&ctype=text/javascript');

Once enabled, it is identified by the text analysis icon (which looks like magnifying glass over text) in the toolbar (see screenshot example below). Upon clicking on the icon, an OCR-ed version of the text should appear on the left side.  Alternatively, you can go to the website directly and add in the image for single-image usage. For more information, check out the documentation.

Example of Indic OCR tool in the proofread view

OCR4Wikisource[edit]

OCR4Wikisource, developed by T. Shrinivasan, is a Python script that is set up to run on Linux operating systems. It requires you to share your password in plain text (on your personal device). The tool will download the book from Wikimedia Commons, split the file into individual pages, upload the pages to Google Drive one-by-one for doing OCR, download the OCRed text, and upload them to respective Wikisource pages. This entire process can be done on a personal device rather than individually clicking on OCR icons for each page. The end results uploads OCR-ed versions of the pages directly onto Wikisource.

This is the only bulk OCR upload provided to users, so some users prefer it. The quality of the OCR is also considered to be rather high. Before IndicOCR was developed, many Indic language Wikisourcers used OCR4Wikisource. You can read more documentation.

To enable OCR4Wikisource, you will need to download the zip file from a link (provided in the documentation), and then you’ll need to follow steps within the Terminal to complete the process.

The primary issues with OCR tools[edit]

Discoverability[edit]

If you are a new Wikisource editor, it can be confusing to first use OCR tools. You may not know that you should use OCR tools. If you do know that you should use OCR tools, you may not know which tools are available or how to access them. The documentation on these processes vary by wiki, and some wikis have more extensive documentation than others. As a result, new editors usually need to directly interact with experienced Wikisource editors to receive this information.

Once users do learn about the OCR tools, there is no simple “quick install.” Rather, different tools require different installation processes. Some can be enabled by checking a box in Preferences. Some are enabled by copying and pasting some code into the common.js page. Others are scripts that need to be run. In total, the discovery and installation is disjointed and often confusing.

Diversity of choices[edit]

There are simply too many OCR tools to choose from. Sometimes, a diversity of tools can be a good thing. However, in the case of Wikisource OCR, the range is confusing. This is because all of the tools are meant to accomplish the same thing: a textual rendering of the image file. Consequently, editors shouldn’t need to pick between various tools that look the same, are named similarly, have similar icons, and are designed to, in theory, do the same thing. Instead, editors should have a more streamlined experience, where they can either pick just one tool or at least be guided to the most appropriate tool for their workflow, without needing to conduct research themselves.

Reliability[edit]

Many of the OCR tools don’t work very well. For example, OCR Gadget has been out of service for significant periods of time in the past, and it has suffered from a lack of sufficient gadget maintainers. The hOCR tool does not work for non-Latin scripts.  Meanwhile, many of the OCR tools have a slew of reported issues, including slow response times and rendering texts of a low quality. The tools also struggle to deal with handling certain formatting issues, such as text divided into columns (e.g., in magazine pages). They also have problems with non-Latin characters and diacritics.

Open questions[edit]

  • Have we covered all of the main OCR tools used by Wikisource editors?
  • Have we covered the major problems experienced when using OCR tools?
  • Which OCR tools do you use the most, and why?
  • What are the most common and frustrating issues you encounter when using OCR tools?
  • Which problems, overall, do you find the most critical to fix, and why?
  • Anything else you would like to add?

Please share your feedback on the talk page!

Status Updates[edit]

April 21, 2021[edit]

Hello, everyone! We are very excited to share our first project update below:

Project principles[edit]

As a team, we first conducted research on OCR tools for Wikisource, which we shared in this project page. Then, we collected feedback on the talk page. Following this feedback, we decided to establish some project principles. This way, we could have a stronger sense of the project and our goals. The principles are as follows:

  1. We want to improve the overall experience of OCR tools: Our #1 goal of the project is to improve the OCR experience on Wikisource. This means that we want the tools to be easier to discover and understand for newcomers, and we want the tools to be easier to use effectively for all Wikisource editors.
  2. We can’t build a new OCR tool: The original wish was entitled “New OCR tool.” Unfortunately, we don’t have the time or resources to build a new OCR tool, which would be an intensive, lengthy project. As a team, we try to take on smaller projects that last a few months, so that we can fulfill multiple wishes per year. However, we can make meaningful improvements to the existing OCR tools.
  3. We can improve Wikimedia OCR: The Wikimedia OCR tool (formerly known as Google OCR) was developed by the Community Tech team. For this reason, we have the ability to make impactful changes to the tool, and we also have already identified some areas of improvements. For this reason, we have made it one of our project priorities to improve this tool.
  4. We can address some major issues: On the project talk page, we heard users share some common pain points related to the OCR experience, including: lack of an easily accessible bulk OCR functionality, minimal support of texts with multiple columns, and other issues. We can’t fix all of the issues, but we will try to at least investigate some of the top issues and see if we can issue improvements.

Completed work[edit]

The team has already begun work on the project! Here is what we have completed so far:

Work in development[edit]

  • Move Wikimedia OCR to Wikisource Extension: We want to improve the current user experience, which requires that users install or enable multiple separate tools. To do this, we are moving the Wikimedia OCR to the Wikisource Extension. Once this work is complete, all users will be able to see the Wikimedia OCR tool on the Proofread page (with no installation required).
    • Note: If wikis don’t want the tool displayed automatically, they can choose to opt out. Also, users will still be able to configure their toolbars with other OCR tools.
  • Add Support for Tesseract on Wikimedia OCR: To improve Wikimedia OCR, we have decided to add Tesseract to it. This way, users do not need to install two separate OCR tools via Preferences, since both OCR engines will be available via Wikimedia OCR. This is currently testable on ocr-test.wmcloud.org.
  • Accept Google Options on the API: This work is the first stage in being able to improve the quality of OCR for pages containing multiple languages. The final result will apply to both Google and Tesseract engines.
  • Improve performance of Tesseract engine: We have identified a way that we can dramatically improve the speed of the Tesseract engine. If we move Tesseract from Toolforge to Cloud VPS, we could see it run much faster (potentially, about 10 times faster!). This work is in progress, and we hope its completion will result in an improved user experience for Wikisourcers.
  • Investigate how to improve multiple column issues: Users have shared that Wikisource lacks sufficient OCR support for texts with multiple columns. For this reason, we have launched an investigation to see how this issue could be addressed. So far, we have come up with two potential approaches, and the investigation is in progress.

Work that is coming up[edit]

  • Add Tesseract options on the API: Through our technical investigation, we learned that Tesseract has many options that may help improve the OCR experience. For example, Tesseract has multiple page segmentation modes, which could help with multiple column support. It also has options to handle multiple languages within one text. For this reason, we want to make some of these options available for an improved editor experience.
  • Determine the user experience for choosing OCR engine: Once Wikimedia OCR has 2 engines (Tesseract and Google Cloud Vision), there will need to be a user experience to support how this is handled on the Proofread page. We will be working on developing a proposal for this experience soon.

Open questions[edit]

  • What are your general thoughts about the project principles?
  • How do you feel about our work to make Wikimedia OCR automatically available, with no installation required?
  • How do you feel about our work to add Tesseract to Wikimedia OCR?
  • Ideally, what user experience do you recommend for choosing an OCR engine when using Wikimedia OCR?
  • What do you think of our work to improve the speed of Tesseract?
  • Anything else you would like to add?

Please share your feedback on the project talk page!

May 18, 2021[edit]

Hello, everyone! We hope all of you are staying safe in these times. We know many of this wish’s enthusiasts collect resources in the languages of India and our minds are with anyone impacted by the devastating COVID-19 surge.

A few improvements are currently underway and we’d love to hear your input as we continue to move the OCR efforts along. The feedback we’re looking for can be summarized as two main asks:

  • Which OCR is best for the language you conduct your transcriptions in?
  • How much faster are the new engine improvements at loading your transcriptions now? Instructions below.

Completed Work[edit]

“Under the hood” engine improvements are now live - Since our last update, we’ve released a version of the newly supported OCR engines to Beta and in Advanced tools. We’d love to hear which engines work best for the different Wikisource wikis so that we may set up the right Default engine for each Wikisource.

To try out the improvements, visit our Advanced Tools page, which will require the URL of an image to transcribe. Once you’re there, try transcribing with both the tesseract or Google OCR versions in our Advanced Options and let us know generally which performs best for the languages you most interact with for a given Wikisource project. We’re looking forward to hearing about how fast or slow the transcriptions are as a result. We’d especially love to hear details on which OCR performs best for the language(s) that you normally transcribe.

Multi-column support in Advanced Options - The column support options are now live inside the Advanced Tools form. We’re looking to hear any tips and tricks about what options work best as you test them inside the test environment.

Opportunities to participate in Design Tests & help us set up a “Default”[edit]

  • Design Flows and User Experience Tests
    • The interface you currently see in Beta (as of late May 2021) is still a work in progress. We are still finalizing the layout on the page and are finalizing our designs by conducting unmoderated user research to see how intuitive our proposed improvements are. If you’re interested in participating in our user tests, please let us know in the talk page. We are looking for a mix of both advanced contributors and newcomers across projects.

Next Steps[edit]

  • Implementing design results from user tests
    • Once we conclude the round of tests mentioned above, the engineers will implement the finalized flows.
  • Onboarding designs for newcomers
    • Part of our improvements will also involve adding a pulsating user interface to let newcomers know about what an OCR is and what they can use it for in their transcription efforts.

Updates on Community Tech staffing[edit]

We wanted to update you on some changes that impact the scope of how much we can tackle:

  • We have a new product manager that joined the team, Natalia Rodriguez. A note: “I am excited to work with all of you to fulfil this wish! It’s my very first wish, and I am excited to learn from you all to deliver a better OCR experience.”
  • We have some engineers with upcoming parental leave starting in the second half of 2021.
  • We have had an Engineering Manager departure and are hiring with hopes to fill the position by end of September.

We thank you for your patience as we onboard new members to the team and support our colleagues. The team will write a longer update on this matter as we are currently also undergoing annual planning and scoping out how this impacts our roadmap.

Open Questions[edit]

  • In this advanced tools release, which engine performs best (Tesseract or Google OCR) for the languages you most often use in your transcriptions and the Wikisource projects you most contribute to?
  • In this advanced tools release, what has your experience been in terms of load time per extraction?
  • What are some suggestions on the parameters for the engines and multi-column support?
  • Would you like to participate in our unmoderated user research?

As a new product manager for the team, Natalia would love to know what other details are useful in project updates. Please share any feedback!

Please share your feedback on the project talk page!