コミュニティ技術/OCR機能の改善

From Meta, a Wikimedia project coordination wiki
This page is a translated version of the page Community Tech/OCR Improvements and the translation is 23% complete.

本プロジェクトではウィキソースで採用する光学文字認識器 (OCR) ツールが主題です。現状、ウィキソースの編集者は査読の過程でさまざまなOCRツールを採用しています。いずれも重要なツールですが、問題もたくさん抱えています。一部を紹介します。

  • 新規利用者にはツールの存在がわかりにくい。
  • ツールによっては機能しない、効率が悪い、信頼性が低い。
  • 直感的な利用者体験ではない、使ってみたいと思わない。
  • 特定の文章に対してどのツールが最適か簡単に選べない。

これらのどの理由も、ウィキソースを編集したくないと利用者は感じるかもしれません。OCR ツールの改善により編集者の手間をぐっと減らしサポートを増やしたいと考えます。このプロジェクトは2020年コミュニティ技術要望第2位の リクエストでした。当プロジェクトの過程で主要な課題を精査し特定する予定で、さまざまなコミュニティとの協働、解決策を導入してボランティアの皆さんがもっとサポートを受け貢献しやすくなるように目指します。こちらのトークページに皆さんからぜひご意見ご感想を投稿してください、よろしくお願いします!

OCR ツールを使う理由

ウィキソースの場合、OCR ツールは編集体験の成否を決める構成要素です。OCR とは「光学式文字認識」の意味です(OCR=optical character recognition)。OCR ツールの働きは文字が存在する画像ファイルを機械エンコードした文書に変換します。作業工程が終わると、ツールの使用者には文書の電子ファイルが出力され、電子データとして編集、検索、保存が可能です。オンラインのコミュニティやプラットフォームの多くは OCR ツールを普通に使っており、ウィキソースもそのひとつです。

When editors add books to Wikisource, they typically do the following:

  1. Upload a file to Wikimedia Commons. The book is usually a PDF or DjVu file, containing images of scanned pages.
  2. Create an index page (powered by the Proofread Page extension) for the book on Wikisource.
  3. Proofread the book, page by page:
    1. [This is where OCR tools come in] Convert the image into editable, machine-encoded text with an OCR tool.
    2. Once completed, the user has a newly digitized version of the text.

How to use OCR tools

In Wikisource, OCR tools can be accessed when the user clicks the “Edit” tab on the page.

Edit tab to open proofreading view and OCR tools in Wikisource

Once they have clicked “Edit,” they will see the original image file of the text (at the right). Sometimes, the file has already been OCR-ed (on the left), such as when it is brought from the Internet Archive, which automatically OCR’s some texts, especially for languages with Latin scripts. However,  these texts often go through the OCR process again with tools on Wikisource, which may improve the existing text layer. To do this, the user will use the OCR tools (described below) to render the image file into a text file (as displayed on the left).

Sometimes, texts may have not gone through the OCR process at all. In these cases, the user will see the image file (on the right) and a blank section (on the left). Users will use the OCR tools (described below) to render the image file into a text file (as displayed on the left).  Once complete, the text is ready for proofreading.

Note that the right and left designations are the opposite for RTL (right-to-left) languages.

It is important to understand that OCR tools do not work for all texts. For example, hand-written manuscripts are usually not supported by OCR tools. This is because the characters are not as standardized in computer-generated fonts. In these cases, users typically need to manually type the text as displayed in the image file.

Example of OCR tools available to Wikisource user in the Proofread view

OCR tools available on Wikisource

OCR Gadget

The OCR Gadget, also known as “basic” OCR, is a widely used OCR tool for Wikisource, originally developed by Phe. It uses Tesseract, an open source OCR system sponsored by Google and hosted on Toolforge, to generate new text. It is part of a wider suite of tools for Wikisource known as phetools, and uses a sophisticated system of speculative pre-processing and caching to deliver great interactive performance.

The backend uses the hOCR structured standard OCR format to communicate with the Gadget. OCR Gadget is considered better than the Google OCR (which we’ll describe below) at recognizing text columns. However, it has more character errors. Additionally, it has limited language support. While OCR Gadget generally supports languages with Latin scripts, it doesn’t generally support Indic languages. For example, it does not support Hindi or Punjabi. The tool also lacks an active maintainer which has led to long stretches of partial or complete outages in the past.

To enable OCR Gadget, go to Special:Preferences. In the Gadgets tab, you go to “Editing tools for the Page: namespace,” and you click OCR: Enable OCR button () in Page: namespace.” Once enabled, OCR Gadget can be accessed in the toolbar (see screenshot example of the grey-colored “OCR” icon). Upon clicking on the icon, an OCR-ed version of the text should appear on the left side. The user can then proceed to proofread that version of the text.

Example of OCR gadget in the proofread view

Google OCR

In 2016, the Community Tech team developed Google OCR, which was wish #25 in the 2015 Community Wishlist Survey. The Google OCR tool was meant to address the lack of Indic language support in Tesseract-based OCR systems, such as OCR Gadget. This new OCR tool used the Cloud Vision API provided by Google.

With the development of Google OCR, Wikisource editors could receive OCR support for the following languages: Multilingual Wikisource, Arabic, Assamese, Bulgarian, Bengali, English, Spanish, Hindi, Kannada, Marathi, Malayalam, Neapolitan, Odia, Russian, Sanskrit, Tamil, Telugu, and Gujrati. However, some Indic languages were not included that had active Wikisource communities. You can read the full list of languages supported by the Google Vision API.

Generally, Google OCR is considered to be a rather accurate OCR tool. However, there are sometimes problems with properly recognizing text in columns, so the lines are interleaved.

To enable the Google OCR Gadget, go to Special:Preferences. In the Gadgets tab, you go to “Editing tools for the Page: namespace,” and you click “Google OCR: Enable the Google OCR button to submit the page image to Google's OCR service. ”Once enabled, Google Gadget can be accessed in the toolbar (see screenshot example below) by clicking on the tri-color “OCR” icon. Upon clicking on the icon, an OCR-ed version of the text should appear on the left side. Alternatively, you can go to the website directly and add in the image for single-image usage (but this is primarily used for non-Wikisource purposes).

Example of Google OCR available in the proofread view

Indic OCR

In 2018, IndicOCR was developed by Jay Prakash, a volunteer developer. IndicOCR uses Google Drive, which uses a different OCR back-end than Cloud Vision. The tool was meant to address the limitations of GoogleOCR by providing support for a wider range of Indic languages, including Bengali, Bhojpuri, Gujrati, Hindi, Kannada, Maithili, Malayalam, Marathi, Nepali, Oriya, Punjabi, Sanskrit, Tamil, Telugu, and Urdu. However, it is important to note that some of these languages do not yet have Wikisource communities (such as Urdu), but this OCR tool could provide support for such communities in the future.

To enable IndicOCR, you can add the following code to your local wiki common.js page.

mw.loader.load('//meta.wikimedia.org/w/index.php?title=User:Indic-TechCom/Script/IndicOCR.js&action=raw&ctype=text/javascript');

If you want to add extra button in Visual Editor then add the following code also to your local wiki common.js page.

mw.loader.load('//meta.wikimedia.org/w/index.php?title=User:Indic-TechCom/Script/OCR4VE.js&action=raw&ctype=text/javascript');

Once enabled, it is identified by the text analysis icon (which looks like magnifying glass over text) in the toolbar (see screenshot example below). Upon clicking on the icon, an OCR-ed version of the text should appear on the left side.  Alternatively, you can go to the website directly and add in the image for single-image usage. For more information, check out the documentation.

Example of Indic OCR tool in the proofread view

OCR4Wikisource

OCR4Wikisource, developed by T. Shrinivasan, is a Python script that is set up to run on Linux operating systems. It requires you to share your password in plain text (on your personal device). The tool will download the book from Wikimedia Commons, split the file into individual pages, upload the pages to Google Drive one-by-one for doing OCR, download the OCRed text, and upload them to respective Wikisource pages. This entire process can be done on a personal device rather than individually clicking on OCR icons for each page. The end results uploads OCR-ed versions of the pages directly onto Wikisource.

This is the only bulk OCR upload provided to users, so some users prefer it. The quality of the OCR is also considered to be rather high. Before IndicOCR was developed, many Indic language Wikisourcers used OCR4Wikisource. You can read more documentation.

To enable OCR4Wikisource, you will need to download the zip file from a link (provided in the documentation), and then you’ll need to follow steps within the Terminal to complete the process.

The primary issues with OCR tools

Discoverability

If you are a new Wikisource editor, it can be confusing to first use OCR tools. You may not know that you should use OCR tools. If you do know that you should use OCR tools, you may not know which tools are available or how to access them. The documentation on these processes vary by wiki, and some wikis have more extensive documentation than others. As a result, new editors usually need to directly interact with experienced Wikisource editors to receive this information.

Once users do learn about the OCR tools, there is no simple “quick install.” Rather, different tools require different installation processes. Some can be enabled by checking a box in Preferences. Some are enabled by copying and pasting some code into the common.js page. Others are scripts that need to be run. In total, the discovery and installation is disjointed and often confusing.

Diversity of choices

There are simply too many OCR tools to choose from. Sometimes, a diversity of tools can be a good thing. However, in the case of Wikisource OCR, the range is confusing. This is because all of the tools are meant to accomplish the same thing: a textual rendering of the image file. Consequently, editors shouldn’t need to pick between various tools that look the same, are named similarly, have similar icons, and are designed to, in theory, do the same thing. Instead, editors should have a more streamlined experience, where they can either pick just one tool or at least be guided to the most appropriate tool for their workflow, without needing to conduct research themselves.

信頼性

Many of the OCR tools don’t work very well. For example, OCR Gadget has been out of service for significant periods of time in the past, and it has suffered from a lack of sufficient gadget maintainers. The hOCR tool does not work for non-Latin scripts.  Meanwhile, many of the OCR tools have a slew of reported issues, including slow response times and rendering texts of a low quality. The tools also struggle to deal with handling certain formatting issues, such as text divided into columns (e.g., in magazine pages). They also have problems with non-Latin characters and diacritics.

Open questions

  • Have we covered all of the main OCR tools used by Wikisource editors?
  • Have we covered the major problems experienced when using OCR tools?
  • Which OCR tools do you use the most, and why?
  • What are the most common and frustrating issues you encounter when using OCR tools?
  • Which problems, overall, do you find the most critical to fix, and why?
  • Anything else you would like to add?

Please share your feedback on the talk page!

Status Updates

April 21, 2021

Hello, everyone! We are very excited to share our first project update below:

Project principles

As a team, we first conducted research on OCR tools for Wikisource, which we shared in this project page. Then, we collected feedback on the talk page. Following this feedback, we decided to establish some project principles. This way, we could have a stronger sense of the project and our goals. The principles are as follows:

  1. We want to improve the overall experience of OCR tools: Our #1 goal of the project is to improve the OCR experience on Wikisource. This means that we want the tools to be easier to discover and understand for newcomers, and we want the tools to be easier to use effectively for all Wikisource editors.
  2. We can’t build a new OCR tool: The original wish was entitled “New OCR tool.” Unfortunately, we don’t have the time or resources to build a new OCR tool, which would be an intensive, lengthy project. As a team, we try to take on smaller projects that last a few months, so that we can fulfill multiple wishes per year. However, we can make meaningful improvements to the existing OCR tools.
  3. We can improve Wikimedia OCR: The Wikimedia OCR tool (formerly known as Google OCR) was developed by the Community Tech team. For this reason, we have the ability to make impactful changes to the tool, and we also have already identified some areas of improvements. For this reason, we have made it one of our project priorities to improve this tool.
  4. We can address some major issues: On the project talk page, we heard users share some common pain points related to the OCR experience, including: lack of an easily accessible bulk OCR functionality, minimal support of texts with multiple columns, and other issues. We can’t fix all of the issues, but we will try to at least investigate some of the top issues and see if we can issue improvements.

完了した作業

The team has already begun work on the project! Here is what we have completed so far:

準備段階の作業

  • Wikimedia OCR をウィキソースの拡張機能へ移動:現状の利用者体験を改善したいので、そのためには利用者に複数の個別ツールをインストールもしくは有効にしてもらう必要があります。その準備としてウィキメディア OCR をウィキソースの拡張機能に移動している最中です。この作業が完了すると、査読ページで利用者全員にウィキメディア OCR が表示されます(インストールは不要。)
    • 注記:このツールを自動的に表示させたくないウィキの場合、アプトアウトが可能です。あるいはまた、利用者はそれぞれのツールバーを設定して、その他の OCR ツールを表示させることができます。
  • Add Support for Tesseract on Wikimedia OCR: To improve Wikimedia OCR, we have decided to add Tesseract to it. This way, users do not need to install two separate OCR tools via Preferences, since both OCR engines will be available via Wikimedia OCR. This is currently testable on ocr-test.wmcloud.org.
  • Accept Google Options on the API: This work is the first stage in being able to improve the quality of OCR for pages containing multiple languages. The final result will apply to both Google and Tesseract engines.
  • Improve performance of Tesseract engine: We have identified a way that we can dramatically improve the speed of the Tesseract engine. If we move Tesseract from Toolforge to Cloud VPS, we could see it run much faster (potentially, about 10 times faster!). This work is in progress, and we hope its completion will result in an improved user experience for Wikisourcers.
  • Investigate how to improve multiple column issues: Users have shared that Wikisource lacks sufficient OCR support for texts with multiple columns. For this reason, we have launched an investigation to see how this issue could be addressed. So far, we have come up with two potential approaches, and the investigation is in progress.

開始予定の作業

  • Add Tesseract options on the API: Through our technical investigation, we learned that Tesseract has many options that may help improve the OCR experience. For example, Tesseract has multiple page segmentation modes, which could help with multiple column support. It also has options to handle multiple languages within one text. For this reason, we want to make some of these options available for an improved editor experience.
  • OCR エンジンを選択する利用者体験を精査する:ウィキメディア OCR にはかつてエンジンが 2 個あり (Tesseract ならびに Google Cloud Vision)、査読の段階でこれらをどう使い分けるのか、利用者体験としてサポートする必要がありました。この体験の提案づくりにまもなく取りかかる予定です。

未解決の質問

  • What are your general thoughts about the project principles?
  • How do you feel about our work to make Wikimedia OCR automatically available, with no installation required?
  • How do you feel about our work to add Tesseract to Wikimedia OCR?
  • Ideally, what user experience do you recommend for choosing an OCR engine when using Wikimedia OCR?
  • What do you think of our work to improve the speed of Tesseract?
  • Anything else you would like to add?

Please share your feedback on the project talk page!

2021年5月31日

Hello, everyone! We hope all of you are staying safe in these times. We know many of this wish’s enthusiasts collect resources in the languages of India and our minds are with anyone impacted by the devastating COVID-19 surge.

A few improvements are currently underway and we’d love to hear your input as we continue to move the OCR efforts along. The feedback we’re looking for can be summarized as two main asks:

  • Which OCR is best for the language you conduct your transcriptions in?
  • How much faster are the new engine improvements at loading your transcriptions now? Instructions below.

完了した作業

“Under the hood” engine improvements are now live – Since our last update, we’ve released a version of the newly supported OCR engines to Beta and in Advanced tools. We’d love to hear which engines work best for the different Wikisource wikis so that we may set up the right Default engine for each Wikisource.

To try out the improvements, visit our Advanced Tools page, which will require the URL of an image to transcribe. Once you’re there, try transcribing with both the tesseract or Google OCR versions in our Advanced Options and let us know generally which performs best for the languages you most interact with for a given Wikisource project. We’re looking forward to hearing about how fast or slow the transcriptions are as a result. We’d especially love to hear details on which OCR performs best for the language(s) that you normally transcribe.

Multi-column support in Advanced Options – The column support options are now live inside the Advanced Tools form. We’re looking to hear any tips and tricks about what options work best as you test them inside the test environment.

Opportunities to participate in Design Tests & help us set up a “Default”

  • Design Flows and User Experience Tests
    • The interface you currently see in Beta (as of late May 2021) is still a work in progress. We are still finalizing the layout on the page and are finalizing our designs by conducting unmoderated user research to see how intuitive our proposed improvements are. If you’re interested in participating in our user tests, please let us know in the talk page. We are looking for a mix of both advanced contributors and newcomers across projects.

Next Steps

  • Implementing design results from user tests
    • Once we conclude the round of tests mentioned above, the engineers will implement the finalized flows.
  • Onboarding designs for newcomers
    • Part of our improvements will also involve adding a pulsating user interface to let newcomers know about what an OCR is and what they can use it for in their transcription efforts.

Updates on Community Tech staffing

We wanted to update you on some changes that impact the scope of how much we can tackle:

  • We have a new product manager that joined the team, Natalia Rodriguez. A note: “I am excited to work with all of you to fulfil this wish! It’s my very first wish, and I am excited to learn from you all to deliver a better OCR experience.”
  • We have some engineers with upcoming parental leave starting in the second half of 2021.
  • We have had an Engineering Manager departure and are hiring with hopes to fill the position by end of September.

We thank you for your patience as we onboard new members to the team and support our colleagues. The team will write a longer update on this matter as we are currently also undergoing annual planning and scoping out how this impacts our roadmap.

Open Questions

  • In this advanced tools release, which engine performs best (Tesseract or Google OCR) for the languages you most often use in your transcriptions and the Wikisource projects you most contribute to?
  • In this advanced tools release, what has your experience been in terms of load time per extraction?
  • What are some suggestions on the parameters for the engines and multi-column support?
  • Would you like to participate in our unmoderated user research?

As a new product manager for the team, Natalia would love to know what other details are useful in project updates. Please share any feedback!

最新更新は2021年8月31日

Hello everybody! We are excited to announce our completed OCR improvements. Working to improve the Transcription Tools in Wikisource has taught us so much, and we are incredibly grateful for all of your feedback from the beginning of the project and for your continued input as we polished the User Interface in the project's final stages. Below, we've fleshed out a summary of all that our improvements entailed. Thank you again for your continued input!

基本にあるエンジンの改善

  • 信頼性:作業開始前には、利用できる OCR ツールは別個のガジェットとして同梱されていました。文字転写の速度と安定性ですぐれたウィキメディア OCR を追加し、今後のメンテナンスはコミュニティ技術チームが担当します。また内部でサポートを受けるインフラの設定ができるという前提条件で、ツール類のダウンタイムをぐっと減らせる見込みです。このツールはエンジンとして Tesseract 用ならびに Google OCR 用に対応、ツールバーにあるドロップダウンを開き1個のアイコンで切り替えできます。またウィキソースのどの言語プロジェクトにも対応します。旧来のガジェットも併存し、新しいツールをいつから導入するか採用をやめるかはプロジェクト単位で主権的に決めてください。
ツールバーの OCR メニュー
  • 速さ:この作業に取りかかる準備として、文字転写のジョブは場合によって最大40秒の待機時間が発生していました。改善により文字転写(訳注=の開始ラグ)は平均4秒以内に抑えました。この立ち上げ時間の刷新で査読者の皆さんにも使いやすくなると予想して誇らしく感じています。

高度なツール改善

  • 多言語サポート:同一の文書に複数の言語が使われている場合には、「高度」なツール(Advanced)で「言語」入力欄(Languages オプション)を使い文書内の対象言語名を優先順に検索するか手入力で指定すると文字転写ができます。
  • 切り抜きツール / 段組みされた文書への対応:通常のページレイアウトより複雑で段組みがしてあるなどの場合、切り抜きツール(Cropper tool)の導入により画像内のどの箇所を文字転写するか指定することも可能にしました。

UI Crop tool in Advanced tools

OCR機能を見つけやすいか使い心地はどうか

  • 新規利用者で文字転写にもそれをOCRと呼ぶことにも慣れていない人のために新しいツールのアイコンに青色に点滅するユーザインタフェース(UI)を導入しました。新しい UI は OCR という単語の意味と、その文脈で文字転写とは何をすることか説明します。ウィキソースのプロジェクトで編集活動をする皆さんに、苦労なく文字転写ができるツールを選択肢として使ってもらえるなら、とても嬉しいところです。

これらの変更が実際にどんな影響を及ぼすか、今後の数ヵ月に注目しており、2022年技術要望でも皆さんの意見を聞けるよう楽しみにしています。この作業が実現できたのも皆さんのおかげです。

当プロジェクトのトークページにぜひご意見ご感想を投稿してください。