This page will make you familiar with the Transkribus interface. It can be used to transcribe documents, create and train new models or even test existing models on Transkribus.
General Overview of the Procedure
The entire process of creating and training a new model is quite extensive. This flowchart given below broadly details the various steps involved in the whole workflow right from getting the required model training data to making the model available on your Wikisource.
NOTE: Certain advanced processes like customizing shapes of polygons or editing baseline data are not mentioned in the flowchart for sake of simplicity. They will be detailed in their respective sections.
The following are the prerequisites to creating and training a new model
- Have a functional account on Transkribus with enough credits to perform OCR operations
- Keep at least 5,000 and 15,000 words (around 25-75 pages) of transcribed material in your desired language ready to be uploaded
- If you are working with printed text and not handwritten text, a lower amount of training data will be needed (around 50 pages)
- Please note that the number of pages of a particular type for which the model is being created is crucial to the performance of the model
- When creating a model for a particular style of handwritten text, ensure that all the manuscripts available are of that particular style only
- And a lot of patience, for this is going to take some time!
The Transkribus workspace
Transkribus has a feature rich web interface that provides a host of functionality including proofreading, text recognition, accessing models that deal with multiple languages, and experimenting with manuscripts. This is where you will spend the majority of your time as you prepare ground truth documents, build a new model, train it on the relevant documents, and validate its accuracy. Once you have logged in with a Transkribus account, you will be directed to a dashboard that looks similar to the one shown below. Don’t worry if you do not have any collections, yet!
While working with Transkribus it is important to be familiar with a few terms. All of them are not immediately relevant, but you can always come back for reference!
Any image or page of a manuscript that is uploaded to Transkribus is considered a document
A collection is a group of related documents (e.g. of a particular language or style) that helps you to organize your work desk better
A Transkribus model that deals with only the baseline common to all the textual material in the document
Note Having a dedicated baseline model is helpful in some cases
A Handwritten Text Recognition (HTR) model is what performs the actual OCR by detecting the handwritten text and generating the required output text
Note It is often used in tandem with a baseline model
All documents that have already been proofread and have correct transcription of text can be labeled as ground truth data, to form the basis of building a new model
Any process run on Transkribus, like performing text recognition on a document, is classified as a job and is queued on the Transkribus server
Usually consisting of 90% of the entire data set, the training data contains documents that the algorithm uses to train a new model on a particular handwriting
Usually consisting of 10% of the entire data set, the validation set contains documents on which the model validates its performance in recognizing the handwriting effectively
The time period for which the model is trained on the training data is called epochs
Note Having a very high number of epochs can cause the model to be over-trained on the training set, causing it to perform poorly on new data
Uploading documents to a collection
The easiest way to add documents to Transkribus is by creating a new collection. Once you create a new collection on the Collections tab, you will be redirected to a screen as shown below.
The interface includes the following options (numbered accordingly):
- Name of the collection you are currently working with
- Click on Upload Document to upload new documents to the collection
- An option to choose whether you are uploading an image or a PDF
- Set title of the document you are uploading
- As indicated, this allows you to upload file(s) to the collection
After the document is successfully uploaded, the collection screen should display the list of documents in that collection. Clicking on any of the documents will take you to a list of individual pages of the document, as shown in the figure below.
The user has the option to add or delete a page from the document, perform handwriting recognition (using an HTR model) on a page, set the status of the page to one of the four allowed page statuses, as well export a subset of the pages. Further options to filter the pages being displayed are available via the Filter option on the right side of the toolbar on the page.
Your Transkribus work area
Once you click on any of the documents under the Work Desk section on your Transkribus interface, you will be redirected to a screen as shown below.
It is where all the work related to your manuscript will take place. The interface includes the following options (numbered accordingly):
- Cursor tool for moving the manuscript around
- Pen tool to indicate baselines for your manuscript
- Region selector tool to define the various regions in your manuscript
- A tool to add tables to the manuscript regions
- A button to provide more information and keyboard shortcuts
- A layout editor that allows you to see your lines and regions in one place
- Zoom controllers
- Center your document with respect to the viewing area
- Fit the document to the viewing area
- Rotate your document
- Change the view to full screen
- Start transcription with an existing model
- Option to download the existing document
- A drop down to change the status of the page to one of the following
- In Progress
- Ground Truth
- Save progress on your current document
Apart from these, there are also buttons to undo/redo changes, a virtual keyboard, and options to share your work.
Adding ground truth
Before training a model, you will need to prepare your training data, this means preparing enough images and their corresponding correct transcriptions to train the model. This process known as the addition of ground truth, ensures that the model can be trained on existing validated data.
While preparing your manuscripts as ground truth data, you can utilise any of the public models available on Transkribus to transcribe your text and make corrections. In case there are no models for your kind of text, you will have to transcribe manually. When you are done with the transcription, save each page as ground truth. This indicates that the pages can be used to train your model. Once the pages have been marked appropriately, you can begin the training process.
Training a custom model
Layout Recognition Model (optional)
This is an optional activity. If you are not sure whether your language requires a layout recognition model, please raise a ticket on Phabricator. The layout recognition/line detection model is primarily intended to be constructed if the handwriting or script is difficult to be trained upon directly and has varying placements of letters or characters. By default, Transkribus internally uses the Mixed Line Orientation model as the layout detection model. This works well for most Western scripts.
The process of training the layout model begins with a section as shown below.
- Go to the Training section and choose a collection as prompted. Select the Baselines model option, as shown in Fig 2.
- In the dialog box that appears, proceed to fill required details like model name (numbered 3 in the figure above) and description (numbered 4 in the figure above). The field named epochs (numbered 5 the figure above) determines how long the model will iterate over the provided data set.
- The next step involves selecting the training data containing the corrected baselines that were prepared in the previous step. Select all relevant documents or collections that you want the model to learn from. Similarly, select the data set to be used for validation as well.
- NOTE: Ideally, 90% of the entire data available should be used for training while 10% should be used for validation.
- Trigger the model training process
The training process takes a few minutes to complete. You can check the progress of the training process in the Jobs tab. Once complete, this job readies the layout recognition model that can further be used to create the main model!
Correcting layouts (optional)After the training phase, Transkribus takes the generated text regions and represents them as polygons, offering the capability to modify these shapes. This functionality, however, is exclusively accessible within the Transkribus Expert Client, which provides advanced features for more intricate document processing.
The region highlighted as 1 in the above figure showcases a chosen polygonal shape. It is important to note that these shapes are essentially composed of individual points linked by straight lines. The visualization consists of interconnected dots that form the outline of the polygon, with each straight line connecting two adjacent dots.
- The tool referenced by 2 introduces the ability to include supplementary points to an already selected shape, enhancing the versatility of the tool. These added points can be positioned on either the text region itself or its baseline, allowing for a higher degree of precision in customization.
- Should any adjustments be needed, the tool pointed to by 3 in the above figure removes a designated point from the chosen shape. This particular tool is particularly advantageous for refining or shortening baselines, ensuring they accurately correspond to the layout of the document.
The process of tailoring the shape to specific requirements involves the manipulation of these defining points. By relocating the points that make up the polygon, users have the flexibility to modify the shape to better match the contour of the corresponding text block.
In essence, the capability to adjust polygonal ground truths in Transkribus, facilitated through the Expert Client, introduces a multifaceted toolset. The combination of interconnected points forming polygons, the addition of new points, the freedom to move points, and the option to eliminate points provides an extensive range of controls.
In case of languages like Balinese and Javanese, this feature is particularly helpful as the script and its corresponding baselines are more erratic than in other Western languages. This helps to enhance the accuracy of the model being trained and, in turn, the transcribed text.