Bidirectional Scene Text Recognition with a Single Decoder

Author: Maurits J.R. Bleeker.

Have you ever wondered how self-driving cars can anticipate new street signs on the road? How Google is able to index photos based on the text in a photo? How Google Maps can automatically find out in which street a photo has been taken? In this blog post, we discuss our ECAI 2020 full paper about Scene Text Recognition (STR). We introduce a new method for bidirectional STR,  BIdirectional Scene TExt Transformer (Bi-STET.).

The world where we, as humans, live in is full of textual ‘clues’ such as street signs, billboards, brand names on buildings, traffic signs, etc. The aim of these textual clues is to provide more context information about the environment: Where am I at the moment? What company is based here? Where should I go next? What is the primary selling product of the store in front of me?

Small text fragments are not only in the physical world around us, but also are often used in videos and images to provide more contextual information. For example, a news broadcast on TV typically has primary information in the form of small text fragments in the video footage.

Figure 1: Video footage (Source: YouTube)

With the ever-growing amount of videos and images on the internet, automatic text extraction from images and video frames is a key technology towards better visual content storing and indexing. If a computer is able to extract text automatically from images and videos, a lot of extra context information about an image or video can be obtained. For example, the text in the image above exactly provides the topic of this video/news item. Extra context information can be utilized in applications, such as image/video retrieval, autonomous driving, handwriting recognition from documents and aid applications for visually impaired people.

‘Reading’, or extracting, text from images and video frames is called Text Spotting. Text Spotting is decomposed into two sub-tasks; Text Detection and Text Recognition (also called Scene Text Recognition). Text Detections focuses on finding words in images. When a Text Detection method extracts regions from an input image (or video frame) where a word is depicted, the Text Recognition method then tries to transcribe the ‘word’ in this region.

There is a major difference between Text Spotting and traditional Optical Character Recognition (OCR). OCR is commonly applied on documents and structured text only, which is a very narrowed domain. On the other hand, Text Spotting algorithms can deal with any kind of unstructured text in natural scenes. The difficulty with text in natural scenes is that the text can have uncommon font types, no clear row structure, different orientations, etc.

Text spotting is challenging to solve as a single task. Most previous methods focus on either Text Detection or Recognition. For this work, we only focus on Text Recognition. That is: given a word image (an image which only contains a word), let the computer transcribe the correct sequence of characters depicted in the input image. We ignore the Text Detection aspect for now, and we assume that there is a method which is able to crop word(s) from an input image.

Text Recognition can be approached as 1) a (word) classification task or 2) a sequence prediction task. The most common approach to transcribe the word depicted in an input image is by predicting a sequence of characters. By predicting a sequence of characters, instead of a word directly, the output of the method is not restricted to a fixed vocabulary of words, which is the case with direct word classification. The common Text Recognition approach is to predict a word depicted in an image character-by-character from left-to-right. Each character prediction is based on the local regional information in the input image and on the already predicted character sequence.

How does this work in practice? Given an input image, the model predicts the full sequence of characters depicted in the input image one-by-one from left-to-right. If, for example, the characters s-t-a-d-i-u are already predicted by the model. Then, based on general knowledge of words and characters, what would be the most likely character to be predicted next (without seeing the input image)? The character “m” is much more likely than the character “x”. A Text Detection algorithm can learn this knowledge from data, similar to the way humans learn to read. This is called a language model: a model of how character/word sequences behave. Besides the language model, the Text Detection model also has the image as input to ‘see’ if there is really an m depicted in the input image. Based on both the language model and the input image, the algorithm can make a reasonable prediction for the next character in the sequence.

Images and video frames can contain a lot of inferences factors, such as noise, (motion) blur, lightning conditions, and shadows, which can make it challenging to read the next character in the sequence for the Text Recognition model. In those cases, the algorithm will primarily rely on the learned language model. However, a language model can only rely on the already predicted character sequence, and not on the future. In our ‘stadiu-’ example, this is not a major problem because most of the word is already predicted and we are quite sure the next character will be the letter m, but what if the first character of the input image is hard to read? Then there is no extra language information available to make an educated ‘guess’.

This is one major disadvantage of a unidirectional sequence modelling. For this reason, bidirectional STR has been introduced: We propose solving this problem by predicting the character sequence twice: with bidirectional sequence modelling, an input image is not only decoded from left-to-right but also from right-to-left as well.

The component which predicts the output sequence is called the ‘decoder’: it decodes a sequence of characters from an input image. To decode a character sequence bidirectionally from an input image, two decoders are needed: one which decodes from left-to-right and another from right-to-left. Although this might result in more robust language models, and therefore better predictions, it is not ideal to have two disjunct decoders for the same task. The reason for this is that both decoders need to be  optimized while training the algorithm and both require computation for a task which is almost identical. Therefore the question remains: Would it not be possible to have one decoder for both decoding directions?

With our Bi-STET model, we introduce a new method for bidirectional STR with only one decoder. We achieve this by relying on the Transformer architecture, which is ‘directionless’.

Method and Experimental Set-up

To create more robust predictions, the notion of bidirectional sequence decoding has been introduced for STR. Bidirectional STR has been introduced by [2]. See Figure 2 for a high-level overview of this pipeline. First, an input word image is encoded by an encoder. After the encoding, two decoders are used to transcribe the character sequence in the input image. The left-to-right (ltr) decoder predicts “FLTNESS”, while the right-to-left (rtl) decoder predicts “SSENTIF”.

The left-to-right decoder has some difficulty to correctly transcribe the second character ‘i’. The left-to-right decoder predicts an ‘l’ instead of an i. However, the right-to-left decoder correctly predicts i. This because it already has seen the character ‘ssent’ and based on that character sequence the next character is much more likely to be an i than an L. If we would have only a single directed decoder (ltr) this character sequence would have been transcribed incorrectly.

Figure 2: STR method with two decoders. One for ltr and one for rtl. Source [2]

The problem, however, remains that we need two decoders to achieve this. As can be seen in Figure 2. The models used for both the decoders and encoder are Recurrent Neural Nets (RNNs). By design, those RNNs have a recurrent inductive bias, because the output sequence is predicted step-by-step.

The Transformer [1] , on the other hand, is a neural network architecture which computes a sequence fully in parallel instead of sequentially like RNNs. This is why the Transformer Decoder is not restricted to one single decoding direction like RNNs.

For this reason, we propose to use the Transformer instead of RNNs as the network architecture for both the encoder and decoder. By using a Transformer, we can use one single decoder to simultaneously decode an output sequence from left-to-right and right to left.

The final problem remains; how do we tell the model if it needs to decode a sequence from ltr or rtl? The model depicted in Figure 2, is the standard way to implement bidirection STR. There are two components to do this and we just can use one of them if we either want to decode from ltr or rtl. This means that based on the architecture of the model we can decide in which direction we would like to decode the sequence. However, for our model we just have one single decoder. The solution to this problem is to provide some extra context information to the model. Besides the input image, we also give an extra input embedding to the model which tells the model if this image needs to be decoded from left-to-right or right-to-left. By doing this, we implement the directional decoding at the input-level of the model, and not on the architecture-level, which is the case when two different decoders are used. This input-level implementation saves us an extra decoder. In Figure 3 an overview of the entire method is given.

Figure 3: High-level overview of BI-STET

Our overall method is called Bi-STET. We evaluate our Bi-STET on seven publicly available benchmark test sets for STR  and compare our method with other state-of-the-art (SOTA) approaches.

Bi-STET is trained end-to-end, by using synthetically generated training data. The reason for this is because ‘word images’ are relatively easy to generate realistically. This in contrast to a normal scene. Because of the recent trend in deep learning ‘the more data the better’, a vast amount of synthetic data is generated for this task. For training, we use more than 12 million images taken from both the SynthText and Synth90k datasets. See Figure 4.1 and 4.2 for two examples of these training images. For each image, different backgrounds, font-types, distortions etc. are sampled.

Figure 4.1: Image from Synth90k. Source: https://www.robots.ox.ac.uk/~vgg/data/text/
Figure 4.2: Another image from Synth90k. Source: https://www.robots.ox.ac.uk/~vgg/data/text

Results

For the first evaluation, we compare our method with the approach by Shi at al [2]. This is, to be the best of our knowledge, the only method who has implemented bidirectional STR. We see in Table 1 is that our approach outperforms the method by Shi et al. on six out of seven of the public benchmark sets. Bi-STET’s decoding leads to higher scoring sequence prediction than using a single decoding direction.

Table 1: BI-STET compared to the method by Shi et al. [2]. Citations are from the original paper.

In Table 2, we evaluate Bi-STET in terms of prediction accuracy on 7 public evaluation sets and compare it to other SOTA STR methods. Bi-STET meets or outperforms SOTA methods on 6 out of 12 evaluation experiments. We achieve new SOTA results on the ICDAR03 and the IIIT5K datasets.

Table 2: BI-STET compared to SOTA STR methods. Citations are from the original paper.

Evaluation

One common problem for STR method is to transcribe curved and/or oriented text from images. See Figure 5 for an example. Many STR methods introduce special components to handle those special cases.  With Bi-STET, we do not include any specific component for this. Bi-STET is just a plain transformer-based image to text encoder-decoder, and the model solely relies on plain end-to-end sequence modelling. However, our method meets or outperforms methods which are specifically optimized to handle curved and/or oriented examples. In Figure 5, examples of curved text examples from the CUTE80 dataset that are correctly and incorrectly predicted by Bi-STET are depicted. In black, the correct label is given. This is to illustrate how well our method is able to deal with oriented text in images.

Figure 5: Examples of curved text examples from the CUTE80 dataset that are correctly and incorrectly predicted by Bi-STET. In black, the ground truth is given.

Summary and Conclusion

In this blog post, we have introduced BI-STET, a new method for bidirectional STR. With Bi-STET we obtain SOTA results for bidirectional STR. Additionally, we meet or achieve SOTA results compared to other STR methods which rely on other approaches than plain bidirectional STR.  We using a Transformer based model, we can implement bidirectional STR without using two decoders.

I hope you enjoyed reading this blog post, and feel free to have a look at the full paper for more results and evaluations and the code on Github. If you want to use this work for your own project, please cite the following paper. If you have any questions or feedback, please reach out!

@article{bleeker2019bidirectional,
  title={Bidirectional Scene Text Recognition with a Single Decoder},
  author={Bleeker, Maurits and de Rijke, Maarten},
  journal={arXiv preprint arXiv:1912.03656},
  year={2019}
}

Bibliography

  1. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems (pp. 5998-6008).
  2. Shi, B., Yang, M., Wang, X., Lyu, P., Yao, C., & Bai, X. (2018). Aster: An attentional scene text recognizer with flexible rectification. IEEE transactions on pattern analysis and machine intelligence, 41(9), 2035-2048.