Unsupervised writer adaptation of whole-word HMMs with application to word-spotting

https://doi.org/10.1016/j.patrec.2010.01.007Get rights and content

Abstract

In this paper we propose a novel approach for writer adaptation in a handwritten word-spotting task. The method exploits the fact that the semi-continuous hidden Markov model separates the word model parameters into (i) a codebook of shapes and (ii) a set of word-specific parameters.

Our main contribution is to employ this property to derive writer-specific word models by statistically adapting an initial universal codebook to each document. This process is unsupervised and does not even require the appearance of the keyword(s) in the searched document. Experimental results show an increase in performance when this adaptation technique is applied. To the best of our knowledge, this is the first work dealing with adaptation for word-spotting. The preliminary version of this paper obtained an IBM Best Student Paper Award at the 19th International Conference on Pattern Recognition.

Introduction

Handwritten word-spotting is the pattern recognition task which consists in detecting words in handwritten documents (Rath and Manmatha, 2007). The key aspect of word-spotting is that search is performed without a full-blown handwriting recognition (HWR) system. Indeed, a straightforward strategy for keyword detection would be to apply a HWR system to recognize all words in a document and then search for the keyword in the obtained text. However, the influential work by Manmatha et al. (1996) revealed that such a strategy is too cumbersome and that in practice an image matching approach is sufficient for certain applications.

As opposed to the previous works that have employed word-spotting for historical document retrieval (Rath and Manmatha, 2007, Adamek et al., 2007, Edwards et al., 2004, Chan et al., 2006, Kolcz et al., 2000, Terasawa and Tanaka, 2007, Van der Zant et al., 2008), the application of interest of this work is the filtering of modern mail documents. Our system is applied to a flow of incoming mail documents (customer letters), where documents containing a particular keyword (such as “cancellation”) have to be flagged.

In that scenario, our word-spotting system can be confronted with documents produced by a huge variety of writers (normally, one different writer per document). In HWR, a set of techniques known as writer adaptation have been proposed to improve the performance of a writer-independent system by customizing the model to the current writer. However, these techniques have not been investigated for word-spotting before. Moreover, we consider the case of whole-word models, in which case the direct application of these techniques is not possible.

The main contribution of this article is to propose a novel writer style adaptation method for word-spotting. The proposed method is unsupervised, i.e. a keyword model is adapted using only unlabeled data. Moreover, examples of the keyword are not even required to be present in the adaptation set. The rest of the introduction provides a more detailed picture of the background, existing adaptation techniques and finally our approach.

The term spotting refers to “search without explicit recognition”. Originally, word-spotting was formulated for detecting words or phrases in speech messages (Myers et al., 1981, Rose and Paul, 1990, Knill and Young, 1994), and then extended to locate words in typed text documents (Kuo and Agazzi, 1994, Cho and Kim, 2004, Chen et al., 1993). The work (Manmatha et al., 1996) pioneered the application of word-spotting to off-line handwritten documents, as a way to automatically index historical document collections. This enabled the paradigm of search engines for handwritten document images (Saykol et al., 2004, Rath et al., 2003, Srihari et al., 2005).

The important contribution of the work (Manmatha et al., 1996) is to deliberately avoid the use of a HWR system for the search and indexing of word images. Instead, these authors revealed that an image matching approach allows easy indexing without any training requirements (as opposed to the costly training phase in a HWR system). Since then, word-spotting has been posed as a content-based image retrieval problem (Kolcz et al., 2000, Srihari et al., 2004, Adamek et al., 2007, Terasawa, 2005). Assuming that the words of a document collection have been segmented, word-spotting can be formulated as an image database search application: given an exemplary image (the query), the goal is to retrieve all word images that are close enough to the example, as determined by a similarity measure of choice. This paradigm is also referred to as query-by-example (QBE). A typical similarity measure is dynamic time warping (Rath and Manmatha, 2003).

Our previous work (Rodríguez-Serrano and Perronnin, 2009) showed that the QBE accuracy can be boosted by using more than one example image for querying and combining these using whole-word HMMs. We also demonstrated that the particular choice of a semi-continuous HMM provides competitive performance with reduced training sets: with as low as a single training sample, the accuracy is higher than with a traditional DTW approach. This is thanks to the prior information incorporated by the Gaussian codebook. We take this approach as baseline for the current work and more details about it can be read in Section 2.

The described approach involves little training compared to HWR systems based on character models and it is much simpler to set up. In our application we estimate N whole-word models with little training material (typically of the order of 10–100 positive samples per keyword). In contrast, a sophisticated HWR system requires of the order of 10K or 100K samples (El-Yacoubi et al., 1999, Knerr et al., 1998) of labeled word images.

This fact, however, impedes the use of traditional adaptation techniques as discussed next.

Personalization of handwriting models, known as writer adaptation, has been a subject of interest among the handwriting recognition community (Connell and Jain, 2002, Kienzle and Chellapilla, 2006, Brakensiek et al., 2001, Mouchère et al., 2007). Owing to the practical limitation of maintaining well-trained models for each possible individual, recognition systems are trained with large amounts of varied data so that an overall good performance is obtained for all the styles. Adaptation techniques go a step further and modify the parameters of a writer-independent system such that the new parameters are optimal on a (relatively small) set of data of a particular writer.

We concentrate on statistical adaptation techniques, successful in speech recognition (Gauvain and Lee, 1994, Leggetter and Woodland, 1995) and handwriting recognition (Vinciarelli and Bengio, 2002, Brakensiek et al., 2001), since they are especially suited to HMM-based frameworks. Here, the speaker/writer-independent set of parameters θ is transformed into θad using a (relatively) small amount of data from the corresponding speaker/writer. This new data set is referred to as the adaptation set, to make an explicit distinction to the training set, which refers to the set of samples used to train the writer-independent model. Two types of adaptation techniques are common in the literature: supervised adaptation and self-adaptation.

In supervised adaptation techniques a labeled set of samples from the writer is available. The writer-independent parameters can be updated using learning techniques such as Maximum-A-Posteriori (MAP) (Gauvain and Lee, 1994) or Maximum Likelihood Linear Regression (MLLR) (Leggetter and Woodland, 1995). While this can be useful for a system which allows an enrollment phase, it is not applicable to our scenario.

Self-adaptation techniques allow adaptation in an unsupervised way. Here, a document is recognized using a HWR system, and the output can be treated as writer-dependent labeled material and exploited to retrain the model (Ball and Srihari, 2008). When using this technique, one must assume that there will be errors in the transcription. Therefore, sample selection criteria must be imposed, e.g. considering only the samples whose recognition confidence is above a threshold.

Nevertheless, self-adaptation cannot be used in our case either, for a more subtle reason. The output of a HWR system on a new document is a set of characters together with their labels. One assumes that on a sufficiently long document there will be enough character-label pairs to retrain the original model accurately. However, when we apply a word detector to the typical mail document, we may have few or even zero occurrences of the search word, which is probably not enough for retraining.

The limitations of supervised and self-adaptation approaches in the explained scenario raise the need for an unsupervised adaptation method. Given a new document image and a keyword model, the challenge is to extract writer style information from the words of the document and use it to improve the keyword model for that document. Here, word labels are not available, which means that all the information for the adaptation must come from unlabelled image data.

We propose the use of a semi-continuous hidden Markov model (SC-HMM) (Huang and Jack, 1990) to achieve such an unsupervised adaptation method. In this type of model, first the input space is clustered using a GMM, which is usually referred to as universal background model or universal GMM. Then the means and covariances of all the states are fixed to the values of the means and covariances of this GMM. Finally, the remaining HMM parameters (transition probabilities and mixture weights) are estimated from the data.

The main contribution of this article is to exploit this parameter separation to propose a novel adaptation technique for tasks in which a whole handwritten word is modeled with an HMM, such as word-spotting. In writer-independent scenarios, a word model is obtained by training a “universal” GMM with many different words from many different writers. But for a new input document, the models of the keywords to spot can be adapted by replacing the universal GMM with a document-specific GMM, and leaving the word-dependent parameters unchanged. To obtain the document-specific GMM, one can apply standard adaptation techniques (such as MAP or MLLR) to the universal GMM.

To the best of the authors’ knowledge (and except for the preliminary conference version (Rodríguez et al., 2008)) this adaptation technique is novel, and also this is the first work to consider adaptation in a word-spotting problem. The rest of the article is structured as follows. Section 2 describes the writer-independent scenario based on SC-HMMs which is the baseline of this work. Section 3 describes how this is modified by the proposed writer style adaptation. Section 4 explains the particular adaptation techniques that are implemented to obtain personalized GMMs. Section 5 reports the experimental validation. Finally, in Section 6 conclusions are drawn.

Section snippets

Word modeling

We use a statistical approach that builds on (Rodríguez-Serrano and Perronnin, 2009) to model handwritten words. A word image is described as a sequence X=x1x2,,xT, where a frame xt is a vector of features extracted at different positions of a sliding window. Each keyword to be searched is modeled by a SC-HMM (Huang and Jack, 1990) which is trained using several sequences of the keyword. The main property of a SC-HMM is that all the states of all keywords share a common pool of Gaussians {pk,k=

Proposed writer style adaptation

The main contribution of this work is to provide an adaptation method that exploits the separation of the parameters λn and θ explained in the previous section. The procedure is as follows. First, a universal shape vocabulary is built by training the GMM p(·|θ) using frames from a large amount of samples of different writers. Then, for a new document, we apply a statistical adaptation technique to make the vocabulary specific to that document. The parameters λn remain unchanged.

This implies

Adaptation techniques

We have experimented with the two most popular statistical adaptation techniques for obtaining a set of source-dependent (SD) parameters θad from a source-independent (SI) set θ: MAP and MLLR.

Experimental conditions

Adaptation is evaluated in the context of a word detection task. Experiments are conducted on word images extracted from 630 scanned letters (written in French), which contain unconstrained handwriting from approximately the same amount of writers. Some letter examples are shown in Fig. 2.

Given a new document, the pipeline of the baseline system is as follows. First, a segmentation process produces word images from the documents. Each resulting word image is checked against each of the keywords

Conclusions

In the proposed system, the detection results of a writer-independent word-spotting system are improved by adapting to the writer style of each input page. To the best of our knowledge, this is the first work to apply writer adaptation in a word-spotting task.

Traditional HMM adaptation techniques cannot be used for problems where whole words are modeled with an HMM (such as word-spotting or small-vocabulary word recognition) because (i) the writer is not present to provide adaptation samples,

Acknowledgments

The work of the CVC authors was partially supported by the Spanish projects TIN2006-15694-C02-02 and CONSOLIDER-INGENIO 2010 (CSD2007-00018).

References (42)

  • A. Brakensiek et al.
  • Chan, J., Ziftci, C., Forsyth, D., 2006. Searching off-line Arabic documents. In: Proc. 2006 IEEE Computer Society...
  • Chen, F.R., Wilcox, L.D., Bloomberg, D.S., 1993. Word spotting in scanned images using hidden Markov models. In: IEEE...
  • S.D. Connell et al.

    Writer adaptation for online handwriting recognition

    IEEE Trans. Pattern Anal. Machine Intell.

    (2002)
  • Edwards, J., Teh, Y.W., Forsyth, D.A., Bock, R., Maire, M., Vesom, G., 2004. Making Latin manuscripts searchable using...
  • A. El-Yacoubi et al.

    An HMM-based approach for off-line unconstrained handwritten word modeling and recognition

    IEEE Trans. Pattern Anal. Machine Intell.

    (1999)
  • Fink, G.A., Plötz, T., 2006. Unsupervised estimation of writing style models for improved unconstrained off-line...
  • Gales, M., 1996. The generation and use of regression class trees for mllr adaptation. Tech. Rep. CUED/F-INFENG/TR.263,...
  • J.-L. Gauvain et al.

    Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains

    IEEE Trans. Speech Audio Process.

    (1994)
  • Huang, X.D., Jack, M.A., 1990. Semi-continuous hidden Markov models for speech signals. In: Readings in Speech...
  • Kienzle, W., Chellapilla, K., 2006. Personalized handwriting recognition via biased regularization. In: ICML’06: Proc....
  • Cited by (15)

    View all citing articles on Scopus
    1

    When this work was carried out, he was a Ph.D. student at the CVC and visitor at XRCE. He is now with the University of Leeds, UK.

    View full text