Abstract
In this paper we define a bidimensional extension of Stochastic Context-Free Grammars for page segmentation of structured documents. Two sets of text classification features are used to perform an initial classification of each zone of the page. Then, the page segmentation is obtained as the most likely hypothesis according to a grammar. This approach is compared to Conditional Random Fields and results show significant improvements in several cases. Furthermore, grammars provide a detailed segmentation that allowed a semantic evaluation which also validates this model.
Keywords
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Álvaro, F., Sánchez, J.A., Benedí, J.M.: Recognition of on-line handwritten mathematical expressions using 2d stochastic context-free grammars and hidden markov models. Pattern Recognition Letters (2012)
An, C., Bird, H.S., Xiu, P.: Iterated document content classification. In: Proc. of ICDAR, Brazil, vol. 1, pp. 252–256 (2007)
Antonacopoulos, A., Clausner, C., Papadopoulos, C., Pletschacher, S.: Historical document layout analysis competition. In: Proc. of ICDAR, pp. 1516–1520 (2011)
Bulacu, M., Koert, R., Schomaker, L., Zant, T.: Layout analysis of handwritten historical documents for searching the archive of the cabinet of the dutch queen. In: Proc. of ICDAR, Brazil, vol. 1, pp. 23–26 (2007)
Crespi Reghizzi, S., Pradella, M.: A CKY parser for picture grammars. Information Processing Letters 105(6), 213–217 (2008)
Cruz, F., Ramos Terrades, O.: Document segmentation using relative location features. In: Proc. of ICPR, Japan, pp. 1562–1565 (2012)
Esteve, A., Cortina, C., Cabré, A.: Long term trends in marital age homogamy patterns: Spain, 1992-2006. Population 64(1), 173–202 (2009)
Gould, S., Rodgers, J., Cohen, D., Elidan, G., Koller, D.: Multi-class segmentation with relative location prior. Int. Journal of Computer Vision 80(3), 300–316 (2008)
Handley, J.C., Namboodiri, A.M., Zanibbi, R.: Document understanding system using stochastic context-free grammars. In: Proc. of ICDAR, vol. 1, pp. 511–515 (2005)
Jain, A.K., Namboodiri, A.M., Subrahmonia, J.: Structure in online documents. In: Proc. of ICDAR, vol. 1, pp. 844–848 (2001)
Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proc. of ICML, USA, pp. 282–289 (2001)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Álvaro, F., Cruz, F., Sánchez, JA., Terrades, O.R., Benedí, JM. (2013). Page Segmentation of Structured Documents Using 2D Stochastic Context-Free Grammars. In: Sanches, J.M., Micó, L., Cardoso, J.S. (eds) Pattern Recognition and Image Analysis. IbPRIA 2013. Lecture Notes in Computer Science, vol 7887. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-38628-2_15
Download citation
DOI: https://doi.org/10.1007/978-3-642-38628-2_15
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-38627-5
Online ISBN: 978-3-642-38628-2
eBook Packages: Computer ScienceComputer Science (R0)