Overview of DocILE 2023: Document Information Localization and Extraction

Šimsa, Štěpán; Uřičář, Michal; Šulc, Milan; Patel, Yash; Hamdi, Ahmed; Kocián, Matěj; Skalický, Matyáš; Matas, Jiří; Doucet, Antoine; Coustaty, Mickaël; Karatzas, Dimosthenis

doi:10.1007/978-3-031-42448-9_21

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14163))

Included in the following conference series:

International Conference of the Cross-Language Evaluation Forum for European Languages

625 Accesses

Abstract

This paper provides an overview of the DocILE 2023 Competition, its tasks, participant submissions, the competition results and possible future research directions. This first edition of the competition focused on two Information Extraction tasks, Key Information Localization and Extraction (KILE) and Line Item Recognition (LIR). Both of these tasks require detection of pre-defined categories of information in business documents. The second task additionally requires correctly grouping the information into tuples, capturing the structure laid out in the document. The competition used the recently published DocILE dataset and benchmark that stays open to new submissions. The diversity of the participant solutions indicates the potential of the dataset as the submissions included pure Computer Vision, pure Natural Language Processing, as well as multi-modal solutions and utilized all of the parts of the dataset, including the annotated, synthetic and unlabeled subsets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Clusters are formed by documents that have similar visual layout and placement of semantic information in this layout.
2.
https://rrc.cvc.uab.es/?ch=26.
3.
In the LiLT paper [28], they pre-train the model on the IIT-CDIP [9] dataset which is a document dataset.

References

Davis, B., Morse, B., Cohen, S., Price, B., Tensmeyer, C.: Deep visual template-free form parsing. In: ICDAR (2019)
Google Scholar
Hammami, M., Héroux, P., Adam, S., d’Andecy, V.P.: One-shot field spotting on colored forms using subgraph isomorphism. In: ICDAR (2015)
Google Scholar
Herzig, J., Nowak, P.K., Müller, T., Piccinno, F., Eisenschlos, J.M.: Tapas: weakly supervised table parsing via pre-training. arXiv (2020)
Google Scholar
Hong, T., Kim, D., Ji, M., Hwang, W., Nam, D., Park, S.: Bros: a pre-trained language model focusing on text and layout for better key information extraction from documents. In: AAAI (2022)
Google Scholar
Huang, Y., Lv, T., Cui, L., Lu, Y., Wei, F.: LayoutLMv3: pre-training for document AI with unified text and image masking. In: ACM-MM (2022)
Google Scholar
Huang, Z., et al.: ICDAR2019 competition on scanned receipt OCR and information extraction. In: ICDAR (2019)
Google Scholar
Jocher, G., Chaurasia, A., Qiu, J.: YOLO by Ultralytics (2023). https://github.com/ultralytics/ultralytics
Katti, A.R., et al.: CharGrid: towards understanding 2D documents. In: Riloff, E., Chiang, D., Hockenmaier, J., Tsujii, J. (eds.) Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018, pp. 4459–4469. Association for Computational Linguistics (2018). https://aclanthology.org/D18-1476/
Lewis, D., Agam, G., Argamon, S., Frieder, O., Grossman, D., Heard, J.: Building a test collection for complex document information processing. In: SIGIR (2006)
Google Scholar
Lin, W., et al.: ViBERTgrid: a jointly trained multi-modal 2D document representation for key information extraction from documents. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) ICDAR 2021. LNCS, vol. 12821, pp. 548–563. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86549-8_35
Chapter Google Scholar
Lohani, D., Belaïd, A., Belaïd, Y.: An invoice reading system using a graph convolutional network. In: Carneiro, G., You, S. (eds.) ACCV 2018. LNCS, vol. 11367, pp. 144–158. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-21074-8_12
Chapter Google Scholar
Majumder, B.P., Potti, N., Tata, S., Wendt, J.B., Zhao, Q., Najork, M.: Representation learning for information extraction from form-like documents. In: ACL (2020)
Google Scholar
Mathew, M., Bagal, V., Tito, R., Karatzas, D., Valveny, E., Jawahar, C.: InfographicVQA. In: WACV (2022)
Google Scholar
Mathew, M., Karatzas, D., Jawahar, C.: DocVQA: a dataset for VQA on document images. In: WACV (2021)
Google Scholar
Mindee: docTR: Document Text Recognition. https://github.com/mindee/doctr (2021)
Olejniczak, K., Šulc, M.: Text detection forgot about document OCR. In: CVWW (2023)
Google Scholar
Powalski, R., Borchmann, Ł, Jurkiewicz, D., Dwojak, T., Pietruszka, M., Pałka, G.: Going full-TILT boogie on document understanding with text-image-layout transformer. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) ICDAR 2021. LNCS, vol. 12822, pp. 732–747. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86331-9_47
Chapter Google Scholar
Riba, P., Dutta, A., Goldmann, L., Fornés, A., Ramos, O., Lladós, J.: Table detection in invoice documents by graph neural networks. In: ICDAR (2019)
Google Scholar
Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015). https://doi.org/10.1007/s11263-015-0816-y
Article MathSciNet Google Scholar
Schreiber, S., Agne, S., Wolf, I., Dengel, A., Ahmed, S.: DeepDeSRT: deep learning for detection and structure recognition of tables in document images. In: ICDAR (2017)
Google Scholar
Šimsa, Š, Šulc, M., Skalický, M., Patel, Y., Hamdi, A.: DocILE 2023 teaser: document information localization and extraction. In: Kamps, J., et al. (eds.) ECIR 2023. LNCS, vol. 13982, pp. 600–608. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-28241-6_69
Chapter Google Scholar
Šimsa, Š., et al.: DocILE benchmark for document information localization and extraction. arXiv preprint arXiv:2302.05658 (2023). Accepted to ICDAR 2023
Skalický, M., Šimsa, Š, Uřičář, M., Šulc, M.: Business document information extraction: Towards practical benchmarks. In: Barrón-Cedeño, A., et al. (eds.) CLEF 2022. LNCS, vol. 13390, pp. 105–117. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-13643-6_8
Chapter Google Scholar
Straka, J., Gruber, I.: Object detection pipeline using YOLOv8 for document information extraction. In: Aliannejadi, M., Faggioli, G., Ferro, N., Vlachos, M. (eds.) Working Notes of CLEF 2023 - Conference and Labs of the Evaluation Forum, Thessaloniki, Greece, 18–21 September. CEUR Workshop Proceedings, CEUR-WS.org (2023)
Google Scholar
Tanaka, R., Nishida, K., Yoshida, S.: VisualMRC: machine reading comprehension on document images. In: AAAI (2021)
Google Scholar
Tang, Z., et al.: Unifying vision, text, and layout for universal document processing. arXiv (2022)
Google Scholar
Tran, B.G., Bao, D.N.M., Bui, K.G., Duong, H.V., Nguyen, D.H., Nguyen, H.M.: Union-RoBERTa: RoBERTas ensemble technique for competition on document information localization and extraction. In: Aliannejadi, M., Faggioli, G., Ferro, N., Vlachos, M. (eds.) Working Notes of CLEF 2023 - Conference and Labs of the Evaluation Forum, Thessaloniki, Greece, 18–21 September. CEUR Workshop Proceedings, CEUR-WS.org (2023)
Google Scholar
Wang, J., Jin, L., Ding, K.: LiLT: a simple yet effective language-independent layout transformer for structured document understanding. In: ACL (2022)
Google Scholar
Wang, Y., Du, J., Ma, J., Hu, P., Zhang, Z., Zhang, J.: USTC-iFLYTEK at DocILE: a multi-modal approach using domain-specific GraphDoc. In: Aliannejadi, M., Faggioli, G., Ferro, N., Vlachos, M. (eds.) Working Notes of CLEF 2023 - Conference and Labs of the Evaluation Forum, Thessaloniki, Greece, 18–21 September. CEUR Workshop Proceedings, CEUR-WS.org (2023)
Google Scholar
Web: Industry Documents Library. https://www.industrydocuments.ucsf.edu/. Accessed 20 Oct 2022
Web: Public Inspection Files. https://publicfiles.fcc.gov/. Accessed 20 Oct 2022
Xu, Y., et al.: LayoutLMv2: multi-modal pre-training for visually-rich document understanding. In: ACL (2021)
Google Scholar
Xu, Y., Li, M., Cui, L., Huang, S., Wei, F., Zhou, M.: LayoutLM: pre-training of text and layout for document image understanding. In: KDD (2020)
Google Scholar
Zhang, Z., Ma, J., Du, J., Wang, L., Zhang, J.: Multimodal pre-training based on graph attention network for document understanding. IEEE Trans. Multimed. (2022)
Google Scholar
Zhong, X., Tang, J., Jimeno-Yepes, A.: PubLayNet: largest dataset ever for document layout analysis. In: ICDAR (2019)
Google Scholar
Zhou, J., Yu, H., Xie, C., Cai, H., Jiang, L.: iRMP: from printed forms to relational data model. In: HPCC (2016)
Google Scholar
Zhu, Y., et al.: Aligning books and movies: towards story-like visual explanations by watching movies and reading books. In: ICCV (2015)
Google Scholar

Download references

Author information

Authors and Affiliations

Rossum, Prague, Czech Republic
Štěpán Šimsa, Michal Uřičář, Matěj Kocián & Matyáš Skalický
Visual Recognition Group, Czech Technical University in Prague, Prague, Czech Republic
Yash Patel & Jiří Matas
University of La Rochelle, La Rochelle, France
Ahmed Hamdi, Antoine Doucet & Mickaël Coustaty
Computer Vision Center, Universitat Autónoma de Barcelona, Barcelona, Spain
Dimosthenis Karatzas
Second Foundation, Prague, Czech Republic
Milan Šulc

Authors

Štěpán Šimsa
View author publications
You can also search for this author in PubMed Google Scholar
Michal Uřičář
View author publications
You can also search for this author in PubMed Google Scholar
Milan Šulc
View author publications
You can also search for this author in PubMed Google Scholar
Yash Patel
View author publications
You can also search for this author in PubMed Google Scholar
Ahmed Hamdi
View author publications
You can also search for this author in PubMed Google Scholar
Matěj Kocián
View author publications
You can also search for this author in PubMed Google Scholar
Matyáš Skalický
View author publications
You can also search for this author in PubMed Google Scholar
Jiří Matas
View author publications
You can also search for this author in PubMed Google Scholar
Antoine Doucet
View author publications
You can also search for this author in PubMed Google Scholar
Mickaël Coustaty
View author publications
You can also search for this author in PubMed Google Scholar
Dimosthenis Karatzas
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Michal Uřičář .

Editor information

Editors and Affiliations

Democritus University of Thrace, Xanthi, Greece
Avi Arampatzis
University of Amsterdam, Amsterdam, The Netherlands
Evangelos Kanoulas
CERTH-ITI, Thessaloniki, Greece
Theodora Tsikrika
CERTH-ITI, Thessaloniki, Greece
Stefanos Vrochidis
Utrecht University, Utrecht, The Netherlands
Anastasia Giachanou
Elsevier, Amsterdam, The Netherlands
Dan Li
University of Amsterdam, Amsterdam, The Netherlands
Mohammad Aliannejadi
University of Lausanne, Lausanne, Switzerland
Michalis Vlachos
University of Padua, Padova, Italy
Guglielmo Faggioli
University of Padua, Padova, Italy
Nicola Ferro

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Šimsa, Š. et al. (2023). Overview of DocILE 2023: Document Information Localization and Extraction. In: Arampatzis, A., et al. Experimental IR Meets Multilinguality, Multimodality, and Interaction. CLEF 2023. Lecture Notes in Computer Science, vol 14163. Springer, Cham. https://doi.org/10.1007/978-3-031-42448-9_21

Download citation

DOI: https://doi.org/10.1007/978-3-031-42448-9_21
Published: 11 September 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-42447-2
Online ISBN: 978-3-031-42448-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Overview of DocILE 2023: Document Information Localization and Extraction