Zero-Shot Invoice Information Extraction Using Foundation Models with Spatial Prompt Tuning
Keywords:
Document Understanding Transformer, Foundation Models, Invoice, Prompt Tuning, Zero-Shot.Abstract
Extracting structured information from scanned invoices poses significant challenges due to diverse layouts, linguistic variability, and the scarcity of annotated training data. To address this, the study introduces a zero-shot invoice information extraction framework that leverages the Donut foundation model, integrated with spatial prompt tuning. Unlike conventional OCR-based pipelines, the proposed approach operates directly on document images without the need for explicit text recognition or task-specific fine-tuning. The method was evaluated using the SROIE v2 dataset, comprising 973 annotated invoice images, and was implemented using the Python framework. Spatially-aware natural language prompts were used to guide the model’s attention toward relevant regions such as headers or totals. Experimental evaluation demonstrated a notable performance gain, with the model achieving 98.0% accuracy, surpassing baseline methods like BiLSTM-CRF and LayoutLM by over 4%. The results validate the model’s effectiveness and scalability for real-world document automation, especially in zero-shot settings with high template variability.
Downloads
References
W. Lehmacher, “Digitizing and automating processes in logistics,” Disrupting Logistics: Startups, Technologies, and Investors Building Future Supply Chains, pp. 9–27, 2021.
T. Saout, F. Lardeux, and F. Saubion, “An overview of data extraction from invoices,” IEEE Access, vol. 12, pp. 19872–19886, 2024.
D. Baviskar, S. Ahirrao, V. Potdar, and K. Kotecha, “Efficient automated processing of the unstructured documents using artificial intelligence: A systematic literature review and future directions,” Ieee Access, vol. 9, pp. 72894–72936, 2021.
A. Sassioui, R. Benouini, Y. El Ouargui, M. El Kamili, M. Chergui, and M. Ouzzif, “Visually-rich document understanding: concepts, taxonomy and challenges,” in 2023 10th International Conference on Wireless Networks and Mobile Communications (WINCOM), IEEE, 2023, pp. 1–7.
Z. Chen et al., “Evolution and Prospects of Foundation Models: From Large Language Models to Large Multimodal Models.,” Computers, Materials & Continua, vol. 80, no. 2, 2024.
K.-A. L. Nguyen, “Document Understanding with Deep Learning Techniques,” PhD Thesis, Sorbonne Université, 2024.
M. Ylisiurunen, “Extracting semi-structured information from receipts,” 2022.
M. Li et al., “Trocr: Transformer-based optical character recognition with pre-trained models,” in Proceedings of the AAAI conference on artificial intelligence, 2023, pp. 13094–13102.
M. Namysł, “Robust Information Extraction From Unstructured Documents,” PhD Thesis, Universitäts-und Landesbibliothek Bonn, 2023.
Y. Song, T. Wang, P. Cai, S. K. Mondal, and J. P. Sahoo, “A comprehensive survey of few-shot learning: Evolution, applications, challenges, and opportunities,” ACM Computing Surveys, vol. 55, no. 13s, pp. 1–40, 2023.
Hanning Zhang, “A financial ticket image intelligent recognition system based on deep learning,” Knowledge-Based Systems, vol. 222, p. 106955, Jun. 2021, doi: 10.1016/j.knosys.2021.106955.
M. Limam, M. Dhiaf, and Y. Kessentini, “Fatura: A multi-layout invoice image dataset for document analysis and understanding,” arXiv preprint arXiv:2311.11856, 2023.
K. Yindumathi, S. S. Chaudhari, and R. Aparna, “Structured data extraction using machine learning from image of unstructured bills/invoices,” in Smart Computing Techniques and Applications: Proceedings of the Fourth International Conference on Smart Computing and Informatics, Volume 2, Springer, 2021, pp. 129–140
Q. Yang, Y. Hu, R. Cao, H. Li, and P. Luo, “Zero-shot key information extraction from mixed-style tables: pre-training on wikipedia,” in 2021 IEEE International Conference on Data Mining (ICDM), IEEE, 2021, pp. 1451–1456.
L. Lam, P. Ratnamogan, J. Tang, W. Vanhuffel, and F. Caspani, “Information extraction from documents: Question answering vs token classification in real-world setups,” in International Conference on Document Analysis and Recognition, Springer, 2023, pp. 205–220.
Urban Knuples, “SROIE datasetv2.” [Online]. Available: https://www.kaggle.com/datasets/urbikn/sroie-datasetv2
Downloads
Published
How to Cite
Issue
Section
License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
All papers should be submitted electronically. All submitted manuscripts must be original work that is not under submission at another journal or under consideration for publication in another form, such as a monograph or chapter of a book. Authors of submitted papers are obligated not to submit their paper for publication elsewhere until an editorial decision is rendered on their submission. Further, authors of accepted papers are prohibited from publishing the results in other publications that appear before the paper is published in the Journal unless they receive approval for doing so from the Editor-In-Chief.
IJISAE open access articles are licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. This license lets the audience to give appropriate credit, provide a link to the license, and indicate if changes were made and if they remix, transform, or build upon the material, they must distribute contributions under the same license as the original.