Advanced Scene Text and Handwriting Recognition for Hindi Using Synthetic Data and Transfer Learning

Authors

  • Barkha Sahu

Keywords:

Hindi Script Recognition, Scene Text Recognition, Handwriting Recognition, Synthetic Data, Transfer Learning, Multilingual Models.

Abstract

The recognition of scene text and handwritten characters in the Hindi language presents significant challenges due to the complexity of the Devanagari script, diverse font styles, and limited annotated datasets. This paper proposes an advanced framework for Scene Text Recognition (STR) and Handwriting Recognition (HWR) in Hindi by leveraging synthetic data generation and transfer learning methodologies. Synthetic datasets enriched with both Unicode and non-Unicode fonts, along with varied handwritten styles, were developed to address data scarcity. Transfer learning techniques, adapted from pre-trained models on extensive multilingual datasets, significantly improved recognition performance by enabling cross-script knowledge transfer. Experimental results demonstrated a 33% improvement in Word Recognition Rate (WRR) on the IIIT-ILST Hindi dataset, validating the effectiveness of our approach. Additionally, transfer learning across six Indian languages revealed potential inter-script benefits, where Hindi models benefited more from Indian scripts than from English datasets. Our work highlights the importance of synthetic data augmentation and cross-lingual learning for enhancing the accuracy and robustness of Hindi STR and HWR systems. Future research will focus on integrating generative models like GANs for realistic data synthesis and developing comprehensive open-source benchmarks for Indian script recognition.

Downloads

Download data is not yet available.

References

Jain, Arti, Divakar Yadav, Anuja Arora, and Devendra K. Tayal. "Named-Entity Recognition for Hindi language using context pattern-based maximum entropy." Computer Science 23 (2022): 81-115.

Patil, Vinita, and P. S. Aithal. "A Mixture of MLPNN/HMM to Demonstrate the Procedure for Online Hindi Writing Recognition." International Journal of Case Studies in Business, IT and Education (IJCSBE) 6, no. 1 (2022): 414-425.

Jain, Arti, Anuja Arora, Jorge Morato, Divakar Yadav, and Kumar Vimal Kumar. "Automatic text summarization for Hindi using real coded genetic algorithm." Applied Sciences 12, no. 13 (2022): 6584.

Bhatia, Surbhi, Ankit Kumar, and Mohammed Mutillah Khan. "Role of genetic algorithm in optimization of Hindi word sense disambiguation." IEEE Access 10 (2022): 75693-75707.

Rajeshwari, S. B., and Jagadish S. Kallimani. "Development of Optimized Linguistic Technique Using Similarity Score on BERT Model in Summarizing Hindi Text Documents." In Innovative Data Communication Technologies and Application: Proceedings of ICIDCA 2021, pp. 767-781. Singapore: Springer Nature Singapore, 2022.

Prakash, Amit, Niraj Kumar Singh, and Sujan Kumar Saha. "Automatic extraction of similar poetry for study of literary texts: An experiment on Hindi poetry." ETRI journal 44, no. 3 (2022): 413-425.

Gupta, Vaishali, and Nisheeth Joshi. "Identification and extraction of multiword expressions from Hindi & Urdu language in natural language processing." International Journal of Advanced Technology and Engineering Exploration 9, no. 91 (2022): 807.

Gunna, Sanjana, Rohit Saluja, and Cheerakkuzhi Veluthemana Jawahar. "Improving scene text recognition for Indian languages with transfer learning and font diversity." Journal of Imaging 8, no. 4 (2022): 86.

Dhankhar, Sunil, Mukesh Kumar Gupta, Fida Hussain Memon, Surbhi Bhatia, Pankaj Dadheech, and Arwa Mashat. "Support Vector Machine Based Handwritten Hindi Character Recognition and Summarization." Computer Systems Science & Engineering 43, no. 1 (2022).

Sharma, Vijay Kumar, Namita Mittal, and Ankit Vidyarthi. "Context-based translation for the out of vocabulary words applied to hindi-english cross-lingual information retrieval." IETE Technical Review 39, no. 2 (2022): 276-285.

Mehta, Shikha, Sakshi Gupta, Raashi Agarwal, Shrashti Trivedi, and Prajjwal Dubey. "String Matching Based Framework for Online Hindi Question Answering System." In Iberoamerican Knowledge Graphs and Semantic Web Conference, pp. 312-321. Cham: Springer International Publishing, 2022.

Sethi, Nandini, Amita Dev, Poonam Bansal, Deepak Kumar Sharma, and Deepak Gupta. "Hybridization based machine translations for low-resource language with language divergence." ACM Transactions on Asian and Low-Resource Language Information Processing (2022).

Kumar, Mohinder, Manish Kumar Jindal, and Munish Kumar. "Design of innovative CAPTCHA for hindi language." Neural Computing and Applications 34, no. 6 (2022): 4957-4992.

Verma, Prashant, Vijay Kumar, and Bharat Gupta. "Indian Languages Requirements for String Search/comparison on Web." In International Conference on Artificial Intelligence and Speech Technology, pp. 210-214. Cham: Springer International Publishing, 2021.

Naaz, Komal, and Niraj Kumar Singh. "Design and development of computational tools for analyzing elements of Hindi poetry." IEEE Access 10 (2022): 97733-97747.

Rani, Ruby, and D. K. Lobiyal. "Document vector embedding based extractive text summarization system for Hindi and English text." Applied Intelligence 52, no. 8 (2022): 9353-9372.

Mishra, Atul, Soharab Hossain Shaikh, and Ratna Sanyal. "Context based NLP framework of textual tagging for low resource language." Multimedia Tools and Applications 81, no. 25 (2022): 35655-35670.

Verma, Ark, Vivek Sikarwar, Himanshu Yadav, Ranjith Jaganathan, and Pawan Kumar. "Shabd: A psycholinguistic database for Hindi." Behavior Research Methods 54, no. 2 (2022): 830-844.

Chakrawarti, Rajesh Kumar, Jayshri Bansal, and Pratosh Bansal. "Machine translation model for effective translation of Hindi poetries into English." Journal of Experimental & Theoretical Artificial Intelligence 34, no. 1 (2022): 95-109.

Sharma, Richa, Sudha Morwal, Basant Agarwal, Ramesh Chandra, and Mohammad S. Khan. "A deep neural network-based model for named entity recognition for Hindi language." Neural Computing and Applications 32, no. 20 (2020): 16191-16203.

Laskar, Sahinur Rahman, Rahul Singh, Md Faizal Karim, Riyanka Manna, Partha Pakray, and Sivaji Bandyopadhyay. "Investigation of english to hindi multimodal neural machine translation using transliteration-based phrase pairs augmentation." In Proceedings of the 9th Workshop on Asian Translation, pp. 117-122. 2022.

Dhankhar, Sunil, and Mukesh Kumar Gupta. "A statistically based sentence scoring method using mathematical combination for extractive Hindi text summarization." Journal of Interdisciplinary Mathematics 25, no. 3 (2022): 773-790.

Puri, Shalini. "Image classification with information extraction by evaluating the text patterns in bilingual documents." In International Conference on Advanced Network Technologies and Intelligent Computing, pp. 115-137. Cham: Springer Nature Switzerland, 2022.

Babhulgaonkar, Arun, and Shefali Sonavane. "Empirical analysis of phrase-based statistical machine translation system for English to Hindi language." Vietnam Journal of Computer Science 9, no. 02 (2022): 135-162.

Downloads

Published

16.01.2023

How to Cite

Barkha Sahu. (2023). Advanced Scene Text and Handwriting Recognition for Hindi Using Synthetic Data and Transfer Learning. International Journal of Intelligent Systems and Applications in Engineering, 11(1), 478 –. Retrieved from https://ijisae.org/index.php/IJISAE/article/view/7731

Issue

Section

Research Article