Reliable Multimodal AI for Structured Knowledge Extraction and Study Material Generation in Real Classrooms: A Transparent Scoping Survey, Taxonomy, Benchmarks, and Research Roadmap

Authors

  • Soma Kiran Kumar Nellipudi, Nidhibehen Patel

Keywords:

Multimodal learning, lecture understanding, automatic note generation, educational knowledge graphs, retrieval-augmented generation, factuality, verification, benchmarks, trustworthy AI.

Abstract

Educational knowledge in real classrooms is distributed across speech, slides, whiteboards, handwritten mathematics, code, and ad hoc diagrams. This makes accurate and persistent study support difficult even when recordings are available. Recent multimodal models and large language model (LLM) systems can summarize lectures and generate notes, but real deployment remains limited by alignment drift, OCR and ASR noise, incomplete extraction of formal STEM content, and hallucinations that can silently corrupt study artifacts. This paper presents a transparent scoping survey of a balanced 100-paper corpus organized into five clusters: multimodal lecture understanding, educational artifact generation, structured knowledge extraction, reliability and hallucination control, and benchmarks and evaluation. We explicitly treat the last two clusters as a transfer toolkit layer for classroom AI rather than as classroom-native systems. Beyond synthesis, the paper contributes: (1) a review protocol with an explicit audit trail and descriptive-count caveats; (2) a reliability-first classroom pipeline in which alignment is the operational core; (3) an operational intermediate representation (IR) with typed fields, evidence granularity, verification records, and abstention behavior; (4) a worked micro-example that carries a 30-second lecture snippet into evidence-linked flashcards; (5) a lecture-grounded versus resource-grounded verification matrix; and (6) a reviewer-ready multimodal faithfulness protocol for mixed evidence such as noisy board crops, OCR, and ASR. The result is a sharper, more operational roadmap for trustworthy classroom AI.

Downloads

Download data is not yet available.

References

Xuebai Zhang, Shyan-Ming Yuan, Ming-Dao Chen, and Xiaolong Liu, “A Complete System for Analysis of Video Lecture Based on Eye Tracking,” IEEE Access, 2018. https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=8438455

Dipesh Chand and Hasan Ogul, “A Framework for Lecture Video Segmentation from Extracted Speech Content,” 2021 19th World Symposium on Applied Machine Intelligence and Informatics (SAMI), 2021. https://www.researchgate.net/profile/Dipesh-Chand/publication/350294257

Alan Chern et al., “A Smartphone-Based Multi-Functional Hearing Assistive System to Facilitate Speech Recognition in the Classroom,” IEEE Access, 2023. https://ieeexplore.ieee.org/document/7938619

Mu-Chun Su et al., “A Video Analytic In-Class Student Concentration Monitoring System,” IEEE Transactions on Consumer Electronics, 2020. https://ieeexplore.ieee.org/abstract/document/9610134

Bhargava Urala Kota et al., “Automated Detection of Handwritten Whiteboard Content in Lecture Videos for Summarization,” IEEE Access, 2021. doi: https://par.nsf.gov/servlets/purl/10113238

Nigel Bosch and Sidney K. D’Mello, “Automatic Detection of Mind Wandering from Video in the Lab and in the Classroom,” IEEE Transactions on Affective Computing, 2020. doi: https://ieeexplore.ieee.org/document/8680698

Muhammad Bagus Andra and Tsuyoshi Usagawa, “Automatic Lecture Video Content Summarization with Attention-Based Recurrent Neural Network,” 2019 International Conference of Artificial Intelligence and Information Technology (ICAIIT), 2019. https://ieeexplore.ieee.org/abstract/document/8834514

H. Zeng, X. Shu, Y. Wang, Y. Wang, L. Zhang, T.-C. Pong, and H. Qu, “EmotionCues: Emotion-Oriented Visual Summarization of Classroom Videos,” IEEE Trans. Vis. Comput. Graph., vol. 27, no. 7, pp. 3168–3181, 2021. https://ink.library.smu.edu.sg/cgi/viewcontent.cgi?article=6366&context=sis_research

Venkatesh Jatla, Sravani Teeparthi, Ugesh Egala, Sylvia Celedon-Pattichis, and Marios S. Pattichis, “Fast and Accurate Video Analysis and Visualization of Classroom Activities Using Multiobjective Optimization of Extremely Low-Parameter Models,” IEEE Access, 2025. https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10988841

Jingen Li, Jiatian Mei, Di Wu, Mingtao Zhou, and Lin Jiang, “Multimodal Speech Recognition Assisted by Slide Information in Classroom Scenes,” 2024 7th International Conference on Video and Image Processing (ICVISP), 2025. https://ieeexplore.ieee.org/abstract/document/10959642

Shashank Shetty, Arun S. Devadiga, S. Sibi Chakkaravarthy, and K. A. Varun Kumar, “Ote-OCR Based Text Recognition and Extraction from Video Frames,” 2014 IEEE 8th International Conference on Intelligent Systems and Control (ISCO), 2014. doi: https://www.researchgate.net/profile/Shashank-Shetty-3/publication/301405380

Md. Saifuddin Khalid and Md. Iqbal Hossan, “Usability Evaluation of a Video Conferencing System in a University’s Classroom,” in Proc. 19th Int. Conf. Comput. Inf. Technol. (ICCIT), Dhaka, Bangladesh, Dec. 2016, pp. 184–189. https://www.researchgate.net/publication/305904926

Nen-Fu Huang, Hao-Hsuan Hsu, So-Chen Chen, Chia-An Lee, Yi-Wei Huang, Po-Wen Ou, and Jian-Wei Tzeng, “VideoMark: A Video-Based Learning Analytic Technique for MOOCs,” 2017 IEEE 2nd International Conference on Big Data Analysis (ICBDA), 2017. https://ieeexplore.ieee.org/abstract/document/8078738

Kenny Davila and Richard Zanibbi, “Visual Search Engine for Handwritten and Typeset Math in Lecture Videos and LaTeX Notes,” 2018 16th International Conference on Frontiers in Handwriting Recognition (ICFHR), 2018. https://pdfs.semanticscholar.org/3a9e/29504ce39568ca64c6e27335aae6ce6eb751.pdf

M. R. Rahman, S. Shah, and J. Subhlok, “Visual Summarization of Lecture Video Segments for Enhanced Navigation,” in Proc. 2020 IEEE Int. Symp. Multimedia (ISM), Dec. 2020, pp. 154–157, https://arxiv.org/pdf/2006.02434

Kenny Davila and Richard Zanibbi, “Whiteboard Video Summarization via Spatio-Temporal Conflict Minimization,” 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), pp. 1727-1734, 2017. https://cs.rit.edu/~rlaz/files/Kenny_ICDAR_2017.pdf

D. Dickson, C. V. Sharma, and K. Kwok, “Whiteboard Content Extraction and Analysis for the Classroom Environment,” 2008 IEEE International Symposium on Multimedia, pp. 131-138, 2008. https://www.researchgate.net/profile/Allen-Hanson-2/publication/221558684

Z. Tang and J. R. Kender, “A Unified Text Extraction Method for Instructional Videos,” 2005 IEEE International Conference on Image Processing (ICIP), vol. 2, pp. II-1088-II-1091, 2005. https://www.researchgate.net/profile/Lijun-Tang/publication/224622476

S. Banerjee, S. Kundu, and B. B. Chaudhuri, “Automatic Detection of Handwritten Texts from Video Frames of Lectures,” 2014 14th International Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 479-484, 2014. https://ieeexplore.ieee.org/abstract/document/6981089

M. A. Choudary and S.-F. Liu, “Summarization of Visual Content in Instructional Videos,” IEEE Transactions on Multimedia, vol. 9, no. 7, pp. 1443-1455, 2007. https://www.researchgate.net/profile/Chekuri-Choudary/publication/3424658

Chengpei Xu, Wenjing Jia, Ruomei Wang, Xiangjian He, Baoquan Zhao, Yuanfang Zhang, “Semantic Navigation of PowerPoint-Based Lecture Video for AutoNote Generation,” IEEE Transactions on Learning Technologies, 2023. https://ieeexplore.ieee.org/abstract/document/9927330

Chengpei Xu, Ruomei Wang, Shujin Lin, Xiaonan Luo, Baoquan Zhao, Lijie Shao, Mengqiu Hu, “Lecture2Note: Automatic Generation of Lecture Notes from Slide-Based Educational Videos,” IEEE ICME 2019, 2019. https://www.researchgate.net/profile/Baoquan-Zhao/publication/334997213

A.W.R.P. Karunarathna, T.U.M.N. Premarathna, R.G.S. Dilshan, W.A.K.H.R. Wanniarachchi, Y.M.C.N. Bimsara, I.T.S. Piyatilake, “Voicense: AI-Powered Lecture Note Generation Tool,” IEEE ICITR 2024, 2024. https://ieeexplore.ieee.org/abstract/document/10857774

A. Madhavi, A. Chilakamarri, C. Jupudi, S. Madanaboina, and S. Sriram, “Automatic Running Notes Generation from Audio Lecture using NLP for Comprehensive Learning,” in Proc. 15th Int. Conf. Computing Communication and Networking Technologies (ICCCNT), 2024https://ieeexplore.ieee.org/abstract/document/10723991

Baoquan Zhao, Songhua Xu, Shujin Lin, Ruomei Wang, Xiaonan Luo, “A New Visual Interface for Searching and Navigating Slide-Based Lecture Videos,” IEEE ICME 2019, 2019. https://www.researchgate.net/profile/Baoquan-Zhao/publication/334997587

Jin-Xia Huang, Yohan Lee, Oh-Woog Kwon, “DIRECT: Toward Dialogue-Based Reading Comprehension Tutoring,” IEEE Access, 2023. https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10003215

N. Singh, V.K. Gunjan, M.M. Nasralla, “A Parametrized Comparative Analysis of Performance Between Proposed Adaptive and Personalized Tutoring System 'Seis Tutor' With Existing Online Tutoring System,” IEEE Access, 2022. https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9755124

M.A. Hasan, N.F.M. Noor, S.S.B. Ab Rahman, M.M. Rahman, “The Transition From Intelligent to Affective Tutoring System: A Review and Open Issues,” IEEE Access, 2020. https://ieeexplore.ieee.org/document/9252896

Lijia Chen, Pingping Chen, Zhijian Lin, “Artificial Intelligence in Education: A Review,” IEEE Access, 2020. https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9069875

M. Murtaza, Y. Ahmed, J. A. Shamsi, F. Sherwani, and M. Usman, “AI-Based Personalized E-Learning Systems: Issues, Challenges, and Solutions,” IEEE Access, vol. 10, pp. 81323-81342, 2022. https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9840390

Amir Hadifar, Semere Kiros Bitew, Johannes Deleu, Chris Develder, Thomas Demeester, “EduQG: A Multi-Format Multiple-Choice Dataset for the Educational Domain,” IEEE Access, 2023. https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10051840

Tim Steuer, Anna Filighera, Thomas Tregel, “Investigating Educational and Noneducational Answer Selection for Educational Question Generation,” IEEE Access, 2022. https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9791321

Shoya Matsumori, Kohei Okuoka, Ryoichi Shibata, Minami Inoue, Yosuke Fukuchi, Michita Imai, “Mask and Cloze: Automatic Open Cloze Question Generation Using a Masked Language Model,” IEEE Access, 2023. https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10024779

Kanokwan Atchariyachanvanich, Srinual Nalintippayawong, and Thanakrit Julavanich, “Reverse SQL Question Generation Algorithm in the DBLearn Adaptive E-Learning System,” IEEE Access, 2019. https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=8703745

Ming Liu, Jingxu Zhang, Lucy Michael Nyagoga, Li Liu, “Student-AI Question Co-Creation for Enhancing Reading Comprehension,” IEEE Transactions on Learning Technologies, 2024. https://ieeexplore.ieee.org/abstract/document/10321718

R. M. Elshiny and A. Hamdy, “Automatic Question Generation Using Natural Language Processing and Transformers,” in Proc. 2023 International Conference on Computer and Applications (ICCA), 2023, pp. 1-6. https://ieeexplore.ieee.org/abstract/document/10401848

Sugiyanto Yoannatan Widjaja, Alfa Yohannis, “AI-Powered Automatic Question Generation for Teachers,” IEEE SIML 2025, 2025. https://www.researchgate.net/profile/Sugiyanto-Yoannatan-W/publication/393937138

A. J. Winata, D. J. Surjawan, and V. C. Mawardi, “Utilizing Large Language Models for Developing Automatic Question Generation in Education,” in Proc. 2025 International Conference on Advancement in Data Science, E-Learning and Information System (ICADEIS), 2025 https://ieeexplore.ieee.org/abstract/document/10933227

P. Preetha, G. Sivakamasundari, and K. Srimathi, “Enhancing Assessments: A Comparative Study of T5 and BART Transformer for QG,” in Proc. 2025 International Conference on Computing, Communication, and Multimedia (ICCMC), 2025. https://ieeexplore.ieee.org/abstract/document/11140610

N. Nair, S. Pikle, S. Save, R. Varghese, and K. Sonawane, “FlashMe: Automatic Flashcard Generation,” in Proc. 14th Int. Conf. Computing Communication and Networking Technologies (ICCCNT), 2023, https://ieeexplore.ieee.org/abstract/document/10308164

Irene Li et al, “What Should I Learn First: Introducing LectureBank for NLP Education and Prerequisite Chain Learning,” AAAI, 2019. https://arxiv.org/abs/1811.12181

Sudeshna Roy et al, “Inferring Concept Prerequisite Relations from Online Educational Resources,” AAAI, 2019. https://arxiv.org/abs/1811.12640

Irene Li et al, “R-VGAE: Relational-Variational Graph Autoencoder for Unsupervised Prerequisite Chain Learning,” COLING, 2020. https://aclanthology.org/2020.coling-main.99.pdf

Jifan Yu et al, “MOOCCube: A Large-scale Data Repository for NLP Applications in MOOCs,” ACL, 2020. https://aclanthology.org/2020.acl-main.285.pdf

Fu-Rong Dang et al, “Constructing an Educational Knowledge Graph with Concepts Linked to Wikipedia,” Journal of Computer Science and Technology, 2021. https://jcst.ict.ac.cn/fileup/1000-9000/PDF/2021-5-18-0328.pdf

Dr. Mehmet Cem Aytekin, Yücel Saygın, “ACE: AI-Assisted Construction of Educational Knowledge Graphs with Prerequisite Relations," Journal of Educational Data Mining, 2024. doi: https://jedm.educationaldatamining.org/index.php/JEDM/article/view/737

Sasha Spala et al, “SemEval-2020 Task 6: Definition Extraction from Free Text with the DEFT Corpus,” SemEval, 2020. https://aclanthology.org/2020.semeval-1.41.pdf

Safder et al, “Deep Learning-based Extraction of Algorithmic Metadata in Full-Text Scholarly Documents,” https://e-space.mmu.ac.uk/625933/8/Deep%20Learning-based%20Extraction%20of%20Algorithmic%20Metadata%20in%20Full-Text%20Scholarly%20Documents%20e.pdf

Sarthak Jain et al., “SciREX: A Challenge Dataset for Document-Level Information Extraction,” in Proc. 58th Annual Meeting of the Association for Computational Linguistics (ACL), 2020. https://aclanthology.org/2020.acl-main.670.pdf

Iz Beltagy et al, “S2ORC: The Semantic Scholar Open Research Corpus,” ACL, 2020. https://aclanthology.org/2020.acl-main.447.pdf

Yang Xu et al, “LayoutLM: Pre-training of Text and Layout for Document Image Understanding,” KDD, 2020. https://dl.acm.org/doi/pdf/10.1145/3394486.3403172

Yang Xu et al, “LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding,” / 2020 preprint, 2021. https://aclanthology.org/2021.acl-long.201.pdf

Yupan Huang et al, “LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking,” 2022. https://dl.acm.org/doi/pdf/10.1145/3503161.3548112

Junlong Li et al, “DiT: Self-supervised Pre-training for Document Image Transformer,” 2022. https://dl.acm.org/doi/pdf/10.1145/3503161.3547911

Yulin Li et al, “StrucTexT: Structured Text Understanding with Multi-Modal Transformers,” arXiv, 2021. https://dl.acm.org/doi/pdf/10.1145/3474085.3475345

Geewook Kim, Teakgyu Hong, Moonbin Yim, Jeongyeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, and Seunghyun Park, “OCR-Free Document Understanding Transformer,” in Proc. European Conference on Computer Vision (ECCV), 2022. doi: 10.1007/978-3-031-19815-1_29.

Minghao Li et al, “TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models,” arXiv, 2021. https://arxiv.org/pdf/2109.10282

Srikar Appalaraju et al, “DocFormer: End-to-End Transformer for Document Understanding,” https://openaccess.thecvf.com/content/ICCV2021/papers/Appalaraju_DocFormer_End-to-End_Transformer_for_Document_Understanding_ICCV_2021_paper.pdf

Zineng Tang et al., "Unifying Vision, Text, and Layout for Universal Document Processing,” CVPR, 2023. https://openaccess.thecvf.com/content/CVPR2023/papers/Tang_Unifying_Vision_Text_and_Layout_for_Universal_Document_Processing_CVPR_2023_paper.pdf

Kenton Lee et al, “Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding,” ICML, 2023. https://proceedings.mlr.press/v202/lee23g/lee23g.pdf

Aleksandra Piktus et al, “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks,” https://proceedings.neurips.cc/paper/2020/file/6b493230205f780e1bc26945df7481e5-Paper.pdf

K. Guu, K. Lee, Z. Tung, P. Pasupat, and M.-W. Chang, “REALM: Retrieval-Augmented Language Model Pre-Training,” https://proceedings.mlr.press/v119/guu20a/guu20a.pdf

Vladimir Karpukhin et al, “Dense Passage Retrieval for Open-Domain Question Answering (DPR),” EMNLP, 2020. https://aclanthology.org/2020.emnlp-main.550.pdf

Gautier Izacard, Edouard Grave, “Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering (Fusion-in-Decoder / FiD),” https://aclanthology.org/2021.eacl-main.74.pdf

Joshua Maynez et al., “On Faithfulness and Factuality in Abstractive Summarization,” ACL, 2020. doi: 10.18653/v1/2020.acl-main.173. https://aclanthology.org/2020.acl-main.173/?utm_source=chatgpt.com

Wojciech Kryściński et al, “Evaluating the Factual Consistency of Abstractive Text Summarization (FactCC),” arXiv, 2019. doi: 10.48550/arXiv.1910.12840. https://arxiv.org/abs/1910.12840

A. Wang, K. Cho, and M. Lewis, “Asking and Answering Questions to Evaluate the Factual Consistency of Summaries (QAGS),” in Proc. 58th Annual Meeting of the Association for Computational Linguistics (ACL), 2020, pp. 5008–5020, doi: 10.18653/v1/2020.acl-main.450. https://aclanthology.org/2020.acl-main.450/

Artidoro Pagnoni et al, “Understanding Factuality in Abstractive Summarization with FRANK: A Benchmark for Factuality Metrics,” NAACL, 2021. doi: 10.18653/v1/2021.naacl-main.383. https://aclanthology.org/2021.naacl-main.383/

Philippe Laban et al, “SummaC: Re-Visiting NLI-based Models for Inconsistency Detection in Summarization,” TACL, 2022. doi: 10.1162/tacl_a_00453. https://aclanthology.org/2022.tacl-1.10/

Or Honovich et al, “TRUE: Re-evaluating Factual Consistency Evaluation,” NAACL, 2022. doi: 10.18653/v1/2022.naacl-main.287. https://aclanthology.org/2022.naacl-main.287/

Stephanie Lin et al, “TruthfulQA: Measuring How Models Mimic Human Falsehoods,” paper circulated as arXiv, 2022. doi: 10.48550/arXiv.2109.07958. https://arxiv.org/abs/2109.07958

Junyi Li et al, “HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models,” arXiv, 2023. doi: 10.48550/arXiv.2305.11747. https://arxiv.org/abs/2305.11747

Potsawee Manakul et al, “SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models,” EMNLP, 2023. doi: 10.18653/v1/2023.emnlp-main.557. https://aclanthology.org/2023.emnlp-main.557/

Luyu Gao et al, “RARR: Researching and Revising What Language Models Say, Using Language Models,” ACL, 2023. doi: 10.18653/v1/2023.acl-long.910.

Liyan Tang et al, “MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents,” arXiv, 2024. doi: 10.48550/arXiv.2404.10774. https://arxiv.org/abs/2404.10774

C. Dong, Y. Yuan, K. Chen, S. Cheng, and C. Wen, “How to Build an Adaptive AI Tutor for Any Course Using Knowledge Graph-Enhanced Retrieval-Augmented Generation (KG-RAG),” arXiv:2311.17696, 2023. doi: 10.48550/arXiv.2311.17696. https://arxiv.org/abs/2311.17696

Y. Hicke, A. Agarwal, Q. Ma, and P. Denny, “AI-TA: Towards an Intelligent Question-Answer Teaching Assistant Using Open-Source Large Language Models,” arXiv:2311.02775, 2023. doi: 10.48550/arXiv.2311.02775. https://arxiv.org/abs/2311.02775

D. Yang, S. Lee, M. Kim, J. Won, N. Kim, D. Lee, and J. Yeo, “YA-TA: Yet Another Teaching Assistant: A Case Study on Using Large Language Models for Learning Python,” arXiv:2409.00355, 2024. doi: 10.48550/arXiv.2409.00355. https://arxiv.org/abs/2409.00355

Zifei FeiFei Han et al, “Improving Assessment of Tutoring Practices using Retrieval-Augmented Generation,” arXiv, 2024. doi: 10.48550/arXiv.2402.14594. https://arxiv.org/abs/2402.14594

Zachary Levonian et al, "Retrieval-Augmented Generation to Improve Math Question-Answering: Trade-offs Between Groundedness and Human Preference,” arXiv, 2023. doi: 10.48550/arXiv.2310.03184. https://arxiv.org/abs/2310.03184

Dong Won Lee et al, “Lecture Presentations Multimodal Dataset: Towards Understanding Multimodality in Educational Videos,” ICCV (IEEE/CVF), 2023. doi: 10.1109/ICCV51070.2023.01838. https://openaccess.thecvf.com/content/ICCV2023/papers/Lee_Lecture_Presentations_Multimodal_Dataset_Towards_Understanding_Multimodality_in_Educational_Videos_ICCV_2023_paper.pdf

Zhe Chen et al, “M3AV: A Multimodal, Multigenre, and Multipurpose Audio-Visual Academic Lecture Dataset,” ACL (Long), 2024. doi: 10.18653/v1/2024.acl-long.489. https://aclanthology.org/2024.acl-long.489/

Haoxu Wang et al, “SlideSpeech: A Large-Scale Slide-Enriched Audio-Visual Corpus,” ICASSP (IEEE) + 2023 arXiv preprint, 2024. doi: 10.1109/ICASSP48485.2024.10448079; 10.48550/arXiv.2309.05396. https://arxiv.org/abs/2309.05396

Katharina Anderer et al, “MaViLS: a Benchmark Dataset for Video-to-Slide Alignment, Assessing Baseline Accuracy with a Multimodal Alignment Algorithm Leveraging Speech, OCR, and Visual Features,” Interspeech, 2024. doi: 10.21437/Interspeech.2024-978. https://www.isca-archive.org/interspeech_2024/anderer24_interspeech.pdf

Pan Lu et al, “Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering,” NeurIPS, 2022. doi: 10.48550/arXiv.2209.09513. https://proceedings.neurips.cc/paper_files/paper/2022/file/11332b6b6cf4485b84afadb1352d3a9a-Paper-Conference.pdf

Minesh Mathew et al, “DocVQA: A Dataset for VQA on Document Images,” WACV (IEEE/CVF); dataset introduced in 2020, 2021. doi: 10.1109/WACV48630.2021.00225. https://ieeexplore.ieee.org/document/9423358

R. Tanaka, K. Nishida, K. Nishida, T. Hasegawa, I. Saito, and K. Saito, “SlideVQA: A Dataset for Document Visual Question Answering on Multiple Images,” https://arxiv.org/pdf/2301.04883

Ahmed Masry et al, “ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning,” 2022 (arXiv preprint; widely used benchmark), 2022. doi: 10.48550/arXiv.2203.10244. https://arxiv.org/abs/2203.10244

Xiang Yue et al, “MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI,” CVPR (IEEE/CVF) (original arXiv 2023), 2024. doi: 10.48550/arXiv.2311.16502. https://arxiv.org/abs/2311.16502

Chaoyou Fu et al., “MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models,” 2023 (arXiv benchmark paper), 2023. doi: 10.48550/arXiv.2306.13394. https://arxiv.org/abs/2306.13394

Yuan Liu et al., “MMBench: Is Your Multi-modal Model an All-Around Player?,” ECCV (LNCS), 2024. doi: 10.1007/978-3-031-72658-3_13. https://www.ecva.net/papers/eccv_2024/papers_ECCV/papers/00959.pdf

Haodong Duan et al., “VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality Models,” 2024 (arXiv + ACM MM tooling), 2024. doi: 10.48550/arXiv.2407.11691. https://arxiv.org/abs/2407.11691

Alexander R. Fabbri et al., “SummEval: Re-evaluating Summarization Evaluation,” TACL, 2021. doi: 10.1162/tacl_a_00373. https://arxiv.org/abs/2007.12626

C.-Y. Lin, “ROUGE: A Package for Automatic Evaluation of Summaries,” in Text Summarization Branches Out (Workshop of ACL), 2004. https://aclanthology.org/W04-1013/

T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi, “BERTScore: Evaluating Text Generation with BERT,” in Proc. International Conference on Learning Representations (ICLR), 2020. doi: 10.48550/arXiv.1904.09675. https://iclr.cc/virtual_2020/poster_SkeHuCVFDr.html

Thomas Scialom et al., “QuestEval: Summarization Asks for Fact-Based Evaluation,” EMNLP, 2021. doi: 10.18653/v1/2021.emnlp-main.529. https://arxiv.org/abs/2103.12693

Ming Zhong et al., “Towards a Unified Multi-Dimensional Evaluator for Text Generation,” EMNLP, 2022. doi: 10.18653/v1/2022.emnlp-main.131. https://arxiv.org/abs/2210.07197

T. Sellam, D. Das, and A. Parikh, “BLEURT: Learning Robust Metrics for Text Generation,” in Proc. 58th Annual Meeting of the Association for Computational Linguistics (ACL), 2020, pp. 7881-7892. doi: 10.18653/v1/2020.acl-main.704. https://aclanthology.org/2020.acl-main.704/

W. Yuan, G. Neubig, and P. Liu, “BARTScore: Evaluating Generated Text as Text Generation,” arXiv:2106.11520, 2021. doi: 10.48550/arXiv.2106.11520. https://arxiv.org/abs/2106.11520

Leping Qiu et al., “MaRginalia: Enabling In-person Lecture Capturing and Note-taking Through Mixed Reality,” CHI, 2025. doi: 10.1145/3706598.3714065. https://dl.acm.org/doi/10.1145/3706598.3714065

P. A. Diaz Munoz, “Interdisciplinary design practices in contemporary architectural development: Integrating creativity and functionality,” Evolutionary Studies in Imaginative Culture, vol. 5, no. 2, pp. 1–9, 2021.

D. Puthiya, “Strategic AI transformation initiatives for scalable business expansion,” Journal of Information Systems Engineering and Management, vol. 6, no. 2, pp. 1–12, 2021.

A. Kejriwal, “High-stakes negotiation frameworks in cross-functional project environments,” International Journal of Environmental Sciences, vol. 7, no. 1S, pp. 20–27, 2021.

R. Chhibber, “Strategic leadership in partner sales networks for enterprise market expansion,” Journal of International Crisis and Risk Communication Research, vol. 4, no. 3, pp. 467–475, 2021.

G. A. Ascanio, “Wellness-driven design development in luxury residential architecture: Spatial, social, and environmental dimensions,” Journal of Information Systems Engineering and Management, vol. 6, no. 1, pp. 1–10, 2021.

Downloads

Published

18.03.2026

How to Cite

Soma Kiran Kumar Nellipudi. (2026). Reliable Multimodal AI for Structured Knowledge Extraction and Study Material Generation in Real Classrooms: A Transparent Scoping Survey, Taxonomy, Benchmarks, and Research Roadmap. International Journal of Intelligent Systems and Applications in Engineering, 14(1s), 212–251. Retrieved from https://ijisae.org/index.php/IJISAE/article/view/8165

Issue

Section

Research Article

Similar Articles

You may also start an advanced similarity search for this article.