Reliable Multimodal AI for Structured Knowledge Extraction and Study Material Generation in Real Classrooms: A Transparent Scoping Survey, Taxonomy, Benchmarks, and Research Roadmap
Keywords:
Multimodal learning, lecture understanding, automatic note generation, educational knowledge graphs, retrieval-augmented generation, factuality, verification, benchmarks, trustworthy AI.Abstract
Educational knowledge in real classrooms is distributed across speech, slides, whiteboards, handwritten mathematics, code, and ad hoc diagrams. This makes accurate and persistent study support difficult even when recordings are available. Recent multimodal models and large language model (LLM) systems can summarize lectures and generate notes, but real deployment remains limited by alignment drift, OCR and ASR noise, incomplete extraction of formal STEM content, and hallucinations that can silently corrupt study artifacts. This paper presents a transparent scoping survey of a balanced 100-paper corpus organized into five clusters: multimodal lecture understanding, educational artifact generation, structured knowledge extraction, reliability and hallucination control, and benchmarks and evaluation. We explicitly treat the last two clusters as a transfer toolkit layer for classroom AI rather than as classroom-native systems. Beyond synthesis, the paper contributes: (1) a review protocol with an explicit audit trail and descriptive-count caveats; (2) a reliability-first classroom pipeline in which alignment is the operational core; (3) an operational intermediate representation (IR) with typed fields, evidence granularity, verification records, and abstention behavior; (4) a worked micro-example that carries a 30-second lecture snippet into evidence-linked flashcards; (5) a lecture-grounded versus resource-grounded verification matrix; and (6) a reviewer-ready multimodal faithfulness protocol for mixed evidence such as noisy board crops, OCR, and ASR. The result is a sharper, more operational roadmap for trustworthy classroom AI.
Downloads
References
Xuebai Zhang, Shyan-Ming Yuan, Ming-Dao Chen, and Xiaolong Liu, “A Complete System for Analysis of Video Lecture Based on Eye Tracking,” IEEE Access, 2018. https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=8438455
Dipesh Chand and Hasan Ogul, “A Framework for Lecture Video Segmentation from Extracted Speech Content,” 2021 19th World Symposium on Applied Machine Intelligence and Informatics (SAMI), 2021. https://www.researchgate.net/profile/Dipesh-Chand/publication/350294257
Alan Chern et al., “A Smartphone-Based Multi-Functional Hearing Assistive System to Facilitate Speech Recognition in the Classroom,” IEEE Access, 2023. https://ieeexplore.ieee.org/document/7938619
Mu-Chun Su et al., “A Video Analytic In-Class Student Concentration Monitoring System,” IEEE Transactions on Consumer Electronics, 2020. https://ieeexplore.ieee.org/abstract/document/9610134
Bhargava Urala Kota et al., “Automated Detection of Handwritten Whiteboard Content in Lecture Videos for Summarization,” IEEE Access, 2021. doi: https://par.nsf.gov/servlets/purl/10113238
Nigel Bosch and Sidney K. D’Mello, “Automatic Detection of Mind Wandering from Video in the Lab and in the Classroom,” IEEE Transactions on Affective Computing, 2020. doi: https://ieeexplore.ieee.org/document/8680698
Muhammad Bagus Andra and Tsuyoshi Usagawa, “Automatic Lecture Video Content Summarization with Attention-Based Recurrent Neural Network,” 2019 International Conference of Artificial Intelligence and Information Technology (ICAIIT), 2019. https://ieeexplore.ieee.org/abstract/document/8834514
H. Zeng, X. Shu, Y. Wang, Y. Wang, L. Zhang, T.-C. Pong, and H. Qu, “EmotionCues: Emotion-Oriented Visual Summarization of Classroom Videos,” IEEE Trans. Vis. Comput. Graph., vol. 27, no. 7, pp. 3168–3181, 2021. https://ink.library.smu.edu.sg/cgi/viewcontent.cgi?article=6366&context=sis_research
Venkatesh Jatla, Sravani Teeparthi, Ugesh Egala, Sylvia Celedon-Pattichis, and Marios S. Pattichis, “Fast and Accurate Video Analysis and Visualization of Classroom Activities Using Multiobjective Optimization of Extremely Low-Parameter Models,” IEEE Access, 2025. https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10988841
Jingen Li, Jiatian Mei, Di Wu, Mingtao Zhou, and Lin Jiang, “Multimodal Speech Recognition Assisted by Slide Information in Classroom Scenes,” 2024 7th International Conference on Video and Image Processing (ICVISP), 2025. https://ieeexplore.ieee.org/abstract/document/10959642
Shashank Shetty, Arun S. Devadiga, S. Sibi Chakkaravarthy, and K. A. Varun Kumar, “Ote-OCR Based Text Recognition and Extraction from Video Frames,” 2014 IEEE 8th International Conference on Intelligent Systems and Control (ISCO), 2014. doi: https://www.researchgate.net/profile/Shashank-Shetty-3/publication/301405380
Md. Saifuddin Khalid and Md. Iqbal Hossan, “Usability Evaluation of a Video Conferencing System in a University’s Classroom,” in Proc. 19th Int. Conf. Comput. Inf. Technol. (ICCIT), Dhaka, Bangladesh, Dec. 2016, pp. 184–189. https://www.researchgate.net/publication/305904926
Nen-Fu Huang, Hao-Hsuan Hsu, So-Chen Chen, Chia-An Lee, Yi-Wei Huang, Po-Wen Ou, and Jian-Wei Tzeng, “VideoMark: A Video-Based Learning Analytic Technique for MOOCs,” 2017 IEEE 2nd International Conference on Big Data Analysis (ICBDA), 2017. https://ieeexplore.ieee.org/abstract/document/8078738
Kenny Davila and Richard Zanibbi, “Visual Search Engine for Handwritten and Typeset Math in Lecture Videos and LaTeX Notes,” 2018 16th International Conference on Frontiers in Handwriting Recognition (ICFHR), 2018. https://pdfs.semanticscholar.org/3a9e/29504ce39568ca64c6e27335aae6ce6eb751.pdf
M. R. Rahman, S. Shah, and J. Subhlok, “Visual Summarization of Lecture Video Segments for Enhanced Navigation,” in Proc. 2020 IEEE Int. Symp. Multimedia (ISM), Dec. 2020, pp. 154–157, https://arxiv.org/pdf/2006.02434
Kenny Davila and Richard Zanibbi, “Whiteboard Video Summarization via Spatio-Temporal Conflict Minimization,” 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), pp. 1727-1734, 2017. https://cs.rit.edu/~rlaz/files/Kenny_ICDAR_2017.pdf
D. Dickson, C. V. Sharma, and K. Kwok, “Whiteboard Content Extraction and Analysis for the Classroom Environment,” 2008 IEEE International Symposium on Multimedia, pp. 131-138, 2008. https://www.researchgate.net/profile/Allen-Hanson-2/publication/221558684
Z. Tang and J. R. Kender, “A Unified Text Extraction Method for Instructional Videos,” 2005 IEEE International Conference on Image Processing (ICIP), vol. 2, pp. II-1088-II-1091, 2005. https://www.researchgate.net/profile/Lijun-Tang/publication/224622476
S. Banerjee, S. Kundu, and B. B. Chaudhuri, “Automatic Detection of Handwritten Texts from Video Frames of Lectures,” 2014 14th International Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 479-484, 2014. https://ieeexplore.ieee.org/abstract/document/6981089
M. A. Choudary and S.-F. Liu, “Summarization of Visual Content in Instructional Videos,” IEEE Transactions on Multimedia, vol. 9, no. 7, pp. 1443-1455, 2007. https://www.researchgate.net/profile/Chekuri-Choudary/publication/3424658
Chengpei Xu, Wenjing Jia, Ruomei Wang, Xiangjian He, Baoquan Zhao, Yuanfang Zhang, “Semantic Navigation of PowerPoint-Based Lecture Video for AutoNote Generation,” IEEE Transactions on Learning Technologies, 2023. https://ieeexplore.ieee.org/abstract/document/9927330
Chengpei Xu, Ruomei Wang, Shujin Lin, Xiaonan Luo, Baoquan Zhao, Lijie Shao, Mengqiu Hu, “Lecture2Note: Automatic Generation of Lecture Notes from Slide-Based Educational Videos,” IEEE ICME 2019, 2019. https://www.researchgate.net/profile/Baoquan-Zhao/publication/334997213
A.W.R.P. Karunarathna, T.U.M.N. Premarathna, R.G.S. Dilshan, W.A.K.H.R. Wanniarachchi, Y.M.C.N. Bimsara, I.T.S. Piyatilake, “Voicense: AI-Powered Lecture Note Generation Tool,” IEEE ICITR 2024, 2024. https://ieeexplore.ieee.org/abstract/document/10857774
A. Madhavi, A. Chilakamarri, C. Jupudi, S. Madanaboina, and S. Sriram, “Automatic Running Notes Generation from Audio Lecture using NLP for Comprehensive Learning,” in Proc. 15th Int. Conf. Computing Communication and Networking Technologies (ICCCNT), 2024https://ieeexplore.ieee.org/abstract/document/10723991
Baoquan Zhao, Songhua Xu, Shujin Lin, Ruomei Wang, Xiaonan Luo, “A New Visual Interface for Searching and Navigating Slide-Based Lecture Videos,” IEEE ICME 2019, 2019. https://www.researchgate.net/profile/Baoquan-Zhao/publication/334997587
Jin-Xia Huang, Yohan Lee, Oh-Woog Kwon, “DIRECT: Toward Dialogue-Based Reading Comprehension Tutoring,” IEEE Access, 2023. https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10003215
N. Singh, V.K. Gunjan, M.M. Nasralla, “A Parametrized Comparative Analysis of Performance Between Proposed Adaptive and Personalized Tutoring System 'Seis Tutor' With Existing Online Tutoring System,” IEEE Access, 2022. https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9755124
M.A. Hasan, N.F.M. Noor, S.S.B. Ab Rahman, M.M. Rahman, “The Transition From Intelligent to Affective Tutoring System: A Review and Open Issues,” IEEE Access, 2020. https://ieeexplore.ieee.org/document/9252896
Lijia Chen, Pingping Chen, Zhijian Lin, “Artificial Intelligence in Education: A Review,” IEEE Access, 2020. https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9069875
M. Murtaza, Y. Ahmed, J. A. Shamsi, F. Sherwani, and M. Usman, “AI-Based Personalized E-Learning Systems: Issues, Challenges, and Solutions,” IEEE Access, vol. 10, pp. 81323-81342, 2022. https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9840390
Amir Hadifar, Semere Kiros Bitew, Johannes Deleu, Chris Develder, Thomas Demeester, “EduQG: A Multi-Format Multiple-Choice Dataset for the Educational Domain,” IEEE Access, 2023. https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10051840
Tim Steuer, Anna Filighera, Thomas Tregel, “Investigating Educational and Noneducational Answer Selection for Educational Question Generation,” IEEE Access, 2022. https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9791321
Shoya Matsumori, Kohei Okuoka, Ryoichi Shibata, Minami Inoue, Yosuke Fukuchi, Michita Imai, “Mask and Cloze: Automatic Open Cloze Question Generation Using a Masked Language Model,” IEEE Access, 2023. https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10024779
Kanokwan Atchariyachanvanich, Srinual Nalintippayawong, and Thanakrit Julavanich, “Reverse SQL Question Generation Algorithm in the DBLearn Adaptive E-Learning System,” IEEE Access, 2019. https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=8703745
Ming Liu, Jingxu Zhang, Lucy Michael Nyagoga, Li Liu, “Student-AI Question Co-Creation for Enhancing Reading Comprehension,” IEEE Transactions on Learning Technologies, 2024. https://ieeexplore.ieee.org/abstract/document/10321718
R. M. Elshiny and A. Hamdy, “Automatic Question Generation Using Natural Language Processing and Transformers,” in Proc. 2023 International Conference on Computer and Applications (ICCA), 2023, pp. 1-6. https://ieeexplore.ieee.org/abstract/document/10401848
Sugiyanto Yoannatan Widjaja, Alfa Yohannis, “AI-Powered Automatic Question Generation for Teachers,” IEEE SIML 2025, 2025. https://www.researchgate.net/profile/Sugiyanto-Yoannatan-W/publication/393937138
A. J. Winata, D. J. Surjawan, and V. C. Mawardi, “Utilizing Large Language Models for Developing Automatic Question Generation in Education,” in Proc. 2025 International Conference on Advancement in Data Science, E-Learning and Information System (ICADEIS), 2025 https://ieeexplore.ieee.org/abstract/document/10933227
P. Preetha, G. Sivakamasundari, and K. Srimathi, “Enhancing Assessments: A Comparative Study of T5 and BART Transformer for QG,” in Proc. 2025 International Conference on Computing, Communication, and Multimedia (ICCMC), 2025. https://ieeexplore.ieee.org/abstract/document/11140610
N. Nair, S. Pikle, S. Save, R. Varghese, and K. Sonawane, “FlashMe: Automatic Flashcard Generation,” in Proc. 14th Int. Conf. Computing Communication and Networking Technologies (ICCCNT), 2023, https://ieeexplore.ieee.org/abstract/document/10308164
Irene Li et al, “What Should I Learn First: Introducing LectureBank for NLP Education and Prerequisite Chain Learning,” AAAI, 2019. https://arxiv.org/abs/1811.12181
Sudeshna Roy et al, “Inferring Concept Prerequisite Relations from Online Educational Resources,” AAAI, 2019. https://arxiv.org/abs/1811.12640
Irene Li et al, “R-VGAE: Relational-Variational Graph Autoencoder for Unsupervised Prerequisite Chain Learning,” COLING, 2020. https://aclanthology.org/2020.coling-main.99.pdf
Jifan Yu et al, “MOOCCube: A Large-scale Data Repository for NLP Applications in MOOCs,” ACL, 2020. https://aclanthology.org/2020.acl-main.285.pdf
Fu-Rong Dang et al, “Constructing an Educational Knowledge Graph with Concepts Linked to Wikipedia,” Journal of Computer Science and Technology, 2021. https://jcst.ict.ac.cn/fileup/1000-9000/PDF/2021-5-18-0328.pdf
Dr. Mehmet Cem Aytekin, Yücel Saygın, “ACE: AI-Assisted Construction of Educational Knowledge Graphs with Prerequisite Relations," Journal of Educational Data Mining, 2024. doi: https://jedm.educationaldatamining.org/index.php/JEDM/article/view/737
Sasha Spala et al, “SemEval-2020 Task 6: Definition Extraction from Free Text with the DEFT Corpus,” SemEval, 2020. https://aclanthology.org/2020.semeval-1.41.pdf
Safder et al, “Deep Learning-based Extraction of Algorithmic Metadata in Full-Text Scholarly Documents,” https://e-space.mmu.ac.uk/625933/8/Deep%20Learning-based%20Extraction%20of%20Algorithmic%20Metadata%20in%20Full-Text%20Scholarly%20Documents%20e.pdf
Sarthak Jain et al., “SciREX: A Challenge Dataset for Document-Level Information Extraction,” in Proc. 58th Annual Meeting of the Association for Computational Linguistics (ACL), 2020. https://aclanthology.org/2020.acl-main.670.pdf
Iz Beltagy et al, “S2ORC: The Semantic Scholar Open Research Corpus,” ACL, 2020. https://aclanthology.org/2020.acl-main.447.pdf
Yang Xu et al, “LayoutLM: Pre-training of Text and Layout for Document Image Understanding,” KDD, 2020. https://dl.acm.org/doi/pdf/10.1145/3394486.3403172
Yang Xu et al, “LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding,” / 2020 preprint, 2021. https://aclanthology.org/2021.acl-long.201.pdf
Yupan Huang et al, “LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking,” 2022. https://dl.acm.org/doi/pdf/10.1145/3503161.3548112
Junlong Li et al, “DiT: Self-supervised Pre-training for Document Image Transformer,” 2022. https://dl.acm.org/doi/pdf/10.1145/3503161.3547911
Yulin Li et al, “StrucTexT: Structured Text Understanding with Multi-Modal Transformers,” arXiv, 2021. https://dl.acm.org/doi/pdf/10.1145/3474085.3475345
Geewook Kim, Teakgyu Hong, Moonbin Yim, Jeongyeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, and Seunghyun Park, “OCR-Free Document Understanding Transformer,” in Proc. European Conference on Computer Vision (ECCV), 2022. doi: 10.1007/978-3-031-19815-1_29.
Minghao Li et al, “TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models,” arXiv, 2021. https://arxiv.org/pdf/2109.10282
Srikar Appalaraju et al, “DocFormer: End-to-End Transformer for Document Understanding,” https://openaccess.thecvf.com/content/ICCV2021/papers/Appalaraju_DocFormer_End-to-End_Transformer_for_Document_Understanding_ICCV_2021_paper.pdf
Zineng Tang et al., "Unifying Vision, Text, and Layout for Universal Document Processing,” CVPR, 2023. https://openaccess.thecvf.com/content/CVPR2023/papers/Tang_Unifying_Vision_Text_and_Layout_for_Universal_Document_Processing_CVPR_2023_paper.pdf
Kenton Lee et al, “Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding,” ICML, 2023. https://proceedings.mlr.press/v202/lee23g/lee23g.pdf
Aleksandra Piktus et al, “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks,” https://proceedings.neurips.cc/paper/2020/file/6b493230205f780e1bc26945df7481e5-Paper.pdf
K. Guu, K. Lee, Z. Tung, P. Pasupat, and M.-W. Chang, “REALM: Retrieval-Augmented Language Model Pre-Training,” https://proceedings.mlr.press/v119/guu20a/guu20a.pdf
Vladimir Karpukhin et al, “Dense Passage Retrieval for Open-Domain Question Answering (DPR),” EMNLP, 2020. https://aclanthology.org/2020.emnlp-main.550.pdf
Gautier Izacard, Edouard Grave, “Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering (Fusion-in-Decoder / FiD),” https://aclanthology.org/2021.eacl-main.74.pdf
Joshua Maynez et al., “On Faithfulness and Factuality in Abstractive Summarization,” ACL, 2020. doi: 10.18653/v1/2020.acl-main.173. https://aclanthology.org/2020.acl-main.173/?utm_source=chatgpt.com
Wojciech Kryściński et al, “Evaluating the Factual Consistency of Abstractive Text Summarization (FactCC),” arXiv, 2019. doi: 10.48550/arXiv.1910.12840. https://arxiv.org/abs/1910.12840
A. Wang, K. Cho, and M. Lewis, “Asking and Answering Questions to Evaluate the Factual Consistency of Summaries (QAGS),” in Proc. 58th Annual Meeting of the Association for Computational Linguistics (ACL), 2020, pp. 5008–5020, doi: 10.18653/v1/2020.acl-main.450. https://aclanthology.org/2020.acl-main.450/
Artidoro Pagnoni et al, “Understanding Factuality in Abstractive Summarization with FRANK: A Benchmark for Factuality Metrics,” NAACL, 2021. doi: 10.18653/v1/2021.naacl-main.383. https://aclanthology.org/2021.naacl-main.383/
Philippe Laban et al, “SummaC: Re-Visiting NLI-based Models for Inconsistency Detection in Summarization,” TACL, 2022. doi: 10.1162/tacl_a_00453. https://aclanthology.org/2022.tacl-1.10/
Or Honovich et al, “TRUE: Re-evaluating Factual Consistency Evaluation,” NAACL, 2022. doi: 10.18653/v1/2022.naacl-main.287. https://aclanthology.org/2022.naacl-main.287/
Stephanie Lin et al, “TruthfulQA: Measuring How Models Mimic Human Falsehoods,” paper circulated as arXiv, 2022. doi: 10.48550/arXiv.2109.07958. https://arxiv.org/abs/2109.07958
Junyi Li et al, “HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models,” arXiv, 2023. doi: 10.48550/arXiv.2305.11747. https://arxiv.org/abs/2305.11747
Potsawee Manakul et al, “SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models,” EMNLP, 2023. doi: 10.18653/v1/2023.emnlp-main.557. https://aclanthology.org/2023.emnlp-main.557/
Luyu Gao et al, “RARR: Researching and Revising What Language Models Say, Using Language Models,” ACL, 2023. doi: 10.18653/v1/2023.acl-long.910.
Liyan Tang et al, “MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents,” arXiv, 2024. doi: 10.48550/arXiv.2404.10774. https://arxiv.org/abs/2404.10774
C. Dong, Y. Yuan, K. Chen, S. Cheng, and C. Wen, “How to Build an Adaptive AI Tutor for Any Course Using Knowledge Graph-Enhanced Retrieval-Augmented Generation (KG-RAG),” arXiv:2311.17696, 2023. doi: 10.48550/arXiv.2311.17696. https://arxiv.org/abs/2311.17696
Y. Hicke, A. Agarwal, Q. Ma, and P. Denny, “AI-TA: Towards an Intelligent Question-Answer Teaching Assistant Using Open-Source Large Language Models,” arXiv:2311.02775, 2023. doi: 10.48550/arXiv.2311.02775. https://arxiv.org/abs/2311.02775
D. Yang, S. Lee, M. Kim, J. Won, N. Kim, D. Lee, and J. Yeo, “YA-TA: Yet Another Teaching Assistant: A Case Study on Using Large Language Models for Learning Python,” arXiv:2409.00355, 2024. doi: 10.48550/arXiv.2409.00355. https://arxiv.org/abs/2409.00355
Zifei FeiFei Han et al, “Improving Assessment of Tutoring Practices using Retrieval-Augmented Generation,” arXiv, 2024. doi: 10.48550/arXiv.2402.14594. https://arxiv.org/abs/2402.14594
Zachary Levonian et al, "Retrieval-Augmented Generation to Improve Math Question-Answering: Trade-offs Between Groundedness and Human Preference,” arXiv, 2023. doi: 10.48550/arXiv.2310.03184. https://arxiv.org/abs/2310.03184
Dong Won Lee et al, “Lecture Presentations Multimodal Dataset: Towards Understanding Multimodality in Educational Videos,” ICCV (IEEE/CVF), 2023. doi: 10.1109/ICCV51070.2023.01838. https://openaccess.thecvf.com/content/ICCV2023/papers/Lee_Lecture_Presentations_Multimodal_Dataset_Towards_Understanding_Multimodality_in_Educational_Videos_ICCV_2023_paper.pdf
Zhe Chen et al, “M3AV: A Multimodal, Multigenre, and Multipurpose Audio-Visual Academic Lecture Dataset,” ACL (Long), 2024. doi: 10.18653/v1/2024.acl-long.489. https://aclanthology.org/2024.acl-long.489/
Haoxu Wang et al, “SlideSpeech: A Large-Scale Slide-Enriched Audio-Visual Corpus,” ICASSP (IEEE) + 2023 arXiv preprint, 2024. doi: 10.1109/ICASSP48485.2024.10448079; 10.48550/arXiv.2309.05396. https://arxiv.org/abs/2309.05396
Katharina Anderer et al, “MaViLS: a Benchmark Dataset for Video-to-Slide Alignment, Assessing Baseline Accuracy with a Multimodal Alignment Algorithm Leveraging Speech, OCR, and Visual Features,” Interspeech, 2024. doi: 10.21437/Interspeech.2024-978. https://www.isca-archive.org/interspeech_2024/anderer24_interspeech.pdf
Pan Lu et al, “Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering,” NeurIPS, 2022. doi: 10.48550/arXiv.2209.09513. https://proceedings.neurips.cc/paper_files/paper/2022/file/11332b6b6cf4485b84afadb1352d3a9a-Paper-Conference.pdf
Minesh Mathew et al, “DocVQA: A Dataset for VQA on Document Images,” WACV (IEEE/CVF); dataset introduced in 2020, 2021. doi: 10.1109/WACV48630.2021.00225. https://ieeexplore.ieee.org/document/9423358
R. Tanaka, K. Nishida, K. Nishida, T. Hasegawa, I. Saito, and K. Saito, “SlideVQA: A Dataset for Document Visual Question Answering on Multiple Images,” https://arxiv.org/pdf/2301.04883
Ahmed Masry et al, “ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning,” 2022 (arXiv preprint; widely used benchmark), 2022. doi: 10.48550/arXiv.2203.10244. https://arxiv.org/abs/2203.10244
Xiang Yue et al, “MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI,” CVPR (IEEE/CVF) (original arXiv 2023), 2024. doi: 10.48550/arXiv.2311.16502. https://arxiv.org/abs/2311.16502
Chaoyou Fu et al., “MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models,” 2023 (arXiv benchmark paper), 2023. doi: 10.48550/arXiv.2306.13394. https://arxiv.org/abs/2306.13394
Yuan Liu et al., “MMBench: Is Your Multi-modal Model an All-Around Player?,” ECCV (LNCS), 2024. doi: 10.1007/978-3-031-72658-3_13. https://www.ecva.net/papers/eccv_2024/papers_ECCV/papers/00959.pdf
Haodong Duan et al., “VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality Models,” 2024 (arXiv + ACM MM tooling), 2024. doi: 10.48550/arXiv.2407.11691. https://arxiv.org/abs/2407.11691
Alexander R. Fabbri et al., “SummEval: Re-evaluating Summarization Evaluation,” TACL, 2021. doi: 10.1162/tacl_a_00373. https://arxiv.org/abs/2007.12626
C.-Y. Lin, “ROUGE: A Package for Automatic Evaluation of Summaries,” in Text Summarization Branches Out (Workshop of ACL), 2004. https://aclanthology.org/W04-1013/
T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi, “BERTScore: Evaluating Text Generation with BERT,” in Proc. International Conference on Learning Representations (ICLR), 2020. doi: 10.48550/arXiv.1904.09675. https://iclr.cc/virtual_2020/poster_SkeHuCVFDr.html
Thomas Scialom et al., “QuestEval: Summarization Asks for Fact-Based Evaluation,” EMNLP, 2021. doi: 10.18653/v1/2021.emnlp-main.529. https://arxiv.org/abs/2103.12693
Ming Zhong et al., “Towards a Unified Multi-Dimensional Evaluator for Text Generation,” EMNLP, 2022. doi: 10.18653/v1/2022.emnlp-main.131. https://arxiv.org/abs/2210.07197
T. Sellam, D. Das, and A. Parikh, “BLEURT: Learning Robust Metrics for Text Generation,” in Proc. 58th Annual Meeting of the Association for Computational Linguistics (ACL), 2020, pp. 7881-7892. doi: 10.18653/v1/2020.acl-main.704. https://aclanthology.org/2020.acl-main.704/
W. Yuan, G. Neubig, and P. Liu, “BARTScore: Evaluating Generated Text as Text Generation,” arXiv:2106.11520, 2021. doi: 10.48550/arXiv.2106.11520. https://arxiv.org/abs/2106.11520
Leping Qiu et al., “MaRginalia: Enabling In-person Lecture Capturing and Note-taking Through Mixed Reality,” CHI, 2025. doi: 10.1145/3706598.3714065. https://dl.acm.org/doi/10.1145/3706598.3714065
P. A. Diaz Munoz, “Interdisciplinary design practices in contemporary architectural development: Integrating creativity and functionality,” Evolutionary Studies in Imaginative Culture, vol. 5, no. 2, pp. 1–9, 2021.
D. Puthiya, “Strategic AI transformation initiatives for scalable business expansion,” Journal of Information Systems Engineering and Management, vol. 6, no. 2, pp. 1–12, 2021.
A. Kejriwal, “High-stakes negotiation frameworks in cross-functional project environments,” International Journal of Environmental Sciences, vol. 7, no. 1S, pp. 20–27, 2021.
R. Chhibber, “Strategic leadership in partner sales networks for enterprise market expansion,” Journal of International Crisis and Risk Communication Research, vol. 4, no. 3, pp. 467–475, 2021.
G. A. Ascanio, “Wellness-driven design development in luxury residential architecture: Spatial, social, and environmental dimensions,” Journal of Information Systems Engineering and Management, vol. 6, no. 1, pp. 1–10, 2021.
Downloads
Published
How to Cite
Issue
Section
License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
All papers should be submitted electronically. All submitted manuscripts must be original work that is not under submission at another journal or under consideration for publication in another form, such as a monograph or chapter of a book. Authors of submitted papers are obligated not to submit their paper for publication elsewhere until an editorial decision is rendered on their submission. Further, authors of accepted papers are prohibited from publishing the results in other publications that appear before the paper is published in the Journal unless they receive approval for doing so from the Editor-In-Chief.
IJISAE open access articles are licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. This license lets the audience to give appropriate credit, provide a link to the license, and indicate if changes were made and if they remix, transform, or build upon the material, they must distribute contributions under the same license as the original.


