Enhancing Information Retrieval from Unstructured Data Using Lexical Chain Analysis and WordNet Integration with Lucene
Keywords:
Lucene, lexical chain analysis, WordNet, information retrieval, unstructured data, semantic search, document indexing, image-based search.Abstract
Big data entails extracting relevant information from unstructured data sources properly. This paper presents a novel approach to increase the efficiency of Lucene search by the application of lexical chain analysis. The purpose of doing this is to increase the precision and relevance of search results. This paper uses the power of lexical chains through WordNet, a large lexical database of synonyms, hyponyms, hypernyms, homonyms, and meronyms. Such chains can be considered to form coherent words in a text, which comprises an important indication to the context associated with a conceptual framework of a given text. The lexical chain analysis that our system does tries to look at and exploit the subtle semantic associations that are there in the search text for better context and improving recall through documents that are semantically more relevant. The system to be proposed, in addition, would support a variety of search modalities: document name, content, and attributes such as type, size, date, and author. The indexing mechanism is based on keyword frequency and concerns the occurrence of various keywords related to the lexical chains in the documents. Furthermore, the system will also apply its search engine with image-based documents, hence taking into account diversified formats that are the characteristic of a contemporary data repository. The facility for providing suggestions for autocompletions by retrieving past search queries and documents will be bound to enrich user interactions through the ability for users to search for things immediately. Furthermore, it will incorporate sentence-based searching capabilities to allow users to break down text and mine for in-depth details. This cross-discipline effort is expected to produce far-reaching innovations in unstructured searching. Through synergy between the power of Lucene in search and the semantic insight provided by WordNet-driven lexical chain analysis, this paper tries to redefine information retrieval to match the changing needs of almost all spheres, making the access to knowledge repositories intuitive and efficient.
Downloads
References
Teofili, T., & Lin, J. (2019). Lucene for approximate nearest-neighbors search on arbitrary dense vectors. arXiv preprint arXiv:1910.10208.
Yilmaz, Z. A., Wang, S., Yang, W., Zhang, H., & Lin, J. (2019, November). Applying BERT to document retrieval with birch. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP): System Demonstrations (pp. 19-24).
Liu, Q., & Xia, Z. (2019). Statement Generation Based on Big Data for Keyword Search. In Machine Learning and Intelligent Communications: 4th International Conference, MLICOM 2019, Nanjing, China, August 24–25, 2019, Proceedings 4 (pp. 477-488). Springer International Publishing.
Madi, N., Al-Mutlaq, N., & Al-Khalifa, H. S. (2019, May). HealthSEA: Towards Improving the Search Engine of KAAHE Arabic Health Encyclopedia. In 2019 2nd International Conference on Computer Applications & Information Security (ICCAIS) (pp. 1-7). IEEE.
Yusuf, N. U. H. U., Yunus, M. A. M., Wahid, N. O. R. F. A. R. A. D. I. L. L. A., Nawi, N. M., Samsudin, N. A., & Arbaiy, N. U. R. E. I. Z. E. (2020). Query expansion method for quran search using semantic search and lucene ranking. J Eng Sci Technol, 15(1), 675-692.
Ji, W. (2020, June). Research and Application of Information Data Retrieval System in Station Based on Lucene Technology. In 2020 IEEE International Conference on Artificial Intelligence and Computer Applications (ICAICA) (pp. 687-690). IEEE.
Youzhuo, Z., Yu, F., Ruifeng, Z., Shuqing, H., & Yi, W. (2020, May). Research on lucene based full-text query search service for smart distribution system. In 2020 3rd international conference on artificial intelligence and big data (ICAIBD) (pp. 338-341). IEEE.
Kanev, A. I., & Terekhov, V. I. (2020, December). Evaluation issues of query result ranking for semantic search. In Journal of Physics: Conference Series (Vol. 1694, No. 1, p. 012004). IOP Publishing.
Kasmani, F., Maniyar, R., & Narvekar, M. (2020, March). Content Based Search Engine for E-Books. In 2020 6th International Conference on Advanced Computing and Communication Systems (ICACCS) (pp. 528-533). IEEE.
Jin, D., Chen, G., Hao, W., & Bin, L. (2020, June). Whole database retrieval method of general relational database based on lucene. In 2020 IEEE International Conference on Artificial Intelligence and Computer Applications (ICAICA) (pp. 1277-1279). IEEE.
Downloads
Published
How to Cite
Issue
Section
License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
All papers should be submitted electronically. All submitted manuscripts must be original work that is not under submission at another journal or under consideration for publication in another form, such as a monograph or chapter of a book. Authors of submitted papers are obligated not to submit their paper for publication elsewhere until an editorial decision is rendered on their submission. Further, authors of accepted papers are prohibited from publishing the results in other publications that appear before the paper is published in the Journal unless they receive approval for doing so from the Editor-In-Chief.
IJISAE open access articles are licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. This license lets the audience to give appropriate credit, provide a link to the license, and indicate if changes were made and if they remix, transform, or build upon the material, they must distribute contributions under the same license as the original.