Semantics-Based String Matching: A Review of Machine Learning Models

Authors

  • Shaik Asha Research Scholar, Department of Computer Science and Engineering, Koneru Lakshmaiah Education Foundation, Guntur, Andhra Pradesh, India.
  • Sajja Tulasi Krishna Assistant Professor, Department of Computer Science and Engineering, Koneru Lakshmaiah Education Foundation, Guntur, Andhra Pradesh, India.

Keywords:

String Matching, Semantic Similarity, Neural Networks, Deep Learning, Natural Language Processing, Information Retrieval, Data Integration, Knowledge Graphs

Abstract

String matching is fundamental across domains including search, data integration, biology, and security. However, traditional algorithms relying on direct character comparisons and predetermined rules fail to capture semantic similarities. Recent advances in artificial intelligence (AI) and machine learning have enabled more flexible, semantics-based string matching models. Our work reviews literature on AI techniques for string matching, focusing on neural networks, graph models, attention mechanisms, reinforcement learning, and generative models. Methodologies extract latent features to match strings based on underlying semantics rather than surface form similarity. Reported benefits include improved ability to handle real-world variability, noise, and ambiguity. However, challenges remain around computational complexity, model interpretability, and adaptation across domains. By synthesizing current advantages and limitations, this review highlights promising research directions for advancing AI-driven string matching. Enabled by modern statistical learning, AI promises more powerful and scalable string matching with versatile applications in text, structured data, multimedia, and bioinformatics comparisons

Downloads

Download data is not yet available.

References

Knuth, D. E., Morris, J. H., & Pratt, V. R. (1977). Fast pattern matching in strings. SIAM Journal on Computing, 6(2), 323-350. https://doi.org/10.1137/0206024

Karp, R. M., & Rabin, M. O. (1987). Efficient randomized pattern-matching algorithms. IBM Journal of Research and Development, 31(2), 249-260. https://doi.org/10.1147/rd.312.0249

Aho, A. V., & Corasick, M. J. (1975). Efficient string matching: an aid to bibliographic search. Communications of the ACM, 18(6), 333-340. https://doi.org/10.1145/360825.360855

Bromley, J., Bentz, J. W., Bottou, L., Guyon, I., LeCun, Y., Moore, C., ... & Shah, R. (1993). Signature verification using a" siamese" time delay neural network. International Journal of Pattern Recognition and Artificial Intelligence, 7(4), 669-688. https://doi.org/10.1142/S0218001493000339

Li, C., Li, D., Das, S., Fu, G., Abujabal, A., Yao, Y., ... & Han, J. (2020). Deep entity matching with pre-trained language models. Proceedings of the VLDB Endowment, 14(1), 50-60. https://doi.org/10.14778/3421424.3421431

Zhao, H., Jiang, D., Zhang, Y., Tang, J., Wang, Q., & Yin, D. (2019). Auto-EM: End-to-end fuzzy entity-matching using pre-trained deep models and transfer learning. arXiv preprint arXiv:1909.13403.

Sutskever, I., Martens, J., & Hinton, G. E. (2011). Generating text with recurrent neural networks. ICML 2011 - Proceedings, 28th International Conference on Machine Learning.

Mueller, J., & Thyagarajan, A. (2016). Siamese recurrent architectures for learning sentence similarity. In Proceedings of the 30th AAAI Conference on Artificial Intelligence.

Tan, X., Qin, T., Socher, R., Xiong, C., & Hu, W. (2018). Multiway attention networks for modeling sentence pairs. In Proceedings of the 27th International Joint Conference on Artificial Intelligence.

Hu, B., Lu, Z., Li, H., & Chen, Q. (2014). Convolutional neural network architectures for matching natural language sentences. In Proceedings of the 28th International Conference on Neural Information Processing Systems (Vol. 2).

Mueller, J., & Thyagarajan, A. (2016). Siamese recurrent architectures for learning sentence similarity. In Proceedings of the 30th AAAI Conference on Artificial Intelligence.

Wang, S., & Jiang, J. (2016). Machine comprehension using match-lstm and answer pointer. arXiv preprint arXiv:1608.07905.

Watkins, C.J.C.H., & Dayan, P. (1992). Q-learning. Machine Learning, 8(3-4), 279-292. https://doi.org/10.1007/BF00992698

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., ... & Hassabis, D. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529-533. https://doi.org/10.1038/nature14236

Sutton, R. S., McAllester, D. A., Singh, S. P., & Mansour, Y. (2000). Policy gradient methods for reinforcement learning with function approximation. In Proceedings of the 13th International Conference on Neural Information Processing Systems (pp. 1057-1063).

Whitley, D. (1989). The GENITOR algorithm and selection pressure: Why rank-based allocation of reproductive trials is best. In Proceedings of the third international conference on genetic algorithms (pp. 116-121).

Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., & Artzi, Y. (2020). ERNIE-GEN: An enhanced multi-flow pre-training and fine-tuning framework for natural language generation. arXiv preprint arXiv:2001.11314.

Wang, X., Kapanipathi, P., Musaev, A., Yu, M., Talamadupula, K., & Chang, C. W. (2020). BERT-PLI: Modeling paragraph-level interactions for legal case retrieval. PRICAI 2020: Trends in Artificial Intelligence (pp. 519-532). Springer, Cham. https://doi.org/10.1007/978-3-030-59580-8_35

Li, Y., Li, J., Suhara, Y., Tan, J., & Li, G. (2020). Entity matching across heterogeneous sources. The VLDB Journal, 29(1), 195-218. https://doi.org/10.1007/s00778-019-00558-x

Taghizadeh, N., Pool, J., & Elkan, C. (2021). BLINK: entity linking in queries. Journal of Artificial Intelligence Research, 72, 1-26. https://doi.org/10.1613/jair.1.12604

Gao, L., Dai, Z., Li, L., Chen, W., Zhang, Y., Chen, J., ... & Yan, R. (2021, July). CoSET: co-training with semantic embedding for text matching. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI-21) (pp. 3718-3724).

Wu, H., Wang, W., Wang, H., & Wang, W. (2019). Entity matching across heterogeneous sources. IEEE Transactions on Knowledge and Data Engineering, 33(6), 2180-2193. https://doi.org/10.1109/TKDE.2019.2946162

Yao, L., Xiong, C., Bunescu, R., & Radev, D. (2021). ROCKETQA: An optimized training approach to dense passage retrieval for open-domain question answering. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (pp. 7130-7140).

Pramanik, S., Pal, A., Kamath, A. A., Kasar, M., & Bhattacharyya, P. (2021). Neural relation extraction with sentence-level attention and entity masking. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 1162-1174).

Guo, Z., Zhang, Y., & Lu, W. (2021). AutoTRE: automatically extracting tree relations with recursive pattern embedding. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (pp. 2334-2346).

Gao, H., Huang, Z., Wang, J., Xu, C., Sun, M., & Huang, J. (2021). Unsupervised entity alignment via joint knowledge embedding model and cross-graph model. IEEE Transactions on Knowledge and Data Engineering, 33(4), 1344-1355. https://doi.org/10.1109/TKDE.2019.2951662

Zhang, Y., Jiang, X., Zhang, Y., Xu, C., & Sun, M. (2020). Multi-channel graph neural network for entity alignment. ACL 2020.

Tran, K. M., Bisazza, A., & Monz, C. (2020). Cross-lingual transfer learning for question answering. AAAI 2020.

Liu, H., Xu, Y., Xu, H., Wang, Q., Chen, T., Tang, J., & Zhao, D. (2021). Towards role-based matching and aggregation for textual evidence retrieval. CIKM 2021.

Wu, Q., Shen, C., Liu, L., Dick, A., & van den Hengel, A. (2019). Addressing the extreme data scarcity in visual question answering. AAAI 2019.

Luo, R., Price, B., Cohen, S., & Shakhnarovich, G. (2019). MARN: meta attention for image-text matching. ICML 2019.

Ying, R., Gao, J., Chen, H., Yan, S., Wang, J., & Chen, J. (2021). DENOISE: deep neural networks for noisy natural language sentence matching. AAAI 2021.

Zhang, K., Lai, S., Zhang, M., Zhang, Y., Liu, J., & Lv, X. (2020). Efficient second-order hypergraph embedding for text matching. WSDM 2020.

Wang, Q., Yang, Y., Liu, Q., Ma, L., Yuan, L., Xu, T., ... & Sebe, N. (2019). Multi-task feature learning for multilingual visual semantic embedding. CVPR 2019.

Zhang, P., Goyal, A., Summers-Stay, D., Batra, D., & Parikh, D. (2021). VLP: improving vision-language pre-training with visual parsing. ACL 2021.

Wang, X., Huang, Q., Celikyilmaz, A., Gao, J., Shen, D., Wang, Y. F., ... & Zhang, L. (2019). Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation. CVPR 2019.

Guo, D., Ding, G., Jin, X., Wang, X., & Di, L. (2021). Diverse semantic cross-modal retrieval via probabilistic neighborhood matching. IJCAI 2021.

Zhang, K., Lai, S., Zhang, M., Zhang, Y., Liu, J., & Ma, S. (2021). Multi-channel hypergraph convolution for document semantic matching. EMNLP 2021.

Yu, W., Sun, K., Cardie, C., & Yu, D. (2020). Improving question answering over incomplete KBs with knowledge-aware reader. ACL 2020.

Zhang, Y., Zhong, V., Chen, D., Angeli, G., & Manning, C. D. (2021). Bipartite flat-graph network for nested named entity recognition. ACL 2021.

Zhang, N., Deng, S., Sun, H., Wang, G., Chen, X., Zhang, W., & Chen, H. (2021). Relation-aware collaborative learning for unified aspect-based sentiment analysis. ACL 2021.

Wang, B., Shin, R., Liu, X., Polozov, O., & Richardson, M. (2020). RAT-SQL: relation-aware schema encoding and linking for text-to-SQL parsers. ACL 2020.

Liu, L., Zhang, Y., Wang, J., & Tang, J. (2021). Knowledge graph augmented neural machine translation. EMNLP 2021.

Wang, B., Shin, R., Liu, X., Polozov, O., & Richardson, M. (2020). RAT-SQL: relation-aware schema encoding and linking for text-to-SQL parsers. ACL 2020.

Zhang, Y., Lai, S., Zhang, M., Zhang, K., Liu, J., & Lv, X. (2020). Efficient second-order hypergraph embedding for text matching. WSDM 2020.

Zhang, K., Lai, S., Zhang, M., Zhang, Y., Liu, J., & Ma, S. (2021). Multi-channel hypergraph convolution for document semantic matching. EMNLP 2021.

Downloads

Published

13.12.2023

How to Cite

Asha, S. ., & Krishna, S. T. . (2023). Semantics-Based String Matching: A Review of Machine Learning Models . International Journal of Intelligent Systems and Applications in Engineering, 12(8s), 347–356. Retrieved from https://ijisae.org/index.php/IJISAE/article/view/4126

Issue

Section

Research Article