Semantics-Based String Matching: A Review of Machine Learning Models
Keywords:
String Matching, Semantic Similarity, Neural Networks, Deep Learning, Natural Language Processing, Information Retrieval, Data Integration, Knowledge GraphsAbstract
String matching is fundamental across domains including search, data integration, biology, and security. However, traditional algorithms relying on direct character comparisons and predetermined rules fail to capture semantic similarities. Recent advances in artificial intelligence (AI) and machine learning have enabled more flexible, semantics-based string matching models. Our work reviews literature on AI techniques for string matching, focusing on neural networks, graph models, attention mechanisms, reinforcement learning, and generative models. Methodologies extract latent features to match strings based on underlying semantics rather than surface form similarity. Reported benefits include improved ability to handle real-world variability, noise, and ambiguity. However, challenges remain around computational complexity, model interpretability, and adaptation across domains. By synthesizing current advantages and limitations, this review highlights promising research directions for advancing AI-driven string matching. Enabled by modern statistical learning, AI promises more powerful and scalable string matching with versatile applications in text, structured data, multimedia, and bioinformatics comparisons
Downloads
References
Knuth, D. E., Morris, J. H., & Pratt, V. R. (1977). Fast pattern matching in strings. SIAM Journal on Computing, 6(2), 323-350. https://doi.org/10.1137/0206024
Karp, R. M., & Rabin, M. O. (1987). Efficient randomized pattern-matching algorithms. IBM Journal of Research and Development, 31(2), 249-260. https://doi.org/10.1147/rd.312.0249
Aho, A. V., & Corasick, M. J. (1975). Efficient string matching: an aid to bibliographic search. Communications of the ACM, 18(6), 333-340. https://doi.org/10.1145/360825.360855
Bromley, J., Bentz, J. W., Bottou, L., Guyon, I., LeCun, Y., Moore, C., ... & Shah, R. (1993). Signature verification using a" siamese" time delay neural network. International Journal of Pattern Recognition and Artificial Intelligence, 7(4), 669-688. https://doi.org/10.1142/S0218001493000339
Li, C., Li, D., Das, S., Fu, G., Abujabal, A., Yao, Y., ... & Han, J. (2020). Deep entity matching with pre-trained language models. Proceedings of the VLDB Endowment, 14(1), 50-60. https://doi.org/10.14778/3421424.3421431
Zhao, H., Jiang, D., Zhang, Y., Tang, J., Wang, Q., & Yin, D. (2019). Auto-EM: End-to-end fuzzy entity-matching using pre-trained deep models and transfer learning. arXiv preprint arXiv:1909.13403.
Sutskever, I., Martens, J., & Hinton, G. E. (2011). Generating text with recurrent neural networks. ICML 2011 - Proceedings, 28th International Conference on Machine Learning.
Mueller, J., & Thyagarajan, A. (2016). Siamese recurrent architectures for learning sentence similarity. In Proceedings of the 30th AAAI Conference on Artificial Intelligence.
Tan, X., Qin, T., Socher, R., Xiong, C., & Hu, W. (2018). Multiway attention networks for modeling sentence pairs. In Proceedings of the 27th International Joint Conference on Artificial Intelligence.
Hu, B., Lu, Z., Li, H., & Chen, Q. (2014). Convolutional neural network architectures for matching natural language sentences. In Proceedings of the 28th International Conference on Neural Information Processing Systems (Vol. 2).
Mueller, J., & Thyagarajan, A. (2016). Siamese recurrent architectures for learning sentence similarity. In Proceedings of the 30th AAAI Conference on Artificial Intelligence.
Wang, S., & Jiang, J. (2016). Machine comprehension using match-lstm and answer pointer. arXiv preprint arXiv:1608.07905.
Watkins, C.J.C.H., & Dayan, P. (1992). Q-learning. Machine Learning, 8(3-4), 279-292. https://doi.org/10.1007/BF00992698
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., ... & Hassabis, D. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529-533. https://doi.org/10.1038/nature14236
Sutton, R. S., McAllester, D. A., Singh, S. P., & Mansour, Y. (2000). Policy gradient methods for reinforcement learning with function approximation. In Proceedings of the 13th International Conference on Neural Information Processing Systems (pp. 1057-1063).
Whitley, D. (1989). The GENITOR algorithm and selection pressure: Why rank-based allocation of reproductive trials is best. In Proceedings of the third international conference on genetic algorithms (pp. 116-121).
Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., & Artzi, Y. (2020). ERNIE-GEN: An enhanced multi-flow pre-training and fine-tuning framework for natural language generation. arXiv preprint arXiv:2001.11314.
Wang, X., Kapanipathi, P., Musaev, A., Yu, M., Talamadupula, K., & Chang, C. W. (2020). BERT-PLI: Modeling paragraph-level interactions for legal case retrieval. PRICAI 2020: Trends in Artificial Intelligence (pp. 519-532). Springer, Cham. https://doi.org/10.1007/978-3-030-59580-8_35
Li, Y., Li, J., Suhara, Y., Tan, J., & Li, G. (2020). Entity matching across heterogeneous sources. The VLDB Journal, 29(1), 195-218. https://doi.org/10.1007/s00778-019-00558-x
Taghizadeh, N., Pool, J., & Elkan, C. (2021). BLINK: entity linking in queries. Journal of Artificial Intelligence Research, 72, 1-26. https://doi.org/10.1613/jair.1.12604
Gao, L., Dai, Z., Li, L., Chen, W., Zhang, Y., Chen, J., ... & Yan, R. (2021, July). CoSET: co-training with semantic embedding for text matching. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI-21) (pp. 3718-3724).
Wu, H., Wang, W., Wang, H., & Wang, W. (2019). Entity matching across heterogeneous sources. IEEE Transactions on Knowledge and Data Engineering, 33(6), 2180-2193. https://doi.org/10.1109/TKDE.2019.2946162
Yao, L., Xiong, C., Bunescu, R., & Radev, D. (2021). ROCKETQA: An optimized training approach to dense passage retrieval for open-domain question answering. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (pp. 7130-7140).
Pramanik, S., Pal, A., Kamath, A. A., Kasar, M., & Bhattacharyya, P. (2021). Neural relation extraction with sentence-level attention and entity masking. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 1162-1174).
Guo, Z., Zhang, Y., & Lu, W. (2021). AutoTRE: automatically extracting tree relations with recursive pattern embedding. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (pp. 2334-2346).
Gao, H., Huang, Z., Wang, J., Xu, C., Sun, M., & Huang, J. (2021). Unsupervised entity alignment via joint knowledge embedding model and cross-graph model. IEEE Transactions on Knowledge and Data Engineering, 33(4), 1344-1355. https://doi.org/10.1109/TKDE.2019.2951662
Zhang, Y., Jiang, X., Zhang, Y., Xu, C., & Sun, M. (2020). Multi-channel graph neural network for entity alignment. ACL 2020.
Tran, K. M., Bisazza, A., & Monz, C. (2020). Cross-lingual transfer learning for question answering. AAAI 2020.
Liu, H., Xu, Y., Xu, H., Wang, Q., Chen, T., Tang, J., & Zhao, D. (2021). Towards role-based matching and aggregation for textual evidence retrieval. CIKM 2021.
Wu, Q., Shen, C., Liu, L., Dick, A., & van den Hengel, A. (2019). Addressing the extreme data scarcity in visual question answering. AAAI 2019.
Luo, R., Price, B., Cohen, S., & Shakhnarovich, G. (2019). MARN: meta attention for image-text matching. ICML 2019.
Ying, R., Gao, J., Chen, H., Yan, S., Wang, J., & Chen, J. (2021). DENOISE: deep neural networks for noisy natural language sentence matching. AAAI 2021.
Zhang, K., Lai, S., Zhang, M., Zhang, Y., Liu, J., & Lv, X. (2020). Efficient second-order hypergraph embedding for text matching. WSDM 2020.
Wang, Q., Yang, Y., Liu, Q., Ma, L., Yuan, L., Xu, T., ... & Sebe, N. (2019). Multi-task feature learning for multilingual visual semantic embedding. CVPR 2019.
Zhang, P., Goyal, A., Summers-Stay, D., Batra, D., & Parikh, D. (2021). VLP: improving vision-language pre-training with visual parsing. ACL 2021.
Wang, X., Huang, Q., Celikyilmaz, A., Gao, J., Shen, D., Wang, Y. F., ... & Zhang, L. (2019). Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation. CVPR 2019.
Guo, D., Ding, G., Jin, X., Wang, X., & Di, L. (2021). Diverse semantic cross-modal retrieval via probabilistic neighborhood matching. IJCAI 2021.
Zhang, K., Lai, S., Zhang, M., Zhang, Y., Liu, J., & Ma, S. (2021). Multi-channel hypergraph convolution for document semantic matching. EMNLP 2021.
Yu, W., Sun, K., Cardie, C., & Yu, D. (2020). Improving question answering over incomplete KBs with knowledge-aware reader. ACL 2020.
Zhang, Y., Zhong, V., Chen, D., Angeli, G., & Manning, C. D. (2021). Bipartite flat-graph network for nested named entity recognition. ACL 2021.
Zhang, N., Deng, S., Sun, H., Wang, G., Chen, X., Zhang, W., & Chen, H. (2021). Relation-aware collaborative learning for unified aspect-based sentiment analysis. ACL 2021.
Wang, B., Shin, R., Liu, X., Polozov, O., & Richardson, M. (2020). RAT-SQL: relation-aware schema encoding and linking for text-to-SQL parsers. ACL 2020.
Liu, L., Zhang, Y., Wang, J., & Tang, J. (2021). Knowledge graph augmented neural machine translation. EMNLP 2021.
Wang, B., Shin, R., Liu, X., Polozov, O., & Richardson, M. (2020). RAT-SQL: relation-aware schema encoding and linking for text-to-SQL parsers. ACL 2020.
Zhang, Y., Lai, S., Zhang, M., Zhang, K., Liu, J., & Lv, X. (2020). Efficient second-order hypergraph embedding for text matching. WSDM 2020.
Zhang, K., Lai, S., Zhang, M., Zhang, Y., Liu, J., & Ma, S. (2021). Multi-channel hypergraph convolution for document semantic matching. EMNLP 2021.
Downloads
Published
How to Cite
Issue
Section
License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
All papers should be submitted electronically. All submitted manuscripts must be original work that is not under submission at another journal or under consideration for publication in another form, such as a monograph or chapter of a book. Authors of submitted papers are obligated not to submit their paper for publication elsewhere until an editorial decision is rendered on their submission. Further, authors of accepted papers are prohibited from publishing the results in other publications that appear before the paper is published in the Journal unless they receive approval for doing so from the Editor-In-Chief.
IJISAE open access articles are licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. This license lets the audience to give appropriate credit, provide a link to the license, and indicate if changes were made and if they remix, transform, or build upon the material, they must distribute contributions under the same license as the original.