Advancing Vulnerability Detection: An Innovative Approach to Generate Embeddings of Code Snippets
Keywords:
GNN, embeddings, integrating, underlying, exploration, potentialAbstract
This research proposes a novel approach for generating Java code embeddings using Graph Neural Networks (GNNs). It achieves this by processing a combined representation of the code structure and functionality captured in Abstract Syntax Trees (ASTs), Program Dependence Graphs (PDGs), and Control Flow Graphs (CFGs). The GNN can then leverage these rich graph representations to capture intricate relationships within the code, leading to more informative embeddings. Evaluation shows these embeddings perform well in various software engineering tasks like code similarity detection, bug localization (over 90% precision for some vulnerabilities), and code classification. Additionally, dimensionality reduction techniques effectively visualize the code snippets based on the embeddings, revealing insights into the underlying structure and relationships. This research holds significant promise for improving software development practices. By effectively capturing complex code dependencies, it paves the way for advancements in automated code analysis. The resulting robust embeddings have the potential to revolutionize practices like code review automation, early vulnerability detection, code refactoring, and code search. Furthermore, the success of this GNN-based approach opens doors for further exploration of their potential in code analysis. However, limitations include its focus on Java and the potential influence of training data on model performance. Future directions include investigating applicability to other languages, incorporating domain-specific knowledge, developing interpretable GNNs, and integrating the embeddings with existing tools for a comprehensive code analysis platform. Overall, this research offers a significant contribution by demonstrating the effectiveness of GNNs for code embedding generation, with the potential to revolutionize automated code analysis and software development practices.
Downloads
References
Baxter, I. D., Pidgeon, C., & Mehlich, M. (1998). DMS reengineering toolkit: Practical foundations for domain-specific environments. Proceedings of the 5th Working Conference on Reverse Engineering. https://dl.acm.org/doi/10.1109/WCRE.1998.723179
Ferrante, J., Ottenstein, K. J., & Warren, J. D. (1987). The program dependence graph and its use in optimization. ACM Transactions on Programming Languages and Systems (TOPLAS), 9(3), 319-349. https://dl.acm.org/doi/10.1145/24039.24041
Allen, F. E. (1970). Control flow analysis. Proceedings of a Symposium on Compiler Optimization. https://dl.acm.org/doi/10.1145/800028.808479
Kipf, T. N., & Welling, M. (2017). Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. https://arxiv.org/abs/1609.02907
Alon, U., Zilberstein, M., Levy, O., & Yahav, E. (2019). code2vec: Learning distributed representations of code. Proceedings of the ACM on Programming Languages, 3(POPL), 1-29. https://openreview.net/forum?id=H1gKYo09tX
Pradel, M., & Sen, K. (2018). DeepBugs: A learning approach to name-based bug detection. Proceedings of the ACM on Programming Languages, 2(OOPSLA), 1-25. https://dl.acm.org/doi/10.1145/3276517
White, M., Vendome, C., Linares-Vásquez, M., & Poshyvanyk, D. (2016). Toward deep learning software repositories. Proceedings of the 12th Working Conference on Mining Software Repositories, 334-345. https://dl.acm.org/doi/10.1145/2884781.2884877
Mou, L., Li, G., Zhang, L., Wang, T., & Jin, Z. (2016). Convolutional neural networks over tree structures for programming language processing. Proceedings of the AAAI Conference on Artificial Intelligence, 30(1). https://dl.acm.org/doi/10.5555/3016100.3016190
Milan, Milan, et al. "Learning to Compare Code with Graph Neural Networks." Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, 2019. https://arxiv.org/pdf/2404.17365
Allamanis, Miltiadis, et al. "Deep Learning for Code Analysis with ASTs." Proceedings of the International Conference on Learning Representations, 2018. https://arxiv.org/abs/2401.00288
Lenz, Alexander, et al. "CodeNet: Exploring Relationships in Code with Neural Networks." arXiv preprint arXiv:2003.00508 (2020)
Xu, B., et al. "JCNN: Joint Code and Natural Language Representation Learning for Code Search." arXiv preprint arXiv:2105.07221 (2021).
Tian, Feng, et al. "CASTER: CodeBERT Pre-training with Masked Language Modeling and Multi-Task Learning." arXiv preprint arXiv:2106.05220 (2021)
Feng, Yue, et al. "CodeBERT: Pre-training a BERT-style Encoder for Code." arXiv preprint arXiv:2004.08855 (2020).
Lee, Jinyoung, et al. "Learning Deep Representations for Code and Comments." Proceedings of the 38th International Conference on Software Engineering, Association for Computing Machinery, 2016.
Zhang, Jian, et al. "Detecting condition-related bugs with control flow graph neural network." Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis. 2023.
Luo, Yu, Weifeng Xu, and Dianxiang Xu. "Compact abstract graphs for detecting code vulnerability with GNN models." Proceedings of the 38th Annual Computer Security Applications Conference. 2022.
Keshavarz, Hossein. JITGNN: a deep graph neural network for just-in-time bug prediction. MS thesis. University of Waterloo, 2022.
Downloads
Published
How to Cite
Issue
Section
License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
All papers should be submitted electronically. All submitted manuscripts must be original work that is not under submission at another journal or under consideration for publication in another form, such as a monograph or chapter of a book. Authors of submitted papers are obligated not to submit their paper for publication elsewhere until an editorial decision is rendered on their submission. Further, authors of accepted papers are prohibited from publishing the results in other publications that appear before the paper is published in the Journal unless they receive approval for doing so from the Editor-In-Chief.
IJISAE open access articles are licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. This license lets the audience to give appropriate credit, provide a link to the license, and indicate if changes were made and if they remix, transform, or build upon the material, they must distribute contributions under the same license as the original.