Advancing Vulnerability Detection: An Innovative Approach to Generate Embeddings of Code Snippets

Authors

  • Anushka Singh

Keywords:

GNN, embeddings, integrating, underlying, exploration, potential

Abstract

This research proposes a novel approach for generating Java code embeddings using Graph Neural Networks (GNNs). It achieves this by processing a combined representation of the code structure and functionality captured in Abstract Syntax Trees (ASTs), Program Dependence Graphs (PDGs), and Control Flow Graphs (CFGs). The GNN can then leverage these rich graph representations to capture intricate relationships within the code, leading to more informative embeddings. Evaluation shows these embeddings perform well in various software engineering tasks like code similarity detection, bug localization (over 90% precision for some vulnerabilities), and code classification. Additionally, dimensionality reduction techniques effectively visualize the code snippets based on the embeddings, revealing insights into the underlying structure and relationships. This research holds significant promise for improving software development practices. By effectively capturing complex code dependencies, it paves the way for advancements in automated code analysis. The resulting robust embeddings have the potential to revolutionize practices like code review automation, early vulnerability detection, code refactoring, and code search. Furthermore, the success of this GNN-based approach opens doors for further exploration of their potential in code analysis. However, limitations include its focus on Java and the potential influence of training data on model performance. Future directions include investigating applicability to other languages, incorporating domain-specific knowledge, developing interpretable GNNs, and integrating the embeddings with existing tools for a comprehensive code analysis platform. Overall, this research offers a significant contribution by demonstrating the effectiveness of GNNs for code embedding generation, with the potential to revolutionize automated code analysis and software development practices.

Downloads

Download data is not yet available.

References

Baxter, I. D., Pidgeon, C., & Mehlich, M. (1998). DMS reengineering toolkit: Practical foundations for domain-specific environments. Proceedings of the 5th Working Conference on Reverse Engineering. https://dl.acm.org/doi/10.1109/WCRE.1998.723179

Ferrante, J., Ottenstein, K. J., & Warren, J. D. (1987). The program dependence graph and its use in optimization. ACM Transactions on Programming Languages and Systems (TOPLAS), 9(3), 319-349. https://dl.acm.org/doi/10.1145/24039.24041

Allen, F. E. (1970). Control flow analysis. Proceedings of a Symposium on Compiler Optimization. https://dl.acm.org/doi/10.1145/800028.808479

Kipf, T. N., & Welling, M. (2017). Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. https://arxiv.org/abs/1609.02907

Alon, U., Zilberstein, M., Levy, O., & Yahav, E. (2019). code2vec: Learning distributed representations of code. Proceedings of the ACM on Programming Languages, 3(POPL), 1-29. https://openreview.net/forum?id=H1gKYo09tX

Pradel, M., & Sen, K. (2018). DeepBugs: A learning approach to name-based bug detection. Proceedings of the ACM on Programming Languages, 2(OOPSLA), 1-25. https://dl.acm.org/doi/10.1145/3276517

White, M., Vendome, C., Linares-Vásquez, M., & Poshyvanyk, D. (2016). Toward deep learning software repositories. Proceedings of the 12th Working Conference on Mining Software Repositories, 334-345. https://dl.acm.org/doi/10.1145/2884781.2884877

Mou, L., Li, G., Zhang, L., Wang, T., & Jin, Z. (2016). Convolutional neural networks over tree structures for programming language processing. Proceedings of the AAAI Conference on Artificial Intelligence, 30(1). https://dl.acm.org/doi/10.5555/3016100.3016190

Milan, Milan, et al. "Learning to Compare Code with Graph Neural Networks." Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, 2019. https://arxiv.org/pdf/2404.17365

Allamanis, Miltiadis, et al. "Deep Learning for Code Analysis with ASTs." Proceedings of the International Conference on Learning Representations, 2018. https://arxiv.org/abs/2401.00288

Lenz, Alexander, et al. "CodeNet: Exploring Relationships in Code with Neural Networks." arXiv preprint arXiv:2003.00508 (2020)

Xu, B., et al. "JCNN: Joint Code and Natural Language Representation Learning for Code Search." arXiv preprint arXiv:2105.07221 (2021).

Tian, Feng, et al. "CASTER: CodeBERT Pre-training with Masked Language Modeling and Multi-Task Learning." arXiv preprint arXiv:2106.05220 (2021)

Feng, Yue, et al. "CodeBERT: Pre-training a BERT-style Encoder for Code." arXiv preprint arXiv:2004.08855 (2020).

Lee, Jinyoung, et al. "Learning Deep Representations for Code and Comments." Proceedings of the 38th International Conference on Software Engineering, Association for Computing Machinery, 2016.

Zhang, Jian, et al. "Detecting condition-related bugs with control flow graph neural network." Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis. 2023.

Luo, Yu, Weifeng Xu, and Dianxiang Xu. "Compact abstract graphs for detecting code vulnerability with GNN models." Proceedings of the 38th Annual Computer Security Applications Conference. 2022.

Keshavarz, Hossein. JITGNN: a deep graph neural network for just-in-time bug prediction. MS thesis. University of Waterloo, 2022.

Downloads

Published

09.07.2024

How to Cite

Anushka Singh. (2024). Advancing Vulnerability Detection: An Innovative Approach to Generate Embeddings of Code Snippets. International Journal of Intelligent Systems and Applications in Engineering, 12(22s), 280–287. Retrieved from https://ijisae.org/index.php/IJISAE/article/view/6423

Issue

Section

Research Article