Page 37 - 《软件学报》2024年第4期

P. 37

杨宏宇等: 基于多模态对比学习的代码表征增强预训练方法 1615

特征(包括文本级代码注释、语义级代码字符、功能级函数名和结构级的 AST)的语义向量建模, 充分挖掘了
各个模态特征的隐藏信息, 提高了代码表征的准确性. 除此之外, REcomp 优化了序列化 AST 的方法, 在保留
完整的非叶结点的控制信息和叶子结点的语义信息的同时, 避免了结点冗余和序列过长等问题, 极大程度缩
短了输入的特征序列. 更短的特征序列促使模型仅使用极少的计算资源和小批量训练集, 就在较短的时间迅
速收敛, 并在公开数据集上达到了不错的效果.

References:
[1] Rey SJ. Big code. Geographical Analysis, 2023, 55(2): 211−224.
[2] Lu S, Guo D, Ren S, et al. CodeXGLUE: A machine learning benchmark dataset for code understanding and generation.
arXiv:2102. 04664, 2021.
[3] Cheng SQ, Liu JX, Peng ZL, et al. CodeBERT based code classification method. Computer Engineering and Applications, 2023,
59(24): 277−288 (in Chinese with English abstract). [doi: 10.3778/j.issn.1002-8331.2209-0402]
[4] Jiang Y, Li M, Zhou ZH. Software defect detection with Rocus. Journal of Computer Science and Technology, 2011, 26(2):
328−342. [doi: 10.1007/s11390-011-1135-6]
[5] Jiang L, Misherghi G, Su Z, et al. Deckard: Scalable and accurate tree-based detection of code clones. In: Proc. of the 29th Int’l
Conf. on Software Engineering (ICSE 2007). IEEE, 2007. 96−105.
[6] Russell R, Kim L, Hamilton L, et al. Automated vulnerability detection in source code using deep representation learning. In: Proc.
of the 17th IEEE Int’l Conf. on Machine Learning and Applications (ICMLA). IEEE, 2018. 757−762.
[7] Zhou ZH, Chen SF. Neural network ensemble. Chinese Journal of Computer, 2002, 25(1): 1−8 (in Chinese with English abstract).
[8] Hindle A, Barr ET, Gabel M, et al. On the naturalness of software. Communications of the ACM, 2016, 59(5): 122−131
[9] Nachmani E, Marciano E, Burshtein D, et al. RNN decoding of linear block codes. arXiv:1702.07560, 2017.
[10] Mou L, Li G, Jin Z, et al. TBCNN: A tree-based convolutional neural network for programming language processing. arXiv:1409.
5718, 2014.
[11] Shuai J, Xu L, Liu C, et al. Improving code search with co-attentive representation learning. In: Proc. of the 28th Int’l Conf. on
Program Comprehension. 2020. 196−207.
[12] Kim Y. Convolutional neural network for sentence classification [MS. Thesis]. University of Waterloo. arXiv:1408.5882v2, 2014.
[13] Li Z, Wu Y, Peng B, et al. SeCNN: A semantic CNN parser for code comment generation. Journal of Systems and Software, 2021,
181: 111036.
[14] Wan Y, Shu JD, Sui YL, et al. Multi-modal attention network learning for semantic source code retrieval. In: Proc. of the 2019 34th
IEEE/ACM Int’l Conf. on Automated Software Engineering (ASE). IEEE, 2019. 13−25.
[15] Zeng C, Yu Y, Li S, et al. DeGraphCS: Embedding variable-based flow graph for neural code search. ACM Trans. on Software
Engineering and Methodology, 2023, 32(2): 1−27.
[16] Xie CL, Liang Y, Wang X. Survey of deep learning applied in code representation. Computer Engineering and Applications, 2021,
57(20): 53−63 (in Chinese with English abstract). [doi: 10.3778/j.issn.1002-8331.2106-0368]
[17] Hu X, Li G, Xia X, et al. Deep code comment generation. In: Proc. of the 26th Conf. on Program Comprehension. 2018. 200−210.
[18] Wen W, Chu J, Zhao T, et al. Code2tree: A method for automatically generating code comments. Hindawi Scientific Programming,
2022. https://doi.org/10.1155/2022/6350686
[19] Wei H, Li M. Supervised deep features for software functional clone detection by exploiting lexical and syntactical information in
source code. In: Proc. of the IJCAI. 2017. 3034−3040.
[20] Graves A. Supervised Sequence Labelling with Recurrent Neural Networks. Springer, 2012. 37−45.
[21] Alon U, Zilberstein M, Levy O, et al. Code2vec: Learning distributed representations of code. Proc. of the ACM on Programming
Languages, 2019, 3(POPL): 1−29.
[22] Gu J, Chen Z, Monperrus M. Multimodal representation for neural code search. In: Proc. of the 2021 IEEE Int’l Conf. on Software
Maintenance and Evolution (ICSME). IEEE, 2021. 483−494.
[23] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. In: Proc. of the Advances in Neural Information Processing
Systems. 2017. 30
[24] Devlin J, Chang MW, Lee K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.
04805, 2018.
[25] Liu Y, Ott M, Goyal N, et al. RoBERTa: A robustly optimized BERT pretraining approach. arXiv:1907.11692, 2019.

32 33 34 35 36 37 38 39 40 41 42