Page 37 - 《软件学报》2024年第4期
P. 37

杨宏宇  等:  基于多模态对比学习的代码表征增强预训练方法                                                   1615


         特征(包括文本级代码注释、语义级代码字符、功能级函数名和结构级的 AST)的语义向量建模,  充分挖掘了
         各个模态特征的隐藏信息,  提高了代码表征的准确性.  除此之外, REcomp 优化了序列化 AST 的方法,  在保留
         完整的非叶结点的控制信息和叶子结点的语义信息的同时,  避免了结点冗余和序列过长等问题,  极大程度缩
         短了输入的特征序列.  更短的特征序列促使模型仅使用极少的计算资源和小批量训练集,  就在较短的时间迅
         速收敛,  并在公开数据集上达到了不错的效果.

         References:
          [1]    Rey SJ. Big code. Geographical Analysis, 2023, 55(2): 211−224.
          [2]    Lu S,  Guo D,  Ren S,  et al. CodeXGLUE:  A  machine learning benchmark dataset for  code understanding  and generation.
             arXiv:2102. 04664, 2021.
          [3]    Cheng SQ, Liu JX, Peng ZL, et al. CodeBERT based code classification method. Computer Engineering and Applications, 2023,
             59(24): 277−288 (in Chinese with English abstract). [doi: 10.3778/j.issn.1002-8331.2209-0402]
          [4]    Jiang Y,  Li  M, Zhou ZH. Software defect detection  with  Rocus.  Journal  of  Computer Science  and Technology, 2011,  26(2):
             328−342. [doi: 10.1007/s11390-011-1135-6]
          [5]    Jiang L, Misherghi G, Su Z, et al. Deckard: Scalable and accurate tree-based detection of code clones. In: Proc. of the 29th Int’l
             Conf. on Software Engineering (ICSE 2007). IEEE, 2007. 96−105.
          [6]    Russell R, Kim L, Hamilton L, et al. Automated vulnerability detection in source code using deep representation learning. In: Proc.
             of the 17th IEEE Int’l Conf. on Machine Learning and Applications (ICMLA). IEEE, 2018. 757−762.
          [7]    Zhou ZH, Chen SF. Neural network ensemble. Chinese Journal of Computer, 2002, 25(1): 1−8 (in Chinese with English abstract).
          [8]    Hindle A, Barr ET, Gabel M, et al. On the naturalness of software. Communications of the ACM, 2016, 59(5): 122−131
          [9]    Nachmani E, Marciano E, Burshtein D, et al. RNN decoding of linear block codes. arXiv:1702.07560, 2017.
         [10]    Mou L, Li G, Jin Z, et al. TBCNN: A tree-based convolutional neural network for programming language processing. arXiv:1409.
             5718, 2014.
         [11]    Shuai J, Xu L, Liu C, et al. Improving code search with co-attentive representation learning. In: Proc. of the 28th Int’l Conf. on
             Program Comprehension. 2020. 196−207.
         [12]    Kim Y. Convolutional neural network for sentence classification [MS. Thesis]. University of Waterloo. arXiv:1408.5882v2, 2014.
         [13]    Li Z, Wu Y, Peng B, et al. SeCNN: A semantic CNN parser for code comment generation. Journal of Systems and Software, 2021,
             181: 111036.
         [14]    Wan Y, Shu JD, Sui YL, et al. Multi-modal attention network learning for semantic source code retrieval. In: Proc. of the 2019 34th
             IEEE/ACM Int’l Conf. on Automated Software Engineering (ASE). IEEE, 2019. 13−25.
         [15]    Zeng C, Yu Y, Li S, et al. DeGraphCS: Embedding variable-based flow graph for neural code search. ACM Trans. on Software
             Engineering and Methodology, 2023, 32(2): 1−27.
         [16]    Xie CL, Liang Y, Wang X. Survey of deep learning applied in code representation. Computer Engineering and Applications, 2021,
             57(20): 53−63 (in Chinese with English abstract). [doi: 10.3778/j.issn.1002-8331.2106-0368]
         [17]    Hu X, Li G, Xia X, et al. Deep code comment generation. In: Proc. of the 26th Conf. on Program Comprehension. 2018. 200−210.
         [18]    Wen W, Chu J, Zhao T, et al. Code2tree: A method for automatically generating code comments. Hindawi Scientific Programming,
             2022. https://doi.org/10.1155/2022/6350686
         [19]    Wei H, Li M. Supervised deep features for software functional clone detection by exploiting lexical and syntactical information in
             source code. In: Proc. of the IJCAI. 2017. 3034−3040.
         [20]    Graves A. Supervised Sequence Labelling with Recurrent Neural Networks. Springer, 2012. 37−45.
         [21]    Alon U, Zilberstein M, Levy O, et al. Code2vec: Learning distributed representations of code. Proc. of the ACM on Programming
             Languages, 2019, 3(POPL): 1−29.
         [22]    Gu J, Chen Z, Monperrus M. Multimodal representation for neural code search. In: Proc. of the 2021 IEEE Int’l Conf. on Software
             Maintenance and Evolution (ICSME). IEEE, 2021. 483−494.
         [23]    Vaswani  A, Shazeer N, Parmar  N,  et al. Attention is  all  you need. In: Proc. of the  Advances in Neural Information Processing
             Systems. 2017. 30
         [24]    Devlin J, Chang MW, Lee K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.
             04805, 2018.
         [25]    Liu Y, Ott M, Goyal N, et al. RoBERTa: A robustly optimized BERT pretraining approach. arXiv:1907.11692, 2019.
   32   33   34   35   36   37   38   39   40   41   42