Page 321 - 《软件学报》2024年第4期
P. 321
软件学报 ISSN 1000-9825, CODEN RUXUEW E-mail: jos@iscas.ac.cn
Journal of Software,2024,35(4):1899−1913 [doi: 10.13328/j.cnki.jos.006833] http://www.jos.org.cn
©中国科学院软件研究所版权所有. Tel: +86-10-62562563
*
跨模态交互融合与全局感知的 RGB-D 显著性目标检测
孙福明, 胡锡航, 武景宇, 孙 静, 王法胜
(大连民族大学 信息与通信工程学院, 辽宁 大连 116600)
通信作者: 王法胜, E-mail: wangfasheng@dlnu.edu.cn
摘 要: 近年来, RGB-D 显著性检测方法凭借深度图中丰富的几何结构和空间位置信息, 取得了比 RGB 显著性检
测模型更好的性能, 受到学术界的高度关注. 然而, 现有的 RGB-D 检测模型仍面临着持续提升检测性能的需求. 最
近兴起的 Transformer 擅长建模全局信息, 而卷积神经网络 (CNN) 擅长提取局部细节. 因此, 如何有效结合 CNN
和 Transformer 两者的优势, 挖掘全局和局部信息, 将有助于提升显著性目标检测的精度. 为此, 提出一种基于跨模
态交互融合与全局感知的 RGB-D 显著性目标检测方法, 通过将 Transformer 网络嵌入 U-Net 中, 从而将全局注意
力机制与局部卷积结合在一起, 能够更好地对特征进行提取. 首先借助 U-Net 编码-解码结构, 高效地提取多层次互
补特征并逐级解码生成显著特征图. 然后, 使用 Transformer 模块学习高级特征间的全局依赖关系增强特征表示,
并针对输入采用渐进上采样融合策略以减少噪声信息的引入. 其次, 为了减轻低质量深度图带来的负面影响, 设计
一个跨模态交互融合模块以实现跨模态特征融合. 最后, 5 个基准数据集上的实验结果表明, 所提算法与其他最新
的算法相比具有显著优势.
关键词: 显著性目标检测; 跨模态; 全局注意力机制; RGB-D 检测模型
中图法分类号: TP391
中文引用格式: 孙福明, 胡锡航, 武景宇, 孙静, 王法胜. 跨模态交互融合与全局感知的RGB-D显著性目标检测. 软件学报, 2024,
35(4): 1899–1913. http://www.jos.org.cn/1000-9825/6833.htm
英文引用格式: Sun FM, Hu XH, Wu JY, Sun J, Wang FS. RGB-D Salient Object Detection Based on Cross-modal Interactive Fusion
and Global Awareness. Ruan Jian Xue Bao/Journal of Software, 2024, 35(4): 1899–1913 (in Chinese). http://www.jos.org.cn/1000-9825
/6833.htm
RGB-D Salient Object Detection Based on Cross-modal Interactive Fusion and Global Awareness
SUN Fu-Ming, HU Xi-Hang, WU Jing-Yu, SUN Jing, WANG Fa-Sheng
(School of Information and Communication Engineering, Dalian Minzu University, Dalian 116600, China)
Abstract: In recent years, RGB-D salient detection method has achieved better performance than RGB salient detection model by virtue of
its rich geometric structure and spatial position information in depth maps and thus has been highly concerned by the academic
community. However, the existing RGB-D detection model still faces the challenge of improving performance continuously. The emerging
Transformer is good at modeling global information, while the convolutional neural network (CNN) is good at extracting local details.
Therefore, effectively combining the advantages of CNN and Transformer to mine global and local information will help to improve the
accuracy of salient object detection. For this purpose, an RGB-D salient object detection method based on cross-modal interactive fusion
and global awareness is proposed in this study. The transformer network is embedded into U-Net to better extract features by combining
the global attention mechanism with local convolution. First, with the help of the U-Net encoder-decoder structure, this study efficiently
extracts multi-level complementary features and decodes them step by step to generate a salient feature map. Then, the Transformer
module is used to learn the global dependency between high-level features to enhance the feature representation, and the progressive
upsampling fusion strategy is used to process the input and reduce the introduction of noise information. Moreover, to reduce the negative
* 基金项目: 国家自然科学基金 (61976042, 61972068); 兴辽英才计划 (XLYC2007023); 辽宁省高等学校创新人才支持计划 (LR2019020)
收稿时间: 2022-06-29; 修改时间: 2022-09-01, 2022-10-10; 采用时间: 2022-11-01; jos 在线出版时间: 2023-06-14
CNKI 网络首发时间: 2023-06-15