Page 194 - 《软件学报》2025年第5期

P. 194

软件学报 ISSN 1000-9825, CODEN RUXUEW E-mail: jos@iscas.ac.cn
2025,36(5):2094−2113 [doi: 10.13328/j.cnki.jos.007187] [CSTR: 32375.14.jos.007187] http://www.jos.org.cn
©中国科学院软件研究所版权所有. Tel: +86-10-62562563

*
基于双向拟合掩码重建的多模态自监督点云表示学习

程浩喆, 祝继华, 史鹏程, 胡乃文, 谢奕凡, 李仕奇

(西安交通大学软件学院, 陕西西安 710049)
通信作者: 祝继华, E-mail: zhujh@xjtu.edu.cn

摘要: 点云自监督表示学习以无标签预训练的方式, 探索三维拓扑几何空间结构关系并捕获特征表示, 可应用至
点云分类、分割以及物体探测等下游任务. 为提升预训练模型的泛化性和鲁棒性, 提出基于双向拟合掩码重建的
多模态自监督点云表示学习方法, 主要由 3 部分构成: (1) 逆密度尺度指导下的“坏教师”模型通过基于逆密度噪声
表示和全局特征表示的双向拟合策略, 加速掩码区域逼近真值. (2) 基于 StyleGAN 的辅助点云生成模型以局部几
何信息为基础, 生成风格化点云并与掩码重建结果在阈值约束下融合, 旨在抵抗重建过程噪声对表示学习的不良
影响. (3) 多模态教师模型以增强三维特征空间多样性及防止模态信息崩溃为目标, 依靠三重特征对比损失函数,
充分汲取点云-图像-文本样本空间中所蕴含的潜层信息. 所提出的方法在 ModelNet、ScanObjectNN 和 ShapeNet
这 3 种点云数据集上进行微调任务测试. 实验结果表明, 预训练模型在点云分类、线性支持向量机分类、小样本
分类、零样本分类以及部件分割等点云识别任务上的效果达到领先水平.
关键词: 三维点云; 自监督表示学习; 多模态特征; 密度尺度; 生成对抗网络
中图法分类号: TP18

中文引用格式: 程浩喆, 祝继华, 史鹏程, 胡乃文, 谢奕凡, 李仕奇. 基于双向拟合掩码重建的多模态自监督点云表示学习. 软件学
报, 2025, 36(5): 2094–2113. http://www.jos.org.cn/1000-9825/7187.htm
英文引用格式: Cheng HZ, Zhu JH, Shi PC, Hu NW, Xie YF, Li SQ. Multi-modal Self-supervised Point Cloud Representation Learning
Based on Bidirectional Fit Mask Reconstruction. Ruan Jian Xue Bao/Journal of Software, 2025, 36(5): 2094–2113 (in Chinese). http://
www.jos.org.cn/1000-9825/7187.htm

Multi-modal Self-supervised Point Cloud Representation Learning Based on Bidirectional Fit
Mask Reconstruction

CHENG Hao-Zhe, ZHU Ji-Hua, SHI Peng-Cheng, HU Nai-Wen, XIE Yi-Fan, LI Shi-Qi
(School of Software Engineering, Xi’an Jiaotong University, Xi’an 710049, China)
Abstract: Point cloud self-supervised representation learning is conducted in an unlabeled pre-training manner, exploring the structural
relationships of 3D topological geometric spaces and capturing feature representations. This approach can be applied to downstream tasks,
such as point cloud classification, segmentation, and object detection. To enhance the generalization and robustness of the pretrained
models, this study proposes a multi-modal self-supervised method for learning point cloud representations. The method is based on
bidirectional fit mask reconstruction and comprises three main components: (1) The “bad teacher” model, guided by the inverse density
scale, employs a bidirectional fit strategy that utilizes inverse density noise representation and global feature representation to expedite the
convergence of the mask region towards the true value. (2) The StyleGAN-based auxiliary point cloud generation model, grounded in local
geometric information, generates stylized point clouds and fuses them with mask reconstruction results while adhering to threshold
constraints. The objective is to mitigate the adverse effects of noise on representation learning during the reconstruction process. (3) The
multi-modal teacher model aims to enhance the diversity of the 3D feature space and prevent the collapse of modal information. It relies
on the triple feature contrast loss function to fully extract the latent information contained in the point cloud-image-text sample space. The

* 基金项目: 陕西省重点研发项目 (2021GY-025, 2021GXLHZ-097)
收稿时间: 2023-11-02; 修改时间: 2024-03-15; 采用时间: 2024-03-26; jos 在线出版时间: 2024-09-11
CNKI 网络首发时间: 2024-09-12

189 190 191 192 193 194 195 196 197 198 199