Page 441 - 《软件学报》2025年第12期

P. 441

5822 软件学报 2025 年第 36 卷第 12 期

4
(National Survey Research Center, Renmin University of China, Beijing 100872, China)
5
(Intel China Research Center Ltd., Beijing 100190, China)
6
(National Satellite Meteorological Center, Beijing 100081, China)
Abstract: Non-uniform memory access (NUMA) is the mainstream memory access architecture for state-of-the-art multicore and multi-way
processor platforms. Reducing the latency of cross-NUMA node accesses during queries is a key issue for modern in-memory database
query optimization techniques. Due to the differences in NUMA architectures and NUMA latency across various processors, NUMA
optimization techniques should be combined with hardware characteristics. This study focuses on the in-memory foreign key join
algorithm, which has high cost and strong locality of data dependency in in-memory databases, and explores different NUMA optimization
techniques, including NUMA-conscious and NUMA-oblivious implementations, on five platforms featuring ARM, Intel CLX/ICX, and
AMD Zen2/Zen3 processors. The study also compares the performance of the algorithms across different processor platforms with
strategies such as data storage, data partitioning, and join intermediate result caching. Experimental results show that the NUMA-conscious
optimization strategy requires the integration of both software and hardware. Radix Join demonstrates neutral sensitivity to NUMA latency,
with NUMA optimization gains constantly around 30%. The NPO algorithm shows higher sensitivity to NUMA latency, with NUMA
optimization gains ranging from 38% to 57%. The Vector Join algorithm is sensitive to NUMA latency, but the impact is relatively minor,
with NUMA optimization gains varying from 1% to 25%. For algorithm performance characteristics, cache efficiency influences the Vector
Join performance more than NUMA latency. NUMA-conscious optimization techniques show significant differences on ARM platforms,
while the differences are minimal on x86 platforms. The less complex NUMA-oblivious algorithms exhibit greater generality. Given
hardware trends, reducing NUMA latency can effectively reduce performance gaps in NUMA-conscious optimization techniques, simplify
join algorithm complexity, and improve join operation performance.
Key words: NUMA architecture; NUMA-conscious optimization; NUMA-oblivious implementation; vector join; join benchmark
当前主流服务器通常采用 NUMA (non-uniform memory access, 非一致性内存访问) 架构, 在 NUMA 架构中
CPU 通过集成的内存控制器低延迟访问本地内存, 而 CPU 之间通过 QPI 通道高延迟访问远程内存. 在 4 路、8 路
以及 16 路 CPU 服务器中, 远程内存访问需要经过更多 QPI 通道, 从而产生更高访问延迟. NUMA 架构中本地内
存与远程内存访问的延迟差异较大, 使得 NUMA 内存访问局部性与 cache 局部性共同成为内存查询优化的重要
影响因素, 最大化线程对内存访问的 NUMA 局部性是 NUMA 优化的重要目标. 因此, 在数据布局和算法设计上需
要根据不同 CPU 的 NUMA 架构和负载特征进行针对性的优化设计.
本文研究在 NUMA 架构上内存数据库中执行代价较高的外键连接操作优化技术. 外键连接是 OLAP (on-line
analytical processing, 联机分析处理) 中重要的算子, 在主键表 R (假设主键为 keyP) 与外键表 S (假设主键为 keyF)
之间执行基于等值条件的连接操作 (R.keyP=S.keyF), S 表记录外键列 keyF 上的每个值都可以在 R 表主键列
keyP 上找到唯一的值与之对应. 外键连接是数据仓库负载的基础操作, 事实表通过外键与多个维表执行连接操作.
外键连接操作的主要实现技术有哈希连接 [1] 、排序归并连接 [2] 、向量连接 [3] 等, 相关研究表明, 哈希连接性能优于
排序归并连接, 因而成为当前内存连接算法的主要研究对象. 哈希连接算法又主要分为 hardware-conscious 的 Radix
分区哈希连接算法 (parallel Radix join optimized, PRO) 和 hardware-oblivious 的无分区哈希连接算法 (NPO, no
partitioning join optimized), 前者需要根据 CPU 硬件参数 (如 TLB 大小等) 设计分区趟数等关键参数, 将较大的表
划分为较小的分区实现面向 CPU 私有 cache 的 in-cache 哈希连接, 后者主要依赖持续增长的共享 L3 cache 容量
和多核并行访问能力, 通过简单共享哈希表提供硬件自适应的哈希连接算法. 两种算法具有不同的 cache 局部性
特征, 还具有其他不同的内存局部性特征: PRO 算法在分区阶段将两个输入表划分为适合 cache 大小的分区, 将数
据写到不同 NUMA 节点上的分区时, 会产生 NUMA 访问延迟, 随后在较小分区上 cache 执行哈希表创建和探测
阶段操作时需要处理线程与分区所在的 NUMA 节点绑定, 避免跨 NUMA 节点的数据访问延迟; NPO 算法跨
NUMA 节点创建共享哈希表, 在哈希探测阶段跨 NUMA 节点访问共享哈希表. 哈希表是一种无索引等值连接算
法, 依赖哈希表完成等值连接操作. 向量连接算法则是一种索引连接算法, 也是一种面向 OLAP 负载的连接算法.
相对于普通的主键-外键约束, 向量索引算法需要主键表 R 使用代理键索引 (surrogate key, 即连续的 1, 2, 3, …序
列) 将主键表映射为维轴, 主键扩展了地址映射约束, 使外键表 S 的外键成为连接索引 (join index), 从而将外键值
直接映射为主键表的偏移地址, 实现无哈希化的键值-地址映射连接. 向量连接算法将 R 表记录映射为以代理键索

436 437 438 439 440 441 442 443 444 445 446