Page 441 - 《软件学报》2025年第12期
P. 441

5822                                                      软件学报  2025  年第  36  卷第  12  期


                 4
                 (National Survey Research Center, Renmin University of China, Beijing 100872, China)
                 5
                 (Intel China Research Center Ltd., Beijing 100190, China)
                 6
                 (National Satellite Meteorological Center, Beijing 100081, China)
                 Abstract:  Non-uniform memory access (NUMA) is the mainstream memory access architecture for state-of-the-art multicore and multi-way
                 processor  platforms.  Reducing  the  latency  of  cross-NUMA  node  accesses  during  queries  is  a  key  issue  for  modern  in-memory  database
                 query  optimization  techniques.  Due  to  the  differences  in  NUMA  architectures  and  NUMA  latency  across  various  processors,  NUMA
                 optimization  techniques  should  be  combined  with  hardware  characteristics.  This  study  focuses  on  the  in-memory  foreign  key  join
                 algorithm,  which  has  high  cost  and  strong  locality  of  data  dependency  in  in-memory  databases,  and  explores  different  NUMA  optimization
                 techniques,  including  NUMA-conscious  and  NUMA-oblivious  implementations,  on  five  platforms  featuring  ARM,  Intel  CLX/ICX,  and
                 AMD  Zen2/Zen3  processors.  The  study  also  compares  the  performance  of  the  algorithms  across  different  processor  platforms  with
                 strategies  such  as  data  storage,  data  partitioning,  and  join  intermediate  result  caching.  Experimental  results  show  that  the  NUMA-conscious
                 optimization strategy requires the integration of both software and hardware. Radix Join demonstrates neutral sensitivity to NUMA latency,
                 with  NUMA  optimization  gains  constantly  around  30%.  The  NPO  algorithm  shows  higher  sensitivity  to  NUMA  latency,  with  NUMA
                 optimization gains ranging from 38% to 57%. The Vector Join algorithm is sensitive to NUMA latency, but the impact is relatively minor,
                 with NUMA optimization gains varying from 1% to 25%. For algorithm performance characteristics, cache efficiency influences the Vector
                 Join  performance  more  than  NUMA  latency.  NUMA-conscious  optimization  techniques  show  significant  differences  on  ARM  platforms,
                 while  the  differences  are  minimal  on  x86  platforms.  The  less  complex  NUMA-oblivious  algorithms  exhibit  greater  generality.  Given
                 hardware  trends,  reducing  NUMA  latency  can  effectively  reduce  performance  gaps  in  NUMA-conscious  optimization  techniques,  simplify
                 join algorithm complexity, and improve join operation performance.
                 Key words:  NUMA architecture; NUMA-conscious optimization; NUMA-oblivious implementation; vector join; join benchmark
                    当前主流服务器通常采用          NUMA (non-uniform memory access, 非一致性内存访问) 架构, 在    NUMA  架构中
                 CPU  通过集成的内存控制器低延迟访问本地内存, 而              CPU  之间通过   QPI 通道高延迟访问远程内存. 在        4  路、8  路
                 以及  16  路  CPU  服务器中, 远程内存访问需要经过更多         QPI 通道, 从而产生更高访问延迟. NUMA         架构中本地内
                 存与远程内存访问的延迟差异较大, 使得             NUMA  内存访问局部性与       cache 局部性共同成为内存查询优化的重要
                 影响因素, 最大化线程对内存访问的           NUMA  局部性是    NUMA  优化的重要目标. 因此, 在数据布局和算法设计上需
                 要根据不同    CPU  的  NUMA  架构和负载特征进行针对性的优化设计.
                    本文研究在     NUMA  架构上内存数据库中执行代价较高的外键连接操作优化技术. 外键连接是                        OLAP (on-line
                 analytical processing, 联机分析处理) 中重要的算子, 在主键表    R (假设主键为    keyP) 与外键表  S (假设主键为    keyF)
                 之间执行基于等值条件的连接操作             (R.keyP=S.keyF), S  表记录外键列  keyF  上的每个值都可以在       R  表主键列
                 keyP  上找到唯一的值与之对应. 外键连接是数据仓库负载的基础操作, 事实表通过外键与多个维表执行连接操作.
                 外键连接操作的主要实现技术有哈希连接              [1] 、排序归并连接    [2] 、向量连接  [3] 等, 相关研究表明, 哈希连接性能优于
                 排序归并连接, 因而成为当前内存连接算法的主要研究对象. 哈希连接算法又主要分为                          hardware-conscious 的  Radix
                 分区哈希连接算法       (parallel Radix join optimized, PRO) 和  hardware-oblivious 的无分区哈希连接算法  (NPO, no
                 partitioning join optimized), 前者需要根据  CPU  硬件参数  (如  TLB  大小等) 设计分区趟数等关键参数, 将较大的表
                 划分为较小的分区实现面向          CPU  私有  cache 的  in-cache 哈希连接, 后者主要依赖持续增长的共享        L3 cache 容量
                 和多核并行访问能力, 通过简单共享哈希表提供硬件自适应的哈希连接算法. 两种算法具有不同的                                cache 局部性
                 特征, 还具有其他不同的内存局部性特征: PRO            算法在分区阶段将两个输入表划分为适合               cache 大小的分区, 将数
                 据写到不同    NUMA  节点上的分区时, 会产生        NUMA  访问延迟, 随后在较小分区上         cache 执行哈希表创建和探测
                 阶段操作时需要处理线程与分区所在的               NUMA  节点绑定, 避免跨      NUMA  节点的数据访问延迟; NPO         算法跨
                 NUMA  节点创建共享哈希表, 在哈希探测阶段跨             NUMA  节点访问共享哈希表. 哈希表是一种无索引等值连接算
                 法, 依赖哈希表完成等值连接操作. 向量连接算法则是一种索引连接算法, 也是一种面向                           OLAP  负载的连接算法.
                 相对于普通的主键-外键约束, 向量索引算法需要主键表                  R  使用代理键索引     (surrogate key, 即连续的  1, 2, 3, …序
                 列) 将主键表映射为维轴, 主键扩展了地址映射约束, 使外键表                  S  的外键成为连接索引      (join index), 从而将外键值
                 直接映射为主键表的偏移地址, 实现无哈希化的键值-地址映射连接. 向量连接算法将                          R  表记录映射为以代理键索
   436   437   438   439   440   441   442   443   444   445   446