Page 499 - 《软件学报》2024年第4期

P. 499

欧阳湘臻等: 榫卯: 一种可组合的定制化内存分配框架 2077

However, they are time-consuming and error-prone to implement. Developers often use the memory allocation framework to build special-
purpose dynamic memory allocators. However, the existing memory allocator framework has the problems of poor abstraction ability and
insufficient composability and customizability. For this reason, this study proposes a composable and customizable dynamic memory
allocator framework, namely mortise, based on function composability by reviewing the dynamic memory allocation process from the
perspective of functional programming. The framework abstracts system memory allocation as a composition of hierarchical functions of
several multiple decoupled memory allocations, and these functions can provide policies to ensure higher customizability and
composability. Mortise is implemented by using standard C. To achieve zero performance overhead of hierarchical function composition,
mortise uses the metaprogramming features offered by the C preprocessor. Developers can quickly build a memory allocator for targeted
application scenarios by composing and customizing the hierarchical function of allocators. In order to prove the effectiveness of mortise,
this study presents three different memory allocator instances, namely tlsfcc, hslab, and wfslab, by using mortise. Specifically, tlsfcc is
designed for multi-core embedded application scenarios, which improves the parallel throughput by replacing the synchronization strategy;
hslab is a core-aware slab-type allocator, which optimizes performance on heterogeneous hardware by customizing thread cache; wfslab is
a low-latency and wait-free/lock-free allocator. This study runs benchmarks to compare these allocators with several existing memory
allocators. The experiments are carried out on an 8-core x86/64 platform and an 8-core heterogeneous aarch64 embedded platform, and the
experimental results show that tlsfcc achieves a mean speedup of 1.76 and 1.59 on the two platforms compared with the original tlsf
allocator; hlsab achieves only 69.6% and 85.0% execution time compared with the tcmalloc with a similar architecture; the worst-case
memory request latency of wfslab is the smallest among all memory allocators in the experiment, including the state-of-art lock-free
memory allocators: mimalloc and snmalloc.
Key words: memory allocation; blocking synchronization; heterogeneous system; operating systems; functional programming

动态内存分配器负责管理空闲内存并在程序请求内存时返回满足需求的内存块, 它是高级编程语言, 操作系
统以及现代应用程序中十分重要的组件. 例如, Java, Python, Haskell 等高级编程语言中对象的创建与释放依赖于
运行时动态内存分配机制; Redis, V8 Javascript 引擎等应用程序运行过程中也伴随着大量内存对象的动态申请与
释放. 一般而言, 通用的动态内存分配器, 如 GNU C 标准库 malloc (glibc malloc), tcmalloc, jemalloc, 能够提供低延
时的内存分配与释放路径, 以及较低的内存占用. 然而, 不同应用场景下对内存分配器的要求并不完全相同, 采用
通用内存分配器并不是最优解. 例如, 在软实时系统中使用动态内存分配器必须确保分配器的申请与释放路径的
最差执行时间是确定的 [1,2] . 而对于硬实时系统而言, 考虑到任务超时的风险, 应当避免内存资源的并发访问. 对于
大型多核服务器, 程序访存开销与核间通信开销较为显著, 如数据仓库服务器中 CPU 等待周期主要是由 CPU data
cache 与 TLB 未命中产生的 [3] . 在这种应用场景下, 内存分配器应当具有良好的局部性, 以充分利用 TLB 和 cache.
在异构多核系统中, 一个 CPU 包含具有不同微体系结构的异构核心, 这些异构核心通常拥有不同的 cache 大小,
分支预测缓存, 指令发射窗口, 而通用动态分配器的设计无法感知到硬件结构, 从而不能充分利用硬件性能加速内
存分配及访问过程. 此外, 对于一些可能存在内存漏洞的应用, 内存分配器还需要在其管理的内存块中增加冗余信
息, 以缓解代码中存在的内存风险. 为满足不同应用场景的需求, 使用定制的专用动态内存分配器是更好的选择.
然而, 现代动态内存分配器的设计与实现较为复杂, 以 tcmalloc 为例, 该分配器包含每线程的 thread cache, 全局共
享的 central cache 和 page heap 等复杂数据结构, 数据结构之间存在代码耦合. 手动实现与维护的代价较高, 且代
[4]
码复用能力较差. 例如在 tcmalloc 基础上实现的巨页感知的加强组件 Temeraire 由于代码耦合无法简单地应用到
其他内存分配器之上, 需要大量的移植和重新编码工作. 因此, 针对传统内存分配器组合性与定制性不足, 代码复
用性较差的缺点, 内存分配框架被提出来.
内存分配框架一般将内存分配算法分为多个互不耦合的层级, 每一层都有各自的实现. Linux 内核内存分配
框架分为页分配器与对象分配器两层. 内核通过 Kconfig 配置提供多种可选的分配器与优化选项. 其中页分配器
是固定的伙伴算法, 而对象分配器有 3 种可选实现 slab, slub 和 slob, 分别针对通用系统, 大型多核系统以及嵌入
式系统优化. 然而, Linux 内存分配框架的组合性不足, 不能定制更灵活的内存分配策略, 如页分配器默认为二分
伙伴系统, 无法更换. 另一方面, Linux 内核分配框架与其他模块耦合程度较高, 例如对象分配器之一的 slub 分配
器的实现需要修改表示物理页的数据结构, 影响代码的复用和模块的隔离. 可组合的高性能内存分配器框架
HeapLayer 解决了内存分配框架组合性不足的问题 [5] . HeapLayer 基于 C++面向对象特性与元编程的模板功能, 将

494 495 496 497 498 499 500 501 502 503 504