存算分离spark 数据读取性能计算存储分离架构

转载

mob64ca140eb362 2023-12-07 14:40:58

文章标签 存算分离spark 数据读取性能 oracle数据库tps优化 MySQL 数据库数据 文章分类 Spark 大数据

在基于 Kubernetes 和 Docker 构建的私有 RDS 中, 普遍采用了计算存储分离架构. 该架构优势明显, 但对于数据库类 Latency Sensitive 应用而言, IO 性能问题无法回避, 下面分享一下我们针对 MySQL

计算存储分离架构

架构示意图如下:

存算分离spark 数据读取性能计算存储分离架构_oracle数据库tps优化

存储层由分布式文件系统组成, 以 Provisoner 的方式集成到 Kubernetes

在我们看来, 计算存储分离的最大优势在于:

将有状态的数据下沉到存储层, 这使得 RDS 在调度时, 无需感知计算节点的存储介质, 只需调度到满足计算资源要求的 Node, 数据库实例启动时, 只需在分布式文件系统挂载mapping 的 volume 即可. 可以显著的提高数据库实例的部署密度和计算资源利用率

其他的好处还有很多, 譬如架构更清晰, 扩展更方便, 问题定位更简单等,这里不赘述.

计算存储分离架构的缺点

俗话说的好

上帝为你关上一扇窗的同时, 再关上一扇门.

如下图所示

存算分离spark 数据读取性能计算存储分离架构_oracle数据库tps优化_02

相较本地存储, 网络开销会成为 IO 开销的一部分, 我们认为会带来两个很明显的问题:

数据库是 Latency Sensitive 型应用, 网络延时会极大影响数据库能力(QPS,TPS)
在高密度部署的场景, 网络带宽会成为瓶颈, 可能导致计算 & 存储资源利用不充分.

kubernetes 本身没有提供 Voting 服务和类似 Oracle Rac 的 Fence 机制, 在计算存储分离架构下, 当集群发生脑裂, 并触发 Node Controller 和 Kubelet 的驱逐机制时, 可能会出现多个数据库实例同时访问一份数据文件导致 Data Corruption 的情况, 数据的损失对用户而言是不可估量也不可忍受的. 我们在 kubernetes 1.7.8 下使用 Oracle , MySQL 都可以100%复现这个场景, 通过在 Kubernetes 上添加 Fence 机制,

下面, 就需要结合 MySQL

以下测试方案的设计, 测试数据的梳理来自于沃趣科技 MySQL 专家 @董大爷 和 @波多野老师.

DoubleWrite

在 MySQL 中我们首先想到了 DoubleWrite.

The InnoDB doublewrite buffer was implemented to recover from half-written pages. This can happen when there's a power failure while InnoDB is writing a page to disk. On reading that page, InnoDB can discover the corruption from the mismatch of the page checksum. However, in order to recover, an intact copy of the page would be needed.
The double write buffer provides such a copy.
Whenever InnoDB flushes a page to disk, it is first written to the double write buffer. Only when the buffer is safely flushed to disk will InnoDB write the page to the final destination. When recovering, InnoDB scans the double write buffer and for each valid page in the buffer checks if the page in the data file is valid too.
Although data is written twice, the doublewrite buffer does not require twice as much I/O, as data is written to the buffer in a large sequential chunk with a single fsync() call. There is extra time consumed however, and the effect becomes visible with fast storage and a heavy write load.

简单说 DoubleWrite 的实现是防止数据页写入时发生故障导致页损坏(partial write)，所以每次写数据文件时都要将一份数据写到共享表空间中当启动时发现数据页 Checkum 校验不正确时会使用共享表空间中副本进行恢复，从 DoubleWrite 实现来看这部分会产生一定量的 IO