如何扩容 Bluestore block.wal 和 blockl.db 分区
iliul@地平线 Ceph开源社区
介绍
本篇介绍如何扩容 bluestore block.wal 和 blockl.db 分区以及有数据溢出到 slow db 后数据迁移方法(从 Ceph v14.1.0 后),下面操作示例以裸盘设备(如果通过 LVM 管理操作类似)为例进行
替换 wal, db 分区
查询当前 OSD.0 信息,确定block.db和block.wal对应的磁盘分区
- tree -a /var/lib/ceph/osd/ceph-0
ceph-0
├── activate.monmap
├── block -> /dev/ceph-b17ef1f2-8c1a-4be1-97a7-c35b77d79e78/osd-block-45693cad-5a91-4dbc-8180-b25f4d864f33
├── block.db -> /dev/sdd2
├── block.wal -> /dev/sdd1
├── bluefs
├── ceph_fsid
├── fsid
├── keyring
├── kv_backend
├── magic
├── mkfs_done
├── osd_key
├── ready
├── require_osd_release
├── type
└── whoami
- fdisk -l /dev/sdd
Disk /dev/sdd: 931 GiB, 999653638144 bytes, 1952448512 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 0389A1C8-5306-AC45-8300-0B012020B755
Device Start End Sectors Size Type
/dev/sdd1 2048 2099199 2097152 1G Linux filesystem
/dev/sdd2 2099200 35653631 33554432 16G Linux filesystem
- sgdisk -p /dev/sdd
Disk /dev/sdd: 1952448512 sectors, 931.0 GiB
Model: PERC H730P Mini
Sector size (logical/physical): 512/512 bytes
Disk identifier (GUID): 0389A1C8-5306-AC45-8300-0B012020B755
Partition table holds up to 128 entries
Main partition table begins at sector 2 and ends at sector 33
First usable sector is 2048, last usable sector is 1952448478
Partitions will be aligned on 2048-sector boundaries
Total free space is 1698691039 sectors (810.0 GiB)
Number Start (sector) End (sector) Size Code Name
1 2048 2099199 1024.0 MiB 8300
2 2099200 35653631 16.0 GiB 8300
- sgdisk -i 1 /dev/sdd 记录下Partition unique GUID下面创建对应的替换分区时需要
Partition GUID code: 0FC63DAF-8483-4772-8E79-3D69D8477DE4 (Linux filesystem)
Partition unique GUID: 98D073A1-925D-944D-9505-7FE489848305
First sector: 2048 (at 1024.0 KiB)
Last sector: 2099199 (at 1025.0 MiB)
Partition size: 2097152 sectors (1024.0 MiB)
Attribute flags: 0000000000000000
Partition name: ''
- sgdisk -i 2 /dev/sdd 同上,记录下Partition unique GUID值
Partition GUID code: 0FC63DAF-8483-4772-8E79-3D69D8477DE4 (Linux filesystem)
Partition unique GUID: 4BBA676A-A628-904A-8C93-1B691161C5A1
First sector: 2099200 (at 1.0 GiB)
Last sector: 35653631 (at 17.0 GiB)
Partition size: 33554432 sectors (16.0 GiB)
Attribute flags: 0000000000000000
Partition name: ''
- stop osd.0
# systemctl stop ceph-osd@0
- ceph-bluestore-tool show-label —path /var/lib/ceph/osd/ceph-0
{
"/var/lib/ceph/osd/ceph-0/block": {
"osd_uuid": "45693cad-5a91-4dbc-8180-b25f4d864f33",
"size": 998579896320,
"btime": "2019-06-13T11:42:12.194273+0800",
"description": "main",
"bluefs": "1",
"ceph_fsid": "45b2df47-f946-43e1-9a06-4832ff3e5c24",
"kv_backend": "rocksdb",
"magic": "ceph osd volume v026",
"mkfs_done": "yes",
"osd_key": "AQASxgFdn465LBAA7x6kMuoS9UiqwqVjDmHD9A==",
"ready": "ready",
"require_osd_release": "15",
"whoami": "0"
},
"/var/lib/ceph/osd/ceph-0/block.wal": {
"osd_uuid": "45693cad-5a91-4dbc-8180-b25f4d864f33",
"size": 1073741824, // 扩容前大小,1G
"btime": "2019-06-13T11:42:12.196216+0800",
"description": "bluefs wal"
},
"/var/lib/ceph/osd/ceph-0/block.db": {
"osd_uuid": "45693cad-5a91-4dbc-8180-b25f4d864f33",
"size": 17179869184, // 扩容前大小,16G
"btime": "2019-06-13T11:42:12.195025+0800",
"description": "bluefs db"
}
}
- create new wal, db 准备要替换目标分区 /dev/sdd3 和 /dev/sdd4 , 这里注意 35653632 下一个扇区的起始位置,参考 sgdisk -p 输出结果,而 typecode 后面加要替换的分区 GUID,
参考如下帮助信息
-c, --change-name=partnum:name change partition's name
-n, --new=partnum:start:end create new partition
-t, --typecode=partnum:{hexcode|GUID} change partition type code
-g, --mbrtogpt convert MBR to GPT
创建新分区 sdd3 和 sdd4
# sgdisk --new=3:35653632:+4GiB --change-name="3:ceph block.wal" --typecode="3:98D073A1-925D-944D-9505-7FE489848305" --mbrtogpt /dev/sdd
# sgdisk --new=4:44042240:+100GiB --change-name="4:ceph block.db" --typecode="4:4BBA676A-A628-904A-8C93-1B691161C5A1" --mbrtogpt /dev/sdd
创建后分区信息
- sgdisk -p /dev/sdd
Number Start (sector) End (sector) Size Code Name
1 2048 2099199 1024.0 MiB 8300
2 2099200 35653631 16.0 GiB 8300
3 35653632 44042239 4.0 GiB FFFF ceph block.wal
4 44042240 253757439 100.0 GiB FFFF ceph block.db
- dd status=progress if=/dev/sdd1 of=/dev/sdd3 拷贝 wal 分区
# dd status=progress if=/dev/sdd1 of=/dev/sdd3
1055113728 bytes (1.1 GB, 1006 MiB) copied, 50 s, 21.1 MB/s
2097152+0 records in
2097152+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 51.1283 s, 21.0 MB/s
- dd status=progress if=/dev/sdd2 of=/dev/sdd4 拷贝 db 分区
# dd status=progress if=/dev/sdd2 of=/dev/sdd4
17179038208 bytes (17 GB, 16 GiB) copied, 997 s, 17.2 MB/s
33554432+0 records in
33554432+0 records out
17179869184 bytes (17 GB, 16 GiB) copied, 998.58 s, 17.2 MB/s
- 删除旧的分区并设置原UUID到新分区 取原分区 Partition unique GUID
# sgdisk --delete=1 --delete=2 --partition-guid="3:98D073A1-925D-944D-9505-7FE489848305" --partition-guid="4:4BBA676A-A628-904A-8C93-1B691161C5A1" /dev/sdd
The operation has completed successfully.
- partprobe 更新内核分区表
# partprobe
- use new partition 删除旧的符号链接
# cd /var/lib/ceph/osd/ceph-0/
# rm block.wal
# rm block.db
创建新符号链接到 block.wal 和 block.db
# ln -s /dev/sdd3 block.wal
# ln -s /dev/sdd4 block.db
配置新分区权限
# chown -R ceph:ceph /dev/sd*
设置分区类型 - Linux filesystem(20)
# fdisk /dev/sdd -> p -> t (3, 4) -> 20 > w
- sgdisk -p /dev/sdd 确认分区信息
Disk /dev/sdd: 1952448512 sectors, 931.0 GiB
Model: PERC H730P Mini
Sector size (logical/physical): 512/512 bytes
Disk identifier (GUID): 0FC63DAF-8483-4772-8E79-3D69D8477DE4
Partition table holds up to 128 entries
Main partition table begins at sector 2 and ends at sector 33
First usable sector is 2048, last usable sector is 1952448478
Partitions will be aligned on 2048-sector boundaries
Total free space is 1734342623 sectors (827.0 GiB)
Number Start (sector) End (sector) Size Code Name
3 35653632 44042239 4.0 GiB 8300 ceph block.wal
4 44042240 253757439 100.0 GiB 8300 ceph block.db
- ceph-bluestore-tool bluefs-bdev-expand —path /var/lib/ceph/osd/ceph-0 扩展 WAL, DB 分区
inferring bluefs devices from bluestore path
0 : device size 0x100000000 : own 0x[1000~3ffff000] = 0x3ffff000 : using 0x4ff000(5.0 MiB)
1 : device size 0x1900000000 : own 0x[2000~3ffffe000] = 0x3ffffe000 : using 0xb7bfe000(2.9 GiB)
2 : device size 0xe880000000 : own 0x[6f99900000~94cd00000] = 0x94cd00000 : using 0x9a200000(2.4 GiB)
Expanding...
0 : expanding from 0x40000000 to 0x100000000
0 : size label updated to 4294967296 // 4G
1 : expanding from 0x400000000 to 0x1900000000
1 : size label updated to 107374182400 // 100G
- ceph-bluestore-tool show-label —path /var/lib/ceph/osd/ceph-0 检验扩展后OSD的 wal, db 分区信息
inferring bluefs devices from bluestore path
{
"/var/lib/ceph/osd/ceph-0/block": {
"osd_uuid": "45693cad-5a91-4dbc-8180-b25f4d864f33",
"size": 998579896320,
"btime": "2019-06-13T11:42:12.194273+0800",
"description": "main",
"bluefs": "1",
"ceph_fsid": "45b2df47-f946-43e1-9a06-4832ff3e5c24",
"kv_backend": "rocksdb",
"magic": "ceph osd volume v026",
"mkfs_done": "yes",
"osd_key": "AQASxgFdn465LBAA7x6kMuoS9UiqwqVjDmHD9A==",
"ready": "ready",
"require_osd_release": "15",
"whoami": "0"
},
"/var/lib/ceph/osd/ceph-0/block.wal": {
"osd_uuid": "45693cad-5a91-4dbc-8180-b25f4d864f33",
"size": 4294967296, // WAL 大小 4G
"btime": "2019-06-13T11:42:12.196216+0800",
"description": "bluefs wal"
},
"/var/lib/ceph/osd/ceph-0/block.db": {
"osd_uuid": "45693cad-5a91-4dbc-8180-b25f4d864f33",
"size": 107374182400, // DB 大小 100G
"btime": "2019-06-13T11:42:12.195025+0800",
"description": "bluefs db"
}
}
- 启动 osd.0 启动 OSD.0 观察 osd log 是否有异常报错,并观察集群状态
# systemctl start ceph-osd@0
- 当前集群状态 仍有告警提示,但注意到 db device 已经显示为 100G
BLUEFS_SPILLOVER BlueFS spillover detected on 1 OSD(s)
osd.0 spilled over 1.1 GiB metadata from 'db' device (4.8 GiB used of 100 GiB) to slow device
- 核对 BlueFS 信息 如下分别显示 WAL, DB 分区大小,并且显示了仍然使用了 slow device,需要进行 migrate 操作
# ceph daemon osd.0 perf dump | jq .bluefs
{
"gift_bytes": 0,
"reclaim_bytes": 0,
"db_total_bytes": 107374174208, // 100G DB 分区
"db_used_bytes": 5129625600,
"wal_total_bytes": 4294963200, // 4G WAL 分区
"wal_used_bytes": 521138176,
"slow_total_bytes": 39943405568,
"slow_used_bytes": 1176502272, // slow 已用大小,需要迁移
"num_files": 104,
"log_bytes": 1908736,
"log_compactions": 2,
"logged_bytes": 278183936,
"files_written_wal": 2,
"files_written_sst": 129,
"bytes_written_wal": 6509639753,
"bytes_written_sst": 7286006904,
"bytes_written_slow": 0,
"max_bytes_wal": 549449728,
"max_bytes_db": 6008332288,
"max_bytes_slow": 0,
"read_random_count": 57820,
"read_random_bytes": 7151101853,
"read_random_disk_count": 19252,
"read_random_disk_bytes": 6999861094,
"read_random_buffer_count": 38650,
"read_random_buffer_bytes": 151240759,
"read_count": 15253,
"read_bytes": 675829317,
"read_prefetch_count": 4902,
"read_prefetch_bytes": 337983129
}
迁移已溢出的元数据
将已经溢出的 metadata 数据移回到 block.db 分区
- 停掉 OSD.0
# systemctl stop ceph-osd@0
- 执行迁移
# ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-0 --devs-source /var/lib/ceph/osd/ceph-0/block --dev-target /var/lib/ceph/osd/ceph-0/block.db --command bluefs-bdev-migrate
inferring bluefs devices from bluestore path
- 设置分区权限
# chown -R ceph:ceph /dev/sdd*
# chown -R ceph:ceph /var/lib/ceph/osd/ceph-0/block.db
- 启动 OSD.0
# systemctl start ceph-osd@0
- 检查 bluefs 统计信息
# ceph daemon osd.0 perf dump | jq .bluefs | grep slow_used_bytes
"slow_used_bytes": 0,
如上,数据已经从 slow device 中迁走,集群告警消失