如何扩容 Bluestore block.wal 和 blockl.db 分区

iliul@地平线 Ceph开源社区

介绍


本篇介绍如何扩容 bluestore block.wal 和 blockl.db 分区以及有数据溢出到 slow db 后数据迁移方法(从 Ceph v14.1.0 后),下面操作示例以裸盘设备(如果通过 LVM 管理操作类似)为例进行

替换 wal, db 分区


查询当前 OSD.0 信息,确定block.db和block.wal对应的磁盘分区

  • tree -a /var/lib/ceph/osd/ceph-0

ceph-0
├── activate.monmap
├── block -> /dev/ceph-b17ef1f2-8c1a-4be1-97a7-c35b77d79e78/osd-block-45693cad-5a91-4dbc-8180-b25f4d864f33
├── block.db -> /dev/sdd2
├── block.wal -> /dev/sdd1
├── bluefs
├── ceph_fsid
├── fsid
├── keyring
├── kv_backend
├── magic
├── mkfs_done
├── osd_key
├── ready
├── require_osd_release
├── type
└── whoami
  • fdisk -l /dev/sdd

Disk /dev/sdd: 931 GiB, 999653638144 bytes, 1952448512 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 0389A1C8-5306-AC45-8300-0B012020B755

Device        Start       End   Sectors  Size Type
/dev/sdd1      2048   2099199   2097152    1G Linux filesystem
/dev/sdd2   2099200  35653631  33554432   16G Linux filesystem
  • sgdisk -p /dev/sdd

Disk /dev/sdd: 1952448512 sectors, 931.0 GiB
Model: PERC H730P Mini
Sector size (logical/physical): 512/512 bytes
Disk identifier (GUID): 0389A1C8-5306-AC45-8300-0B012020B755
Partition table holds up to 128 entries
Main partition table begins at sector 2 and ends at sector 33
First usable sector is 2048, last usable sector is 1952448478
Partitions will be aligned on 2048-sector boundaries
Total free space is 1698691039 sectors (810.0 GiB)

Number  Start (sector)    End (sector)  Size       Code  Name
   1            2048         2099199   1024.0 MiB  8300
   2         2099200        35653631   16.0 GiB    8300
  • sgdisk -i 1 /dev/sdd 记录下Partition unique GUID下面创建对应的替换分区时需要

Partition GUID code: 0FC63DAF-8483-4772-8E79-3D69D8477DE4 (Linux filesystem)
Partition unique GUID: 98D073A1-925D-944D-9505-7FE489848305
First sector: 2048 (at 1024.0 KiB)
Last sector: 2099199 (at 1025.0 MiB)
Partition size: 2097152 sectors (1024.0 MiB)
Attribute flags: 0000000000000000
Partition name: ''
  • sgdisk -i 2 /dev/sdd 同上,记录下Partition unique GUID值

Partition GUID code: 0FC63DAF-8483-4772-8E79-3D69D8477DE4 (Linux filesystem)
Partition unique GUID: 4BBA676A-A628-904A-8C93-1B691161C5A1
First sector: 2099200 (at 1.0 GiB)
Last sector: 35653631 (at 17.0 GiB)
Partition size: 33554432 sectors (16.0 GiB)
Attribute flags: 0000000000000000
Partition name: ''
  • stop osd.0

# systemctl stop ceph-osd@0
  • ceph-bluestore-tool show-label —path /var/lib/ceph/osd/ceph-0

{
    "/var/lib/ceph/osd/ceph-0/block": {
        "osd_uuid": "45693cad-5a91-4dbc-8180-b25f4d864f33",
        "size": 998579896320,
        "btime": "2019-06-13T11:42:12.194273+0800",
        "description": "main",
        "bluefs": "1",
        "ceph_fsid": "45b2df47-f946-43e1-9a06-4832ff3e5c24",
        "kv_backend": "rocksdb",
        "magic": "ceph osd volume v026",
        "mkfs_done": "yes",
        "osd_key": "AQASxgFdn465LBAA7x6kMuoS9UiqwqVjDmHD9A==",
        "ready": "ready",
        "require_osd_release": "15",
        "whoami": "0"
    },
    "/var/lib/ceph/osd/ceph-0/block.wal": {
        "osd_uuid": "45693cad-5a91-4dbc-8180-b25f4d864f33",
        "size": 1073741824,    // 扩容前大小,1G
        "btime": "2019-06-13T11:42:12.196216+0800",
        "description": "bluefs wal"
    },
    "/var/lib/ceph/osd/ceph-0/block.db": {
        "osd_uuid": "45693cad-5a91-4dbc-8180-b25f4d864f33",
        "size": 17179869184,   // 扩容前大小,16G
        "btime": "2019-06-13T11:42:12.195025+0800",
        "description": "bluefs db"
    }
}
  • create new wal, db 准备要替换目标分区 /dev/sdd3 和 /dev/sdd4 , 这里注意 35653632 下一个扇区的起始位置,参考 sgdisk -p 输出结果,而 typecode 后面加要替换的分区 GUID,

参考如下帮助信息


-c, --change-name=partnum:name              change partition's name
-n, --new=partnum:start:end                 create new partition
-t, --typecode=partnum:{hexcode|GUID}       change partition type code
-g, --mbrtogpt                              convert MBR to GPT

创建新分区 sdd3 和 sdd4



# sgdisk --new=3:35653632:+4GiB --change-name="3:ceph block.wal" --typecode="3:98D073A1-925D-944D-9505-7FE489848305" --mbrtogpt /dev/sdd
# sgdisk --new=4:44042240:+100GiB --change-name="4:ceph block.db" --typecode="4:4BBA676A-A628-904A-8C93-1B691161C5A1" --mbrtogpt /dev/sdd

创建后分区信息

  • sgdisk -p /dev/sdd

Number  Start (sector)    End (sector)  Size       Code  Name
   1            2048         2099199   1024.0 MiB  8300
   2         2099200        35653631   16.0 GiB    8300
   3        35653632        44042239   4.0 GiB     FFFF  ceph block.wal
   4        44042240       253757439   100.0 GiB   FFFF  ceph block.db
  • dd status=progress if=/dev/sdd1 of=/dev/sdd3 拷贝 wal 分区


# dd status=progress if=/dev/sdd1 of=/dev/sdd3
1055113728 bytes (1.1 GB, 1006 MiB) copied, 50 s, 21.1 MB/s
2097152+0 records in
2097152+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 51.1283 s, 21.0 MB/s
  • dd status=progress if=/dev/sdd2 of=/dev/sdd4 拷贝 db 分区

# dd status=progress if=/dev/sdd2 of=/dev/sdd4
17179038208 bytes (17 GB, 16 GiB) copied, 997 s, 17.2 MB/s
33554432+0 records in
33554432+0 records out
17179869184 bytes (17 GB, 16 GiB) copied, 998.58 s, 17.2 MB/s
  • 删除旧的分区并设置原UUID到新分区 取原分区 Partition unique GUID

# sgdisk --delete=1 --delete=2 --partition-guid="3:98D073A1-925D-944D-9505-7FE489848305" --partition-guid="4:4BBA676A-A628-904A-8C93-1B691161C5A1" /dev/sdd
The operation has completed successfully.
  • partprobe 更新内核分区表

# partprobe
  • use new partition 删除旧的符号链接

# cd /var/lib/ceph/osd/ceph-0/
# rm block.wal
# rm block.db

创建新符号链接到 block.wal 和 block.db


# ln -s /dev/sdd3 block.wal
# ln -s /dev/sdd4 block.db

配置新分区权限


# chown -R ceph:ceph /dev/sd*

设置分区类型 - Linux filesystem(20)



# fdisk /dev/sdd -> p -> t (3, 4) -> 20 > w
  • sgdisk -p /dev/sdd 确认分区信息

Disk /dev/sdd: 1952448512 sectors, 931.0 GiB
Model: PERC H730P Mini
Sector size (logical/physical): 512/512 bytes
Disk identifier (GUID): 0FC63DAF-8483-4772-8E79-3D69D8477DE4
Partition table holds up to 128 entries
Main partition table begins at sector 2 and ends at sector 33
First usable sector is 2048, last usable sector is 1952448478
Partitions will be aligned on 2048-sector boundaries
Total free space is 1734342623 sectors (827.0 GiB)

Number  Start (sector)    End (sector)  Size       Code  Name
   3        35653632        44042239   4.0 GiB     8300  ceph block.wal
   4        44042240       253757439   100.0 GiB   8300  ceph block.db
  • ceph-bluestore-tool bluefs-bdev-expand —path /var/lib/ceph/osd/ceph-0 扩展 WAL, DB 分区

inferring bluefs devices from bluestore path
0 : device size 0x100000000 : own 0x[1000~3ffff000] = 0x3ffff000 : using 0x4ff000(5.0 MiB)
1 : device size 0x1900000000 : own 0x[2000~3ffffe000] = 0x3ffffe000 : using 0xb7bfe000(2.9 GiB)
2 : device size 0xe880000000 : own 0x[6f99900000~94cd00000] = 0x94cd00000 : using 0x9a200000(2.4 GiB)
Expanding...
0 : expanding  from 0x40000000 to 0x100000000
0 : size label updated to 4294967296   // 4G
1 : expanding  from 0x400000000 to 0x1900000000
1 : size label updated to 107374182400   // 100G
  • ceph-bluestore-tool show-label —path /var/lib/ceph/osd/ceph-0 检验扩展后OSD的 wal, db 分区信息

inferring bluefs devices from bluestore path
{
    "/var/lib/ceph/osd/ceph-0/block": {
        "osd_uuid": "45693cad-5a91-4dbc-8180-b25f4d864f33",
        "size": 998579896320,
        "btime": "2019-06-13T11:42:12.194273+0800",
        "description": "main",
        "bluefs": "1",
        "ceph_fsid": "45b2df47-f946-43e1-9a06-4832ff3e5c24",
        "kv_backend": "rocksdb",
        "magic": "ceph osd volume v026",
        "mkfs_done": "yes",
        "osd_key": "AQASxgFdn465LBAA7x6kMuoS9UiqwqVjDmHD9A==",
        "ready": "ready",
        "require_osd_release": "15",
        "whoami": "0"
    },
    "/var/lib/ceph/osd/ceph-0/block.wal": {
        "osd_uuid": "45693cad-5a91-4dbc-8180-b25f4d864f33",
        "size": 4294967296,       // WAL 大小 4G
        "btime": "2019-06-13T11:42:12.196216+0800",
        "description": "bluefs wal"
    },
    "/var/lib/ceph/osd/ceph-0/block.db": {
        "osd_uuid": "45693cad-5a91-4dbc-8180-b25f4d864f33",
        "size": 107374182400,        // DB 大小 100G
        "btime": "2019-06-13T11:42:12.195025+0800",
        "description": "bluefs db"
    }
}
  • 启动 osd.0 启动 OSD.0 观察 osd log 是否有异常报错,并观察集群状态

# systemctl start ceph-osd@0
  • 当前集群状态 仍有告警提示,但注意到 db device 已经显示为 100G


BLUEFS_SPILLOVER BlueFS spillover detected on 1 OSD(s)
     osd.0 spilled over 1.1 GiB metadata from 'db' device (4.8 GiB used of 100 GiB) to slow device
  • 核对 BlueFS 信息 如下分别显示 WAL, DB 分区大小,并且显示了仍然使用了 slow device,需要进行 migrate 操作


# ceph daemon osd.0 perf dump | jq .bluefs
{
  "gift_bytes": 0,
  "reclaim_bytes": 0,
  "db_total_bytes": 107374174208,   // 100G DB 分区
  "db_used_bytes": 5129625600,
  "wal_total_bytes": 4294963200,      // 4G WAL 分区
  "wal_used_bytes": 521138176,
  "slow_total_bytes": 39943405568,
  "slow_used_bytes": 1176502272,     // slow 已用大小,需要迁移
  "num_files": 104,
  "log_bytes": 1908736,
  "log_compactions": 2,
  "logged_bytes": 278183936,
  "files_written_wal": 2,
  "files_written_sst": 129,
  "bytes_written_wal": 6509639753,
  "bytes_written_sst": 7286006904,
  "bytes_written_slow": 0,
  "max_bytes_wal": 549449728,
  "max_bytes_db": 6008332288,
  "max_bytes_slow": 0,
  "read_random_count": 57820,
  "read_random_bytes": 7151101853,
  "read_random_disk_count": 19252,
  "read_random_disk_bytes": 6999861094,
  "read_random_buffer_count": 38650,
  "read_random_buffer_bytes": 151240759,
  "read_count": 15253,
  "read_bytes": 675829317,
  "read_prefetch_count": 4902,
  "read_prefetch_bytes": 337983129
}

迁移已溢出的元数据


将已经溢出的 metadata 数据移回到 block.db 分区

  • 停掉 OSD.0

# systemctl stop ceph-osd@0
  • 执行迁移

# ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-0 --devs-source /var/lib/ceph/osd/ceph-0/block --dev-target /var/lib/ceph/osd/ceph-0/block.db --command bluefs-bdev-migrate
inferring bluefs devices from bluestore path
  • 设置分区权限

# chown -R ceph:ceph /dev/sdd*
# chown -R ceph:ceph /var/lib/ceph/osd/ceph-0/block.db
  • 启动 OSD.0

# systemctl start ceph-osd@0
  • 检查 bluefs 统计信息

# ceph daemon osd.0 perf dump | jq .bluefs | grep slow_used_bytes
  "slow_used_bytes": 0,

如上,数据已经从 slow device 中迁走,集群告警消失