ceph是一种开源的分布式的存储系统,包含以下几种存储类型:

  • 块存储(rbd)
  • 对象存储(RADOS Fateway)
  • 文件系统(cephfs)

块存储(rbd)


块是一个字节序列(例如,512字节的数据块)。 基于块的存储接口是使用旋转介质(如硬盘,CD,软盘甚至传统的9轨磁带)存储数据的最常用方法。

Ceph块设备是精简配置,可调整大小并存储在Ceph集群中多个OSD条带化的数据。 Ceph块设备利用RADOS功能,如快照,复制和一致性。 Ceph的RADOS块设备(RBD)使用内核模块或librbd库与OSD进行交互;

Ceph的块设备为内核模块或QVM等KVM以及依赖libvirt和QEMU与Ceph块设备集成的OpenStack和CloudStack等基于云的计算系统提供高性能和无限可扩展性。 可以使用同一个集群同时运行Ceph RADOS Gateway,CephFS文件系统和Ceph块设备。

linux系统中,ls /dev/下有很多块设备文件,这些文件就是我们添加硬盘时识别出来的。

rbd就是由Ceph集群提供出来的块设备。可以这样理解,sda是通过数据线连接到了真实的硬盘,而rbd是通过网络连接到了Ceph集群中的一块存储区域,往rbd设备文件写入数据,最终会被存储到Ceph集群的这块区域中。

总结:块设备可理解成一块硬盘,用户可以直接使用不含文件系统的块设备,也可以将其格式化成特定的文件系统,由文件系统来组织管理存储空间,从而为用户提供丰富而友好的数据操作支持。

k8s不支持跨节点去挂载rbd的,也就是不支持跨节点的pod去挂载同一个ceph的。

Ceph存储有三种存储接口,分别是:

  • 对象存储 Ceph Object Gateway
  • 块设备 RBD
  • 文件系统 CEPHFS

Kubernetes支持后两种存储接口,支持的接入模式如下图:

kubernetes pod 挂载 ceph rbd_bootstrap

在本篇将测试使用ceph rbd作持久化存储后端

因为公司有专门团队做Ceph,我只管用就行,跳过Ceph集群的安装过程。对于k8s来说,需要的Ceph信息有这些:IP地址、端口号 管理员用户名 管理员keyring

 

RBD创建测试


rbd的使用分为3个步骤:

  1. 服务端/客户端创建块设备image
  2. 客户端将image映射进linux系统内核,内核识别出该块设备后生成dev下的文件标识
  3. 格式化块设备并挂载使用

接下来会用一些ceph的命令行,用来根ceph集群交互。

yum/apt install ceph-common ceph-fs-common -y

ceph命令行需要2个配置文件,具体配置问ceph团队就行了。

  • /etc/ceph/ceph.conf
  • /etc/ceph/ceph.client.admin.keyring

RBD是这样用的:用户在Ceph上创建Pool(逻辑隔离),然后在Pool中创建image(实际存储介质),之后再将image挂载到本地服务器的某个目录上。

记住下面几条命令就行了。

# rbd list    #列出默认pool下的image
# rbd list -p k8s #列出pool k8s下的image
# rbd create foo -s 1024 #在默认pool中创建名为foo的image,大小为1024MB


# rbd map foo #将ceph集群的image映射到本地的块设备
/dev/rbd0
# ls -l /dev/rbd0 #是b类型
brw-rw---- 1 root disk 252, 0 May 22 20:57 /dev/rbd0
$ rbd showmapped #查看已经map的rbd image
id pool image snap device
0 rbd foo - /dev/rbd0


# mount /dev/rbd1 /mnt/bar/ #此时去mount会失败,因为image还没有格式化文件系统
mount: /dev/rbd1 is write-protected, mounting read-only
mount: wrong fs type, bad option, bad superblock on /dev/rbd1,
missing codepage or helper program, or other error

In some cases useful info is found in syslog - try
dmesg | tail or so.


# mkfs.ext4 /dev/rbd0 #格式化为ext4
...
Writing superblocks and filesystem accounting information: done
# mount /dev/rbd0 /mnt/foo/ #重新挂载
# df -h |grep foo #ok
/dev/rbd0 976M 2.6M 907M 1% /mnt/foo

 现在ceph集群是健康的 

[root@master1-admin ceph]# ceph -s
cluster cee1da3c-7be4-4f20-8256-5403deb42c1f
health HEALTH_OK
monmap e1: 3 mons at {master1-admin=192.168.0.5:6789/0,node1-monitor=192.168.0.6:6789/0,node2-osd=192.168.0.7:6789/0}
election epoch 6, quorum 0,1,2 master1-admin,node1-monitor,node2-osd
fsmap e7: 1/1/1 up {0=node1-monitor=up:active}, 2 up:standby
osdmap e20: 3 osds: 3 up, 3 in
flags sortbitwise,require_jewel_osds
pgmap v52: 116 pgs, 3 pools, 2068 bytes data, 20 objects
323 MB used, 45723 MB / 46046 MB avail
116 active+clean
You have new mail in /var/spool/mail/root

 

测试k8s挂载ceph rbd


kubernetes要想使用ceph,需要在k8s的每个node节点安装ceph-common(接下来会用一些ceph的命令行,用来根ceph集群交互),把ceph节点上的ceph.repo文件拷贝到k8s各个节点/etc/yum.repos.d/目录下,然后在k8s的各个节点yum install ceph-common -y(ceph网段要和k8s网段是通的)

[root@master1-admin ceph]# scp /etc/yum.repos.d/ceph.repo   root@192.168.0.2:/etc/yum.repos.d/
Warning: Permanently added '192.168.0.2' (ECDSA) to the list of known hosts.
root@192.168.0.2's password:
ceph.repo 100% 622 1.4MB/s 00:00
You have new mail in /var/spool/mail/root

[root@master1-admin ceph]# scp /etc/yum.repos.d/ceph.repo root@192.168.0.3:/etc/yum.repos.d/
Warning: Permanently added '192.168.0.3' (ECDSA) to the list of known hosts.
root@192.168.0.3's password:
ceph.repo


[root@node1 ~]# yum install ceph-common -y
[root@node2 ~]# yum install ceph-common -y

#将ceph配置文件拷贝到k8s的控制节点和工作节点 

[root@master1-admin ceph]# scp /etc/ceph/*   root@192.168.0.3:/etc/ceph/
Warning: Permanently added '192.168.0.3' (ECDSA) to the list of known hosts.
root@192.168.0.3's password:
ceph.bootstrap-mds.keyring 100% 113 271.5KB/s 00:00
ceph.bootstrap-mgr.keyring 100% 71 233.2KB/s 00:00
ceph.bootstrap-osd.keyring 100% 113 386.5KB/s 00:00
ceph.bootstrap-rgw.keyring 100% 113 391.3KB/s 00:00
ceph.client.admin.keyring 100% 63 174.2KB/s 00:00
ceph.conf 100% 343 1.1MB/s 00:00
ceph-deploy-ceph.log 100% 129KB 69.8MB/s 00:00
ceph.mon.keyring 100% 73 238.3KB/s 00:00
rbdmap 100% 92 334.5KB/s 00:00
You have new mail in /var/spool/mail/root
[root@master1-admin ceph]# scp /etc/ceph/* root@192.168.0.4:/etc/ceph/
Warning: Permanently added '192.168.0.4' (ECDSA) to the list of known hosts.
root@192.168.0.4's password:
ceph.bootstrap-mds.keyring 100% 113 265.2KB/s 00:00
ceph.bootstrap-mgr.keyring 100% 71 210.5KB/s 00:00
ceph.bootstrap-osd.keyring 100% 113 437.9KB/s 00:00
ceph.bootstrap-rgw.keyring 100% 113 346.2KB/s 00:00
ceph.client.admin.keyring 100% 63 235.1KB/s 00:00
ceph.conf 100% 343 1.3MB/s 00:00
ceph-deploy-ceph.log 100% 129KB 65.0MB/s 00:00
ceph.mon.keyring 100% 73 256.5KB/s 00:00
rbdmap 100% 92 375.0KB/s 00:00

 

创建ceph rbd


先创建一个池,再基于这个池创建rbd,rbd就是块设备,cepfs就是文件系统 

[root@master1-admin ceph]# ceph osd pool create k8srbd1 56
pool 'k8srbd1' created
You have new mail in /var/spool/mail/root

[root@master1-admin ceph]# rbd create rbda -s 1024 -p k8srbd1

[root@master1-admin ceph]# rbd feature disable k8srbd1/rbda object-map fast-diff deep-flatten
You have new mail in /var/spool/mail/root
[root@master1-admin ceph]# netstat -tpln | grep 6789
tcp 0 0 192.168.0.5:6789 0.0.0.0:* LISTEN 20773/ceph-mon

[root@node1-monitor ~]# netstat -tpln | grep 6789
tcp 0 0 192.168.0.6:6789 0.0.0.0:* LISTEN 21323/ceph-mon

[root@node2-osd ~]# netstat -tpln | grep 6789
tcp 0 0 192.168.0.7:6789 0.0.0.0:* LISTEN 22332/ceph-mon

使用rbd做卷的时候需要监控服务,还需要指定池pool,这个也就是刚刚刚刚创建的,然后指定这个池下面的块设备

[root@master ~]# cat pod.yaml 
apiVersion: v1
kind: Pod
metadata:
name: testrbd
spec:
containers:
- image: nginx
name: nginx
imagePullPolicy: IfNotPresent
volumeMounts:
- name: testrbd
mountPath: /mnt
volumes:
- name: testrbd
rbd:
monitors:
- '192.168.0.5:6789'
- '192.168.0.6:6789'
- '192.168.0.7:6789'
pool: k8srbd1
image: rbda
fsType: xfs
readOnly: false
user: admin
keyring: /etc/ceph/ceph.client.admin.keyring

#调度到哪个节点就会去找该文件
keyring: /etc/ceph/ceph.client.admin.keyring
[root@master ~]# kubectl get pod
NAME READY STATUS RESTARTS AGE
dns-test 1/1 Running 6 69d
nfs-client-provisioner-867b44bd69-xscs6 1/1 Running 2 17d
testrbd 1/1 Running 0 21s


[root@master ~]# kubectl describe pod testrbd
Name: testrbd
Namespace: default
Priority: 0
Node: node1/192.168.0.3
Start Time: Wed, 18 Aug 2021 11:10:46 +0800
Labels: <none>
Annotations: cni.projectcalico.org/podIP: 10.233.90.175/32
cni.projectcalico.org/podIPs: 10.233.90.175/32
Status: Running
IP: 10.233.90.175
IPs:
IP: 10.233.90.175
Containers:
nginx:
Container ID: docker://72ba2a480a963451d5e32bf7c462a31b6f1fd3718d9405aed4a55fac5e86de5c
Image: nginx
Image ID: docker-pullable://nginx@sha256:8f335768880da6baf72b70c701002b45f4932acae8d574dedfddaf967fc3ac90
Port: <none>
Host Port: <none>
State: Running
Started: Wed, 18 Aug 2021 11:11:04 +0800
Ready: True
Restart Count: 0
Environment: <none>
Mounts:
/mnt from testrbd (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-cwbdx (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
testrbd:
Type: RBD (a Rados Block Device mount on the host that shares a pod's lifetime)
CephMonitors: [192.168.0.5:6789 192.168.0.6:6789 192.168.0.7:6789]
RBDImage: rbda
FSType: xfs
RBDPool: k8srbd1
RadosUser: admin
Keyring: /etc/ceph/ceph.client.admin.keyring
SecretRef: nil
ReadOnly: false
default-token-cwbdx:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-cwbdx
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 13m default-scheduler Successfully assigned default/testrbd to node1
Normal SuccessfulAttachVolume 13m attachdetach-controller AttachVolume.Attach succeeded for volume "testrbd"
Normal Pulled 13m kubelet, node1 Container image "nginx" already present on machine
Normal Created 13m kubelet, node1 Created container nginx
Normal Started 13m kubelet, node1 Started container nginx


[root@master ~]# kubectl exec -it testrbd bash
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl kubectl exec [POD] -- [COMMAND] instead.
root@testrbd:/# ls /mnt
root@testrbd:/# cd /mnt/
root@testrbd:/mnt# exit
exit

可以看到k8s就可以接入ceph rbd块存储了,如果ceph不是你维护,只需要提要求,比如pool池子多大,然后再创建rbd要多大。然后只需要将pool和rbd的名称写上,然后知道监控节点的IP地址,然后就可以使用了。这样就将rbd块设备挂载到pod里面了。

上面实现了pod里面使用块设备,但是有个缺点,如果在创建一个pod那么就不能使用之前的块设备了,因为已经被pod占用了,就无法使用了。(只能在pool下面新创建rbd)

[root@master1-admin ceph]# rbd create rbdb -s 1024 -p k8srbd1
You have new mail in /var/spool/mail/root
[root@master1-admin ceph]# rbd feature disable k8srbd1/rbdb object-map fast-diff deep-flatten

前面Pod要求各node上都要有keyring文件,很不方便也不安全。新的Pod我使用推荐的做法:secret(虽然也安全不到哪里)

先创建一个secret。

apiVersion: v1
kind: Secret
metadata:
name: ceph-secret
type: "kubernetes.io/rbd"
data:
key: QVFCXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX9PQ==

 在新的Pod里Ref这个secret。

apiVersion: v1
kind: Pod
metadata:
name: rbd3
spec:
containers:
- image: gcr.io/nginx
name: rbd-rw
volumeMounts:
- name: rbdpd
mountPath: /mnt/rbd
volumes:
- name: rbdpd
rbd:
monitors:
- '1.2.3.4:6789'
pool: k8s
image: foobar
fsType: ext4
readOnly: false
user: admin
secretRef:
name: ceph-secret

再来看看前面写到image上的文件还在不在。

kubectl exec rbd3 -- cat /mnt/rbd/hosts
# Kubernetes-managed hosts file.
127.0.0.1 localhost
::1 localhost ip6-localhost ip6-loopback
fe00::0 ip6-localnet
fe00::0 ip6-mcastprefix
fe00::1 ip6-allnodes
fe00::2 ip6-allrouters
10.244.3.249 rbd

你会发现,哎,文件还在哎!对,这就是RBD的效果。但注意,Volume已经不存在了,虽然数据保住了,但这其实不是我们希望的持久化存储的方式。

volume是一种比较初级的使用方式,终级的方式是使用PV/PVC。