原理概述
一个文件系统的空间管理,常见的技术大致有两种,bitmap和tree方式。
bitmap是将文件系统所有管辖的空间细化成block(windows叫cluster),每一个block对应一个二进制位,两种状态分别表示自由/已分配。将这些二进制位集合在一起,就是bitmap。当需要分配空间时,在bitmap中查找连续的自由位,分配后,再置成已分配就可以了;释放时,将对应位置为可分配即可。
tree方式是以extent的记录来描述自由/已分配空间的状态,如果以表示自由空间的tree来说,有可能是由一些"{可分配起始位置,可分配大小}"的记录组成的数据表。需要分配空间时,可以就某种优先原则查找可分配的自由空间段,分配后,重新调整树即可;释放时,重构记录,重构树即可。
上述两种模式在应付小容量的文件系统时基本问题不大,尤其是bitmap模式,以常见的4K块大小为例,在1T的文件系统上,位图的大小为32M(1T/4K/8),在内存中还是可以应付得过来,释放时效率的影响也不大。tree模式,虽然运算的负载较bitmap重了些,但好在文件系统容量不大,片断数量多数也在可接受范围内,整体性能影响也较小。
当文件系统越来越大,上述两种空间管理模式对效率的影响就开始突显,例如,在一个16T的文件系统上,如果还是4K的文件系统块大小,则bitmap的大小将大到1G,如果文件系统全部载入内存,一来内存负担太重,二来很难及时全部回写磁盘,在应付突然断电之类的问题上,要消耗巨大的资源(设计时也麻烦得多)。同时如果某个文件分配得很散乱,分配或释放时都会涉及大量随机IO的bitmap操作,导致io性能极低。如果在这么大的空间内想找到一个自由空间点,最差情况需要遍历1G的位图空间,也似乎也是很低下的模式。
tree的模式会综合一点,通常可以比bitmap消耗内存少一些,同时可能有记录合并的情况,所以,在查找可分配空间上不会因快用满性能明显下降。但释放空间、分配空间,会导致树结构的不断调整,树越大,这些负荷也越大。还有,最麻烦的是,分配还是释放还是无法保证连续记录,在磁盘IO方面,仍然无法避免随机io带来的低效(不过,优秀设计过的树结构还是有较大提升空间,以实现最大程度掩盖上述缺点)。
说完上述两种常见空间管理,显然是为ZFS铺垫的。
ZFS不同于bitmap和tree的空间管理,它是第三种,极少有文件系统采用的一种空间管理方法。
第一层:将所有需要管理的空间均匀分为不超过200个(源码中define的常量)的相同容量的小段,每个段只可接受2^N。也就是说把全空间除以200,得到的值向上趋近至2^N,就用这个大小对全空间进行切分,每一个分配空间叫一个metaslab。比如一个150G的全空间,除以200是750M,就向上趋近于2^N,即1GB,得到150个metaslab,每个大小为1GB。
metaslab是化整为零的第一步,每个metaslab都对应着单独的空间管理记录。(这样的好处是全局考虑释放、分配的情况会少很多)。
在ZFS中,一个metaslab的分配情况对应一个object,简单的理解,就像对应了一个文件(mos中的某个元文件)。其内容就是分配情况。如果没有分配空间(即全部空间可用),就不用为其分配空间记录表,也不用为其分配元文件。
第二层:每个metaslab中用位图表的方式记录:本metaslab中连续的1扇区的片断有多少个、连续的1K的片断有多少个、连续的2K的片断有多少个、4K,8K、16K,..1G、2G等连续片断有多少个。以便于在分配时,快速确定本metaslab有没有最优分配片断。
第三层:每个metaslab用流水账(space map)的方式记录本metaslab的io日志,所有的释放/分配都在这个账本的尾部续写。当本metaslab需要分配/释放空间时,先按时间顺序读入这个流水账,读完后,就生成了本metaslab的真实分配位图,再在内存中进行分配就可以,当达到一个事务节点时,将内存中的摘要信息(第二层)和正好在内存中顺便合并后的流水账信息回写到磁盘上(space map entry可以视情况合并,老的记录也可能优先合并了,所以不会特别大)
这样的好处至少在于:
1、整个文件系统的空间管理单元很精简,不会因文件系统特别大,或因文件系统快用完了就变得臃肿。空间管理更优秀地匹配了真实复杂度,正如文件系统刚开始的时候,空间管理非常简单;快用满时,也从复杂变得简单(分配融合,记录更少了),这是合理的。
2、化整为零,空间的分配优先在同一个metaslab中进行,实际上,相当于利用连续的小磁盘空间进行最有效的当前事务管理,提高了管理效力(与之相对的bitmap,用户可能只需1M空间,但文件系统驱动得读进所有bitmap,其他不相关的部分几乎是无用负载)
3、流水账方式的记录,简化的树操作。zfs把树的操作部分,全部放在内存中处理,磁盘空间上除了位图方式记录的片断表,就是顺序IO记录。这样减少了与磁盘IO的交互次数(因树触发的随机数据变更是最要命的),提高了文件系统性能。
唯一感觉缺点的是(张宇www.datahf.net注:未做详细实验,仅凭理论推断),每个metaslab分配时,都需要在内存中重构树,在简单应用时,运算负载会较传统文件系统更大。可简单应用,其实也无所谓运算负载稍重,反正可能cpu,内存都闲着。
理论部分就是上面,再图文并茂下,直观透视。
使用下面命令,创建了一个zpool,简单的说,就是有2个vdev,一个30G,一个60G组成了一个pool。
qemu-img create -f vmdk case4.1.vmdk 30G qemu-img create -f vmdk case4.2.vmdk 60G qemu-nbd -c /dev/nbd0 case4.1.vmdk qemu-nbd -c /dev/nbd1 case4.2.vmdk zpool create -f case4 /dev/nbd0 /dev/nbd1
一、使用zdb -l命令列出两个的vdev的label,其实只为获取 metaslab_array与metaslab_shift。据执行结果可知:
vdev0(case4.1.vmdk 30G):/dev/nbd0:
metaslab_array:37 ----表示vdev1的metaslab分析表由37号元文件管理。
metaslab_shift: 28 ----表示vdev1的metaslab大小为2^28=256M。
vdev1(case4.2.vmdk 60G):/dev/nbd1:
metaslab_array:34 ----表示vdev1的metaslab分析表由34号元文件管理。
metaslab_shift: 29 ----表示vdev1的metaslab大小为2^29=512M。
命令执行结果:
[root@localhost case4]# zdb -l /dev/nbd0 -------------------------------------------- LABEL 0 -------------------------------------------- version: 5000 name: 'case4' state: 0 txg: 4 pool_guid: 4712723554953817788 errata: 0 hostname: 'localhost' top_guid: 10025130926649767584 guid: 10025130926649767584 vdev_children: 2 vdev_tree: type: 'disk' id: 0 guid: 10025130926649767584 path: '/dev/nbd0' whole_disk: 0 metaslab_array: 37 metaslab_shift: 28 ashift: 9 asize: 32207536128 is_log: 0 create_txg: 4 features_for_read: com.delphix:hole_birth com.delphix:embedded_data -------------------------------------------- LABEL 1 -------------------------------------------- version: 5000 name: 'case4' state: 0 txg: 4 pool_guid: 4712723554953817788 errata: 0 hostname: 'localhost' top_guid: 10025130926649767584 guid: 10025130926649767584 vdev_children: 2 vdev_tree: type: 'disk' id: 0 guid: 10025130926649767584 path: '/dev/nbd0' whole_disk: 0 metaslab_array: 37 metaslab_shift: 28 ashift: 9 asize: 32207536128 is_log: 0 create_txg: 4 features_for_read: com.delphix:hole_birth com.delphix:embedded_data -------------------------------------------- LABEL 2 -------------------------------------------- version: 5000 name: 'case4' state: 0 txg: 4 pool_guid: 4712723554953817788 errata: 0 hostname: 'localhost' top_guid: 10025130926649767584 guid: 10025130926649767584 vdev_children: 2 vdev_tree: type: 'disk' id: 0 guid: 10025130926649767584 path: '/dev/nbd0' whole_disk: 0 metaslab_array: 37 metaslab_shift: 28 ashift: 9 asize: 32207536128 is_log: 0 create_txg: 4 features_for_read: com.delphix:hole_birth com.delphix:embedded_data -------------------------------------------- LABEL 3 -------------------------------------------- version: 5000 name: 'case4' state: 0 txg: 4 pool_guid: 4712723554953817788 errata: 0 hostname: 'localhost' top_guid: 10025130926649767584 guid: 10025130926649767584 vdev_children: 2 vdev_tree: type: 'disk' id: 0 guid: 10025130926649767584 path: '/dev/nbd0' whole_disk: 0 metaslab_array: 37 metaslab_shift: 28 ashift: 9 asize: 32207536128 is_log: 0 create_txg: 4 features_for_read: com.delphix:hole_birth com.delphix:embedded_data
第二个vdev的label(部分,后面的LABEL1,2,3略):
[root@localhost case4]# zdb -l /dev/nbd1 -------------------------------------------- LABEL 0 -------------------------------------------- version: 5000 name: 'case4' state: 0 txg: 4 pool_guid: 4712723554953817788 errata: 0 hostname: 'localhost' top_guid: 5318385741477692907 guid: 5318385741477692907 vdev_children: 2 vdev_tree: type: 'disk' id: 1 guid: 5318385741477692907 path: '/dev/nbd1' whole_disk: 0 metaslab_array: 34 metaslab_shift: 29 ashift: 9 asize: 64419790848 is_log: 0 create_txg: 4 features_for_read: com.delphix:hole_birth com.delphix:embedded_data
二、使用zdb -m 命令列出每个metaslab的摘要情况,其实就是解释37号和34号元文件的内容。
命令的执行结果表明:
vdev0(case4.1.vmdk 30G):/dev/nbd0:
0号metaslab(负责第0个256M)的位图由38号元文件管理
23号metaslab(负责第23个256M)的位图由39号元文件管理
其他空间未分配
vdev1(case4.2.vmdk 60G):/dev/nbd1:
0号metaslab(负责第0个512M)的位图由36号元文件管理
23号metaslab(负责第23个512M)的位图由35号元文件管理
其他空间未分配
[root@localhost case4]# zdb -m case4 Metaslabs: vdev 0 metaslabs 119 offset spacemap free --------------- ------------------- --------------- ------------- metaslab 0 offset 0 spacemap 38 free 256M metaslab 1 offset 10000000 spacemap 0 free 256M metaslab 2 offset 20000000 spacemap 0 free 256M metaslab 3 offset 30000000 spacemap 0 free 256M metaslab 4 offset 40000000 spacemap 0 free 256M metaslab 5 offset 50000000 spacemap 0 free 256M metaslab 6 offset 60000000 spacemap 0 free 256M metaslab 7 offset 70000000 spacemap 0 free 256M metaslab 8 offset 80000000 spacemap 0 free 256M metaslab 9 offset 90000000 spacemap 0 free 256M metaslab 10 offset a0000000 spacemap 0 free 256M metaslab 11 offset b0000000 spacemap 0 free 256M metaslab 12 offset c0000000 spacemap 0 free 256M metaslab 13 offset d0000000 spacemap 0 free 256M metaslab 14 offset e0000000 spacemap 0 free 256M metaslab 15 offset f0000000 spacemap 0 free 256M metaslab 16 offset 100000000 spacemap 0 free 256M metaslab 17 offset 110000000 spacemap 0 free 256M metaslab 18 offset 120000000 spacemap 0 free 256M metaslab 19 offset 130000000 spacemap 0 free 256M metaslab 20 offset 140000000 spacemap 0 free 256M metaslab 21 offset 150000000 spacemap 0 free 256M metaslab 22 offset 160000000 spacemap 0 free 256M metaslab 23 offset 170000000 spacemap 39 free 256M metaslab 24 offset 180000000 spacemap 0 free 256M metaslab 25 offset 190000000 spacemap 0 free 256M metaslab 26 offset 1a0000000 spacemap 0 free 256M metaslab 27 offset 1b0000000 spacemap 0 free 256M metaslab 28 offset 1c0000000 spacemap 0 free 256M metaslab 29 offset 1d0000000 spacemap 0 free 256M metaslab 30 offset 1e0000000 spacemap 0 free 256M metaslab 31 offset 1f0000000 spacemap 0 free 256M metaslab 32 offset 200000000 spacemap 0 free 256M metaslab 33 offset 210000000 spacemap 0 free 256M metaslab 34 offset 220000000 spacemap 0 free 256M metaslab 35 offset 230000000 spacemap 0 free 256M metaslab 36 offset 240000000 spacemap 0 free 256M metaslab 37 offset 250000000 spacemap 0 free 256M metaslab 38 offset 260000000 spacemap 0 free 256M metaslab 39 offset 270000000 spacemap 0 free 256M metaslab 40 offset 280000000 spacemap 0 free 256M metaslab 41 offset 290000000 spacemap 0 free 256M metaslab 42 offset 2a0000000 spacemap 0 free 256M metaslab 43 offset 2b0000000 spacemap 0 free 256M metaslab 44 offset 2c0000000 spacemap 0 free 256M metaslab 45 offset 2d0000000 spacemap 0 free 256M metaslab 46 offset 2e0000000 spacemap 0 free 256M metaslab 47 offset 2f0000000 spacemap 0 free 256M metaslab 48 offset 300000000 spacemap 0 free 256M metaslab 49 offset 310000000 spacemap 0 free 256M metaslab 50 offset 320000000 spacemap 0 free 256M metaslab 51 offset 330000000 spacemap 0 free 256M metaslab 52 offset 340000000 spacemap 0 free 256M metaslab 53 offset 350000000 spacemap 0 free 256M metaslab 54 offset 360000000 spacemap 0 free 256M metaslab 55 offset 370000000 spacemap 0 free 256M metaslab 56 offset 380000000 spacemap 0 free 256M metaslab 57 offset 390000000 spacemap 0 free 256M metaslab 58 offset 3a0000000 spacemap 0 free 256M metaslab 59 offset 3b0000000 spacemap 0 free 256M metaslab 60 offset 3c0000000 spacemap 0 free 256M metaslab 61 offset 3d0000000 spacemap 0 free 256M metaslab 62 offset 3e0000000 spacemap 0 free 256M metaslab 63 offset 3f0000000 spacemap 0 free 256M metaslab 64 offset 400000000 spacemap 0 free 256M metaslab 65 offset 410000000 spacemap 0 free 256M metaslab 66 offset 420000000 spacemap 0 free 256M metaslab 67 offset 430000000 spacemap 0 free 256M metaslab 68 offset 440000000 spacemap 0 free 256M metaslab 69 offset 450000000 spacemap 0 free 256M metaslab 70 offset 460000000 spacemap 0 free 256M metaslab 71 offset 470000000 spacemap 0 free 256M metaslab 72 offset 480000000 spacemap 0 free 256M metaslab 73 offset 490000000 spacemap 0 free 256M metaslab 74 offset 4a0000000 spacemap 0 free 256M metaslab 75 offset 4b0000000 spacemap 0 free 256M metaslab 76 offset 4c0000000 spacemap 0 free 256M metaslab 77 offset 4d0000000 spacemap 0 free 256M metaslab 78 offset 4e0000000 spacemap 0 free 256M metaslab 79 offset 4f0000000 spacemap 0 free 256M metaslab 80 offset 500000000 spacemap 0 free 256M metaslab 81 offset 510000000 spacemap 0 free 256M metaslab 82 offset 520000000 spacemap 0 free 256M metaslab 83 offset 530000000 spacemap 0 free 256M metaslab 84 offset 540000000 spacemap 0 free 256M metaslab 85 offset 550000000 spacemap 0 free 256M metaslab 86 offset 560000000 spacemap 0 free 256M metaslab 87 offset 570000000 spacemap 0 free 256M metaslab 88 offset 580000000 spacemap 0 free 256M metaslab 89 offset 590000000 spacemap 0 free 256M metaslab 90 offset 5a0000000 spacemap 0 free 256M metaslab 91 offset 5b0000000 spacemap 0 free 256M metaslab 92 offset 5c0000000 spacemap 0 free 256M metaslab 93 offset 5d0000000 spacemap 0 free 256M metaslab 94 offset 5e0000000 spacemap 0 free 256M metaslab 95 offset 5f0000000 spacemap 0 free 256M metaslab 96 offset 600000000 spacemap 0 free 256M metaslab 97 offset 610000000 spacemap 0 free 256M metaslab 98 offset 620000000 spacemap 0 free 256M metaslab 99 offset 630000000 spacemap 0 free 256M metaslab 100 offset 640000000 spacemap 0 free 256M metaslab 101 offset 650000000 spacemap 0 free 256M metaslab 102 offset 660000000 spacemap 0 free 256M metaslab 103 offset 670000000 spacemap 0 free 256M metaslab 104 offset 680000000 spacemap 0 free 256M metaslab 105 offset 690000000 spacemap 0 free 256M metaslab 106 offset 6a0000000 spacemap 0 free 256M metaslab 107 offset 6b0000000 spacemap 0 free 256M metaslab 108 offset 6c0000000 spacemap 0 free 256M metaslab 109 offset 6d0000000 spacemap 0 free 256M metaslab 110 offset 6e0000000 spacemap 0 free 256M metaslab 111 offset 6f0000000 spacemap 0 free 256M metaslab 112 offset 700000000 spacemap 0 free 256M metaslab 113 offset 710000000 spacemap 0 free 256M metaslab 114 offset 720000000 spacemap 0 free 256M metaslab 115 offset 730000000 spacemap 0 free 256M metaslab 116 offset 740000000 spacemap 0 free 256M metaslab 117 offset 750000000 spacemap 0 free 256M metaslab 118 offset 760000000 spacemap 0 free 256M vdev 1 metaslabs 119 offset spacemap free --------------- ------------------- --------------- ------------- metaslab 0 offset 0 spacemap 36 free 511M metaslab 1 offset 20000000 spacemap 0 free 512M metaslab 2 offset 40000000 spacemap 0 free 512M metaslab 3 offset 60000000 spacemap 0 free 512M metaslab 4 offset 80000000 spacemap 0 free 512M metaslab 5 offset a0000000 spacemap 0 free 512M metaslab 6 offset c0000000 spacemap 0 free 512M metaslab 7 offset e0000000 spacemap 0 free 512M metaslab 8 offset 100000000 spacemap 0 free 512M metaslab 9 offset 120000000 spacemap 0 free 512M metaslab 10 offset 140000000 spacemap 0 free 512M metaslab 11 offset 160000000 spacemap 0 free 512M metaslab 12 offset 180000000 spacemap 0 free 512M metaslab 13 offset 1a0000000 spacemap 0 free 512M metaslab 14 offset 1c0000000 spacemap 0 free 512M metaslab 15 offset 1e0000000 spacemap 0 free 512M metaslab 16 offset 200000000 spacemap 0 free 512M metaslab 17 offset 220000000 spacemap 0 free 512M metaslab 18 offset 240000000 spacemap 0 free 512M metaslab 19 offset 260000000 spacemap 0 free 512M metaslab 20 offset 280000000 spacemap 0 free 512M metaslab 21 offset 2a0000000 spacemap 0 free 512M metaslab 22 offset 2c0000000 spacemap 0 free 512M metaslab 23 offset 2e0000000 spacemap 35 free 512M metaslab 24 offset 300000000 spacemap 0 free 512M metaslab 25 offset 320000000 spacemap 0 free 512M metaslab 26 offset 340000000 spacemap 0 free 512M metaslab 27 offset 360000000 spacemap 0 free 512M metaslab 28 offset 380000000 spacemap 0 free 512M metaslab 29 offset 3a0000000 spacemap 0 free 512M metaslab 30 offset 3c0000000 spacemap 0 free 512M metaslab 31 offset 3e0000000 spacemap 0 free 512M metaslab 32 offset 400000000 spacemap 0 free 512M metaslab 33 offset 420000000 spacemap 0 free 512M metaslab 34 offset 440000000 spacemap 0 free 512M metaslab 35 offset 460000000 spacemap 0 free 512M metaslab 36 offset 480000000 spacemap 0 free 512M metaslab 37 offset 4a0000000 spacemap 0 free 512M metaslab 38 offset 4c0000000 spacemap 0 free 512M metaslab 39 offset 4e0000000 spacemap 0 free 512M metaslab 40 offset 500000000 spacemap 0 free 512M metaslab 41 offset 520000000 spacemap 0 free 512M metaslab 42 offset 540000000 spacemap 0 free 512M metaslab 43 offset 560000000 spacemap 0 free 512M metaslab 44 offset 580000000 spacemap 0 free 512M metaslab 45 offset 5a0000000 spacemap 0 free 512M metaslab 46 offset 5c0000000 spacemap 0 free 512M metaslab 47 offset 5e0000000 spacemap 0 free 512M metaslab 48 offset 600000000 spacemap 0 free 512M metaslab 49 offset 620000000 spacemap 0 free 512M metaslab 50 offset 640000000 spacemap 0 free 512M metaslab 51 offset 660000000 spacemap 0 free 512M metaslab 52 offset 680000000 spacemap 0 free 512M metaslab 53 offset 6a0000000 spacemap 0 free 512M metaslab 54 offset 6c0000000 spacemap 0 free 512M metaslab 55 offset 6e0000000 spacemap 0 free 512M metaslab 56 offset 700000000 spacemap 0 free 512M metaslab 57 offset 720000000 spacemap 0 free 512M metaslab 58 offset 740000000 spacemap 0 free 512M metaslab 59 offset 760000000 spacemap 0 free 512M metaslab 60 offset 780000000 spacemap 0 free 512M metaslab 61 offset 7a0000000 spacemap 0 free 512M metaslab 62 offset 7c0000000 spacemap 0 free 512M metaslab 63 offset 7e0000000 spacemap 0 free 512M metaslab 64 offset 800000000 spacemap 0 free 512M metaslab 65 offset 820000000 spacemap 0 free 512M metaslab 66 offset 840000000 spacemap 0 free 512M metaslab 67 offset 860000000 spacemap 0 free 512M metaslab 68 offset 880000000 spacemap 0 free 512M metaslab 69 offset 8a0000000 spacemap 0 free 512M metaslab 70 offset 8c0000000 spacemap 0 free 512M metaslab 71 offset 8e0000000 spacemap 0 free 512M metaslab 72 offset 900000000 spacemap 0 free 512M metaslab 73 offset 920000000 spacemap 0 free 512M metaslab 74 offset 940000000 spacemap 0 free 512M metaslab 75 offset 960000000 spacemap 0 free 512M metaslab 76 offset 980000000 spacemap 0 free 512M metaslab 77 offset 9a0000000 spacemap 0 free 512M metaslab 78 offset 9c0000000 spacemap 0 free 512M metaslab 79 offset 9e0000000 spacemap 0 free 512M metaslab 80 offset a00000000 spacemap 0 free 512M metaslab 81 offset a20000000 spacemap 0 free 512M metaslab 82 offset a40000000 spacemap 0 free 512M metaslab 83 offset a60000000 spacemap 0 free 512M metaslab 84 offset a80000000 spacemap 0 free 512M metaslab 85 offset aa0000000 spacemap 0 free 512M metaslab 86 offset ac0000000 spacemap 0 free 512M metaslab 87 offset ae0000000 spacemap 0 free 512M metaslab 88 offset b00000000 spacemap 0 free 512M metaslab 89 offset b20000000 spacemap 0 free 512M metaslab 90 offset b40000000 spacemap 0 free 512M metaslab 91 offset b60000000 spacemap 0 free 512M metaslab 92 offset b80000000 spacemap 0 free 512M metaslab 93 offset ba0000000 spacemap 0 free 512M metaslab 94 offset bc0000000 spacemap 0 free 512M metaslab 95 offset be0000000 spacemap 0 free 512M metaslab 96 offset c00000000 spacemap 0 free 512M metaslab 97 offset c20000000 spacemap 0 free 512M metaslab 98 offset c40000000 spacemap 0 free 512M metaslab 99 offset c60000000 spacemap 0 free 512M metaslab 100 offset c80000000 spacemap 0 free 512M metaslab 101 offset ca0000000 spacemap 0 free 512M metaslab 102 offset cc0000000 spacemap 0 free 512M metaslab 103 offset ce0000000 spacemap 0 free 512M metaslab 104 offset d00000000 spacemap 0 free 512M metaslab 105 offset d20000000 spacemap 0 free 512M metaslab 106 offset d40000000 spacemap 0 free 512M metaslab 107 offset d60000000 spacemap 0 free 512M metaslab 108 offset d80000000 spacemap 0 free 512M metaslab 109 offset da0000000 spacemap 0 free 512M metaslab 110 offset dc0000000 spacemap 0 free 512M metaslab 111 offset de0000000 spacemap 0 free 512M metaslab 112 offset e00000000 spacemap 0 free 512M metaslab 113 offset e20000000 spacemap 0 free 512M metaslab 114 offset e40000000 spacemap 0 free 512M metaslab 115 offset e60000000 spacemap 0 free 512M metaslab 116 offset e80000000 spacemap 0 free 512M metaslab 117 offset ea0000000 spacemap 0 free 512M metaslab 118 offset ec0000000 spacemap 0 free 512M
以vdev0为例,使用命令读出37号元文件,验证内容:
先解析37号元文件节点,根据dnode的提示,知道内容区(dva)位于1:13f000:200:
[root@localhost case4]# zdb -ddddd case4 37 Dataset mos [META], ID 0, cr_txg 4, 759K, 39 objects, rootbp DVA[0]=<1:13d600:800> DVA[1]=<0:9d800:800> DVA[2]=<1:2e0049c00:800> [L0 DMU objset] fletcher4 uncompressed LE contiguous unique triple size=800L/800P birth=16L/16P fill=39 cksum=6d0c48364:c1aa77fe98b:ac992e22f766b:66eb89954167a31 Object lvl iblk dblk dsize lsize %full type 37 1 16K 512 1.50K 512 100.00 object array dnode flags: USED_BYTES dnode maxblkid: 0 Indirect blocks: 0 L0 1:13f000:200 200L/200P F=1 B=16/16 segment [0000000000000000, 0000000000000200) size 512
读出其Indirect blocks指向的DVA区域:
[root@localhost case4]# zdb -R case4 1:13f000:200 Found vdev: /dev/nbd1 1:13f000:200 0 1 2 3 4 5 6 7 8 9 a b c d e f 0123456789abcdef 000000: 0000000000000026 0000000000000000 &............... 000010: 0000000000000000 0000000000000000 ................ 000020: 0000000000000000 0000000000000000 ................ 000030: 0000000000000000 0000000000000000 ................ 000040: 0000000000000000 0000000000000000 ................ 000050: 0000000000000000 0000000000000000 ................ 000060: 0000000000000000 0000000000000000 ................ 000070: 0000000000000000 0000000000000000 ................ 000080: 0000000000000000 0000000000000000 ................ 000090: 0000000000000000 0000000000000000 ................ 0000a0: 0000000000000000 0000000000000000 ................ 0000b0: 0000000000000000 0000000000000027 ........'....... 0000c0: 0000000000000000 0000000000000000 ................ 0000d0: 0000000000000000 0000000000000000 ................ 0000e0: 0000000000000000 0000000000000000 ................ 0000f0: 0000000000000000 0000000000000000 ................ 000100: 0000000000000000 0000000000000000 ................ 000110: 0000000000000000 0000000000000000 ................ 000120: 0000000000000000 0000000000000000 ................ 000130: 0000000000000000 0000000000000000 ................ 000140: 0000000000000000 0000000000000000 ................ 000150: 0000000000000000 0000000000000000 ................ 000160: 0000000000000000 0000000000000000 ................ 000170: 0000000000000000 0000000000000000 ................ 000180: 0000000000000000 0000000000000000 ................ 000190: 0000000000000000 0000000000000000 ................ 0001a0: 0000000000000000 0000000000000000 ................ 0001b0: 0000000000000000 0000000000000000 ................ 0001c0: 0000000000000000 0000000000000000 ................ 0001d0: 0000000000000000 0000000000000000 ................ 0001e0: 0000000000000000 0000000000000000 ................ 0001f0: 0000000000000000 0000000000000000 ................
可以看到,图中就是一个标准的64位整型数组,metaslab[0] = 0x26(38) , metaslab[23]=0x27(39) , 即表现为vdev0的metaslab分配表。
三、读出space map
以vdev0为例,读出metaslab[0]的内容(汉字部分为解释性注释):
[root@localhost case4]# zdb -mmmm case4 /dev/nbd0 0 Metaslabs: vdev 0 metaslabs 119 offset spacemap free --------------- ------------------- --------------- ------------- metaslab 0 offset 0 spacemap 38 free 256M segments 8 maxsize 255M freepct 99% In-memory histogram: //表示内存中的可分配片断摘要图,不会体现在磁盘上 9: 1 * 10: 0 11: 1 * 12: 1 * 13: 1 * 14: 1 * 15: 1 * 16: 0 17: 1 * 18: 0 19: 0 20: 0 21: 0 22: 0 23: 0 24: 0 25: 0 26: 0 27: 1 * On-disk histogram: fragmentation 0 //表示磁盘上的可分配片断摘要图 11: 1 * //表示2^11连续的片断有1个 12: 0 13: 0 14: 0 15: 0 16: 0 17: 0 18: 0 19: 0 20: 0 21: 0 22: 0 23: 0 24: 0 25: 0 26: 0 27: 1 *//表示2^27连续的片断有1个 [ 0] ALLOC: txg 4, pass 1 表示:一条debug记录,分配,事务号4,过程1 [ 1] A range: 0000000000-000003d200 size: 03d200 表示:一条分配记录,<区域>,<大小> [ 2] FREE: txg 4, pass 1 表示:一条debug记录,释放,事务号4,过程1 [ 3] F range: 0000024e00-0000025600 size: 000800 表示:一条释放记录,<区域>,<大小> 下同。。。 [ 4] ALLOC: txg 4, pass 2 [ 5] A range: 000003d200-0000040600 size: 003400 [ 6] ALLOC: txg 5, pass 2 [ 7] A range: 0000040600-0000068e00 size: 028800 [ 8] ALLOC: txg 5, pass 3 [ 9] A range: 0000068e00-000006fe00 size: 007000 [ 10] ALLOC: txg 16, pass 1 [ 11] A range: 000006fe00-000009e000 size: 02e200 [ 12] FREE: txg 16, pass 1 [ 13] F range: 0000000600-0000000800 size: 000200 [ 14] F range: 0000000e00-0000024e00 size: 024000 [ 15] F range: 0000025600-0000026200 size: 000c00 [ 16] F range: 0000064600-0000068e00 size: 004800 [ 17] F range: 0000098400-0000098c00 size: 000800 [ 18] ALLOC: txg 16, pass 2 [ 19] A range: 000009e000-00000a6400 size: 008400 [ 20] FREE: txg 16, pass 2 [ 21] F range: 0000034a00-000003d200 size: 008800 [ 22] F range: 000003d400-000003f400 size: 002000 [ 23] F range: 000003f600-0000040600 size: 001000 [ 24] ALLOC: txg 16, pass 3 [ 25] A range: 00000a6400-00000aa400 size: 004000
上面的的数据有两部分,一是可分配空间片断统计图,存在于dnode的bonus中,一是space map,存在于对应分配文件的内容区。
使用zdb进行验证:
[root@localhost case4]# zdb -uuu case4 Uberblock: magic = 0000000000bab10c version = 5000 txg = 16 guid_sum = 1609496149371726663 timestamp = 1476716014 UTC = Mon Oct 17 10:53:34 2016 rootbp = DVA[0]=<1:13d600:800> DVA[1]=<0:9d800:800> DVA[2]=<1:2e0049c00:800> [L0 DMU objset] fletcher4 uncompressed LE contiguous unique triple size=800L/800P birth=16L/16P fill=39 cksum=6d0c48364:c1aa77fe98b:ac992e22f766b:66eb89954167a31
得到元文件集的objset_phys_t的blkprt_t,DVA: <1:13d600:800>
读出其内容:
[root@localhost case4]# zdb -R case4 1:13d600:800 Found vdev: /dev/nbd1 1:13d600:800 0 1 2 3 4 5 6 7 8 9 a b c d e f 0123456789abcdef 000000: 0100000003010e0a 0000000000000020 ........ ....... 000010: 0000000000000001 0000000000018000 ................ 000020: 0000000000000000 0000000000000000 ................ 000030: 0000000000000000 0000000000000000 ................ 000040: 0000000000000020 00000000000004cc ............... 000050: 0000000100000020 00000000000009cb ............... 000060: 0000000000000020 0000000000b80000 ............... 000070: 800a0702001f001f 0000000000000000 ................ 000080: 0000000000000000 0000000000000000 ................ 000090: 0000000000000010 000000000000001f ................ 0000a0: 0000003fc89656a4 00018a7e56e571ed .V..?....q.V~... 0000b0: 076504429388c8ac f58fd3b44082deb5 ....B.e....@.... 0000c0: 0000000100000020 0000000000000a11 ............... 0000d0: 0000000000000020 0000000000000512 ............... 0000e0: 0000000100000020 0000000001700274 .......t.p..... 0000f0: 800a0702001f001f 0000000000000000 ................ 000100: 0000000000000000 0000000000000000 ................ 000110: 0000000000000010 0000000000000008 ................ 000120: 0000000fc467e7bb 0000deee52adaf6b ..g.....k..R.... 000130: 0635fd0f5dc8e276 c90e402d95a65d44 v..]..5.D]..-@.. 000140: 0000000000000000 0000000000000000 ................ ...后面内容为0,省略
可知mos有2个片断,各32个扇区组成。可以通过<0:4cc扇区:32扇区> <1:a11扇区:32扇区>得到:
每个dnode 512字节,所以,第0x26(38)号dnode位置位于<1:a11扇区+6扇区:32扇区>,换成字节方式的标准DVA, 即:<1:142e00:200>,使用命令读出:
[root@localhost case4]# zdb -R case4 1:142e00:200 Found vdev: /dev/nbd1 1:142e00:200 0 1 2 3 4 5 6 7 8 9 a b c d e f 0123456789abcdef 000000: 0100000701010e08 0000000001400008 ..........@..... 000010: 0000000000000000 0000000000003000 .........0...... 000020: 0000000000000000 0000000000000000 ................ 000030: 0000000000000000 0000000000000000 ................ 000040: 0000000100000008 00000000000009f9 ................ 000050: 0000000000000008 00000000000004fa ................ 000060: 0000000100000008 000000000170025c ........\.p..... 000070: 8008070200070007 0000000000000000 ................ 000080: 0000000000000000 0000000000000000 ................ 000090: 0000000000000010 0000000000000001 ................ 0000a0: 0000000554eb8757 000014dd9a66d378 W..T....x.f..... 0000b0: 0028e1062768aab4 35752d75087c35c8 ..h'..(..5|.u-u5 0000c0: 0000000000000026 00000000000000d0 &............... 0000d0: 0000000000074600 0000000000000000 .F.............. 0000e0: 0000000000000000 0000000000000000 ................ 0000f0: 0000000000000000 0000000000000000 ................ 000100: 0000000000000000 0000000000000000 ................ 000110: 0000000000000001 0000000000000000 ................ 000120: 0000000000000000 0000000000000000 ................ 000130: 0000000000000000 0000000000000000 ................ 000140: 0000000000000000 0000000000000000 ................ 000150: 0000000000000000 0000000000000000 ................ 000160: 0000000000000000 0000000000000000 ................ 000170: 0000000000000000 0000000000000000 ................ 000180: 0000000000000000 0000000000000000 ................ 000190: 0000000000000001 0000000000000000 ................ 0001a0: 0000000000000000 0000000000000000 ................ 0001b0: 0000000000000000 0000000000000000 ................ 0001c0: 0000000000000000 0000000000000000 ................ 0001d0: 0000000000000000 0000000000000000 ................ 0001e0: 0000000000000000 0000000000000000 ................ 0001f0: 0000000000000000 0000000000000000 ................
对应space map bonus的源代码结构:
typedef struct space_map_phys { uint64_t smp_object; /* on-disk space map object */ uint64_t smp_objsize; /* size of the object */ uint64_t smp_alloc; /* space allocated from the map */ uint64_t smp_pad[5]; /* reserved */ /* * The smp_histogram maintains a histogram of free regions. Each * bucket, smp_histogram[i], contains the number of free regions * whose size is: * 2^(i+sm_shift) <= size of free region in bytes < 2^(i+sm_shift+1) */ uint64_t smp_histogram[SPACE_MAP_HISTOGRAM_SIZE]; } space_map_phys_t; /* 对应上图0xC0位置,即: typedef struct space_map_phys { uint64_t 对象ID:0x26; uint64_t 本对象已用空间:0xD0; uint64_t 本metaslab已用空间:0x74600; uint64_t smp_pad[5];//保留 uint64_t smp_histogram[SPACE_MAP_HISTOGRAM_SIZE]; //64位数组,表示2^9的连续数量,2^10的连续数量。。。 //0x110处的1表示2^11连续的可分配片断为1个 //0x190处的1表示2^27连续的可分配片断为1个 //其余大小的可分配片断为0 } space_map_phys_t; */
上述是可分配片断的统计表,再来看space map(分配/释放流水账):
使用zdb 读出38号文件的内容:
[root@localhost case4]# zdb -ddddd case4 38 Dataset mos [META], ID 0, cr_txg 4, 759K, 39 objects, rootbp DVA[0]=<1:13d600:800> DVA[1]=<0:9d800:800> DVA[2]=<1:2e0049c00:800> [L0 DMU objset] fletcher4 uncompressed LE contiguous unique triple size=800L/800P birth=16L/16P fill=39 cksum=6d0c48364:c1aa77fe98b:ac992e22f766b:66eb89954167a31 Object lvl iblk dblk dsize lsize %full type 38 1 16K 4K 12.0K 4K 100.00 SPA space map 320 bonus SPA space map header dnode flags: USED_BYTES dnode maxblkid: 0 Indirect blocks: 0 L0 1:13f200:1000 1000L/1000P F=1 B=16/16 segment [0000000000000000, 0000000000001000) size 4K
可知其内容在地址<1:13f200:1000> 真实大小在上面space_map_phys中显示为0xD0,读出其内容为:
[root@localhost case4]# zdb -R case4 1:13f200:200 Found vdev: /dev/nbd1 1:13f200:200 0 1 2 3 4 5 6 7 8 9 a b c d e f 0123456789abcdef 000000: 8004000000000004 00000000000001e8 ................ 000010: 9004000000000004 0000000001278003 ..........'..... 000020: 8008000000000004 0000000001e90019 ................ 000030: 8008000000000005 0000000002030143 ........C....... 000040: 800c000000000005 0000000003470037 ........7.G..... 000050: 8004000000000010 00000000037f0170 ........p....... 000060: 9004000000000010 0000000000038000 ................ 000070: 000000000007811f 00000000012b8005 ..........+..... 000080: 0000000003238023 0000000004c28003 #.#............. 000090: 8008000000000010 0000000004f00041 ........A....... 0000a0: 9008000000000010 0000000001a58043 ........C....... 0000b0: 0000000001ea800f 0000000001fb8007 ................ 0000c0: 800c000000000010 000000000532001f ..........2..... 后面省略,全为无效数据,表现也全为0
这是一组按64位整型数组,每64位表示一条space map entry,即一条流水账。对应的解释规则参考源代码:
/* * debug entry * * 1 3 10 50 * ,---+--------+------------+---------------------------------. * | 1 | action | syncpass | txg (lower bits) | * `---+--------+------------+---------------------------------' * 63 62 60 59 50 49 0 * * * non-debug entry * * 1 47 1 15 * ,-----------------------------------------------------------. * | 0 | offset (sm_shift units) | type | run | * `-----------------------------------------------------------' * 63 62 17 16 15 0 */ typedef enum { SM_ALLOC, SM_FREE } maptype_t;
解释如下(与前面zdb表现呼应):
注:
offset:表示扇区单位的位置
run:表示大小,加1后乘以扇区大小即表示真实分配字节数
第一条记录:0x8004000000000004
最高位为1,表示debug,action为0,表示分配,过程号为1,txg为4
第二条记录:0x00000000000001e8
最高位为0,表示non-debug,type为0,表示分配,位置为0,大小数值为0x1E8,即表示(0x1E8+1)*512字节=0x03d200字节
第三条记录:0x9004000000000004
最高位为1,表示debug,action为1,表示释放,过程号为1,txg为4
第四条记录:0x0000000001278003
最高位为0,表示non-debug,type为1,表示释放,位置为0x127,即0x127*512=0x24E00 ,大小数值3,表示4*512=0x800字节
这些space map entry按顺序即可重构此metaslab的空间使用图。