源代码及NVMe协议版本

  • SPDK : spdk-17.07.1
  • DPDK : dpdk-17.08
  • NVMe Spec: 1.2.1

基本分析方法

  • 01 - 到官网http://www.spdk.io/下载spdk-17.07.1.tar.gz
  • 02 - 到官网http://www.dpdk.org/下载dpdk-17.08.tar.xz
  • 03 - 创建目录nvme/src, 将spdk-17.07.1.tar.gz和dpdk-17.08.tar.xz解压缩到nvme/src中,然后用OpenGrok创建网页版的源代码树
  • 04 - 阅读SPDK/NVMe驱动源代码, 同时参考NVMeDirect和Linux内核NVMe驱动

1. 识别NVMe固态硬盘的方法

NVMe SSD是一个PCIe设备, 那么怎么识别这种类型的设备? 有两种方法。

方法1: 通过Device ID + Vendor ID

方法2: 通过Class Code

在Linux内核NVMe驱动中,使用的是第一种方法。而在SPDK中,使用的是第二种方法。 上代码:

  • src/spdk-17.07.1/include/spdk/pci_ids.h
53  * PCI class code for NVMe devices.
54  *
55  * Base class code 01h: mass storage
56  * Subclass code 08h: non-volatile memory
57  * Programming interface 02h: NVM Express
58  */
59 #define SPDK_PCI_CLASS_NVME          0x010802

而Class Code (0x010802) 在NVMe Specification中的定义如下:

检查是否支持KVM 如何查看是否支持nvme_检查是否支持KVM

2. Hello World

开始学习一门新的语言或者开发套件的时候,总是离不开"Hello World"。 SPDK也不例外, 让我们从hello_world.c开始, 看一下main()是如何使用SPDK/NVMe驱动的API的,从而帮助我们发现使用NVMe SSDs的主逻辑,

  • src/spdk-17.07.1/examples/nvme/hello_world/hello_world.c
306 int main(int argc, char **argv)
307 {
308     int rc;
309     struct spdk_env_opts opts;
310
311     /*
312      * SPDK relies on an abstraction around the local environment
313      * named env that handles memory allocation and PCI device operations.
314      * This library must be initialized first.
315      *
316      */
317     spdk_env_opts_init(&opts);
318     opts.name = "hello_world";
319     opts.shm_id = 0;
320     spdk_env_init(&opts);
321
322     printf("Initializing NVMe Controllers\n");
323
324     /*
325      * Start the SPDK NVMe enumeration process.  probe_cb will be called
326      *  for each NVMe controller found, giving our application a choice on
327      *  whether to attach to each controller.  attach_cb will then be
328      *  called for each controller after the SPDK NVMe driver has completed
329      *  initializing the controller we chose to attach.
330      */
331     rc = spdk_nvme_probe(NULL, NULL, probe_cb, attach_cb, NULL);
332     if (rc != 0) {
333             fprintf(stderr, "spdk_nvme_probe() failed\n");
334             cleanup();
335             return 1;
336     }
337
338     if (g_controllers == NULL) {
339             fprintf(stderr, "no NVMe controllers found\n");
340             cleanup();
341             return 1;
342     }
343
344     printf("Initialization complete.\n");
345     hello_world();
346     cleanup();
347     return 0;
348 }

main()的处理流程为:

001 - 317 spdk_env_opts_init(&opts);
002 - 320 spdk_env_init(&opts);
003 - 331 rc = spdk_nvme_probe(NULL, NULL, probe_cb, attach_cb, NULL);
004 - 345 hello_world();
005 - 346 cleanup();

  • 001-002,spdk运行环境初始化
  • 003,调用函数spdk_nvme_probe()主动发现NVMe SSDs设备。 显然, 接下来我们要分析的关键函数就是spdk_nvme_probe()。
  • 004,调用函数hello_world()做简单的读写操作
  • 005,调用函数cleanup()以释放内存资源,detach NVMe SSD设备等。

在分析关键函数spdk_nvme_probe()之前,让我们先搞清楚两个问题:

  • 问题1: 每一块NVMe固态硬盘里都一个控制器(Controller), 那么发现的所有NVMe固态硬盘(也就是NVMe Controllers)以什么方式组织在一起?
  • 问题2: 每一块NVMe固态硬盘都可以划分为多个NameSpace (类似逻辑分区的概念), 那么这些NameSpace以什么方式组织在一起?

对有经验的C程序员来说,回答这两个问题很easy,那就是链表。我们的hello_world.c也是这么干的。看代码:

39 struct ctrlr_entry {
40      struct spdk_nvme_ctrlr  *ctrlr;
41      struct ctrlr_entry      *next;
42      char                    name[1024];
43 };
44
45 struct ns_entry {
46      struct spdk_nvme_ctrlr  *ctrlr;
47      struct spdk_nvme_ns     *ns;
48      struct ns_entry         *next;
49      struct spdk_nvme_qpair  *qpair;
50 };
51
52 static struct ctrlr_entry *g_controllers = NULL;
53 static struct ns_entry *g_namespaces = NULL;

其中,

  • g_controllers是管理所有NVMe固态硬盘(i.e. NVMe Controllers)的全局链表头。
  • g_namespaces是管理所有的namespaces的全局链表头。

那么,回到main()的L338-342, 就很好理解了。 因为g_controllers指针为NULL, 所以没有找到NVMe SSD盘啊,于是cleanup后退出。

338     if (g_controllers == NULL) {
339             fprintf(stderr, "no NVMe controllers found\n");
340             cleanup();
341             return 1;
342     }

现在看看hello_world.c是如何使用spdk_nvme_probe()的,

331     rc = spdk_nvme_probe(NULL, NULL, probe_cb, attach_cb, NULL);

显然,probe_cb和attach_cb是两个callback函数, (其实还有remove_cb, L331未使用)

  • probe_cb: 当枚举到一个NVMe设备的时候被调用
  • attach_cb: 当一个NVMe设备已经被attach(挂接?)到一个用户态的NVMe 驱动的时候被调用

probe_cb, attach_cb以及remove_cb的相关定义如下:

  • src/spdk-17.07.1/include/spdk/nvme.h
268 /**
269  * Callback for spdk_nvme_probe() enumeration.
270  *
271  * \param opts NVMe controller initialization options.  This structure will be populated with the
272  * default values on entry, and the user callback may update any options to request a different
273  * value.  The controller may not support all requested parameters, so the final values will be
274  * provided during the attach callback.
275  * \return true to attach to this device.
276  */
277 typedef bool (*spdk_nvme_probe_cb)(void *cb_ctx, const struct spdk_nvme_transport_id *trid,
278                                struct spdk_nvme_ctrlr_opts *opts);
279
280 /**
281  * Callback for spdk_nvme_probe() to report a device that has been attached to the userspace NVMe driver.
282  *
283  * \param opts NVMe controller initialization options that were actually used.  Options may differ
284  * from the requested options from the probe call depending on what the controller supports.
285  */
286 typedef void (*spdk_nvme_attach_cb)(void *cb_ctx, const struct spdk_nvme_transport_id *trid,
287                                 struct spdk_nvme_ctrlr *ctrlr,
288                                 const struct spdk_nvme_ctrlr_opts *opts);
289
290 /**
291  * Callback for spdk_nvme_probe() to report that a device attached to the userspace NVMe driver
292  * has been removed from the system.
293  *
294  * The controller will remain in a failed state (any new I/O submitted will fail).
295  *
296  * The controller must be detached from the userspace driver by calling spdk_nvme_detach()
297  * once the controller is no longer in use.  It is up to the library user to ensure that
298  * no other threads are using the controller before calling spdk_nvme_detach().
299  *
300  * \param ctrlr NVMe controller instance that was removed.
301  */
302 typedef void (*spdk_nvme_remove_cb)(void *cb_ctx, struct spdk_nvme_ctrlr *ctrlr);
303
304 /**
305  * \brief Enumerate the bus indicated by the transport ID and attach the userspace NVMe driver
306  * to each device found if desired.
307  *
308  * \param trid The transport ID indicating which bus to enumerate. If the trtype is PCIe or trid is NULL,
309  * this will scan the local PCIe bus. If the trtype is RDMA, the traddr and trsvcid must point at the
310  * location of an NVMe-oF discovery service.
311  * \param cb_ctx Opaque value which will be passed back in cb_ctx parameter of the callbacks.
312  * \param probe_cb will be called once per NVMe device found in the system.
313  * \param attach_cb will be called for devices for which probe_cb returned true once that NVMe
314  * controller has been attached to the userspace driver.
315  * \param remove_cb will be called for devices that were attached in a previous spdk_nvme_probe()
316  * call but are no longer attached to the system. Optional; specify NULL if removal notices are not
317  * desired.
318  *
319  * This function is not thread safe and should only be called from one thread at a time while no
320  * other threads are actively using any NVMe devices.
321  *
322  * If called from a secondary process, only devices that have been attached to the userspace driver
323  * in the primary process will be probed.
324  *
325  * If called more than once, only devices that are not already attached to the SPDK NVMe driver
326  * will be reported.
327  *
328  * To stop using the the controller and release its associated resources,
329  * call \ref spdk_nvme_detach with the spdk_nvme_ctrlr instance returned by this function.
330  */
331 int spdk_nvme_probe(const struct spdk_nvme_transport_id *trid,
332                 void *cb_ctx,
333                 spdk_nvme_probe_cb probe_cb,
334                 spdk_nvme_attach_cb attach_cb,
335                 spdk_nvme_remove_cb remove_cb);

为了不被proce_cb, attach_cb, remove_cb带跑偏了,我们接下来看看结构体struct spdk_nvme_transport_id和spdk_nvme_probe()函数的主逻辑。

  • src/spdk-17.07.1/include/spdk/nvme.h
142 /**
143  * NVMe transport identifier.
144  *
145  * This identifies a unique endpoint on an NVMe fabric.
146  *
147  * A string representation of a transport ID may be converted to this type using
148  * spdk_nvme_transport_id_parse().
149  */
150 struct spdk_nvme_transport_id {
151     /**
152      * NVMe transport type.
153      */
154     enum spdk_nvme_transport_type trtype;
155
156     /**
157      * Address family of the transport address.
158      *
159      * For PCIe, this value is ignored.
160      */
161     enum spdk_nvmf_adrfam adrfam;
162
163     /**
164      * Transport address of the NVMe-oF endpoint. For transports which use IP
165      * addressing (e.g. RDMA), this should be an IP address. For PCIe, this
166      * can either be a zero length string (the whole bus) or a PCI address
167      * in the format DDDD:BB:DD.FF or DDDD.BB.DD.FF
168      */
169     char traddr[SPDK_NVMF_TRADDR_MAX_LEN + 1];
170
171     /**
172      * Transport service id of the NVMe-oF endpoint.  For transports which use
173      * IP addressing (e.g. RDMA), this field shoud be the port number. For PCIe,
174      * this is always a zero length string.
175      */
176     char trsvcid[SPDK_NVMF_TRSVCID_MAX_LEN + 1];
177
178     /**
179      * Subsystem NQN of the NVMe over Fabrics endpoint. May be a zero length string.
180      */
181     char subnqn[SPDK_NVMF_NQN_MAX_LEN + 1];
182 };

对于NVMe over PCIe, 我们只需要关注"NVMe transport type"这一项:

154    enum spdk_nvme_transport_type trtype;

而目前,支持两种传输类型, PCIe和RDMA。

130 enum spdk_nvme_transport_type {
131     /**
132      * PCIe Transport (locally attached devices)
133      */
134     SPDK_NVME_TRANSPORT_PCIE = 256,
135
136     /**
137      * RDMA Transport (RoCE, iWARP, etc.)
138      */
139     SPDK_NVME_TRANSPORT_RDMA = SPDK_NVMF_TRTYPE_RDMA,
140 };

有关RDMA的问题,我们后面暂时不做讨论,因为我们目前主要关心NVMe over PCIe

接下来看函数spdk_nvme_probe()的代码,

  • src/spdk-17.07.1/lib/nvme/nvme.c
396 int
397 spdk_nvme_probe(const struct spdk_nvme_transport_id *trid, void *cb_ctx,
398             spdk_nvme_probe_cb probe_cb, spdk_nvme_attach_cb attach_cb,
399             spdk_nvme_remove_cb remove_cb)
400 {
401     int rc;
402     struct spdk_nvme_ctrlr *ctrlr;
403     struct spdk_nvme_transport_id trid_pcie;
404
405     rc = nvme_driver_init();
406     if (rc != 0) {
407             return rc;
408     }
409
410     if (trid == NULL) {
411             memset(&trid_pcie, 0, sizeof(trid_pcie));
412             trid_pcie.trtype = SPDK_NVME_TRANSPORT_PCIE;
413             trid = &trid_pcie;
414     }
415
416     if (!spdk_nvme_transport_available(trid->trtype)) {
417             SPDK_ERRLOG("NVMe trtype %u not available\n", trid->trtype);
418             return -1;
419     }
420
421     nvme_robust_mutex_lock(&g_spdk_nvme_driver->lock);
422
423     nvme_transport_ctrlr_scan(trid, cb_ctx, probe_cb, remove_cb);
424
425     if (!spdk_process_is_primary()) {
426             TAILQ_FOREACH(ctrlr, &g_spdk_nvme_driver->attached_ctrlrs, tailq) {
427                     nvme_ctrlr_proc_get_ref(ctrlr);
428
429                     /*
430                      * Unlock while calling attach_cb() so the user can call other functions
431                      *  that may take the driver lock, like nvme_detach().
432                      */
433                     nvme_robust_mutex_unlock(&g_spdk_nvme_driver->lock);
434                     attach_cb(cb_ctx, &ctrlr->trid, ctrlr, &ctrlr->opts);
435                     nvme_robust_mutex_lock(&g_spdk_nvme_driver->lock);
436             }
437
438             nvme_robust_mutex_unlock(&g_spdk_nvme_driver->lock);
439             return 0;
440     }
441
442     nvme_robust_mutex_unlock(&g_spdk_nvme_driver->lock);
443     /*
444      * Keep going even if one or more nvme_attach() calls failed,
445      *  but maintain the value of rc to signal errors when we return.
446      */
447
448     rc = nvme_init_controllers(cb_ctx, attach_cb);
449
450     return rc;
451 }

spdk_nvme_probe()的处理流程为:

001 405:         rc = nvme_driver_init();
002 410-414: set trid if it is NULL
003 416:     check NVMe trtype via spdk_nvme_transport_available(trid->trtype)
004 423:     nvme_transport_ctrlr_scan(trid, cb_ctx, probe_cb, remove_cb);
005 425:     check spdk process is primary, if not, do something at L426-440
006 448:         rc = nvme_init_controllers(cb_ctx, attach_cb);

接下来,让我们看看函数nvme_transport_ctrlr_scan(),

423     nvme_transport_ctrlr_scan(trid, cb_ctx, probe_cb, remove_cb);
/* src/spdk-17.07.1/lib/nvme/nvme_transport.c#92 */

91 int
92 nvme_transport_ctrlr_scan(const struct spdk_nvme_transport_id *trid,
93                        void *cb_ctx,
94                        spdk_nvme_probe_cb probe_cb,
95                        spdk_nvme_remove_cb remove_cb)
96 {
97      NVME_TRANSPORT_CALL(trid->trtype, ctrlr_scan, (trid, cb_ctx, probe_cb, remove_cb));
98 }

而宏NVME_TRANSPORT_CALL的定义是:

/* src/spdk-17.07.1/lib/nvme/nvme_transport.c#60 */
52 #define TRANSPORT_PCIE(func_name, args)      case SPDK_NVME_TRANSPORT_PCIE: return nvme_pcie_ ## func_name args;
..
60 #define NVME_TRANSPORT_CALL(trtype, func_name, args)         \
61      do {                                                    \
62              switch (trtype) {                               \
63              TRANSPORT_PCIE(func_name, args)                 \
64              TRANSPORT_FABRICS_RDMA(func_name, args)         \
65              TRANSPORT_DEFAULT(trtype)                       \
66              }                                               \
67              SPDK_UNREACHABLE();                             \
68      } while (0)
..

于是, nvme_transport_ctrlr_scan()被转化为nvme_pcie_ctrlr_scan()调用(对NVMe over PCIe)来说,

/* src/spdk-17.07.1/lib/nvme/nvme_pcie.c#620 */
619 int
620 nvme_pcie_ctrlr_scan(const struct spdk_nvme_transport_id *trid,
621                  void *cb_ctx,
622                  spdk_nvme_probe_cb probe_cb,
623                  spdk_nvme_remove_cb remove_cb)
624 {
625     struct nvme_pcie_enum_ctx enum_ctx = {};
626
627     enum_ctx.probe_cb = probe_cb;
628     enum_ctx.cb_ctx = cb_ctx;
629
630     if (strlen(trid->traddr) != 0) {
631             if (spdk_pci_addr_parse(&enum_ctx.pci_addr, trid->traddr)) {
632                     return -1;
633             }
634             enum_ctx.has_pci_addr = true;
635     }
636
637     if (hotplug_fd < 0) {
638             hotplug_fd = spdk_uevent_connect();
639             if (hotplug_fd < 0) {
640                     SPDK_TRACELOG(SPDK_TRACE_NVME, "Failed to open uevent netlink socket\n");
641             }
642     } else {
643             _nvme_pcie_hotplug_monitor(cb_ctx, probe_cb, remove_cb);
644     }
645
646     if (enum_ctx.has_pci_addr == false) {
647             return spdk_pci_nvme_enumerate(pcie_nvme_enum_cb, &enum_ctx);
648     } else {
649             return spdk_pci_nvme_device_attach(pcie_nvme_enum_cb, &enum_ctx, &enum_ctx.pci_addr);
650     }
651 }

接下来重点看看L647对应的函数spck_pci_nvme_enumerate()就好,因为我们的目标是看明白是如何利用Class Code发现SSD设备的。

647         return spdk_pci_nvme_enumerate(pcie_nvme_enum_cb, &enum_ctx);
/* src/spdk-17.07.1/lib/env_dpdk/pci_nvme.c */

81 int
82 spdk_pci_nvme_enumerate(spdk_pci_enum_cb enum_cb, void *enum_ctx)
83 {
84      return spdk_pci_enumerate(&g_nvme_pci_drv, enum_cb, enum_ctx);
85 }

注意: L84第一个参数为一个全局变量g_nvme_pci_drv的地址, ( 看到一个全局结构体变量总是令人兴奋的:-) )

/* src/spdk-17.07.1/lib/env_dpdk/pci_nvme.c */

38 static struct rte_pci_id nvme_pci_driver_id[] = {
39 #if RTE_VERSION >= RTE_VERSION_NUM(16, 7, 0, 1)
40      {
41              .class_id = SPDK_PCI_CLASS_NVME,
42              .vendor_id = PCI_ANY_ID,
43              .device_id = PCI_ANY_ID,
44              .subsystem_vendor_id = PCI_ANY_ID,
45              .subsystem_device_id = PCI_ANY_ID,
46      },
47 #else
48      {RTE_PCI_DEVICE(0x8086, 0x0953)},
49 #endif
50      { .vendor_id = 0, /* sentinel */ },
51 };
..
53 static struct spdk_pci_enum_ctx g_nvme_pci_drv = {
54      .driver = {
55              .drv_flags      = RTE_PCI_DRV_NEED_MAPPING,
56              .id_table       = nvme_pci_driver_id,
..
66      },
67
68      .cb_fn = NULL,
69      .cb_arg = NULL,
70      .mtx = PTHREAD_MUTEX_INITIALIZER,
71      .is_registered = false,
72 };

啊哈! 终于跟Class Code (SPDK_PCI_CLASS_NVME=0x010802)扯上了关系。 全局变量g_nvme_pci_drv就是在L53行定义的,而g_nvme_pci_drv.driver.id_table则是在L38行定义的。

38 static struct rte_pci_id nvme_pci_driver_id[] = {
..
41              .class_id = SPDK_PCI_CLASS_NVME,
..
53 static struct spdk_pci_enum_ctx g_nvme_pci_drv = {
54      .driver = {
..
56              .id_table       = nvme_pci_driver_id,
..

那么,我们只需要进一步深挖spdk_pci_enumerate()就可以找到SSD设备是如何被发现的了…

/* src/spdk-17.07.1/lib/env_dpdk/pci.c#150 */

149 int
150 spdk_pci_enumerate(struct spdk_pci_enum_ctx *ctx,
151                spdk_pci_enum_cb enum_cb,
152                void *enum_ctx)
153 {
...
168
169 #if RTE_VERSION >= RTE_VERSION_NUM(17, 05, 0, 4)
170     if (rte_pci_probe() != 0) {
171 #else
172     if (rte_eal_pci_probe() != 0) {
173 #endif
...
184     return 0;
185 }

省略了一些代码,我们接下来重点关注L170,

170     if (rte_pci_probe() != 0) {

从rte_pci_probe()函数的实现开始,我们就深入到DPDK的内部了,代码如下,

/* src/dpdk-17.08/lib/librte_eal/common/eal_common_pci.c#413 */

407 /*
408  * Scan the content of the PCI bus, and call the probe() function for
409  * all registered drivers that have a matching entry in its id_table
410  * for discovered devices.
411  */
412 int
413 rte_pci_probe(void)
414 {
415     struct rte_pci_device *dev = NULL;
416     size_t probed = 0, failed = 0;
417     struct rte_devargs *devargs;
418     int probe_all = 0;
419     int ret = 0;
420
421     if (rte_pci_bus.bus.conf.scan_mode != RTE_BUS_SCAN_WHITELIST)
422             probe_all = 1;
423
424     FOREACH_DEVICE_ON_PCIBUS(dev) {
425             probed++;
426
427             devargs = dev->device.devargs;
428             /* probe all or only whitelisted devices */
429             if (probe_all)
430                     ret = pci_probe_all_drivers(dev);
431             else if (devargs != NULL &&
432                     devargs->policy == RTE_DEV_WHITELISTED)
433                     ret = pci_probe_all_drivers(dev);
434             if (ret < 0) {
435                     RTE_LOG(ERR, EAL, "Requested device " PCI_PRI_FMT
436                              " cannot be used\n", dev->addr.domain, dev->addr.bus,
437                              dev->addr.devid, dev->addr.function);
438                     rte_errno = errno;
439                     failed++;
440                     ret = 0;
441             }
442     }
443
444     return (probed && probed == failed) ? -1 : 0;
445 }

L430是我们关注的重点,

430             ret = pci_probe_all_drivers(dev);

函数pci_probe_all_drivers()的实现如下:

/* src/dpdk-17.08/lib/librte_eal/common/eal_common_pci.c#307 */

301 /*
302  * If vendor/device ID match, call the probe() function of all
303  * registered driver for the given device. Return -1 if initialization
304  * failed, return 1 if no driver is found for this device.
305  */
306 static int
307 pci_probe_all_drivers(struct rte_pci_device *dev)
308 {
309     struct rte_pci_driver *dr = NULL;
310     int rc = 0;
311
312     if (dev == NULL)
313             return -1;
314
315     /* Check if a driver is already loaded */
316     if (dev->driver != NULL)
317             return 0;
318
319     FOREACH_DRIVER_ON_PCIBUS(dr) {
320             rc = rte_pci_probe_one_driver(dr, dev);
321             if (rc < 0)
322                     /* negative value is an error */
323                     return -1;
324             if (rc > 0)
325                     /* positive value means driver doesn't support it */
326                     continue;
327             return 0;
328     }
329     return 1;
330 }

L320是我们关注的重点,

320             rc = rte_pci_probe_one_driver(dr, dev);
/* src/dpdk-17.08/lib/librte_eal/common/eal_common_pci.c#200 */

195 /*
196  * If vendor/device ID match, call the probe() function of the
197  * driver.
198  */
199 static int
200 rte_pci_probe_one_driver(struct rte_pci_driver *dr,
201                      struct rte_pci_device *dev)
202 {
203     int ret;
204     struct rte_pci_addr *loc;
205
206     if ((dr == NULL) || (dev == NULL))
207             return -EINVAL;
208
209     loc = &dev->addr;
210
211     /* The device is not blacklisted; Check if driver supports it */
212     if (!rte_pci_match(dr, dev))
213             /* Match of device and driver failed */
214             return 1;
215
216     RTE_LOG(INFO, EAL, "PCI device "PCI_PRI_FMT" on NUMA socket %i\n",
217                     loc->domain, loc->bus, loc->devid, loc->function,
218                     dev->device.numa_node);
219
220     /* no initialization when blacklisted, return without error */
221     if (dev->device.devargs != NULL &&
222             dev->device.devargs->policy ==
223                     RTE_DEV_BLACKLISTED) {
224             RTE_LOG(INFO, EAL, "  Device is blacklisted, not"
225                     " initializing\n");
226             return 1;
227     }
228
229     if (dev->device.numa_node < 0) {
230             RTE_LOG(WARNING, EAL, "  Invalid NUMA socket, default to 0\n");
231             dev->device.numa_node = 0;
232     }
233
234     RTE_LOG(INFO, EAL, "  probe driver: %x:%x %s\n", dev->id.vendor_id,
235             dev->id.device_id, dr->driver.name);
236
237     if (dr->drv_flags & RTE_PCI_DRV_NEED_MAPPING) {
238             /* map resources for devices that use igb_uio */
239             ret = rte_pci_map_device(dev);
240             if (ret != 0)
241                     return ret;
242     }
243
244     /* reference driver structure */
245     dev->driver = dr;
246     dev->device.driver = &dr->driver;
247
248     /* call the driver probe() function */
249     ret = dr->probe(dr, dev);
250     if (ret) {
251             dev->driver = NULL;
252             dev->device.driver = NULL;
253             if ((dr->drv_flags & RTE_PCI_DRV_NEED_MAPPING) &&
254                     /* Don't unmap if device is unsupported and
255                      * driver needs mapped resources.
256                      */
257                     !(ret > 0 &&
258                             (dr->drv_flags & RTE_PCI_DRV_KEEP_MAPPED_RES)))
259                     rte_pci_unmap_device(dev);
260     }
261
262     return ret;
263 }

L212是我们关注的重点,

212     if (!rte_pci_match(dr, dev))

而rte_pci_match()的实现如下,

/* src/dpdk-17.08/lib/librte_eal/common/eal_common_pci.c#163 */

151 /*
152  * Match the PCI Driver and Device using the ID Table
153  *
154  * @param pci_drv
155  *  PCI driver from which ID table would be extracted
156  * @param pci_dev
157  *  PCI device to match against the driver
158  * @return
159  *  1 for successful match
160  *  0 for unsuccessful match
161  */
162 static int
163 rte_pci_match(const struct rte_pci_driver *pci_drv,
164               const struct rte_pci_device *pci_dev)
165 {
166     const struct rte_pci_id *id_table;
167
168     for (id_table = pci_drv->id_table; id_table->vendor_id != 0;
169          id_table++) {
170             /* check if device's identifiers match the driver's ones */
171             if (id_table->vendor_id != pci_dev->id.vendor_id &&
172                             id_table->vendor_id != PCI_ANY_ID)
173                     continue;
174             if (id_table->device_id != pci_dev->id.device_id &&
175                             id_table->device_id != PCI_ANY_ID)
176                     continue;
177             if (id_table->subsystem_vendor_id !=
178                 pci_dev->id.subsystem_vendor_id &&
179                 id_table->subsystem_vendor_id != PCI_ANY_ID)
180                     continue;
181             if (id_table->subsystem_device_id !=
182                 pci_dev->id.subsystem_device_id &&
183                 id_table->subsystem_device_id != PCI_ANY_ID)
184                     continue;
185             if (id_table->class_id != pci_dev->id.class_id &&
186                             id_table->class_id != RTE_CLASS_ANY_ID)
187                     continue;
188
189             return 1;
190     }
191
192     return 0;
193 }

看到这里,我们终于找到了SSD设备是如何被发现的, L185-187是我们最希望看到的三行代码:

185             if (id_table->class_id != pci_dev->id.class_id &&
186                             id_table->class_id != RTE_CLASS_ANY_ID)
187                     continue;

而结构体struct rte_pci_driver和struct rte_pci_device的定义为:

/* src/dpdk-17.08/lib/librte_eal/common/include/rte_pci.h#100 */

96  /**
97   * A structure describing an ID for a PCI driver. Each driver provides a
98   * table of these IDs for each device that it supports.
99   */
100 struct rte_pci_id {
101     uint32_t class_id;            /**< Class ID (class, subclass, pi) or RTE_CLASS_ANY_ID. */
102     uint16_t vendor_id;           /**< Vendor ID or PCI_ANY_ID. */
103     uint16_t device_id;           /**< Device ID or PCI_ANY_ID. */
104     uint16_t subsystem_vendor_id; /**< Subsystem vendor ID or PCI_ANY_ID. */
105     uint16_t subsystem_device_id; /**< Subsystem device ID or PCI_ANY_ID. */
106 };

/* src/dpdk-17.08/lib/librte_eal/common/include/rte_pci.h#120 */

120 /**
121  * A structure describing a PCI device.
122  */
123 struct rte_pci_device {
124     TAILQ_ENTRY(rte_pci_device) next;       /**< Next probed PCI device. */
125     struct rte_device device;               /**< Inherit core device */
126     struct rte_pci_addr addr;               /**< PCI location. */
127     struct rte_pci_id id;                   /**< PCI ID. */
128     struct rte_mem_resource mem_resource[PCI_MAX_RESOURCE];
129                                             /**< PCI Memory Resource */
130     struct rte_intr_handle intr_handle;     /**< Interrupt handle */
131     struct rte_pci_driver *driver;          /**< Associated driver */
132     uint16_t max_vfs;                       /**< sriov enable if not zero */
133     enum rte_kernel_driver kdrv;            /**< Kernel driver passthrough */
134     char name[PCI_PRI_STR_SIZE+1];          /**< PCI location (ASCII) */
135 };

/* src/dpdk-17.08/lib/librte_eal/common/include/rte_pci.h#178 */

175 /**
176  * A structure describing a PCI driver.
177  */
178 struct rte_pci_driver {
179     TAILQ_ENTRY(rte_pci_driver) next;       /**< Next in list. */
180     struct rte_driver driver;               /**< Inherit core driver. */
181     struct rte_pci_bus *bus;                /**< PCI bus reference. */
182     pci_probe_t *probe;                     /**< Device Probe function. */
183     pci_remove_t *remove;                   /**< Device Remove function. */
184     const struct rte_pci_id *id_table;      /**< ID table, NULL terminated. */
185     uint32_t drv_flags;                     /**< Flags contolling handling of device. */
186 };

到此为止,我们可以对SSD设备发现做如下总结

  • 01 - 使用Class Code (0x010802)作为SSD设备发现的依据
  • 02 - 发现SSD设备的时候,从SPDK进入到DPDK中,函数调用栈为:
00 hello_word.c
01 -> main()
02 --> spdk_nvme_probe()
03 ---> nvme_transport_ctrlr_scan()
04 ----> nvme_pcie_ctrlr_scan()
05 -----> spdk_pci_nvme_enumerate()
06 ------> spdk_pci_enumerate(&g_nvme_pci_drv, ...)                 | SPDK |
   =========================================================================
07 -------> rte_pci_probe()                                         | DPDK |
08 --------> pci_probe_all_drivers()
09 ---------> rte_pci_probe_one_driver()
10 ----------> rte_pci_match()
  • 03 - DPDK中环境抽象层(EAL: Environment Abstraction Layer)的函数rte_pci_match()是发现SSD设备的关键。
  • 04 - DPDK的EAL在DPDK架构中所处的位置,如下图所示:

检查是否支持KVM 如何查看是否支持nvme_linux_02

Your greatness is measured by your horizons. | 你的成就是由你的眼界来衡量的。