第一步 配置主机

虚拟化通过iommu 特性将物理设备透传到vm里面,iommu的driver是vfio提供。

OS环境: ubuntu20.04 LTS

GPU版本:NVIDIA Corporation TU104

bios需要开启vt-d host需要隔离该gpu 需要将这一组iommu同时bind到vfio-pci driver上

  1. 安装包
apt install qemu-kvm qemu-utils libvirt-clients bridge-utils ovmf -y
  1. 修改/etc/default/grub
GRUB_DEFAULT=0
GRUB_TIMEOUT_STYLE=hidden
GRUB_TIMEOUT=0
GRUB_DISTRIBUTOR=`lsb_release -i -s 2> /dev/null || echo Debian`
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash"
GRUB_CMDLINE_LINUX=""
GRUB_CMDLINE_LINUX_DEFAULT="intel_iommu=on iommu=pt kvm.ignore_msrs=1 vfio-pci.ids=01:00.0,01:00.1,01:00.2,01:00.3"

vfio-pci.ids值来自如下命令:

lspci -nnv |grep -i nvidia
01:00.0 VGA compatible controller [0300]: NVIDIA Corporation TU104 [GeForce RTX 2070 SUPER] [10de:1e84] (rev a1) (prog-if 00 [VGA controller])
Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
01:00.1 Audio device [0403]: NVIDIA Corporation TU104 HD Audio Controller [10de:10f8] (rev a1)
01:00.2 USB controller [0c03]: NVIDIA Corporation TU104 USB 3.1 Host Controller [10de:1ad8] (rev a1) (prog-if 30 [XHCI])
01:00.3 Serial bus controller [0c80]: NVIDIA Corporation TU104 USB Type-C UCSI Controller [10de:1ad9] (rev a1)
Kernel modules: i2c_nvidia_gpu

确认下是不是同属于一组

bash iommu.sh

GPU配置_bash

iommu.sh内容如下:

#!/bin/bash
# change the 17 if needed
shopt -s nullglob
for d in /sys/kernel/iommu_groups/{0..17}/devices/*; do
n=${d#*/iommu_groups/*}; n=${n%%/*}
printf 'IOMMU Group %s ' "$n"
lspci -nns "${d##*/}"
done;

如何确认iommu_groups的个数?

dmesg -T|grep -i iommu
(venv) root@openstack-ubuntu:~# dmesg -T|grep -i iommu
[二 11月 23 12:47:16 2021] Command line: BOOT_IMAGE=/boot/vmlinuz-5.11.0-40-generic root=/dev/mapper/vgubuntu-root ro intel_iommu=on iommu=pt kvm.ignore_msrs=1 vfio-pci.ids=01:00.0,01:00.1
[二 11月 23 12:47:16 2021] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-5.11.0-40-generic root=/dev/mapper/vgubuntu-root ro intel_iommu=on iommu=pt kvm.ignore_msrs=1 vfio-pci.ids=01:00.0,01:00.1
[二 11月 23 12:47:16 2021] DMAR: IOMMU enabled
[二 11月 23 12:47:16 2021] DMAR-IR: IOAPIC id 2 under DRHD base 0xfed91000 IOMMU 1
[二 11月 23 12:47:16 2021] iommu: Default domain type: Passthrough (set via kernel command line)
[二 11月 23 12:47:16 2021] pci 0000:00:00.0: Adding to iommu group 0
[二 11月 23 12:47:16 2021] pci 0000:00:01.0: Adding to iommu group 1
[二 11月 23 12:47:16 2021] pci 0000:00:02.0: Adding to iommu group 2
[二 11月 23 12:47:16 2021] pci 0000:00:14.0: Adding to iommu group 3
[二 11月 23 12:47:16 2021] pci 0000:00:14.2: Adding to iommu group 3
[二 11月 23 12:47:16 2021] pci 0000:00:15.0: Adding to iommu group 4
[二 11月 23 12:47:16 2021] pci 0000:00:15.1: Adding to iommu group 4
[二 11月 23 12:47:16 2021] pci 0000:00:16.0: Adding to iommu group 5
[二 11月 23 12:47:16 2021] pci 0000:00:17.0: Adding to iommu group 6
[二 11月 23 12:47:16 2021] pci 0000:00:1b.0: Adding to iommu group 7
[二 11月 23 12:47:16 2021] pci 0000:00:1c.0: Adding to iommu group 8
[二 11月 23 12:47:16 2021] pci 0000:00:1c.2: Adding to iommu group 9
[二 11月 23 12:47:16 2021] pci 0000:00:1c.3: Adding to iommu group 10
[二 11月 23 12:47:16 2021] pci 0000:00:1c.4: Adding to iommu group 11
[二 11月 23 12:47:16 2021] pci 0000:00:1d.0: Adding to iommu group 12
[二 11月 23 12:47:16 2021] pci 0000:00:1f.0: Adding to iommu group 13
[二 11月 23 12:47:16 2021] pci 0000:00:1f.3: Adding to iommu group 13
[二 11月 23 12:47:16 2021] pci 0000:00:1f.4: Adding to iommu group 13
[二 11月 23 12:47:16 2021] pci 0000:00:1f.5: Adding to iommu group 13
[二 11月 23 12:47:16 2021] pci 0000:01:00.0: Adding to iommu group 1
[二 11月 23 12:47:16 2021] pci 0000:01:00.1: Adding to iommu group 1
[二 11月 23 12:47:16 2021] pci 0000:01:00.2: Adding to iommu group 1
[二 11月 23 12:47:16 2021] pci 0000:01:00.3: Adding to iommu group 1
[二 11月 23 12:47:16 2021] pci 0000:02:00.0: Adding to iommu group 14
[二 11月 23 12:47:16 2021] pci 0000:04:00.0: Adding to iommu group 15
[二 11月 23 12:47:16 2021] pci 0000:05:00.0: Adding to iommu group 16
[二 11月 23 12:47:16 2021] pci 0000:06:00.0: Adding to iommu group 17
[二 11月 23 12:47:17 2021] intel_iommu=on
root@openstack-ubuntu:/opt/images/packer_tutorial/centos-vanilla# lspci -nnv -s 01:00.0
01:00.0 VGA compatible controller [0300]: NVIDIA Corporation TU104 [GeForce RTX 2070 SUPER] [10de:1e84] (rev a1) (prog-if 00 [VGA controller])
Subsystem: Gigabyte Technology Co., Ltd TU104 [GeForce RTX 2070 SUPER] [1458:4001]
Flags: bus master, fast devsel, latency 0, IRQ 16
Memory at a4000000 (32-bit, non-prefetchable) [size=16M]
Memory at 90000000 (64-bit, prefetchable) [size=256M]
Memory at a0000000 (64-bit, prefetchable) [size=32M]
I/O ports at 5000 [size=128]
Expansion ROM at a5000000 [virtual] [disabled] [size=512K]
Capabilities: [60] Power Management version 3
Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
Capabilities: [78] Express Legacy Endpoint, MSI 00
Capabilities: [100] Virtual Channel
Capabilities: [250] Latency Tolerance Reporting
Capabilities: [258] L1 PM Substates
Capabilities: [128] Power Budgeting <?>
Capabilities: [420] Advanced Error Reporting
Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
Capabilities: [900] Secondary PCI Express
Capabilities: [bb0] Resizable BAR <?>
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
  1. 更新grup
update-grub
  1. reboot
  2. 将vfio-pci driver通过pci bus id应用(vfio.sh需要创建,参考其它网站)
vim /etc/initramfs-tools/scripts/init-top/vfio.sh
#!/bin/sh

PREREQ=""

prereqs()
{
echo "$PREREQ"
}

case $1 in
prereqs)
prereqs
exit 0
;;
esac

for dev in 0000:01:00.0 0000:01:00.1 0000:01:00.2 0000:01:00.3
do
echo "vfio-pci" > /sys/bus/pci/devices/$dev/driver_override
echo "$dev" > /sys/bus/pci/drivers/vfio-pci/bind
done

exit 0
  1. 修改vfio.sh权限
chmod +x /etc/initramfs-tools/scripts/init-top/vfio.sh
  1. 在文件/etc/initramfs-tools/modules添加
options kvm ignore_msrs=1
  1. 在文件/etc/modprobe.d/blacklist.conf增加主机过滤
blacklist snd_hda_intel
blacklist vga16fb
blacklist rivafb
blacklist nvidiafb
blacklist rivatv
  1. 修改文件/etc/modprobe.d/vfio.conf
options vfio-pci ids=10de:1e84,10de:10f8,10de:1ad9,10de:1ad8

上面的地址来自对应的

lspci -nnv |grep -i nvidia
  1. 更新initramfs文件
update-initramfs -u -k all
  1. 重启操作系统
  2. 验证vfio-pci drvier是否被对应的设备使用
lspci -nnv

GPU配置_bash_02

并且会在/sys/bus/pci/drivers/vfio-pci/下生成对应的设备如下图:

GPU配置_bash_03

也会在/dev/vfio下生成两个设备如下图:

GPU配置_linux_04

故障

问题1:

  1. gpu被占用

表现:​​nvida-smi​​有显示运行的进程,并且解绑的时候一直卡住

​echo "0000:01:00.0" >/sys/bus/pci/drivers/nvidia/unbind​​解决方法kill掉对应的进程。

如果gpu被nvidia driver使用,通过命令查看gpu是不是被其它进程占用

nvidia-smi
Tue Nov 23 13:17:53 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.82.00 Driver Version: 470.82.00 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 Off | N/A |
| 30% 36C P0 43W / 235W | 0MiB / 7982MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+

表明没有其它进程占用,如有kill掉 然后解绑nvidia

echo -n "0000:01:00.0" >/sys/bus/pci/drivers/nvidia/unbind

绑定到vfio-pci驱动

echo -n "0000:01:00.0" >/sys/bus/pci/drivers/vfio-pci/bind

把设备01:00.0的driver改为vfio-pci

echo "vfio-pci" >/sys/bus/pci/devices/0000\:01\:00.0/driver_override

验证是否成功:

lspci -nnv -s 01:00.0
01:00.0 VGA compatible controller [0300]: NVIDIA Corporation TU104 [GeForce RTX 2070 SUPER] [10de:1e84] (rev a1) (prog-if 00 [VGA controller])
Subsystem: Gigabyte Technology Co., Ltd TU104 [GeForce RTX 2070 SUPER] [1458:4001]
Flags: fast devsel, IRQ 16
Memory at a4000000 (32-bit, non-prefetchable) [size=16M]
Memory at 90000000 (64-bit, prefetchable) [size=256M]
Memory at a0000000 (64-bit, prefetchable) [size=32M]
I/O ports at 5000 [size=128]
Expansion ROM at a5000000 [virtual] [disabled] [size=512K]
Capabilities: [60] Power Management version 3
Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
Capabilities: [78] Express Legacy Endpoint, MSI 00
Capabilities: [100] Virtual Channel
Capabilities: [250] Latency Tolerance Reporting
Capabilities: [258] L1 PM Substates
Capabilities: [128] Power Budgeting <?>
Capabilities: [420] Advanced Error Reporting
Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
Capabilities: [900] Secondary PCI Express
Capabilities: [bb0] Resizable BAR <?>
Kernel driver in use: vfio-pci
Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia

问题2:

vfio 0000:01:00.0: group 1 is not viable

该故障的原因是group 1上的其它设备01:00.* 没有使用vfio,解决方法通过

lspci -nnv -s 01:00.1

GPU配置_bash_05

找到对应的模块snd_hda_intel进行解绑更换driver为vfio-pci

echo "01:00.1" >/sys/bus/pci/drivers/snd_hda_intel/unbind
echo "01:00.1" >/sys/bus/pci/drivers/vfio-pci/bind

第二步 OpenStack层面的配置

  1. 在/etc/kolla/config/nova.conf文件中增加如下内容:
[filter_scheduler]
enabled_filters = AvailabilityZoneFilter, ComputeFilter, ComputeCapabilitiesFilter, ImagePropertiesFilter, ServerGroupAntiAffinityFilter, ServerGroupAffinityFilter, PciPassthroughFilter
available_filters = nova.scheduler.filters.all_filters

[pci]
alias = { "vendor_id":"10de", "product_id":"1e84", "name":"a1" }
passthrough_whitelist = { "vendor_id":"10de", "product_id":"1e84" }

参数​​name​​自己定义

​vendor_id​​​和​​product_id​​​来自​​lspci -nnv 01:00.0​​的地址。

  1. 修改配置nova.conf配置
kolla-ansible -i all-in-one reconfigure -t nova
  1. 创建一个带有元数据​​pci_passthrough:alias='a1:1'​​的flavor
openstack flavor create --ram 2048 --disk 10 --vcpu 2 gpu
openstack flavor set gpu --property pci_passthrough:alias='a1:1'
  1. 启动虚拟机验证是否透传成功(用官方的centos7)

进入虚拟机通过​​lspci​​命令查看

GPU配置_ubuntu_06