问题描述:NVIDIA-SMI has failed because it couldn‘t communicate with the NVIDIA driver

  当我们重启ubuntu系统之后,使用nvidia-smi命令查看GPU使用情况时,有时候会出现“NVIDIA-SMI has failed because it couldn‘t communicate with the NVIDIA driver”错误,这很可能是内核版本更新的问题,导致新版本内核和原来显卡驱动不匹配!

Linux Kernel

Kernel 是与计算机硬件接口的易替换软件的最低级别。它负责将所有以“用户模式”运行的应用程序连接到物理硬件,并允许称为服务器的进程使用进程间通信(IPC)彼此获取信息。

ubuntu查看是否支持kvm虚拟化 ubuntu查看kernel版本_ci

 

 

解决方案:切换到原来的内核版本

1. 查看内核列表

sudo dpkg --get-selections |grep linux-image

ubuntu查看是否支持kvm虚拟化 ubuntu查看kernel版本_重启_02

 

2. 查看当前使用的内核

->uname

uname -r

ubuntu查看是否支持kvm虚拟化 ubuntu查看kernel版本_ci_03

或者:

->/proc/version

/proc目录包含虚拟文件,其中包含有关系统内存,CPU内核,已安装文件系统等的信息。有关正在运行的内核的信息存储在/proc/version虚拟文件中。

cat /proc/version

ubuntu查看是否支持kvm虚拟化 ubuntu查看kernel版本_ubuntu查看是否支持kvm虚拟化_04

 

3. 删除内核

tips:删除当前版本重启会使用低一级的已安装内核, 如果是最后一个内核版本删除之后重启会进入BIOS界面

sudo apt-get remove linux-image-5.15.0-52-generic

ubuntu查看是否支持kvm虚拟化 ubuntu查看kernel版本_ci_05

 使用以下命令进行自动清理:

sudo apt autoremove

ubuntu查看是否支持kvm虚拟化 ubuntu查看kernel版本_linux_06

这个时候再去查看内核列表,就会发现 linux-image-5.15.0-52-generic变成deinstall的状态了:

ubuntu查看是否支持kvm虚拟化 ubuntu查看kernel版本_ubuntu查看是否支持kvm虚拟化_07

 

注意(这是一个补充的情况)

有时候你需要同时删除包含unsigned的版本才行,不然它会相互替换使用而无法切换到你想要的旧版本:

sudo apt remove linux-image-5.15.0-53-generic linux-image-unsigned-5.15.0-53-generic

看下面的处理过程就能明白(我想用的版本是linux-image-5.15.0-52-generic,与上面的例子不同情况):

mulan@mulan-PowerEdge-R7525:~$ sudo dpkg --get-selections |grep linux-image
linux-image-5.15.0-52-generic                   install
linux-image-5.15.0-53-generic                   install
linux-image-5.8.0-43-generic                    deinstall
linux-image-unsigned-5.15.0-53-generic          deinstall
mulan@mulan-PowerEdge-R7525:~$ sudo apt remove linux-image-5.15.0-53-generic
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following additional packages will be installed:
  linux-image-unsigned-5.15.0-53-generic
Suggested packages:
  fdutils linux-doc | linux-hwe-5.15-source-5.15.0 linux-hwe-5.15-tools linux-modules-extra-5.15.0-53-generic
The following packages will be REMOVED:
  linux-image-5.15.0-53-generic
The following NEW packages will be installed:
  linux-image-unsigned-5.15.0-53-generic
0 upgraded, 1 newly installed, 1 to remove and 185 not upgraded.
Need to get 0 B/11.6 MB of archives.
After this operation, 447 kB of additional disk space will be used.
Do you want to continue? [Y/n] y
dpkg: linux-image-5.15.0-53-generic: dependency problems, but removing anyway as you requested:
 linux-modules-5.15.0-53-generic depends on linux-image-5.15.0-53-generic | linux-image-unsigned-5.15.0-53-generic; however:
  Package linux-image-5.15.0-53-generic is to be removed.
  Package linux-image-unsigned-5.15.0-53-generic is not installed.

(Reading database ... 165303 files and directories currently installed.)
Removing linux-image-5.15.0-53-generic (5.15.0-53.59~20.04.1) ...
W: Removing the running kernel
I: /boot/vmlinuz is now a symlink to vmlinuz-5.15.0-52-generic
I: /boot/initrd.img is now a symlink to initrd.img-5.15.0-52-generic
/etc/kernel/postrm.d/initramfs-tools:
update-initramfs: Deleting /boot/initrd.img-5.15.0-53-generic
/etc/kernel/postrm.d/zz-update-grub:
Sourcing file `/etc/default/grub'
Sourcing file `/etc/default/grub.d/init-select.cfg'
Generating grub configuration file ...
Found linux image: /boot/vmlinuz-5.15.0-52-generic
Found initrd image: /boot/initrd.img-5.15.0-52-generic
done
Selecting previously unselected package linux-image-unsigned-5.15.0-53-generic.
(Reading database ... 165299 files and directories currently installed.)
Preparing to unpack .../linux-image-unsigned-5.15.0-53-generic_5.15.0-53.59~20.04.1_amd64.deb ...
Unpacking linux-image-unsigned-5.15.0-53-generic (5.15.0-53.59~20.04.1) ...
Setting up linux-image-unsigned-5.15.0-53-generic (5.15.0-53.59~20.04.1) ...
I: /boot/vmlinuz is now a symlink to vmlinuz-5.15.0-53-generic
I: /boot/initrd.img is now a symlink to initrd.img-5.15.0-53-generic
Processing triggers for linux-image-unsigned-5.15.0-53-generic (5.15.0-53.59~20.04.1) ...
/etc/kernel/postinst.d/initramfs-tools:
update-initramfs: Generating /boot/initrd.img-5.15.0-53-generic
/etc/kernel/postinst.d/zz-update-grub:
Sourcing file `/etc/default/grub'
Sourcing file `/etc/default/grub.d/init-select.cfg'
Generating grub configuration file ...
Found linux image: /boot/vmlinuz-5.15.0-53-generic
Found initrd image: /boot/initrd.img-5.15.0-53-generic
Found linux image: /boot/vmlinuz-5.15.0-52-generic
Found initrd image: /boot/initrd.img-5.15.0-52-generic
done
mulan@mulan-PowerEdge-R7525:~$ sudo dpkg --get-selections |grep linux-image
linux-image-5.15.0-52-generic                   install
linux-image-5.15.0-53-generic                   deinstall
linux-image-5.8.0-43-generic                    deinstall
linux-image-unsigned-5.15.0-53-generic          install
mulan@mulan-PowerEdge-R7525:~$ vim /etc/default/grub.d/init-select.cfg
mulan@mulan-PowerEdge-R7525:~$ sudo apt remove linux-image-5.15.0-53-generic linux-image-unsigned-5.15.0-53-generic
Reading package lists... Done
Building dependency tree
Reading state information... Done
Package 'linux-image-5.15.0-53-generic' is not installed, so not removed
The following packages will be REMOVED:
  linux-image-unsigned-5.15.0-53-generic linux-modules-5.15.0-53-generic
0 upgraded, 0 newly installed, 2 to remove and 185 not upgraded.
After this operation, 130 MB disk space will be freed.
Do you want to continue? [Y/n] y
(Reading database ... 165303 files and directories currently installed.)
Removing linux-image-unsigned-5.15.0-53-generic (5.15.0-53.59~20.04.1) ...
W: Removing the running kernel
I: /boot/vmlinuz is now a symlink to vmlinuz-5.15.0-52-generic
I: /boot/initrd.img is now a symlink to initrd.img-5.15.0-52-generic
/etc/kernel/postrm.d/initramfs-tools:
update-initramfs: Deleting /boot/initrd.img-5.15.0-53-generic
/etc/kernel/postrm.d/zz-update-grub:
Sourcing file `/etc/default/grub'
Sourcing file `/etc/default/grub.d/init-select.cfg'
Generating grub configuration file ...
Found linux image: /boot/vmlinuz-5.15.0-52-generic
Found initrd image: /boot/initrd.img-5.15.0-52-generic
done
Removing linux-modules-5.15.0-53-generic (5.15.0-53.59~20.04.1) ...
mulan@mulan-PowerEdge-R7525:~$ sudo dpkg --get-selections |grep linux-image
linux-image-5.15.0-52-generic                   install
linux-image-5.15.0-53-generic                   deinstall
linux-image-5.8.0-43-generic                    deinstall
linux-image-unsigned-5.15.0-53-generic          deinstall

  

4.查看内核的启动顺序

grep menuentry /boot/grub/grub.cfg

ubuntu查看是否支持kvm虚拟化 ubuntu查看kernel版本_ci_08

 

备注:假如不小心配置修改错误,可以在重启电脑后进入Minimal BASH-like

 line editing界面,可以输入下面指令显示出启动的图形界面:

grub>normal

 

5.Ubuntu设置开机默认内核

ubuntu查看是否支持kvm虚拟化 ubuntu查看kernel版本_linux_09

 

ubuntu查看是否支持kvm虚拟化 ubuntu查看kernel版本_ubuntu查看是否支持kvm虚拟化_10

 

6. Ubuntu关闭自动更新

1)命令行关闭系统自动更新,使用命令打开文件并编辑(将双引号中的“1”全部置“0”即可,修改后保存):

mulan@mulan-PowerEdge-R7525:~$ sudo vim /etc/apt/apt.conf.d/10periodic
APT::Periodic::Update-Package-Lists "0";
APT::Periodic::Download-Upgradeable-Packages "0";
APT::Periodic::AutocleanInterval "0";
APT::Periodic::Unattended-Upgrade "0";

2)图形界面来关闭自动更新,找到软件更新(Software & Updates)

ubuntu查看是否支持kvm虚拟化 ubuntu查看kernel版本_重启_11

 

 3)ubuntu默认启动了自动更新内核,为了避免出现重启系统后遇到错误进入不到系统中去,我们可以进一步关闭内核更新,使用当前内核。

sudo apt-mark hold linux-image-5.15.0-48-generic

ubuntu查看是否支持kvm虚拟化 ubuntu查看kernel版本_linux_12

如果要重启启动内核更新,对应执行unhold就可以了。

禁止系统更新,一般用的第二种方法发现有时候不起作用,为了保险起见,建议以上三种都进行设置!  

 

 参考:解决NVIDIA-SMI has failed because it couldn‘t communicate with the NVIDIA driver

 

 

  

 

朱颜辞镜花辞树,敏捷开发靠得住!