显卡服务器中一个显卡崩溃了:

May 16 05:38:58 dell kernel: [14244871.006970] NVRM: Xid (PCI:0000:b1:00): 13, pid=1375637, Graphics SM Warp Exception on (GPC 0, TPC 0, SM 0): Illegal Instruction Encoding
May 16 05:38:58 dell kernel: [14244871.010256] NVRM: Xid (PCI:0000:b1:00): 13, pid=1375637, Graphics Exception: ESR 0x504730=0x30009 0x504734=0x0 0x504728=0x4c1eb72 0x50472c=0x174

 

 

个人估计是显卡过热导致的。找到一个解决方法:

sudo nvidia-smi -pl 150    # 把功率限制从默认的250W调整到150W

 

 

 

 

参考:

[杂记] Nvidia-smi显卡丢失以及GPU Fan显示ERR!

 

 

 

=========================================

 

 

NVRM: Xid (PCI:0000:b1:00): 13, pid=1375637, Gra_服务器

 

 

NVRM: Xid (PCI:0000:b1:00): 13, pid=1375637, Gra_杂谈_02

 

 

 

=====================================