本文以TESLA P100显卡机型进行说明
显卡类型:TESLA P100
操作系统:ALIOS 7U2
内核版本:Linux artisf41sna11139225101.k2.na61 4.9.168-016.ali3000.alios7.x86_64 #1 SMP Fri Jun 21 22:41:58 CST 2019 x86_64 x86_64 x86_64 GNU/Linux
内核编译器:GCC 6.5
默认编译器:GCC 4.9.2
内核显卡驱动:NVIDIA-SMI 418.39       Driver Version: 418.39       CUDA Version: 10.1
安装过程
| 12
 3
 4
 
 | sudo yum -y install -b current nvidia-diag-driver-local-repo-rhel7-418.40.04-1.0-1.x86_64sudo yum -y install -b current cuda-repo-rhel7-10-0-local-10.0.130-410.48-1.0-1.x86_64
 sudo yum -y install cuda-drivers
 sudo yum -y install cuda
 
 | 
遇到问题
- 安装完成后,查看/proc/drivers/nvidia/version,发现版本依然是旧版本(418.39)
- NVRM version: NVIDIA UNIX x86_64 Kernel Module  418.39  Sat Feb  9 19:19:37 CST 2019
 
- 重启机器,无效,依然是418.39
排查过程
| 12
 3
 4
 5
 6
 7
 8
 9
 10
 11
 12
 13
 14
 15
 16
 17
 
 | lspci | grep NVIDIA
 86:00.0 3D controller: NVIDIA Corporation Device 15f8 (rev a1)
 
 lsmod | grep nvidia
 nvidia_uvm
 nvidia_drm
 nvidia_modeset
 nvidia_drm
 
 ls -lh /lib/modules/4.9.168-016.ali3000.alios7.x86_64/kernel/drivers/video
 nvidia-drm.ko
 nvidia.ko
 nvidia-modeset.ko
 nvidia-uvm.ko
 
 nvidia id: NVIDIA UNIX x86_64 Kernel Module  418.39  Sat Feb  9 19:19:37 CST 2019
 
 | 
以上可以确定,因系统内核视频相关驱动中,nvidia.ko的版本号是418.39,故在系统重启过程后,依然会自动加载该驱动核心
疑问
- 为什么安装新版显卡驱动的过程中,没有生成对应版本的nvidia.ko和相关模块?
继续排查
| 12
 3
 4
 5
 6
 7
 8
 9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 
 | cd /usr/src/nvidia-418.40.04
 
 make
 
 无法连接Linux Kernel(因Linux Kernel是使用的gcc 6.5编译的),而当前系统默认GCC版本为4.9.2
 t-search-gcc-4.9.2-3.alios7.x86_64
 
 http://gitlab.alibaba-inc.com/gnu/gnu-manifest/wikis/how-to-install-gcc65
 
 yum install alios7u-2_17-gcc-6-repo.noarch -y
 yum install gcc gcc-c++ libstdc++-static gdb coreutils binutils bash
 
 sudo rm -f /lib/modules/4.9.168-016.ali3000.alios7.x86_64/kernel/drivers/video/nvidia*.ko
 
 rm -f /usr/lib64/*nvidia*418.39*
 sudo nvidia-uninstall
 sudo yum remove nvidia*
 sudo yum remove cuda*
 
 sudo rmmod nvidia_drm
 sudo rmmod nvidia_modeset
 sudo rmmod nvidia_uvm
 sudo rmmod nvidia
 
 执行“安装”部分命令即可
 
 ls -lh /lib/modules/4.9.168-016.ali3000.alios7.x86_64/kernel/drivers/video
 
 ls -lh /lib/modules/4.9.168-016.ali3000.alios7.x86_64/extra
 nvidia-drm.ko
 nvidia.ko
 nvidia-modeset.ko
 nvidia-uvm.ko
 
 lsmod | grep nvidia
 nvidia_drm
 nvidia_modeset
 nvidia_uvm
 nvidia
 
 | 
验证
| 12
 3
 4
 5
 6
 7
 8
 
 | sudo reboot
 
 cat /proc/driver/nvidia/version
 NVRM version: NVIDIA UNIX x86_64 Kernel Module  418.40.04  Fri Mar 15 00:59:12 CDT 2019
 GCC version:  gcc version 6.5.1 20190307 (Alibaba 6.5.1-1 2.17) (GCC)
 
 NVIDIA-SMI 418.40.04    Driver Version: 418.40.04    CUDA Version: 10.1
 
 |