本文以TESLA P100显卡机型进行说明
显卡类型:TESLA P100
操作系统:ALIOS 7U2
内核版本:Linux artisf41sna11139225101.k2.na61 4.9.168-016.ali3000.alios7.x86_64 #1 SMP Fri Jun 21 22:41:58 CST 2019 x86_64 x86_64 x86_64 GNU/Linux
内核编译器:GCC 6.5
默认编译器:GCC 4.9.2
内核显卡驱动:NVIDIA-SMI 418.39 Driver Version: 418.39 CUDA Version: 10.1
安装过程
1 2 3 4
| sudo yum -y install -b current nvidia-diag-driver-local-repo-rhel7-418.40.04-1.0-1.x86_64 sudo yum -y install -b current cuda-repo-rhel7-10-0-local-10.0.130-410.48-1.0-1.x86_64 sudo yum -y install cuda-drivers sudo yum -y install cuda
|
遇到问题
- 安装完成后,查看/proc/drivers/nvidia/version,发现版本依然是旧版本(418.39)
- NVRM version: NVIDIA UNIX x86_64 Kernel Module 418.39 Sat Feb 9 19:19:37 CST 2019
- 重启机器,无效,依然是418.39
排查过程
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
| lspci | grep NVIDIA 86:00.0 3D controller: NVIDIA Corporation Device 15f8 (rev a1)
lsmod | grep nvidia nvidia_uvm nvidia_drm nvidia_modeset nvidia_drm
ls -lh /lib/modules/4.9.168-016.ali3000.alios7.x86_64/kernel/drivers/video nvidia-drm.ko nvidia.ko nvidia-modeset.ko nvidia-uvm.ko
nvidia id: NVIDIA UNIX x86_64 Kernel Module 418.39 Sat Feb 9 19:19:37 CST 2019
|
以上可以确定,因系统内核视频相关驱动中,nvidia.ko的版本号是418.39,故在系统重启过程后,依然会自动加载该驱动核心
疑问
- 为什么安装新版显卡驱动的过程中,没有生成对应版本的nvidia.ko和相关模块?
继续排查
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
| cd /usr/src/nvidia-418.40.04
make
无法连接Linux Kernel(因Linux Kernel是使用的gcc 6.5编译的),而当前系统默认GCC版本为4.9.2 t-search-gcc-4.9.2-3.alios7.x86_64
http://gitlab.alibaba-inc.com/gnu/gnu-manifest/wikis/how-to-install-gcc65
yum install alios7u-2_17-gcc-6-repo.noarch -y yum install gcc gcc-c++ libstdc++-static gdb coreutils binutils bash
sudo rm -f /lib/modules/4.9.168-016.ali3000.alios7.x86_64/kernel/drivers/video/nvidia*.ko
rm -f /usr/lib64/*nvidia*418.39* sudo nvidia-uninstall sudo yum remove nvidia* sudo yum remove cuda*
sudo rmmod nvidia_drm sudo rmmod nvidia_modeset sudo rmmod nvidia_uvm sudo rmmod nvidia
执行“安装”部分命令即可
ls -lh /lib/modules/4.9.168-016.ali3000.alios7.x86_64/kernel/drivers/video
ls -lh /lib/modules/4.9.168-016.ali3000.alios7.x86_64/extra nvidia-drm.ko nvidia.ko nvidia-modeset.ko nvidia-uvm.ko
lsmod | grep nvidia nvidia_drm nvidia_modeset nvidia_uvm nvidia
|
验证
1 2 3 4 5 6 7 8
| sudo reboot
cat /proc/driver/nvidia/version NVRM version: NVIDIA UNIX x86_64 Kernel Module 418.40.04 Fri Mar 15 00:59:12 CDT 2019 GCC version: gcc version 6.5.1 20190307 (Alibaba 6.5.1-1 2.17) (GCC)
NVIDIA-SMI 418.40.04 Driver Version: 418.40.04 CUDA Version: 10.1
|