AMD VEGA64 在内核 > 4.15 上崩溃

AMD VEGA64 crash on kernel > 4.15

因此,在尝试 运行 内核 4.19.39、5.0.13 和 5.1 时,它们会在启动 Steam 或守望先锋(BattleNet 客户端)后冻结几秒。当前 运行ning 4.15,运行 非常稳定。

我做了以下事情:

硬件

AMD Ryzen 7 2700X Wraith Boxed
Asus Vega 64 Strix    
Gigabyte X470 AORUS ULTRA GAMING (AGESA 1.0.0.6)
G.Skill Ripjaws V 16GB DDR4 3200MHz (4 x 16GB)
Corsair CX850M 850W ATX power supply unit

screenfetch -n

OS: Ubuntu 18.04 bionic
 Kernel: x86_64 Linux 4.15.0-48-generic
 Uptime: 1h 29m
 Packages: 3497
 Shell: bash 4.4.19
 Resolution: 3840x2160
 DE: GNOME 
 WM: GNOME Shell
 WM Theme: Adwaita
 GTK Theme: Ambiance [GTK2/3]
 Icon Theme: ubuntu-mono-dark
 Font: Ubuntu 11
 CPU: AMD Ryzen 7 2700X Eight-Core @ 16x 3.7GHz [36.3°C]
 GPU: Radeon RX Vega (VEGA10, DRM 3.23.0, 4.15.0-48-generic, LLVM 9.0.0)
 RAM: 6208MiB / 64432MiB

驱动程序 + 附加信息

~$ glxinfo | grep "OpenGL version"
OpenGL version string: 4.5 (Compatibility Profile) Mesa 19.2.0-devel - padoka PPA

~$ cat /etc/apt/sources.list.d/paulo-miguel-dias-ubuntu-mesa-bionic.list
deb http://ppa.launchpad.net/paulo-miguel-dias/mesa/ubuntu bionic main
# deb-src http://ppa.launchpad.net/paulo-miguel-dias/mesa/ubuntu bionic main

~$ sudo lspci -v | grep -i vga -A 10
0c:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 XT [Radeon RX Vega 64] (rev c1) (prog-if 00 [VGA controller])
    Subsystem: ASUSTeK Computer Inc. Vega 10 XT [Radeon RX Vega 64]
    Flags: bus master, fast devsel, latency 0, IRQ 114
    Memory at e0000000 (64-bit, prefetchable) [size=256M]
    Memory at f0000000 (64-bit, prefetchable) [size=2M]
    I/O ports at e000 [size=256]
    Memory at fcc00000 (32-bit, non-prefetchable) [size=512K]
    Expansion ROM at 000c0000 [disabled] [size=128K]
    Capabilities: [48] Vendor Specific Information: Len=08 <?>
    Capabilities: [50] Power Management version 3
    Capabilities: 

    ...

~$ apt show libdrm-amdgpu1 -a
Package: libdrm-amdgpu1
Version: 2.4.98+git1905192304.922d929~b~padoka0
Priority: optional
Section: libs
Source: libdrm
Maintainer: Debian X Strike Force <debian-x@lists.debian.org>
Installed-Size: 76,8 kB
Depends: libc6 (>= 2.17), libdrm2 (>= 2.4.82)
Download-Size: 26,9 kB
APT-Manual-Installed: yes
APT-Sources: http://ppa.launchpad.net/paulo-miguel-dias/mesa/ubuntu bionic/main amd64 Packages
Description: Userspace interface to amdgpu-specific kernel DRM services -- runtime
 This library implements the userspace interface to the kernel DRM
 services.  DRM stands for "Direct Rendering Manager", which is the
 kernelspace portion of the "Direct Rendering Infrastructure" (DRI).
 The DRI is currently used on Linux to provide hardware-accelerated

我在使用内核 5.1 进行测试时在内核日志中发现了以下内容

May 22 18:46:31 [HOST] kernel: [  256.354386] amdgpu 0000:0c:00.0: [gfxhub] no-retry page fault (src_id:0 ring:158 vmid:5 pasid:32780, for process Battle.net.exe pid 10384 thread Battle.net:cs0 pid 10575)
May 22 18:46:31 [HOST] kernel: [  256.354390] amdgpu 0000:0c:00.0:   in page starting at address 0x0000000000400000 from 27
May 22 18:46:31 [HOST] kernel: [  256.354391] amdgpu 0000:0c:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x0050153D
May 22 18:46:31 [HOST] kernel: [  256.354395] amdgpu 0000:0c:00.0: [gfxhub] no-retry page fault (src_id:0 ring:158 vmid:5 pasid:32780, for process Battle.net.exe pid 10384 thread Battle.net:cs0 pid 10575)
May 22 18:46:31 [HOST] kernel: [  256.354397] amdgpu 0000:0c:00.0:   in page starting at address 0x0000000000400000 from 27
May 22 18:46:31 [HOST] kernel: [  256.354398] amdgpu 0000:0c:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
May 22 18:46:31 [HOST] kernel: [  256.354404] amdgpu 0000:0c:00.0: [gfxhub] no-retry page fault (src_id:0 ring:158 vmid:5 pasid:32780, for process Battle.net.exe pid 10384 thread Battle.net:cs0 pid 10575)
May 22 18:46:31 [HOST] kernel: [  256.354405] amdgpu 0000:0c:00.0:   in page starting at address 0x0000000000400000 from 27
May 22 18:46:31 [HOST] kernel: [  256.354407] amdgpu 0000:0c:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
May 22 18:46:31 [HOST] kernel: [  256.354411] amdgpu 0000:0c:00.0: [gfxhub] no-retry page fault (src_id:0 ring:158 vmid:5 pasid:32780, for process Battle.net.exe pid 10384 thread Battle.net:cs0 pid 10575)
May 22 18:46:31 [HOST] kernel: [  256.354412] amdgpu 0000:0c:00.0:   in page starting at address 0x0000000000400000 from 27
May 22 18:46:31 [HOST] kernel: [  256.354413] amdgpu 0000:0c:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
May 22 18:46:31 [HOST] kernel: [  256.354418] amdgpu 0000:0c:00.0: [gfxhub] no-retry page fault (src_id:0 ring:158 vmid:5 pasid:32780, for process Battle.net.exe pid 10384 thread Battle.net:cs0 pid 10575)
May 22 18:46:31 [HOST] kernel: [  256.354419] amdgpu 0000:0c:00.0:   in page starting at address 0x0000000000400000 from 27
May 22 18:46:31 [HOST] kernel: [  256.354420] amdgpu 0000:0c:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
May 22 18:46:31 [HOST] kernel: [  256.354424] amdgpu 0000:0c:00.0: [gfxhub] no-retry page fault (src_id:0 ring:158 vmid:5 pasid:32780, for process Battle.net.exe pid 10384 thread Battle.net:cs0 pid 10575)
May 22 18:46:31 [HOST] kernel: [  256.354426] amdgpu 0000:0c:00.0:   in page starting at address 0x0000000000400000 from 27
May 22 18:46:31 [HOST] kernel: [  256.354427] amdgpu 0000:0c:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
May 22 18:46:31 [HOST] kernel: [  256.354430] amdgpu 0000:0c:00.0: [gfxhub] no-retry page fault (src_id:0 ring:158 vmid:5 pasid:32780, for process Battle.net.exe pid 10384 thread Battle.net:cs0 pid 10575)
May 22 18:46:31 [HOST] kernel: [  256.354432] amdgpu 0000:0c:00.0:   in page starting at address 0x0000000000400000 from 27
May 22 18:46:31 [HOST] kernel: [  256.354433] amdgpu 0000:0c:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
May 22 18:46:31 [HOST] kernel: [  256.354437] amdgpu 0000:0c:00.0: [gfxhub] no-retry page fault (src_id:0 ring:158 vmid:5 pasid:32780, for process Battle.net.exe pid 10384 thread Battle.net:cs0 pid 10575)
May 22 18:46:31 [HOST] kernel: [  256.354438] amdgpu 0000:0c:00.0:   in page starting at address 0x0000000000400000 from 27
May 22 18:46:31 [HOST] kernel: [  256.354439] amdgpu 0000:0c:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
May 22 18:46:31 [HOST] kernel: [  256.354443] amdgpu 0000:0c:00.0: [gfxhub] no-retry page fault (src_id:0 ring:158 vmid:5 pasid:32780, for process Battle.net.exe pid 10384 thread Battle.net:cs0 pid 10575)
May 22 18:46:31 [HOST] kernel: [  256.354444] amdgpu 0000:0c:00.0:   in page starting at address 0x0000000000400000 from 27
May 22 18:46:31 [HOST] kernel: [  256.354445] amdgpu 0000:0c:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
May 22 18:46:31 [HOST] kernel: [  256.354449] amdgpu 0000:0c:00.0: [gfxhub] no-retry page fault (src_id:0 ring:158 vmid:5 pasid:32780, for process Battle.net.exe pid 10384 thread Battle.net:cs0 pid 10575)
May 22 18:46:31 [HOST] kernel: [  256.354450] amdgpu 0000:0c:00.0:   in page starting at address 0x0000000000400000 from 27
May 22 18:46:31 [HOST] kernel: [  256.354451] amdgpu 0000:0c:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
May 22 18:46:41 [HOST] kernel: [  261.469953] [drm:amdgpu_dm_commit_planes.isra.43 [amdgpu]] *ERROR* Waiting for fences timed out.
May 22 18:46:41 [HOST] kernel: [  266.593840] [drm:amdgpu_dm_commit_planes.isra.43 [amdgpu]] *ERROR* Waiting for fences timed out.
May 22 18:46:41 [HOST] kernel: [  266.599848] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=18098, emitted seq=18100
May 22 18:46:41 [HOST] kernel: [  266.599914] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Battle.net.exe pid 10384 thread Battle.net:cs0 pid 10575
May 22 18:46:41 [HOST] kernel: [  266.599918] amdgpu 0000:0c:00.0: GPU reset begin!
May 22 18:46:47 [HOST] kernel: [  271.709694] [drm:amdgpu_dm_commit_planes.isra.43 [amdgpu]] *ERROR* Waiting for fences timed out.
May 22 18:46:47 [HOST] kernel: [  272.165625] amdgpu 0000:0c:00.0: GPU BACO reset
May 22 18:46:47 [HOST] kernel: [  272.643907] amdgpu 0000:0c:00.0: GPU reset succeeded, trying to resume
May 22 18:46:47 [HOST] kernel: [  272.644035] [drm] PCIE GART of 512M enabled (table at 0x000000F400900000).
May 22 18:46:47 [HOST] kernel: [  272.644126] [drm:amdgpu_device_gpu_recover [amdgpu]] *ERROR* VRAM is lost!
May 22 18:46:47 [HOST] kernel: [  272.644277] [drm] PSP is resuming...
May 22 18:46:47 [HOST] kernel: [  272.790964] [drm] reserve 0x400000 from 0xf400d00000 for PSP TMR SIZE
May 22 18:46:47 [HOST] kernel: [  272.801714] amdgpu: [powerplay] Failed to send message: 0x46, ret value: 0xffffffff
May 22 18:46:47 [HOST] kernel: [  272.801830] amdgpu: [powerplay] Failed to send message: 0x61, ret value: 0xffffffff
May 22 18:46:48 [HOST] kernel: [  273.172332] [drm] UVD and UVD ENC initialized successfully.
May 22 18:46:48 [HOST] kernel: [  273.271995] [drm] VCE initialized successfully.
May 22 18:46:48 [HOST] kernel: [  273.273190] [drm] recover vram bo from shadow start
May 22 18:46:48 [HOST] kernel: [  273.279784] [drm] recover vram bo from shadow done
May 22 18:46:48 [HOST] kernel: [  273.279787] [drm] Skip scheduling IBs!
May 22 18:46:48 [HOST] kernel: [  273.279789] [drm] Skip scheduling IBs!
May 22 18:46:48 [HOST] kernel: [  273.279823] [drm] Skip scheduling IBs!
May 22 18:46:48 [HOST] kernel: [  273.279831] [drm] Skip scheduling IBs!
May 22 18:46:48 [HOST] kernel: [  273.279833] [drm] Skip scheduling IBs!
May 22 18:46:48 [HOST] kernel: [  273.279838] [drm] Skip scheduling IBs!
May 22 18:46:48 [HOST] kernel: [  273.279844] amdgpu 0000:0c:00.0: GPU reset(2) succeeded!
May 22 18:46:48 [HOST] kernel: [  273.279844] [drm] Skip scheduling IBs!
May 22 18:46:48 [HOST] kernel: [  273.279848] [drm] Skip scheduling IBs!
May 22 18:46:48 [HOST] kernel: [  273.279853] [drm] Skip scheduling IBs!
May 22 18:46:48 [HOST] kernel: [  273.279855] [drm] Skip scheduling IBs!

内核 5.5 运行 并且稳定!

uname -a

Linux patrick-X470-AORUS-ULTRA-GAMING 5.5.10-050510-generic #202003180732 SMP Wed Mar 18 07:35:23 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

screenfetch -n

 patrick@patrick-X470-AORUS-ULTRA-GAMING
 OS: Ubuntu 18.04 bionic
 Kernel: x86_64 Linux 5.5.10-050510-generic
 Uptime: 17h 38m
 Packages: 3877
 Shell: bash 4.4.20
 Resolution: 3840x2160
 DE: GNOME 
 WM: GNOME Shell
 WM Theme: Adwaita
 GTK Theme: Ambiance [GTK2/3]
 Icon Theme: ubuntu-mono-dark
 Font: Ubuntu 11
 CPU: AMD Ryzen 7 2700X Eight-Core @ 16x 3.7GHz [38.8°C]
 GPU: Radeon RX Vega (VEGA10, DRM 3.36.0, 5.5.10-050510-generic, LLVM 10.0.0)
 RAM: 10126MiB / 64332MiB

驱动程序 + 附加信息

$ glxinfo | grep "OpenGL version"
OpenGL version string: 4.6 (Compatibility Profile) Mesa 20.0.0-devel - padoka PPA

$ sudo lspci -v | grep -i vga -A 10
0c:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 XT [Radeon RX Vega 64] (rev c1) (prog-if 00 [VGA controller])
    Subsystem: ASUSTeK Computer Inc. Vega 10 XT [Radeon RX Vega 64]
    Flags: bus master, fast devsel, latency 0, IRQ 119
    Memory at e0000000 (64-bit, prefetchable) [size=256M]
    Memory at f0000000 (64-bit, prefetchable) [size=2M]
    I/O ports at e000 [size=256]
    Memory at fcc00000 (32-bit, non-prefetchable) [size=512K]
    Expansion ROM at 000c0000 [disabled] [size=128K]
    Capabilities: [48] Vendor Specific Information: Len=08 <?>
    Capabilities: [50] Power Management version 3
    Capabilities: [64] Express Legacy Endpoint, MSI 00

$ apt show libdrm-amdgpu1 -a
Package: libdrm-amdgpu1
Version: 2.4.100+git2001081023.9ebfac1~b~padoka0
Priority: optional
Section: libs
Source: libdrm
Maintainer: Debian X Strike Force <debian-x@lists.debian.org>
Installed-Size: 80,9 kB
Depends: libc6 (>= 2.17), libdrm2 (>= 2.4.100)
Download-Size: 28,2 kB
APT-Manual-Installed: yes
APT-Sources: http://ppa.launchpad.net/paulo-miguel-dias/mesa/ubuntu bionic/main amd64 Packages