深度学习显卡损坏导致BUG

问题描述

  1. 程序无缘无故出现 nan 导致程序奔溃

1
2
3
4
5
6
7
8
9
10
11
12
2021-04-19 12:50:59.493929: E tensorflow/core/kernels/gpu_utils.cc:93] Redzone mismatch in RHS redzone of buffer 0x7fe00b742600 at offset 4984856; expected ffffffffffffffff but was faffc4ffffffffff.
2021-04-19 12:50:59.494006: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at conv_grad_input_ops.cc:1264 : Not found: No algorithm worked!

....(省略无用部分)
tensorflow.python.framework.errors_impl.NotFoundError: No algorithm worked!
[[node gradient_tape/cin/conv1d_2/Conv2DBackpropInput (defined at /home/zhangxu/workspace/ctr/share/base_model.py:59) ]] [Op:__inference_fit_on_batch_72311]

Errors may have originated from an input operation.
Input Source operations connected to node gradient_tape/cin/conv1d_2/Conv2DBackpropInput:
cin/conv1d_2/ExpandDims_1 (defined at xDeepFM.py:135)

Function call stack:

问题原因

显卡坏了

显卡检测

因为我没有 root 权限, 所有只能自己写一个小程序验证我的显卡有问题的猜想

1
2
3
4
5
6
7
8
9
10
11
12
13
14
import os

os.environ['CUDA_VISIBLE_DEVICES'] = '1'
os.environ["TF_FORCE_GPU_ALLOW_GROWTH"] = "true"

import tensorflow as tf

for i in range(10):
print(i)
a = tf.zeros((7 // 2, 1024 * 1024 * 1024 // 8), dtype=tf.float32) # 3.5 GB
b = tf.zeros((7 // 2, 1024 * 1024 * 1024 // 8), dtype=tf.float32)
for i in range(100):
a += b
assert tf.reduce_sum(a) == 0

吐槽

tmd傻逼项目服务器, rnm, 内存只有32G, 实际可用才16G, 磁盘空间小, tm只剩300多G了, 服务器还不稳定, 知道显卡坏了还不修

还有为什么我tm总是遇到硬件bug, 以前是CPU, 这次是显卡