每次训练都碰到ValueError: cannot convert float NaN to integer

1111a1333 · 发表于 2024-2-27 20:49:08

星级打分

1
2
3
4
5

平均分:NAN 参与人数:0 我的评分:未评

新手求教，每次训练都碰到下面的问题
ValueError: cannot convert float NaN to integer
[22:56:26][[color=var(--fgColor-accent, var(--color-accent-fg))]#1284][0175ms][nan][nan]
[22:58:29][[color=var(--fgColor-accent, var(--color-accent-fg))]#2574][0135ms][nan][nan]
[23:06:06][[color=var(--fgColor-accent, var(--color-accent-fg))]#6687][0128ms][nan][nan]

训练倒是还在继续跑，但不管多少loss就只是NaN。而且没有训练的预览界面，看不到实时的训练情况，也不能按enter停止训练。只能ctrl+c强行推出终端。我查了Github上原作者的回答，一个是GPU降频，一个是降低模型参数，但对我好像都没用。我即使直接使用q96训练，也会报错。查到论坛里有大佬说是训练数据里有纯色图的原因。但我查了并没有，难道训练数据还能有其他的原因吗？

感谢各位大佬

wtxx8888 · 发表于 2024-2-27 21:52:39

本帖最后由 wtxx8888 于 2024-2-27 22:44 编辑

NAN代表 显卡根本没收到训练数据。说穿了就是训练了个寂寞===全部无效，模型是空白状态。
大概率你显卡驱动有问题。或者是你开了F16半精度，显卡过老不支持（GTX1660部分支持F16，GTX 1660以下不支持半精度F16）。

1111a1333 · 发表于 2024-2-27 22:45:25

wtxx8888 发表于 2024-2-27 21:52
NAN代表显卡根本没收到图像数据。说穿了就是训练了个寂寞===黑屏无图像。
大概率你显卡驱动有问题。或者是 ...

请问有方法验证驱动的问题吗？因为用tensorflow跑一个简单的算法的话，输出的结果貌似显示cuda和cudnn都是通的。比如随便跑个简单的卷积模型
import tensorflow as tf

# 构建一个简单的卷积神经网络
model = tf.keras.models.Sequential([
tf.keras.layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)),
tf.keras.layers.MaxPooling2D((2, 2)),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(10, activation='softmax')
])

# 编译模型
model.compile(optimizer='adam',
            loss='sparse_categorical_crossentropy',
            metrics=['accuracy'])

# 打印模型摘要
model.summary()

# 检查是否使用了 GPU
if tf.test.is_gpu_available(cuda_only=True):
print("CUDA is available!")
else:
print("CUDA is not available.")

我的会输出
$python test_cudnn.py
2024-02-27 15:57:41.403509: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
2024-02-27 15:57:42.018270: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2024-02-27 15:57:42.018812: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2024-02-27 15:57:42.027762: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2024-02-27 15:57:42.027864: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: NVIDIA GeForce RTX 3090 computeCapability: 8.6
coreClock: 1.695GHz coreCount: 82 deviceMemorySize: 23.68GiB deviceMemoryBandwidth: 871.81GiB/s
2024-02-27 15:57:42.027875: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
2024-02-27 15:57:42.029176: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10
2024-02-27 15:57:42.029199: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.10
2024-02-27 15:57:42.030278: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2024-02-27 15:57:42.030463: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2024-02-27 15:57:42.031427: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2024-02-27 15:57:42.031931: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.10
2024-02-27 15:57:42.033867: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.7
2024-02-27 15:57:42.033946: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2024-02-27 15:57:42.034056: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2024-02-27 15:57:42.034108: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2024-02-27 15:57:42.034290: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-02-27 15:57:42.034751: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
2024-02-27 15:57:42.034802: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2024-02-27 15:57:42.034867: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: NVIDIA GeForce RTX 3090 computeCapability: 8.6
coreClock: 1.695GHz coreCount: 82 deviceMemorySize: 23.68GiB deviceMemoryBandwidth: 871.81GiB/s
2024-02-27 15:57:42.034878: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
2024-02-27 15:57:42.034887: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10
2024-02-27 15:57:42.034893: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.10
2024-02-27 15:57:42.034898: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2024-02-27 15:57:42.034903: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2024-02-27 15:57:42.034909: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2024-02-27 15:57:42.034914: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.10
2024-02-27 15:57:42.034919: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.7
2024-02-27 15:57:42.034942: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2024-02-27 15:57:42.035029: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2024-02-27 15:57:42.035078: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2024-02-27 15:57:42.035093: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1

看起来cuda和cudnn都很正常。。。。显卡用的是3090，驱动版本是545.29.06 cuda是11.8 cudnn是8.9.7。好像nvidia不提供查询是否用F16半精度的方法？是只能通过加一个conda环境激活脚本来强制float32吗？

wtxx8888 · 发表于 2024-2-27 23:49:48

本帖最后由 wtxx8888 于 2024-2-28 00:13 编辑

1111a1333 发表于 2024-2-27 22:45
请问有方法验证驱动的问题吗？因为用tensorflow跑一个简单的算法的话，输出的结果貌似显示cuda和cudnn都 ...

我学的是C语言，python只是略懂，看过的python代码，一般都是啥也不指定就是F32，F16都是单独指定的

（毕竟F16是GTX1660及以上的显卡才能开）
不显示LOSS值而显示NAN，就是显卡接收的训练数据是空白，至于怎么造成的，需要你自己检查

先把驱动升到最新的看看吧，现在驱动是551.61.

1111a1333 · 发表于 2024-2-28 00:43:59

本帖最后由 1111a1333 于 2024-2-28 00:45 编辑

wtxx8888 发表于 2024-2-27 23:49
我学的是C语言，python只是略懂，看过的python代码，一般都是啥也不指定就是F32，F16都是单独指定的（ ...

我解决了。是我犯了个低级的错误，我cuda的版本和DFL用的tensorflow不适配。

感谢让我意识到是显卡根本没收到任何训练数据。我重新装了一遍conda的环境，就正常了。

		自动登录	找回密码
密码			立即注册（仅限QQ邮箱）

每次训练都碰到ValueError: cannot convert float NaN to integer

浏览过的版块

真我风采勋章

万事如意节日勋章