deepfacelab中文网

 找回密码
 立即注册(仅限QQ邮箱)
查看: 306|回复: 4

每次训练都碰到ValueError: cannot convert float NaN to integer

[复制链接]

1

主题

4

帖子

55

积分

高级丹童

Rank: 2

积分
55
 楼主| 发表于 2024-2-27 20:49:08 | 显示全部楼层 |阅读模式
星级打分
  • 1
  • 2
  • 3
  • 4
  • 5
平均分:NAN  参与人数:0  我的评分:未评

新手求教,每次训练都碰到下面的问题
ValueError: cannot convert float NaN to integer
[22:56:26][[color=var(--fgColor-accent, var(--color-accent-fg))]#1284][0175ms][nan][nan]
[22:58:29][[color=var(--fgColor-accent, var(--color-accent-fg))]#2574][0135ms][nan][nan]
[23:06:06][[color=var(--fgColor-accent, var(--color-accent-fg))]#6687][0128ms][nan][nan]



训练倒是还在继续跑,但不管多少loss就只是NaN。而且没有训练的预览界面,看不到实时的训练情况,也不能按enter停止训练。只能ctrl+c强行推出终端。我查了Github上原作者的回答,一个是GPU降频,一个是降低模型参数,但对我好像都没用。我即使直接使用q96训练,也会报错。查到论坛里有大佬说是训练数据里有纯色图的原因。但我查了并没有,难道训练数据还能有其他的原因吗?


感谢各位大佬
回复

使用道具 举报

9

主题

1931

帖子

1万

积分

高级丹圣

Rank: 13Rank: 13Rank: 13Rank: 13

积分
10696

真我风采勋章万事如意节日勋章

发表于 2024-2-27 21:52:39 | 显示全部楼层
本帖最后由 wtxx8888 于 2024-2-27 22:44 编辑

NAN代表 显卡根本没收到训练数据。说穿了就是训练了个寂寞===全部无效,模型是空白状态
大概率你显卡驱动有问题。或者是你开了F16半精度,显卡过老不支持(GTX1660部分支持F16,GTX 1660以下不支持半精度F16)。
回复 支持 反对

使用道具 举报

1

主题

4

帖子

55

积分

高级丹童

Rank: 2

积分
55
 楼主| 发表于 2024-2-27 22:45:25 | 显示全部楼层
wtxx8888 发表于 2024-2-27 21:52
NAN代表 显卡根本没收到图像数据。说穿了就是训练了个寂寞===黑屏无图像。
大概率你显卡驱动有问题。或者是 ...

请问有方法验证驱动的问题吗?因为用tensorflow跑一个简单的算法的话,输出的结果貌似显示cuda和cudnn都是通的。比如随便跑个简单的卷积模型
import tensorflow as tf

# 构建一个简单的卷积神经网络
model = tf.keras.models.Sequential([
    tf.keras.layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)),
    tf.keras.layers.MaxPooling2D((2, 2)),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(10, activation='softmax')
])

# 编译模型
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

# 打印模型摘要
model.summary()

# 检查是否使用了 GPU
if tf.test.is_gpu_available(cuda_only=True):
    print("CUDA is available!")
else:
    print("CUDA is not available.")

我的会输出
$python test_cudnn.py
2024-02-27 15:57:41.403509: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
2024-02-27 15:57:42.018270: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2024-02-27 15:57:42.018812: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2024-02-27 15:57:42.027762: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2024-02-27 15:57:42.027864: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: NVIDIA GeForce RTX 3090 computeCapability: 8.6
coreClock: 1.695GHz coreCount: 82 deviceMemorySize: 23.68GiB deviceMemoryBandwidth: 871.81GiB/s
2024-02-27 15:57:42.027875: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
2024-02-27 15:57:42.029176: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10
2024-02-27 15:57:42.029199: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.10
2024-02-27 15:57:42.030278: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2024-02-27 15:57:42.030463: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2024-02-27 15:57:42.031427: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2024-02-27 15:57:42.031931: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.10
2024-02-27 15:57:42.033867: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.7
2024-02-27 15:57:42.033946: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2024-02-27 15:57:42.034056: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2024-02-27 15:57:42.034108: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2024-02-27 15:57:42.034290: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-02-27 15:57:42.034751: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
2024-02-27 15:57:42.034802: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2024-02-27 15:57:42.034867: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: NVIDIA GeForce RTX 3090 computeCapability: 8.6
coreClock: 1.695GHz coreCount: 82 deviceMemorySize: 23.68GiB deviceMemoryBandwidth: 871.81GiB/s
2024-02-27 15:57:42.034878: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
2024-02-27 15:57:42.034887: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10
2024-02-27 15:57:42.034893: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.10
2024-02-27 15:57:42.034898: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2024-02-27 15:57:42.034903: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2024-02-27 15:57:42.034909: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2024-02-27 15:57:42.034914: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.10
2024-02-27 15:57:42.034919: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.7
2024-02-27 15:57:42.034942: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2024-02-27 15:57:42.035029: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2024-02-27 15:57:42.035078: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2024-02-27 15:57:42.035093: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1

看起来cuda和cudnn都很正常。。。。显卡用的是3090,驱动版本是545.29.06 cuda是11.8 cudnn是8.9.7。好像nvidia不提供查询是否用F16半精度的方法?是只能通过加一个conda环境激活脚本来强制float32吗?
回复 支持 反对

使用道具 举报

9

主题

1931

帖子

1万

积分

高级丹圣

Rank: 13Rank: 13Rank: 13Rank: 13

积分
10696

真我风采勋章万事如意节日勋章

发表于 2024-2-27 23:49:48 | 显示全部楼层
本帖最后由 wtxx8888 于 2024-2-28 00:13 编辑
1111a1333 发表于 2024-2-27 22:45
请问有方法验证驱动的问题吗?因为用tensorflow跑一个简单的算法的话,输出的结果貌似显示cuda和cudnn都 ...

我学的是C语言,python只是略懂,看过的python代码,一般都是啥也不指定就是F32,F16都是单独指定的(毕竟F16是GTX1660及以上的显卡才能开)
不显示LOSS值而显示NAN,就是显卡接收的训练数据是空白,至于怎么造成的,需要你自己检查先把驱动升到最新的看看吧,现在驱动是551.61.
回复 支持 反对

使用道具 举报

1

主题

4

帖子

55

积分

高级丹童

Rank: 2

积分
55
 楼主| 发表于 2024-2-28 00:43:59 | 显示全部楼层
本帖最后由 1111a1333 于 2024-2-28 00:45 编辑
wtxx8888 发表于 2024-2-27 23:49
我学的是C语言,python只是略懂,看过的python代码,一般都是啥也不指定就是F32,F16都是单独指定的( ...

我解决了。是我犯了个低级的错误,我cuda的版本和DFL用的tensorflow不适配。感谢让我意识到是显卡根本没收到任何训练数据。我重新装了一遍conda的环境,就正常了。
回复 支持 反对

使用道具 举报

QQ|Archiver|手机版|deepfacelab中文网 |网站地图

GMT+8, 2024-5-18 21:22 , Processed in 0.084487 second(s), 10 queries , Redis On.

Powered by Discuz! X3.4

Copyright © 2001-2020, Tencent Cloud.

快速回复 返回顶部 返回列表