求大佬别离开帮忙看下LORA训练遇到的问题!!

546078000 · 发表于 2023-3-26 17:32:57

星级打分

1
2
3
4
5

平均分:NAN 参与人数:0 我的评分:未评

prepare tokenizer
update token length: 225
Use DreamBooth method.
ignore directory without repeats / 繰り返し回数のないディレクトリを無視します: <built-in function dir>
ignore directory without repeats / 繰り返し回数のないディレクトリを無視します: <built-in function dir>
prepare images.
found directory train\wang\3_wang contains 300 image files
900 train images with repeating.
0 reg images.
no regularization images / 正則化画像が見つかりませんでした
[Dataset 0]
  batch_size: 1
  resolution: (512, 512)
  enable_bucket: True
  min_bucket_reso: 256
  max_bucket_reso: 512
  bucket_reso_steps: 64
  bucket_no_upscale: False

  [Subset 0 of Dataset 0]
image_dir: "train\wang\3_wang"
image_count: 300
num_repeats: 3
shuffle_caption: True
keep_tokens: 0
caption_dropout_rate: 0.0
caption_dropout_every_n_epoches: 0
caption_tag_dropout_rate: 0.0
color_aug: False
flip_aug: False
face_crop_aug_range: None
random_crop: False
is_reg: False
class_tokens: wang
caption_extension: .txt

[Dataset 0]
loading image sizes.
100%|██████████████████████████████████████████████████████████████████████████████| 150/150 [00:00<00:00, 3123.46it/s]
make buckets
number of images (including repeats) / 各bucketの画像枚数（繰り返し回数を含む）
bucket 0: resolution (512, 512), count: 450
mean ar error (without repeats): 0.0
prepare accelerator
Using accelerator 0.15.0 or above.
load StableDiffusion checkpoint
loading u-net: <All keys matched successfully>
loading vae: <All keys matched successfully>
loading text encoder: <All keys matched successfully>
Replace CrossAttention.forward to use xformers
[Dataset 0]
caching latents.
100%|████████████████████████████████████████████████████████████████████████████████| 150/150 [01:59<00:00,  1.26it/s]
import network module: networks.lora
create LoRA network. base dim (rank): 32, alpha: 32.0
create LoRA for Text Encoder: 72 modules.
create LoRA for U-Net: 192 modules.
enable LoRA for text encoder
enable LoRA for U-Net
prepare optimizer, data loader etc.

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
For effortless bug reporting copy-paste your error into this form: https://docs.google.com/forms/d/ ... iewform?usp=sf_link
================================================================================
CUDA SETUP: Loading binary F:\AI\lora-scripts\venv\lib\site-packages\bitsandbytes\libbitsandbytes_cuda116.dll...
use 8-bit AdamW optimizer | {}
override steps. steps for 10 epochs is / 指定エポックまでのステップ数: 4500
Traceback (most recent call last):
  File "F:\AI\lora-scripts\sd-scripts\train_network.py", line 699, in <module>
train(args)
  File "F:\AI\lora-scripts\sd-scripts\train_network.py", line 216, in train
unet, text_encoder, network, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
  File "F:\AI\lora-scripts\venv\lib\site-packages\accelerate\accelerator.py", line 876, in prepare
result = tuple(
  File "F:\AI\lora-scripts\venv\lib\site-packages\accelerate\accelerator.py", line 877, in <genexpr>
self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
  File "F:\AI\lora-scripts\venv\lib\site-packages\accelerate\accelerator.py", line 741, in _prepare_one
return self.prepare_model(obj, device_placement=device_placement)
  File "F:\AI\lora-scripts\venv\lib\site-packages\accelerate\accelerator.py", line 912, in prepare_model
model = model.to(self.device)
  File "F:\AI\lora-scripts\venv\lib\site-packages\transformers\modeling_utils.py", line 1749, in to
return super().to(*args, **kwargs)
  File "F:\AI\lora-scripts\venv\lib\site-packages\torch\nn\modules\module.py", line 927, in to
return self._apply(convert)
  File "F:\AI\lora-scripts\venv\lib\site-packages\torch\nn\modules\module.py", line 579, in _apply
module._apply(fn)
  File "F:\AI\lora-scripts\venv\lib\site-packages\torch\nn\modules\module.py", line 579, in _apply
module._apply(fn)
  File "F:\AI\lora-scripts\venv\lib\site-packages\torch\nn\modules\module.py", line 579, in _apply
module._apply(fn)
  [Previous line repeated 3 more times]
  File "F:\AI\lora-scripts\venv\lib\site-packages\torch\nn\modules\module.py", line 602, in _apply
param_applied = fn(param)
  File "F:\AI\lora-scripts\venv\lib\site-packages\torch\nn\modules\module.py", line 925, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 4.00 GiB total capacity; 3.43 GiB already allocated; 0 bytes free; 3.50 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code
exec(code, run_globals)
  File "F:\AI\lora-scripts\venv\Scripts\accelerate.exe\__main__.py", line 7, in <module>
  File "F:\AI\lora-scripts\venv\lib\site-packages\accelerate\commands\accelerate_cli.py", line 45, in main
args.func(args)
  File "F:\AI\lora-scripts\venv\lib\site-packages\accelerate\commands\launch.py", line 1104, in launch_command
simple_launcher(args)
  File "F:\AI\lora-scripts\venv\lib\site-packages\accelerate\commands\launch.py", line 567, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['F:\\AI\\lora-scripts\\venv\\Scripts\\python.exe', './sd-scripts/train_network.py', '--enable_bucket', '--pretrained_model_name_or_path=./sd-models/model.ckpt', '--train_data_dir=./train/wang', '--output_dir=./output', '--logging_dir=./logs', '--resolution=512,512', '--network_module=networks.lora', '--max_train_epochs=10', '--learning_rate=1e-4', '--unet_lr=1e-4', '--text_encoder_lr=1e-5', '--lr_scheduler=cosine_with_restarts', '--lr_warmup_steps=0', '--lr_scheduler_num_cycles=1', '--network_dim=32', '--network_alpha=32', '--output_name=wangyuchun', '--train_batch_size=1', '--save_every_n_epochs=2', '--mixed_precision=fp16', '--save_precision=fp16', '--seed=1337', '--cache_latents', '--clip_skip=2', '--prior_loss_weight=1', '--max_token_length=225', '--caption_extension=.txt', '--save_model_as=safetensors', '--min_bucket_reso=256', '--max_bucket_reso=512', '--keep_tokens=0', '--xformers', '--shuffle_caption', '--use_8bit_adam']' returned non-zero exit status 1.
Train finished

AKERSHUS · 发表于 2023-3-26 17:55:37

CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 4.00 GiB total capacity; 3.43 GiB already allocated; 0 bytes free; 3.50 GiB reserved in total by PyTorch) If reserved memory 看清楚训练需要的配置，你这都爆显存了吧

lisen · 发表于 2023-3-26 17:56:00

RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 4.00 GiB total capacity; 3.43 GiB already allocated; 0 bytes free; 3.50 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.

显存炸了应该是

546078000 · 发表于 2023-3-26 17:59:28

lisen 发表于 2023-3-26 17:56
RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 4.00 GiB total capacity; 3.43 ...

哦哦感谢大佬

546078000 · 发表于 2023-3-26 18:00:35

AKERSHUS 发表于 2023-3-26 17:55
CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 4.00 GiB total capacity; 3.43 GiB already al ...

哦哦感谢大佬，大佬之前你说租GPU怎么搞？在哪里租？

AKERSHUS · 发表于 2023-3-28 00:24:50

546078000 发表于 2023-3-26 18:00
哦哦感谢大佬，大佬之前你说租GPU怎么搞？在哪里租？

腾讯云，每月可以秒杀gpu服务器。45元T4

		自动登录	找回密码
密码			立即注册（仅限QQ邮箱）