Stable Diffusion LoRA训练实战:为公司训练专属风格模型

2022-11-20 | Stable Diffusion LoRA 图像生成

公司需要一个能生成特定IP风格图片的AI,我花了两周时间研究LoRA训练。这篇文章记录从数据准备到模型上线的完整流程,以及我踩过的所有坑。

项目背景

2022年底,公司想做一个AI绘图工具,能够生成符合我们品牌IP风格的图片。原始SD模型生成的图和品牌风格差很远,需要微调。

评估了几种方案:

方案	训练时间	模型大小	效果
全量微调	1-2天	4GB	最好
Dreambooth	30分钟	4GB	好
LoRA	20分钟	10-100MB	好
Textual Inversion	1小时	4KB	一般

最终选择LoRA:训练快、模型小、可以叠加使用。

数据准备

收集训练图片

数量:收集了50张品牌IP相关的图片
尺寸:统一裁剪为512x512(SD 1.5)或1024x1024(SDXL)
质量:清晰、无水印、风格一致

图片处理脚本

from PIL import Image
import os

def prepare_images(input_dir: str, output_dir: str, size: int = 512):
    """
    处理训练图片
    1. 调整尺寸
    2. 裁剪为正方形
    3. 保存为PNG
    """
    os.makedirs(output_dir, exist_ok=True)
    
    for filename in os.listdir(input_dir):
        if not filename.lower().endswith(('.jpg', '.jpeg', '.png', '.webp')):
            continue
        
        img_path = os.path.join(input_dir, filename)
        img = Image.open(img_path).convert('RGB')
        
        # 中心裁剪为正方形
        width, height = img.size
        min_dim = min(width, height)
        left = (width - min_dim) // 2
        top = (height - min_dim) // 2
        img = img.crop((left, top, left + min_dim, top + min_dim))
        
        # 调整尺寸
        img = img.resize((size, size), Image.LANCZOS)
        
        # 保存
        output_path = os.path.join(output_dir, f"{os.path.splitext(filename)[0]}.png")
        img.save(output_path, 'PNG')
        print(f"Processed: {filename}")

prepare_images('./raw_images', './train_images', size=512)

编写标签(Caption)

每张图片需要一个对应的txt文件描述内容:

# 自动打标 + 手动修正
from transformers import BlipProcessor, BlipForConditionalGeneration
import torch

class AutoCaptioner:
    def __init__(self):
        self.processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
        self.model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")
    
    def caption(self, image_path: str) -> str:
        img = Image.open(image_path).convert('RGB')
        inputs = self.processor(img, return_tensors="pt")
        
        with torch.no_grad():
            output = self.model.generate(**inputs, max_new_tokens=50)
        
        caption = self.processor.decode(output[0], skip_special_tokens=True)
        return caption

# 使用
captioner = AutoCaptioner()
for img_file in os.listdir('./train_images'):
    if img_file.endswith('.png'):
        caption = captioner.caption(f'./train_images/{img_file}')
        
        # 加上触发词
        caption = f"myipstyle, {caption}"
        
        # 保存
        txt_file = img_file.replace('.png', '.txt')
        with open(f'./train_images/{txt_file}', 'w') as f:
            f.write(caption)

⚠️ 触发词很重要!
每个caption都要加上一个独特的触发词(如myipstyle),推理时用这个词触发学到的风格。不加触发词会污染模型的通用能力。

训练配置

使用kohya-ss训练

# 安装kohya-ss
git clone https://github.com/kohya-ss/sd-scripts
cd sd-scripts
pip install -r requirements.txt

# 训练配置文件 train_config.toml
[pretrained_model_name_or_path]
model = "runwayml/stable-diffusion-v1-5"

[train_data_dir]
path = "./train_images"

[output_dir]
path = "./output"

[network_module]
module = "networks.lora"

[network_args]
network_dim = 32      # LoRA rank
network_alpha = 16    # 通常设为dim的一半

[training]
resolution = 512
batch_size = 1
max_train_epochs = 10
learning_rate = 1e-4
unet_lr = 1e-4
text_encoder_lr = 5e-5
lr_scheduler = "cosine_with_restarts"
lr_warmup_steps = 100

[optimizer]
optimizer_type = "AdamW8bit"  # 省显存

[advanced]
mixed_precision = "fp16"
gradient_checkpointing = true
xformers = true

训练脚本

#!/bin/bash

# 基础训练命令
accelerate launch --num_cpu_threads_per_process=2 train_network.py \
    --pretrained_model_name_or_path="runwayml/stable-diffusion-v1-5" \
    --train_data_dir="./train_images" \
    --output_dir="./output" \
    --output_name="my_ip_style" \
    --save_model_as=safetensors \
    --network_module=networks.lora \
    --network_dim=32 \
    --network_alpha=16 \
    --resolution=512 \
    --train_batch_size=1 \
    --max_train_epochs=10 \
    --learning_rate=1e-4 \
    --unet_lr=1e-4 \
    --text_encoder_lr=5e-5 \
    --lr_scheduler="cosine_with_restarts" \
    --lr_warmup_steps=100 \
    --optimizer_type="AdamW8bit" \
    --mixed_precision="fp16" \
    --gradient_checkpointing \
    --xformers \
    --cache_latents \
    --save_every_n_epochs=2

参数调优经验

network_dim (rank)

rank	适用场景	模型大小
4	简单风格、少量训练图	~4MB
8-16	一般场景(推荐)	~10-20MB
32-64	复杂风格、人物特征	~40-80MB
128+	接近全量微调	~150MB+

💡 经验: 对于风格LoRA,rank=32通常足够。人物/角色LoRA可能需要64或更高。

学习率

# 不同组件使用不同学习率
unet_lr = 1e-4       # UNet主体
text_encoder_lr = 5e-5  # 文本编码器,设小一点

# 如果过拟合,降低学习率
# 如果欠拟合,提高学习率或增加epoch

训练轮数

50张图片:8-15 epochs
100张图片:5-10 epochs
200+张图片:3-8 epochs

⚠️ 过拟合的表现:

生成的图和训练图几乎一样
不同prompt生成的图差异很小
出现训练图中的水印、签名等

解决方法:减少epoch、降低学习率、增加训练图数量。

显存优化

3090 24GB的配置

# 启用所有省显存选项
--gradient_checkpointing \  # 省~30%显存
--xformers \                # 省~20%显存
--cache_latents \           # 预计算latent,省显存
--optimizer_type="AdamW8bit" \  # 8bit优化器

# batch_size=1时,512分辨率约需12GB显存
# 1024分辨率需要~20GB

显存不足时的策略

# 方案1:降低分辨率
--resolution=384  # 或者448

# 方案2:使用gradient accumulation
--gradient_accumulation_steps=4
--train_batch_size=1

# 方案3:只训练UNet,不训练text encoder
--network_train_unet_only

推理测试

from diffusers import StableDiffusionPipeline
import torch

# 加载基础模型
pipe = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    torch_dtype=torch.float16
).to("cuda")

# 加载LoRA
pipe.load_lora_weights("./output/my_ip_style.safetensors")

# 生成图片(使用触发词)
prompt = "myipstyle, a cute cat sitting on a sofa, high quality"
negative_prompt = "low quality, blurry, ugly"

image = pipe(
    prompt,
    negative_prompt=negative_prompt,
    num_inference_steps=30,
    guidance_scale=7.5,
    width=512,
    height=512
).images[0]

image.save("output.png")

调整LoRA强度

# 强度太高会过于明显,太低则效果不够
# 默认是1.0

# 方法1:加载时指定
pipe.load_lora_weights("./my_ip_style.safetensors", adapter_name="style")
pipe.set_adapters(["style"], adapter_weights=[0.8])

# 方法2:在prompt中使用
# 使用AUTOMATIC1111 WebUI时:

多LoRA叠加

# 可以同时使用多个LoRA
pipe.load_lora_weights("./style_lora.safetensors", adapter_name="style")
pipe.load_lora_weights("./character_lora.safetensors", adapter_name="character")

# 分别设置权重
pipe.set_adapters(
    ["style", "character"], 
    adapter_weights=[0.7, 0.5]
)

# 注意:叠加太多LoRA可能导致效果混乱

训练日志分析

import matplotlib.pyplot as plt

def plot_training_loss(log_file: str):
    """绘制训练loss曲线"""
    steps = []
    losses = []
    
    with open(log_file, 'r') as f:
        for line in f:
            if 'loss=' in line:
                # 解析日志
                step = int(line.split('step=')[1].split(',')[0])
                loss = float(line.split('loss=')[1].split(',')[0])
                steps.append(step)
                losses.append(loss)
    
    plt.figure(figsize=(10, 5))
    plt.plot(steps, losses)
    plt.xlabel('Step')
    plt.ylabel('Loss')
    plt.title('Training Loss')
    plt.savefig('loss_curve.png')

# 正常的loss曲线:先快速下降,然后趋于平稳
# 异常情况:
# - loss不降:学习率太小或数据有问题
# - loss震荡:学习率太大
# - loss先降后升:过拟合

常见问题

问题1:生成的图模糊

原因:训练图片质量差或分辨率不匹配

解决:使用高清训练图,确保训练和推理分辨率一致

问题2:颜色不对

原因:可能使用了不兼容的VAE

解决:训练和推理使用相同的VAE

问题3:特定概念学不会

原因:训练图数量不够或caption不准确

解决:增加训练图,改进caption描述

上线部署

# 使用TensorRT加速推理
from diffusers import StableDiffusionPipeline
from optimum.onnxruntime import ORTStableDiffusionPipeline

# 转换为ONNX
pipe = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5")
pipe.load_lora_weights("./my_lora.safetensors")

# 合并LoRA到基础模型(推理更快)
pipe.fuse_lora()

# 导出ONNX
pipe.save_pretrained("./sd_with_lora_onnx", safe_serialization=True)

# 推理速度对比(3090):
# - 原始PyTorch: 3.2s/image
# - 合并LoRA后: 2.8s/image
# - ONNX优化: 2.0s/image
# - TensorRT: 1.2s/image

效果展示

指标	原始SD	+LoRA
风格匹配度(人工评分)	2.1/5	4.3/5
推理时间	2.8s	3.0s
模型增量大小	-	35MB

✅ 项目收益: 通过LoRA训练,我们获得了一个能生成品牌风格图片的模型,设计师可以用它快速生成概念图,效率提升约3倍。

经验总结

数据质量>数量:20张高质量图比100张杂图效果好
Caption要准确:描述不准会导致概念混乱
触发词必须加:避免污染模型通用能力
从小rank开始:先用rank=8尝试,不够再加
早停很重要:宁可欠拟合也不要过拟合

更新记录:
2022-11-20: 初版发布
2023-04-15: 补充SDXL LoRA训练方法
2023-09-10: 更新kohya-ss最新配置