Train your WAN2.1 Lora model in Windows/Linux

Train Lora Wan model on windows and linux

Fine-tune your own LoRA model using WAN 2.1 locally is not that difficult. The process will be so similar than that of the other LoRA trainings. We will explain you the step by step process whether you are training locally or using third party cloud services (like Runpod, HuggingSpace or Replit). 

Currently the script supported with both TextToVideo and ImageToVideo variant. For TextToVideo you can use dataset (Image/Videos) with minimum 24GB VRAM but for ImageToVideo only Video sample will work with minimum 48GB VRAM required. Make sure you use the 14B model for ImageToVideo.

We will cover everything from setting up the dataset to running your training and testing the results. 


Table of Contents:


Requirements

- A machine with a powerful GPU RTX 4000/5000 series card with 24GB VRAM. Third party platform are also a great option if your local machine is not powerful enough.

- ComfyUI (pre-installed on your system)

- Diffusion-Pipe (a great tool for fine-tuning models)

- A well-structured dataset (more on that below)


Step 1. Setting Up Your Training Environment

Before we start training, we need to set up our environment properly for Windows system. To do this, first you have to install WSL2 and Diffusion Pipe that we have already explained in HunyuanVideo training tutorial. 

Linux users do not need this installation. Users who have already installed (for HunyuanVideo training) can skip to Step2.


Step 2. Preparing Your Dataset

Your dataset is the key for your training process. Here is how to setup your Dataset Structure:

(a) Images: At least 10 to 15 images (even 7 to 8 can work) in JPEG or PNG format. You can use short (2-3 seconds long) mp4 videos as well but will consume lot of VRAM.

(b) Text Prompts: Each image must have a corresponding '.txt' file with a description of the image shown above for easy understanding.

(c) Trigger Word: Add a unique keyword that ensures your model learns a specific style or character.



image dataset for wan model

Example Dataset Structure:

|---image_1.jpg

|---image_1.txt   

Captioning - "A NO2VA model wearing a white Tank top, taking professional shoot, having messy brown colored hair. "

|---image_2.jpg

|---image_2.txt  

Captioning- "A NO2VA woman in a white dress having close up shot."

Note: Make sure you are using the data and taking consent while using it. We are showing just for educational and research purpose only.


Prompt Writing Tips

(a) Be descriptive but concise (avoid overly long or too short prompts).

(b) Include details about the background, clothing, actions etc.

(c) Use the same trigger word consistently across all '.txt' files.

Once your dataset is ready, upload your dataset into your "/diffusion-pipe/data/input" folder.


Step 3. Downloading WAN models from Hugging Face

We will be downloading the models from Official WAN's Hugging Face repository.

1. Navigate to the folder:

cd diffusion-pipe

2. Create New directory as "data" inside diffusion-pipe :

mkdir data

Move into data folder:

cd data

Create new directory "output":

mkdir output

Then create new directory "input":

mkdir input

Now upload all your dataset (prepared in Step2) into this "input" directory.


3. Create New directory as something relatable (Ex WAN2.1_1.3B)

mkdir -p models/WAN2.1_1.3B

Move inside models folder.

cd models

4. Install hugging Face CLI to download the WAN models:

pip install -U "huggingface_hub[cli]"

5. Download the WAN 2.1 model (1.3B version or use 14B for better performance), Text encoders, VAE, clip vision:

huggingface-cli download Wan-AI/Wan2.1-T2V-1.3B --local-dir ./wan


repository ID

Here, 

huggingface-cli download: Uses the Hugging Face CLI to download a model or dataset. 

Wan-AI/Wan2.1-T2V-1.3B : Specifies the repository name (Wan-AI) and model version (Wan2.1-T2V-1.3B) on Hugging Face. Get their Hugging Face repo ID as shown above. If using 14B TxtToVideo variant use "Wan-AI/Wan2.1-T2V-14B" without quotes.

--local-dir ./wan : Using wan.py file it save the downloaded files in the models/WAN2.1_1.3B directory on your local machine.

If you already using WAN for generation then you can also use that. No need to download again.



Step 4. Configuring files

Now  we have to do some changes into the file. To do this, first configure the basic Training Script that will be so simple.

change dataset path location

1. Open the "dataset.toml" file avaiable inside 'diffusion-pipe/examples' directory.

2. Set your dataset path:

dataset_path = "<<your-root-folder>>/diffusion-pipe/data/input"

In our case, this is the path location of dataset : "home/SDT/diffusion-pipe/data/input". Yours will be different if your username is different.

3. Adjust the epoch settings:

num_epochs = 100

save_every_n_epochs = 10     

This Saves every 10 epochs

change lora output path location

4. Modify the output directory, replace <<your-root-folder>> with your directory path:

output_path = "<<your-root-folder>>/diffusion-pipe/data/output"


change file name

change file name to wan

5. Create new copy of "hunyuan_video.toml" file.

After creation, your new file will be "hunyuan_video - Copy.toml". Rename it to something relevant (Ex- wan_video.toml) for Wan or leave this as it is. Well, renaming is not necessary but to make things simple we did that.

configure settings for wan video model

6. Open the "wan_video.toml" file and remove the setting shown above. Here, we are using VS Code to edit the settings.

Now, use these settings (provided below) by just copying and paste it into "wan_video.toml" file, and add your WAN's model folder path into "ckpt_path" parameter:

#########################

[model]

type = 'wan'

ckpt_path = '/models/Wan2.1-T2V-1.3B'

dtype = 'bfloat16'

# You can use fp8 for the transformer when training LoRA.

#transformer_dtype = 'float8'

timestep_sample_method = 'logit_normal'

#########################

If you already downloaded and using WAN model for video generation, you can also use their path of models.


Step 5. Running the Training Command

1. Move into diffusion-pipe folder. 

cd diffusion-pipe

2. Run the following command inside your 'diffusion-pipe' folder, where "wan_video.toml" is the config file name we created :

NCCL_P2P_DISABLE="1" NCCL_IB_DISABLE="1" deepspeed --num_gpus=1 train.py --deepspeed --config examples/wan_video.toml

This will start training your LoRA model. Training speed depends on your GPU, dataset size (images/videos), and number of epochs.

If you get ModuleNotFoundError for 'hyvideo' use this command to install it :

pip install git+https://github.com/ollanoinc/hyvideo.git


Step 6. Testing and Using Your LoRA Model

Once training is complete, your trained LoRA model will be saved into the "output" folder. Here is how to test it:

1. Move the trained model (.safetensors file) to the "ComfyUI/models/loras" folder.

2. Open ComfyUI and load the LoRA model in the LoRA Loader node. To use the generated LoRA you need to use the default WAN TextToVideo workflow and then just add LoraLoaderModelOnly node into it and again connect it to "Load Diffuion model" node and "KSampler" node. 

3. Set your trigger word in the prompt.

4. Adjust the weight (start with 0.8-1.0).

5. Generate images and compare results across different epochs.

6. Fine-Tuning for Better Results

If your LoRA is not performing as expected, here are a few tweaks:

- Increase the number of epochs (try 150-200 for better convergence).

- Improve dataset quality (use more high-quality images).

- Tweak learning rates (lower values help avoid overfitting).

- Test different trigger words to see what works best.

We trained LoRA for anime style using WAN2.1 (13B) with 14 images. For this, we are using RTX 3090 with the 512 resolution and the whole training time took about 2.5 hours with total 3500 steps. This is much faster as compared to HunyuanVideo LoRA training. Ofcourse, you can also use the other Wan2.1 (14B) which will comparitively yeild good results but will be slower.

If you have RTX4090 and working with Wan2.1 (14B) do adjust the parameter, transformer_dtype = 'float8' in wan_video.toml (config) file that will faster your training upto 30x-40x.