Wan 2.1: Install & Generate Videos locally (FP16/BF16/FP8/GGUF)

Again the new diffusion based video generation model released by AlibabaCloud. Wan2.1 an open-source suite of video foundation models licensed under Apache 2.0. It delivers state-of-the-art performance while remaining accessible on consumer hardware. You can read more information from their research paper.

WanX 2.1 an OPEN-SOURCE VideoGen model 🙂🧐

by AlibabaCloud

Coming soon … pic.twitter.com/j9HktHzHQI
— Stable Diffusion Tutorials (@SD_Tutorial) February 21, 2025

It outperforms existing open-source models and rivals commercial solutions in the market. The TextToVideo model generates 5-second 480P video on RTX 4090 in ~4 minutes using 8.19 VRAM without optimization from both Chinese and English textual prompts.

Now, its possible to train WAN Lora models on Windows/Linux machine. You can do using the Diffusion Pipe Framework. Currently, TextToVideo and ImageToVideo both are supported. To get the improved variant, you can also explore our Wan2.2 video generation tutorial.

Model	Resolution	Features
T2V-14B	480P & 720P	Best overall quality
I2V-14B-720P	720P	Higher resolution image-to-video
I2V-14B-480P	480P	Standard resolution image-to-video
T2V-1.3B	480P	Lightweight for consumer hardware

Table of Contents:

Installation

No matter whatsoever workflow you want. Just install ComfyUI if you are new to it. Old users need to update ComfyUI from the Manager section by selecting "Update ComfyUI".

Type A: Native Support

1. Download any of the Wan 2.1 model (TextToVideo or ImageToVideo) from Hugging Face. All the details for specific model given below:

(a) wan2.1_i2v_480p_14B_bf16.safetensors

(b) wan2.1_i2v_480p_14B_fp16.safetensors

(d) wan2.1_i2v_480p_14B_fp8_scaled.safetensors

(e) wan2.1_i2v_720p_14B_bf16.safetensors

(f) wan2.1_i2v_720p_14B_fp16.safetensors

(g) wan2.1_i2v_720p_14B_fp8_e4m3fn.safetensors

(h) wan2.1_i2v_720p_14B_fp8_scaled.safetensors

(i) wan2.1_t2v_1.3B_bf16.safetensors

(j) wan2.1_t2v_1.3B_fp16.safetensors

(k) wan2.1_t2v_14B_bf16.safetensors

(l) wan2.1_t2v_14B_fp16.safetensors

(m) wan2.1_t2v_14B_fp8_e4m3fn.safetensors

(n) wan2.1_t2v_14B_fp8_scaled.safetensors

Save it into your "ComfyUI/models/diffusion_models" directory.

2. Now, download Text encoder(umt5_xxl_fp8_e4m3fn_scaled.safetensors) and save it into "ComfyUI/models/text_encoders" folder.

3. Download clip model and put it into your "ComfyUI/models/clip_vision" folder.

4. At last, you also need to download VAE models, put it into your "ComfyUI/models/vae" folder.

5. Restart your ComfyUI to take effect.

Workflow

1. Get the required workflow from our Hugging face repository.

2. Drag and drop into ComfyUI.

(a) Load Wan model(TxtToVideo or ImgToVideo) into UNet loader node.

(b) Load text encoder into clip node

(d) Put Positive/negative prompts

(e) Set KSampler settings

(f) Hit "Run" button to start generation.

ImgToVid 14B 480p (Shift Test)

ImgToVid 14B 480p (CFGTest)

Type B: (Quantized By Kijai)

1. Clone the Wan Wrapper repository inside your "ComfyUI/custom_nodes" folder by typing following command into command prompt.

~~git cone https://github.com/kijai/ComfyUI-WanVideoWrapper.git~~

2. Now, install the required dependencies by typing these command provided below.

For normal ComfyUI users:

~~pip install -r requirements.txt~~

For portable ComfyUI users, move inside "ComfyUI_windows_portable" folder and open command prompt. :

~~python_embeded\python.exe -m pip install -r ComfyUI\custom_nodes\ComfyUI-WanVideoWrapper\requirements.txt~~

3. Download models (TextToVideo or ImageToVideo) from Kijai's Hugging Face Repository and put it into "ComfyUI/models/diffusion_models" folder.

Here, there are two options (BF16 and FP8)to choose from with different video (480p and 720p) generation. Select the one that is relevant for your machine and use cases. BF 16 is for higher VRAM(more than 12GB) and FP8 for lower VRAM (12GB or lesser)users.

4. Download the relevant Text encoders and save it into "ComfyUI/models/text_encoders" folder. Select Bf16 or Fp8 variant.

5. Then, you also need to download the relevant VAE model and place it into your "ComfyUI/models/vae" directory. Select Bf16 or Fp32 variant.

6. Restart ComfyUI.

Workflow

1. You can get the workflow inside your "ComfyUI/custom_nodes/ComfyUI-WanVideoWrapper/example_workflows" folder.

2. Drag and drop into ComfyUI.

Text To video 14b (CFG Test)

Text To Video 14b (StepsTesting)

Text To Video 14b (Shift Testing)

We tested this with ImageToVideo on RTX 3080 10GB VRAM with sage attention enabled and the generation time was around 467 seconds.

Note: This workflow uses Triton and Sageattention in the background that will increase the inference time but its optional. You can enable and disable if you do not need. To enable use "--use-sage-attention" as startup argument in ComfyUI.

If you want to install these then make sure you have the Vcredist, CUDA12.x, Visual studio, installed on your system. Their are many confusions in setting up these Triton into windows machines. You can get the detailed understanding from Triton-windows github repository.

Install Windows Trition .whl file for your python version. To check python version run "python --version" (without quotes) in command prompt. We have python 3.10 version installed. For other python version checkout Windows Trition release section.

Type C: GGUF variant (By City96)

Users having lower VRAMs can use these quantized model. It ranges from Q2(faster with lower precision and lower quality) to Q8(slower inference with higher precision and high quality generation).

1. Install GGUF custom nodes from the Manager section by selecting "Custom nodes manager" option. Now, search for "ComfyUI-GGUF" (by Author City 96) and hit install.

Users who already using the Flux GGUF/ Stable Diffusion 3.5 GGUF variant/ HunyuanVideo GGUF earlier by City96, only need to update this custom node from the Manager by selecting "Update" option.

2. Download any of the relevant models from Hugging Face repository:-

(a) ImageToVideo-14B-720P-gguf

Download rest of the models from Comfy's Hugging Face repository.

Save Img2Vid model into "ComfyUI/models/unet" folder. Clip vision into "ComfyUI/models/clip_vision", Text encoder into "ComfyUI/models/text_encoders" and VAE into "ComfyUI/models/vae" folder.

(b) ImageToVideo-14B-480P-gguf

Download the rest of the models from Comfy's Hugging Face repository.

Download rest of the models files (Text encoder, VAE etc)from Kijai Wan repository explained above.

Save Txt2Vid model into "ComfyUI/models/unet" folder. Clip vision into "ComfyUI/models/clip_vision", Text encoder into "ComfyUI/models/text_encoders" and VAE into "ComfyUI/models/vae" folder.

Here, you have various model types from Q2bit(very light weight, faster with lower quality generation) to Q8bit(very heavy weight, slower with high precision). Choose as per your system VRAM and use case.

3. Restart ComfyUI to take effect.

Workflow

1. Download the same workflow of Comfyui's repository from Type B's workflow section.

2. All will be same here. Just replace the "Load Diffusion Model" node with "UNet Loader (GGUF)" node.

Wan 2.1: Install & Generate Videos locally (FP16/BF16/FP8/GGUF)