Again the new diffusion based video generation model released by AlibabaCloud. Wan2.1 an open-source suite of video foundation models licensed under Apache 2.0. It delivers state-of-the-art performance while remaining accessible on consumer hardware. You can read more information from their research paper.
It outperforms existing open-source models and rivals commercial solutions in the market. The TextToVideo model generates 5-second 480P video on RTX 4090 in ~4 minutes using 8.19 VRAM without optimization from both Chinese and English textual prompts.
Model |
Resolution |
Features |
T2V-14B |
480P & 720P |
Best overall quality |
I2V-14B-720P |
720P |
Higher resolution image-to-video |
I2V-14B-480P |
480P |
Standard resolution image-to-video |
T2V-1.3B |
480P |
Lightweight for consumer hardware |
Table of Contents:
Installation
No matter whatsoever workflow you want. Just
install ComfyUI if you are new to it. Old users need to update ComfyUI from the Manager section by selecting "
Update ComfyUI".
Type A: Native Support
1. Download the model (TextToVideo or ImageToVideo) from
Hugging Face and save it into your "
ComfyUI/models/diffusion_models" directory.
2. Now, download
Text encoder(umt5_xxl_fp8_e4m3fn_scaled.safetensors) and save it into "
ComfyUI/models/text_encoders" folder.
3. Download
clip model and put it into your "
ComfyUI/models/clip_vision" folder.
4. At last, you also need to download
VAE models, put it into your "
ComfyUI/models/vae" folder.
5. Restart your ComfyUI to take effect.
Workflow
2. Drag and drop into ComfyUI.
(a) Load Wan model(TxtToVideo or ImgToVideo) into UNet loader node.
(b) Load text encoder into clip node
(c) Select VAE model
(d) Put Positive/negative prompts
(e) Set KSampler settings
(f) Click "Queue" option to start generation.
 |
ImgToVid 14B 480p (Shift Test) |
 |
ImgToVid 14B 480p (CFG Test) |
Type B: (Quantized By Kijai)
1. Clone the Wan Wrapper repository inside your "ComfyUI/custom_nodes" folder by typing following command into command prompt.
git cone https://github.com/kijai/ComfyUI-WanVideoWrapper.git
2. Now, move inside "ComfyUI_windows_portable" folder and open command prompt. Install the required dependencies by typing these command provided below.
For normal ComfyUI users:
pip install -r requirements.txt
For portable ComfyUI users:
python_embeded\python.exe -m pip install -r ComfyUI\custom_nodes\ComfyUI-WanVideoWrapper\requirements.txt
3. Download models (TextToVideo or ImageToVideo) from
Hugging Face and put it into "
ComfyUI/models/diffusion_models" folder.
Here, there are two options (BF16 and FP8)to choose from with different video (480p and 720p) generation. Select the one that is relevant for your machine and use cases. BF 16 is for higher VRAM(more than 12GB) and FP8 for lower VRAM (12GB or lesser)users.
4. Download the relevant Text encoders and save it into "ComfyUI/models/text_encoders" folder. Select Bf16 or Fp8 variant.
5. Then, you also need to download the relevant VAE model and place it into your "ComfyUI/models/vae" directory. Select Bf16 or Fp32 variant.
6. Restart ComfyUI.
Workflow
1. You can get the workflow inside your "ComfyUI/custom_nodes/ComfyUI-WanVideoWrapper/example_workflows" folder.
2. Drag and drop into ComfyUI.
 |
Text To video 14b (CFG Test) |
 |
Text To Video 14b (Steps Testing) |
 |
Text To Video 14b (Shift Testing) |
We tested this with ImageToVideo on RTX 3080 10GB VRAM with sage attention enabled and the generation time was around 467 seconds.
Note: This workflow uses Triton and Sageattention in the background that will increase the inference time but its optional. You can enable and disable if you do not need. To enable use "--use-sage-attention" as startup argument in ComfyUI.
If you want to install these then make sure you have the Vcredist, CUDA12.x, Visual studio, installed on your system. Their are many confusions in setting up these Triton into windows machines. You can get the detailed understanding from Triton-windows github repository.
Install Windows Trition .whl file for your python version. To check python version run "python --version" (without quotes) in command prompt. We have python 3.10 version installed. For other python version checkout Windows Trition release section.
Type C: GGUF variant (By City96)
Users having lower VRAMs can use these quantized model. It ranges from Q2(faster with lower precision and lower quality) to Q8(slower inference with higher precision and high quality generation).
1. Install GGUF custom nodes from the Manager section by selecting "Custom nodes manager" option. Now, search for "ComfyUI-GGUF" (by Author City 96) and hit install.
Users who already using the Flux GGUF/ Stable Diffusion 3.5 GGUF variant/ HunyuanVideo GGUF earlier by City96, only need to update this custom node from the Manager by selecting "Update" option.
2. Download any of the relevant models from Hugging Face repository:-
Save Img2Vid model into "ComfyUI/models/unet" folder. Clip vision into "ComfyUI/models/clip_vision", Text encoder into "ComfyUI/models/text_encoders" and VAE into "ComfyUI/models/vae" folder.
Save Img2Vid model into "ComfyUI/models/unet" folder. Clip vision into "ComfyUI/models/clip_vision", Text encoder into "ComfyUI/models/text_encoders" and VAE into "ComfyUI/models/vae" folder.
Download rest of the models files (Text encoder, VAE etc)from Kijai Wan repository explained above.
Save Txt2Vid model into "ComfyUI/models/unet" folder. Clip vision into "ComfyUI/models/clip_vision", Text encoder into "ComfyUI/models/text_encoders" and VAE into "ComfyUI/models/vae" folder.
Here, you have various model types from Q2bit(very light weight, faster with lower quality generation) to Q8bit(very heavy weight, slower with high precision). Choose as per your system VRAM and use case.
3. Restart ComfyUI to take effect.
Workflow
1. Download the same workflow of Comfyui's repository from Type B's workflow section.
2. All will be same here. Just replace the "Load Diffusion Model" node with "UNet Loader (GGUF)" node.