A Text-to-Video diffusion-based model, CogVideoX has been released by The Knowledge Engineering Group (KEG) & Data Mining (THUDM) at Tsinghua University.
The model has been trained on the base of long detailed prompts like Chat GLM4 or ChatGPT4. To get a detailed overview of CogVideoX, access the respective research paper. You can also access the information for commercial purposes by checking out ChatGlm and the API platform.
Unlike other diffusion video generation models that are unable to generate longer videos, the CogVideoX model can generate 6-second long videos. Now, it's capable of running with lower VRAMs lesser than 12GB.
Currently, these variants have been released:
- CogVideoX5B(Text-to-Video) registered under CogVideoX license.
- CogVideoX2B(Text-to-Video) registered under Apache2.0 license.
- CogVideoX5b-I2V(Image-to-Video) registered under CogVideoX License.
Lets move to the installation section and the workflow.
Table of Contents:
Installation
2. Now, clone the CogVideoX wrapper (custom nodes). Move into the "ComfyUI/custom_nodes" folder. Navigate to the folder address bar and type "cmd" to open the command prompt.
Then, just paste this command in the command prompt to install the wrapper:
git clone https://github.com/kijai/ComfyUI-CogVideoXWrapper.git
3. You also required other dependencies to speed up the video rendering.
For ComfyUI portable users:
Move inside the "ComfyUI_windows_portable" folder. Navigate to the folder address bar and type "cmd" to open the command prompt and again use these commands:
python_embeded\python.exe -m pip install --pre onediff onediffx nexfort
For normal comfy users:
Open the command prompt and use these commands:
pip install --pre onediff onediffx
pip install nexfort
All the required models get downloaded automatically from THUDM's hugging face repository. So, you don't need to download it manually.
At the initial run of the workflow this will take time as the models get downloaded in the background. To get the real-time status, you can switch to command prompt running at the background.
Optional(for windows users): Here, we want to mention is that you can install Windows Triton and Sage-Attention which will significantly drop the video rendering time to almost 25% as reported by the community.
Install Windows Trition .whl file for your python version. We have python 3.10 version installed. For other python version checkout Windows Trition release section.
For normal ComfyUI user:
pip install triton-3.1.0-cp310-cp310-win_amd64.whl
pip install sageattention
For Comfy Portable users(move inside ComfyUI_windows_portable folder and open command prompt):
.\python_embeded\python.exe -m pip install triton-3.1.0-cp310-cp310-win_amd64.whl
.\python_embeded\python.exe -m pip install sageattention
Workflow
1. The workflow can be found inside the "ComfyUI/custom_nodes/ComfyUI-CogVideoXWrapper/examples" folder. Directly drag and drop into ComfyUI.
Workflow |
Description |
cogvideo_2b_context_schedule_test_01.json |
update workflows |
cogvideo_5b_vid2vid_example_01.json |
update workflows |
cogvideo_2b_controlnet_example_01.json |
Update cogvideo_2b_controlnet_example_01.json |
cogvideo_5b_example_01.json |
Update cogvideo_5b_example_01.json |
cogvideo_I2V_example_01.json |
update workflows |
cogvideo_fun_pose_example_01.json |
Add context schedules for control pipeline |
cogvideo_fun_5b_GGUF_10GB_VRAM_example_01.json |
CreateCogvideo_fun_5b_GGUF_10GB_VRAM_example_01.json |
cogvideo_fun_i2v_example_01.json |
update vae tile defaults |
There are multiple workflows available. Choose the one as per your requirements. For illustration, we are showcasing the basic one.
2. In the Cog Video model node there are three variants you can choose from-
(a) CogVideoX-5B(Text-to-Video for higher VRAM)
(b) CogVideoX-2B(Text-To-Video for lower VRAM)
(c) CogVideoX-5B I2V(For Image-To-video)
|
Source: CogVideoX |
You should use the recommended settings as per your system requirements. More detailed information has been shared by the CogVideoX team. Go through these to get a better understanding.
CogVideoX model details:
Model Type |
CogVideoX-2B |
CogVideoX-5B |
CogVideoX-5B-I2V |
Model Description |
Entry-level model, balancing compatibility. Low cost for running and secondary development. |
Larger model with higher video generation quality and better visual effects. |
CogVideoX-5B image-to-video version. |
Inference Precision |
FP16* (recommended), BF16, FP32, FP8*, INT8, not supported: INT4 |
BF16 (recommended), FP16, FP32, FP8*, INT8, not supported: INT4 |
Same as CogVideoX-5B |
Single GPU Memory Usage |
SAT FP16: 18GB diffusers FP16: from 4GB* diffusers INT8 (torchao): from 3.6GB* |
SAT BF16: 26GB diffusers BF16: from 5GB* diffusers INT8 (torchao): from 4.4GB* |
Same as CogVideoX-5B |
Multi-GPU Inference Memory Usage |
FP16: 10GB* using diffusers |
BF16: 15GB* using diffusers |
Same as CogVideoX-5B |
Inference Speed (Step = 50, FP/BF16) |
Single A100: ~90 seconds Single H100: ~45 seconds |
Single A100: ~180 seconds Single H100: ~90 seconds |
Same as CogVideoX-5B |
Fine-tuning Precision |
FP16 |
BF16 |
Same as CogVideoX-5B |
Fine-tuning Memory Usage |
47 GB (bs=1, LORA) 61 GB (bs=2, LORA) 62GB (bs=1, SFT) |
63 GB (bs=1, LORA) 80 GB (bs=2, LORA) 75GB (bs=1, SFT) |
78 GB (bs=1, LORA) 75GB (bs=1, SFT, 16 GPU) |
Prompt Language |
English* |
Same as CogVideoX-2B |
Same as CogVideoX-2B |
Maximum Prompt Length |
226 Tokens |
Same as CogVideoX-2B |
Same as CogVideoX-2B |
Video Length |
6 Seconds |
Same as CogVideoX-2B |
Same as CogVideoX-2B |
Frame Rate |
8 Frames / Second |
Same as CogVideoX-2B |
Same as CogVideoX-2B |
Video Resolution |
720 x 480, no support for other resolutions (including fine-tuning) |
Same as CogVideoX-2B |
Same as CogVideoX-2B |
Position Embedding |
3d_sincos_pos_embed |
3d_rope_pos_embed |
3d_rope_pos_embed + learnable_pos_embed |
3. Load clip models. Fp16 is for higher end and FP8 for lower end GPUs.
4. As officially instructed, it has been trained on long batch of prompts based on transformers T5 models, we used detailed prompts generated using ChatGPT so that the CogVideoX model can understand better.
First test:
We generated a professional model photoshoot clip in the ocean.
Prompt used: A professional photoshoot scene set in the ocean, featuring a model standing confidently in shallow water. The model is dressed in a sleek, elegant outfit, with a flowing fabric that moves gracefully with the ocean breeze. The scene is captured during the golden hour, with the sun setting on the horizon, casting a warm glow on the water's surface. Gentle waves lap around the model’s feet, creating a dynamic and serene atmosphere. A professional photographer is seen on the shore, using a high-end camera with a large lens, capturing the moment. Reflective equipment and light modifiers are strategically placed to enhance the lighting, with an assistant holding a reflector to direct sunlight onto the model. The overall mood is glamorous, serene, and professional, emphasizing the beauty of the ocean backdrop and the skill of the photoshoot crew.
Here is we got our first result. You can observe the female's right hand has been deformed a little. But the camera moving and panning has been added creating a lot of professional effects with realistic ocean tidal waves.
Of course, the video frames' quality is low. But this is not a big deal. Our focus is to generate consistent video frame generation without any defects. It can be up-scale using other techniques in ComfyUI (Neural network latent upscale), etc. or just split the video into multiple frames and use
Supir Upscaler.
Second test:
Let's challenge the model and see how much intelligently it maintains to a certain level.
Prompt used: An action-packed scene set in a futuristic cityscape at night, inspired by an Iron Man movie. The central figure is a superhero in a high-tech, red and gold metallic suit with glowing blue eyes and arc reactor on the chest, hovering in mid-air with jet thrusters blazing from his hands and feet. The suit is sleek, with intricate details and panels that reflect the city lights. In the background, towering skyscrapers with neon signs and holographic billboards illuminate the night sky. The superhero is in a dynamic pose, dodging a barrage of energy blasts from a formidable enemy robot flying nearby, which is large, menacing,
and armed with glowing red weaponry. Sparks fly and smoke trails in the air, adding to the intensity of the battle. The scene captures a sense of speed, power, and heroism, with a dramatic sky filled with dark clouds and flashes of lightning, amplifying the urgency and high stakes of the confrontation.
Now the model is kind of confused as to what to generate and there are lots of morphing in the batch of video frames. Although, overall it's a lot better than that of any other diffusion-based models where you need to try multiple attempts to generate a single video clip.
Conclusion
After certain testing, we can conclude that CogVideoX is much more capable than other diffusion-based video generation models. Now, it can be supported on lower-end GPUs as well where you can use Quantized model.