Install CogVideoX: Text-to-Video and Image-to-Video in ComfyUI

install cog video x

A Text-to-Video diffusion-based model, CogVideoX has been released by The Knowledge Engineering Group (KEG) & Data Mining (THUDM) at Tsinghua University.

The model has been trained on the base of long detailed prompts like Chat GLM4 or ChatGPT4. To get a detailed overview of CogVideoX, access the respective research paper. You can also access the information for commercial purposes by checking out ChatGlm and the API platform.

Unlike other diffusion video generation models that are unable to generate longer videos, the CogVideoX model can generate 6-second long videos. Now, it's capable of running with lower VRAMs lesser than 12GB.

Currently, these variants have been released:

  •  CogVideoX5B(Text-to-Video) registered under CogVideoX license.
  •  CogVideoX2B(Text-to-Video) registered under Apache2.0 license.
  • CogVideoX5b-I2V(Image-to-Video) registered under Apache2.0 license.

Lets move to the installation section and the workflow.


Table of Contents:


Installation:

1. First, do the ComfyUI installation if you are new to ComfyUI.

2. Now, clone the CogVideoX wrapper (custom nodes). Move into the "ComfyUI/custom_nodes" folder. Navigate to the folder address bar and type "cmd" to open the command prompt.

Then, just paste this command in the command prompt to install the wrapper:

git clone https://github.com/kijai/ComfyUI-CogVideoXWrapper.git

3. You also required other dependencies to speed up the video rendering.

For ComfyUI portable users:

Move inside the "ComfyUI_windows_portable" folder. Navigate to the folder address bar and type "cmd" to open the command prompt and again use these commands:

python_embeded\python.exe -m pip install --pre onediff onediffx nexfort


For normal comfy users:

Open the command prompt and use these commands:

pip install --pre onediff onediffx

pip install nexfort


All the required models get downloaded automatically from THUDM's hugging face repository. So, you don't need to download it manually.

At the initial run of the workflow this will take time as the models get downloaded in the background. To get the real-time status, you can switch to command prompt running at the background.


Workflow:

1. The workflow can be found inside the "ComfyUI/custom_nodes/ComfyUI-CogVideoXWrapper/examples" folder. Directly drag and drop into ComfyUI.



load cogVideoX model


2. Load the relevant Cog VideoX model node there are two variants you can choose from-
(a) CogVideoX5B 
(b) CogVideoX2B 


cogVideoX configuration
Source: CogVideoX

You should use the recommended settings as per your system requirements. More detailed information has been shared by the CogVideoX team. Go through these to get a better understanding. 

CogVideoX model details:

Model Type CogVideoX-2B CogVideoX-5B CogVideoX-5B-I2V 
Model Description Entry-level model, balancing compatibility. Low cost for running and secondary development. Larger model with higher video generation quality and better visual effects. CogVideoX-5B image-to-video version.
Inference Precision FP16* (recommended), BF16, FP32, FP8*, INT8, not supported: INT4 BF16 (recommended), FP16, FP32, FP8*, INT8, not supported: INT4 Same as CogVideoX-5B
Single GPU Memory Usage SAT FP16: 18GB
diffusers FP16: from 4GB*
diffusers INT8 (torchao): from 3.6GB*
SAT BF16: 26GB
diffusers BF16: from 5GB*
diffusers INT8 (torchao): from 4.4GB*
Same as CogVideoX-5B
Multi-GPU Inference Memory Usage FP16: 10GB* using diffusers BF16: 15GB* using diffusers Same as CogVideoX-5B
Inference Speed (Step = 50, FP/BF16) Single A100: ~90 seconds
Single H100: ~45 seconds
Single A100: ~180 seconds
Single H100: ~90 seconds
Same as CogVideoX-5B
Fine-tuning Precision FP16 BF16 Same as CogVideoX-5B
Fine-tuning Memory Usage 47 GB (bs=1, LORA)
61 GB (bs=2, LORA)
62GB (bs=1, SFT)
63 GB (bs=1, LORA)
80 GB (bs=2, LORA)
75GB (bs=1, SFT)
78 GB (bs=1, LORA)
75GB (bs=1, SFT, 16 GPU)
Prompt Language English* Same as CogVideoX-2B Same as CogVideoX-2B
Maximum Prompt Length 226 Tokens Same as CogVideoX-2B Same as CogVideoX-2B
Video Length 6 Seconds Same as CogVideoX-2B Same as CogVideoX-2B
Frame Rate 8 Frames / Second Same as CogVideoX-2B Same as CogVideoX-2B
Video Resolution 720 x 480, no support for other resolutions (including fine-tuning) Same as CogVideoX-2B Same as CogVideoX-2B
Position Embedding 3d_sincos_pos_embed 3d_rope_pos_embed 3d_rope_pos_embed + learnable_pos_embed

load clip model

3. Load clip models. Fp16 is for higher end and FP8 for lower end GPUs.

As officially instructed, it has been trained on long batch of prompts based on transformers T5 models, we used detailed prompts generated using ChatGPT so that the CogVideoX model can understand better.


First test: 

We generated a professional model photoshoot clip in the ocean.

Prompt used: A professional photoshoot scene set in the ocean, featuring a model standing confidently in shallow water. The model is dressed in a sleek, elegant outfit, with a flowing fabric that moves gracefully with the ocean breeze. The scene is captured during the golden hour, with the sun setting on the horizon, casting a warm glow on the water's surface. Gentle waves lap around the model’s feet, creating a dynamic and serene atmosphere. A professional photographer is seen on the shore, using a high-end camera with a large lens, capturing the moment. Reflective equipment and light modifiers are strategically placed to enhance the lighting, with an assistant holding a reflector to direct sunlight onto the model. The overall mood is glamorous, serene, and professional, emphasizing the beauty of the ocean backdrop and the skill of the photoshoot crew.


video generated using cogvideoX

Here is we got our first result. You can observe the woman's left hand has been deformed a little. But the camera moving and panning has been added creating a lot of professional effects with realistic ocean tidal waves.

Of course, the video frames' quality is low. But this is not a big deal. Our focus is to generate consistent video frame generation without any defects. It can be up-scale using other techniques in ComfyUI or using any third-party tool. 


Second test:

Let's challenge the model and see how much intelligently it maintains to a certain level.

Prompt used:  An action-packed scene set in a futuristic cityscape at night, inspired by an Iron Man movie. The central figure is a superhero in a high-tech, red and gold metallic suit with glowing blue eyes and arc reactor on the chest, hovering in mid-air with jet thrusters blazing from his hands and feet. The suit is sleek, with intricate details and panels that reflect the city lights. In the background, towering skyscrapers with neon signs and holographic billboards illuminate the night sky. The superhero is in a dynamic pose, dodging a barrage of energy blasts from a formidable enemy robot flying nearby, which is large, menacing,
and armed with glowing red weaponry. Sparks fly and smoke trails in the air, adding to the intensity of the battle. The scene captures a sense of speed, power, and heroism, with a dramatic sky filled with dark clouds and flashes of lightning, amplifying the urgency and high stakes of the confrontation.


video generated using cogvideoX

Now the model is kind of confused as to what to generate and there are lots of morphing in the batch of video frames. Although, overall it's a lot better than that of any other diffusion-based models where you need to try multiple attempts to generate a single video clip.


Conclusion:

After certain testing, we can conclude that CogVideoX is much more capable than other diffusion-based video generation models. Now, it can be supported on lower-end GPUs. 
For further approach, you can raise your issues into their respective github issue section