Install CogVideoX: Text-to-Video model in ComfyUI

install cog video x

A Text-to-Video diffusion based model, CogVideoX has been released by The Knowledge Engineering Group (KEG) & Data Mining (THUDM) at Tsinghua University.

The model has been trained on the base of long detailed prompts like Chat GLM4 or ChatGPT4.To get the detailed overview for CogVideoX, get relevant research paper. You can also access the information for commercial purpose by checking out ChatGlm and API platform.

Unlike other diffusion video generation models that unable to generate longer videos, CogVideoX model can generate 6 seconds long videos. Now, its capable to run with lower VRAMs lesser than 12GB.

Currently these variants have been released:

  •  CogVideoX5B (trained with 5B parameters) registered under CogVideoX license.
  •  CogVideoX2B (trained with 2B parameters) registered under Apache2.0 license.

Lets get into the installation section and the workflow.


Table of Contents:


Installation:

1. First, do the ComfyUI installation if you are new to ComfyUI.

2. Now, clone the CogVideoX  wrapper (custom nodes). Move into "ComfyUI/custom_nodes" folder. Navigate to the folder address bar and type "cmd" to open command prompt.

Then, just paste this command in command prompt to install the wrapper:

git clone https://github.com/kijai/ComfyUI-CogVideoXWrapper.git

3. You also required other dependencies to faster the video rendering.

For ComfyUI portable users:

Move inside "ComfyUI_windows_portable" folder. Navigate to the folder address bar and type "cmd" to open command prompt and again use these command:

python_embeded\python.exe -m pip install --pre onediff onediffx nexfort


For normal comfy users:

Open command prompt and use these command:

pip install --pre onediff onediffx

pip install nexfort


All  the required models gets downloaded automatically from THUDM's hugging face repository. So, you don't need to download manually.

At the initial run of the workflow this will take time as the models get downloaded in the background. To get the real-time status, you can switch to command prompt running at the background.


Workflow:

1. The workflow can be found inside "ComfyUI-CogVideoXWrapper/examples" folder. Directly drag and drop into ComfyUI.



load cogVideoX model


2. Load the relevant Cog VideoX model node there are two variants you can choose from-
(a) CogVideoX5B 
(b) CogVideoX2B 


cogVideoX configuration
Source: CogVideoX

You should use the recommended settings as per your system requirements. More detailed information have been shared by CogVideoX team that we have shown above. 


load clip model

3. Load clip models. Fp16 is for higher end and FP8 for lower end GPUs.

As officially instructed, it has been trained on long batch of prompts based on transformers T5 models, we used detailed prompts generated using ChatGPT so that the CogVideoX model can understand better.


First test: 

We generated a professional model photoshoot clip in the ocean.

Prompt used: A professional photoshoot scene set in the ocean, featuring a model standing confidently in shallow water. The model is dressed in a sleek, elegant outfit, with a flowing fabric that moves gracefully with the ocean breeze. The scene is captured during the golden hour, with the sun setting on the horizon, casting a warm glow on the water's surface. Gentle waves lap around the model’s feet, creating a dynamic and serene atmosphere. A professional photographer is seen on the shore, using a high-end camera with a large lens, capturing the moment. Reflective equipment and light modifiers are strategically placed to enhance the lighting, with an assistant holding a reflector to direct sunlight onto the model. The overall mood is glamorous, serene, and professional, emphasizing the beauty of the ocean backdrop and the skill of the photoshoot crew.


video generated using cogvideoX

Here is we got our first result. You can observe the woman's left hand has been deformed a little. But the camera moving and panning has been added creating a lot of professional effect with realistic ocean tidal waves.

Of course, the video frames' quality is low. But this is not the big deal. Our focus is to generate consistent video frame generation without any defects. It can be up scale using other techniques in ComfyUI or using any third party tool. 


Second test:

Lets challenge the model and see how much intelligently it maintains to the certain level.

Prompt used:  An action-packed scene set in a futuristic cityscape at night, inspired by an Iron Man movie. The central figure is a superhero in a high-tech, red and gold metallic suit with glowing blue eyes and arc reactor on the chest, hovering in mid-air with jet thrusters blazing from his hands and feet. The suit is sleek, with intricate details and panels that reflect the city lights. In the background, towering skyscrapers with neon signs and holographic billboards illuminate the night sky. The superhero is in a dynamic pose, dodging a barrage of energy blasts from a formidable enemy robot flying nearby, which is large, menacing,
and armed with glowing red weaponry. Sparks fly and smoke trails in the air, adding to the intensity of the battle. The scene captures a sense of speed, power, and heroism, with a dramatic sky filled with dark clouds and flashes of lightning, amplifying the urgency and high stakes of the confrontation.


video generated using cogvideoX

Now the model is kind of confused to what to generate and there are lots of morphing in the batch of video frames. Although, overall its a lot better than that of any other diffusion based models where you need to try multiple attempts to generate a single video clips.


Conclusion:

After certain testing, we can conclude CogVideoX is much more capable to other diffusion based video generation models. Now, it can be supported on lower end GPUs. 
For further approach, you can raise your issues into their respective github issue section