Install CogVideoX: Text-to-Video and Image-to-Video

A Text-to-Video diffusion-based model, CogVideoX has been released by The Knowledge Engineering Group (KEG) & Data Mining (THUDM) at Tsinghua University.

The model has been trained on the base of long detailed prompts like Chat GLM4 or ChatGPT4. To get a detailed overview of CogVideoX, access the respective research paper. You can also access the information for commercial purposes by checking out ChatGlm and the API platform.

Unlike other diffusion video generation models that are unable to generate longer videos, the CogVideoX model can generate 6-second long videos. Now, it's capable of running with lower VRAMs lesser than 12GB.

Currently, these variants have been released:

CogVideoX5B(Text-to-Video) registered under CogVideoX license.
CogVideoX2B(Text-to-Video) registered under Apache2.0 license.
CogVideoX5b-I2V(Image-to-Video) registered under CogVideoX License.

Lets move to the installation section and the workflow.

Table of Contents:

Installation

1. First, do the ComfyUI installation if you are new to ComfyUI.

2. Now, clone the CogVideoX wrapper (custom nodes). Move into the "ComfyUI/custom_nodes" folder. Navigate to the folder address bar and type "cmd" to open the command prompt.

Then, just paste this command in the command prompt to install the wrapper:

~~git clone https://github.com/kijai/ComfyUI-CogVideoXWrapper.git~~

3. You also required other dependencies to speed up the video rendering.

For ComfyUI portable users:

Move inside the "ComfyUI_windows_portable" folder. Navigate to the folder address bar and type "cmd" to open the command prompt and again use these commands:

~~python_embeded\python.exe -m pip install --pre onediff onediffx nexfort~~

For normal comfy users:

Open the command prompt and use these commands:

~~pip install --pre onediff onediffx~~

~~pip install nexfort~~

All the required models get downloaded automatically from THUDM's hugging face repository. So, you don't need to download it manually.

At the initial run of the workflow this will take time as the models get downloaded in the background. To get the real-time status, you can switch to command prompt running at the background.

Optional(for windows users): Here, we want to mention is that you can install Windows Triton and Sage-Attention which will significantly drop the video rendering time to almost 25% as reported by the community.

Make sure you have the CUDA12.x, Visual studio, Vcredist installed on your system. You can get the detailed understanding from Triton-windows github repository.

Install Windows Trition .whl file for your python version. We have python 3.10 version installed. For other python version checkout Windows Trition release section.

For normal ComfyUI user:

~~pip install triton-3.1.0-cp310-cp310-win_amd64.whl~~

~~pip install sageattention~~

For Comfy Portable users(move inside ComfyUI_windows_portable folder and open command prompt):

~~.\python_embeded\python.exe -m pip install triton-3.1.0-cp310-cp310-win_amd64.whl~~

~~.\python_embeded\python.exe -m pip install sageattention~~

Workflow

1. The workflow can be found inside the "ComfyUI/custom_nodes/ComfyUI-CogVideoXWrapper/examples" folder. Directly drag and drop into ComfyUI.

Workflow	Description
cogvideo_2b_context_schedule_test_01.json	update workflows
cogvideo_5b_vid2vid_example_01.json	update workflows
cogvideo_2b_controlnet_example_01.json	Update cogvideo_2b_controlnet_example_01.json
cogvideo_5b_example_01.json	Update cogvideo_5b_example_01.json
cogvideo_I2V_example_01.json	update workflows
cogvideo_fun_pose_example_01.json	Add context schedules for control pipeline
cogvideo_fun_5b_GGUF_10GB_VRAM_example_01.json	CreateCogvideo_fun_5b_GGUF_10GB_VRAM_example_01.json
cogvideo_fun_i2v_example_01.json	update vae tile defaults

There are multiple workflows available. Choose the one as per your requirements. For illustration, we are showcasing the basic one.

2. In the Cog Video model node there are three variants you can choose from-

(a) CogVideoX-5B(Text-to-Video for higher VRAM)

(b) CogVideoX-2B(Text-To-Video for lower VRAM)

Source: CogVideoX

You should use the recommended settings as per your system requirements. More detailed information has been shared by the CogVideoX team. Go through these to get a better understanding.

CogVideoX model details:

Model Type	CogVideoX-2B	CogVideoX-5B	CogVideoX-5B-I2V
Model Description	Entry-level model, balancing compatibility. Low cost for running and secondary development.	Larger model with higher video generation quality and better visual effects.	CogVideoX-5B image-to-video version.
Inference Precision	FP16* (recommended), BF16, FP32, FP8*, INT8, not supported: INT4	BF16 (recommended), FP16, FP32, FP8*, INT8, not supported: INT4	Same as CogVideoX-5B
Single GPU Memory Usage	SAT FP16: 18GB diffusers FP16: from 4GB* diffusers INT8 (torchao): from 3.6GB*	SAT BF16: 26GB diffusers BF16: from 5GB* diffusers INT8 (torchao): from 4.4GB*	Same as CogVideoX-5B
Multi-GPU Inference Memory Usage	FP16: 10GB* using diffusers	BF16: 15GB* using diffusers	Same as CogVideoX-5B
Inference Speed (Step = 50, FP/BF16)	Single A100: ~90 seconds Single H100: ~45 seconds	Single A100: ~180 seconds Single H100: ~90 seconds	Same as CogVideoX-5B
Fine-tuning Precision	FP16	BF16	Same as CogVideoX-5B
Fine-tuning Memory Usage	47 GB (bs=1, LORA) 61 GB (bs=2, LORA) 62GB (bs=1, SFT)	63 GB (bs=1, LORA) 80 GB (bs=2, LORA) 75GB (bs=1, SFT)	78 GB (bs=1, LORA) 75GB (bs=1, SFT, 16 GPU)
Prompt Language	English*	Same as CogVideoX-2B	Same as CogVideoX-2B
Maximum Prompt Length	226 Tokens	Same as CogVideoX-2B	Same as CogVideoX-2B
Video Length	6 Seconds	Same as CogVideoX-2B	Same as CogVideoX-2B
Frame Rate	8 Frames / Second	Same as CogVideoX-2B	Same as CogVideoX-2B
Video Resolution	720 x 480, no support for other resolutions (including fine-tuning)	Same as CogVideoX-2B	Same as CogVideoX-2B
Position Embedding	3d_sincos_pos_embed	3d_rope_pos_embed	3d_rope_pos_embed + learnable_pos_embed

3. Load clip models. Fp16 is for higher end and FP8 for lower end GPUs.

4. As officially instructed, it has been trained on long batch of prompts based on transformers T5 models, we used detailed prompts generated using ChatGPT so that the CogVideoX model can understand better.

First test:

We generated a professional model photoshoot clip in the ocean.

Prompt used: A professional photoshoot scene set in the ocean, featuring a model standing confidently in shallow water. The model is dressed in a sleek, elegant outfit, with a flowing fabric that moves gracefully with the ocean breeze. The scene is captured during the golden hour, with the sun setting on the horizon, casting a warm glow on the water's surface. Gentle waves lap around the model’s feet, creating a dynamic and serene atmosphere. A professional photographer is seen on the shore, using a high-end camera with a large lens, capturing the moment. Reflective equipment and light modifiers are strategically placed to enhance the lighting, with an assistant holding a reflector to direct sunlight onto the model. The overall mood is glamorous, serene, and professional, emphasizing the beauty of the ocean backdrop and the skill of the photoshoot crew.

Here is we got our first result. You can observe the female's right hand has been deformed a little. But the camera moving and panning has been added creating a lot of professional effects with realistic ocean tidal waves.

Of course, the video frames' quality is low. But this is not a big deal. Our focus is to generate consistent video frame generation without any defects. It can be up-scale using other techniques in ComfyUI (Neural network latent upscale), etc. or just split the video into multiple frames and use Supir Upscaler.

Second test:

Let's challenge the model and see how much intelligently it maintains to a certain level.

Prompt used: An action-packed scene set in a futuristic cityscape at night, inspired by an Iron Man movie. The central figure is a superhero in a high-tech, red and gold metallic suit with glowing blue eyes and arc reactor on the chest, hovering in mid-air with jet thrusters blazing from his hands and feet. The suit is sleek, with intricate details and panels that reflect the city lights. In the background, towering skyscrapers with neon signs and holographic billboards illuminate the night sky. The superhero is in a dynamic pose, dodging a barrage of energy blasts from a formidable enemy robot flying nearby, which is large, menacing,

and armed with glowing red weaponry. Sparks fly and smoke trails in the air, adding to the intensity of the battle. The scene captures a sense of speed, power, and heroism, with a dramatic sky filled with dark clouds and flashes of lightning, amplifying the urgency and high stakes of the confrontation.

Now the model is kind of confused as to what to generate and there are lots of morphing in the batch of video frames. Although, overall it's a lot better than that of any other diffusion-based models where you need to try multiple attempts to generate a single video clip.

Conclusion

After certain testing, we can conclude that CogVideoX is much more capable than other diffusion-based video generation models. Now, it can be supported on lower-end GPUs as well where you can use Quantized model.

Install CogVideoX: Text-to-Video and Image-to-Video

Installation

Workflow

CogVideoX model details:

Conclusion

Posted by Admin

Search This Blog

Trending

Wan 2.1: Install & Generate Videos locally with lower VRAM

Train your WAN2.1 Lora model on Windows/Linux

Easy Install ComfyUI Portable (Windows/Mac/Linux)

Wan2.1 FusionX 14B: Consistent Fast VideoGen with Low VRAM

Installing Stable Diffusion 3.5 Locally

Run Stable Diffusion 10x faster on AMD GPUs

Our Social Pages

Recent Posts

Important pages

Contact form

Install CogVideoX: Text-to-Video and Image-to-Video

Installation

Workflow

CogVideoX model details:

Conclusion

Posted by Admin

Related Posts

Search This Blog

Trending

Wan 2.1: Install & Generate Videos locally with lower VRAM

Train your WAN2.1 Lora model on Windows/Linux

Easy Install ComfyUI Portable (Windows/Mac/Linux)

Wan2.1 FusionX 14B: Consistent Fast VideoGen with Low VRAM

Installing Stable Diffusion 3.5 Locally

Run Stable Diffusion 10x faster on AMD GPUs

Our Social Community

Our Social Pages

Recent Posts

Important pages

Contact form