Pyramid Flow: Generate longer Videos from Image/Text in ComfyUI

generate videos using text and images

Generating longer videos is one of the challenging task for any diffusion based models. But now it can be possible with Pyramid Flow. A text to video open source model based on Stable Diffusion3 Medium, CogVideoX, Flux1.0, WebVid-10M, OpenVid-1M, Diffusion Forcing, GameNGen, Open-Sora Plan, and VideoLLAMA2.

The entire framework is optimized in an end-to-end manner and with a single unified Diffusion Transformer (DiT). The generation of high-quality 5-second (up to 10-second) videos at 768p resolution and 24 FPS within 20.7k A100 GPU training hours is supported by this method, as demonstrated by extensive experiments. Interested people can access the research paper for in-depth understanding.

Now it can be supported in ComfyUI. Lets dive to the installation section.

Table of Contents:


Installation:

1. Install ComfyUI on your machine.

2.Update it if already installed. Select "Update all" from ComfyUI Manager.

3. Setup the necessary files:

Automatic Download:

Move to your "ComfyUI/custom_nodes" folder. Navigate to folder address bar. Open the command prompt by typing "cmd". Then into the command prompt, just clone the Kijai's repository using following command:

git clone https://github.com/kijai/ComfyUI-PyramidFlowWrapper.git

All the respective models get auto downloaded from Pyramid's Hugging face repository. The models are not optimized. As these are raw variants, you need to wait further for the GPU optimization. 

Manual Download:

If you do not like to wait for auto download, you can manage all by creating directory structure by your own.

Search and install the "Pyramid flow wrapper" from ComfyUI Manager by author Kijai and click install. This will create a custom nodes folder structure inside your custom nodes folder.

Now download all the respective models from Pyramid flow's repository. After downloading save them inside your "ComfyUI/models/pyramidflow/pyramid-flow-sd3" folder. Follow the folder structure with relevant model names provided below:

Folder name Models
causal_video_vae config.json, diffusion_pytorch_model.safetensors
diffusion_transformer_384p config.json, diffusion_pytorch_model.safetensors
diffusion_transformer_768p config.json, diffusion_pytorch_model.safetensors
text_encoder config.json, model.safetensors
text_encoder_2 config.json, model.safetensors
text_encoder_3 config.json, model-00001-of-00002.safetensors, model-00002-of-00002.safetensors, model.safetensors.index.json
tokenizer merges.txt, special_tokens_map.json, tokenizer_config.json, vocab.json
tokenizer_2 merges.txt, special_tokens_map.json, tokenizer_config.json, vocab.json
tokenizer_3 special_tokens_map.json, spiece.model, tokenizer.json, tokenizer_config.json


Workflow:

1. The workflow can be found inside your "ComfyUI/custom_nodes/ComfyUI-PyramidFlowWrapper/examples" folder.

There are the workflows you can choose from:

(a) Text to Video generation 

(b) Image to Video generation

(c) Text to Video Multi prompts

2. There are two models for different video generation length:

(a) 384p checkpoint - supports up to 5 seconds with 24FPS video generation for running under 10GB VRAM. It can generate videos in 640x384 resolution.

(b) 768p checkpoint -supports maximum 10 seconds  with 24FPS video generation for 10-12 GB VRAM. Its capable in generating videos with 1280x768 reosultion. 

We tested with the inputted prompt using 384p checkpoint:

a Lamborghini car drifting, at night show, highly realistic

generated output using pyramid flow model

Here, is the result. Its 3 seconds video length with frame rate of 8. But there is a little deformation in the frames. The car is moving but there is not drifting kind of stuff going on. Capturing night view is good but again not that much impressive. 

As we performed earlier test with CogVideoX and this model is not that much capable. The output resulting the model is not that much optimized. 


Recommended settings:

(a) Text to Video generation 

num inference steps=20, 20, 20

video num inference steps=10, 10, 10

height=768, width=1280

guidance scale=9.0

video guidance scale=5.0

temp=16

(b) Image to Video generation

num inference steps=10, 10, 10

temp=16

video guidance scale=4.0


Conclusion:

As suggested in the official paper, we are not pretty much satisfied with the results. We conclude that the model is not that perfect specially for Human characters, faster moving objects. But if they release the fine tune variant then we can definitely see some major improvements.