Pyramid Flow: Generate Long Videos in ComfyUI

generate videos using text and images

Generating longer videos is one of the challenging task for any diffusion based models. But now it can be possible with Pyramid Flow. A text to video open source model based on Stable Diffusion3 Medium, CogVideoX, Flux1.0, WebVid-10M, OpenVid-1M, Diffusion Forcing, GameNGen, Open-Sora Plan, and VideoLLAMA2.

The entire framework is optimized in an end-to-end manner and with a single unified Diffusion Transformer (DiT). The generation of high-quality 5-second (up to 10-second) videos at 768p resolution and 24 FPS within 20.7k A100 GPU training hours is supported by this method, as demonstrated by extensive experiments. Interested people can access the research paper for in-depth understanding.

Now it can be supported in ComfyUI. Lets dive to the installation section.

Table of Contents:


Installation:

1. Install ComfyUI on your machine.

2.Update it if already installed. Select "Update all" from ComfyUI Manager.

3. Setup the necessary files:

Automatic Download:

Move to your "ComfyUI/custom_nodes" folder. Navigate to folder address bar. Open the command prompt by typing "cmd". Then into the command prompt, just clone the Kijai's repository using following command:

git clone https://github.com/kijai/ComfyUI-PyramidFlowWrapper.git

While you initiate the workflow for the first time, all the respective models get auto downloaded from Pyramid's Hugging face repository. To get the real-time status, you can switch to ComfyUI's command prompt. 

The models are not optimized. As these are raw variants, you need to wait further for the GPU optimization. 

Manual Download:

If you do not like to wait for auto download, you can manage all by creating directory structure by your own.

To dot this, first search and install the "Pyramid flow wrapper" from ComfyUI Manager by author Kijai and click "install". This will create a custom nodes folder structure inside your custom nodes folder.

Now, download all the respective models from Pyramid flow's repository. After downloading save them inside your "ComfyUI/models/pyramidflow/pyramid-flow-sd3" folder. Follow the folder structure with relevant model names provided below:

Folder name Models
causal_video_vae config.json, diffusion_pytorch_model.safetensors
diffusion_transformer_384p config.json, diffusion_pytorch_model.safetensors
diffusion_transformer_768p config.json, diffusion_pytorch_model.safetensors
text_encoder config.json, model.safetensors
text_encoder_2 config.json, model.safetensors
text_encoder_3 config.json, model-00001-of-00002.safetensors, model-00002-of-00002.safetensors, model.safetensors.index.json
tokenizer merges.txt, special_tokens_map.json, tokenizer_config.json, vocab.json
tokenizer_2 merges.txt, special_tokens_map.json, tokenizer_config.json, vocab.json
tokenizer_3 special_tokens_map.json, spiece.model, tokenizer.json, tokenizer_config.json


Workflow:

1. The workflow can be found inside your "ComfyUI/custom_nodes/ComfyUI-PyramidFlowWrapper/examples" folder. If you get any red colored error node, directly install it from Manager by clicking on "Install missing custom nodes".

There are the workflows you can choose from:

(a) Text to Video generation 

(b) Image to Video generation

(c) Text to Video Multi prompts

2. There are two models for different video generation length:

(a) 384p checkpoint - supports up to 5 seconds with 24FPS video generation for running under 10GB VRAM. It can generate videos in 640x384 resolution.

(b) 768p checkpoint -supports maximum 10 seconds  with 24FPS video generation for 10-12 GB VRAM. Its capable in generating videos with 1280x768 reosultion. 

We tested with the inputted prompt using 384p checkpoint:

First Try

generated output using pyramid flow model

Prompt used: a Lamborghini car drifting, at night show, highly realistic

Here, is the result. Its 3 seconds video length with frame rate of 8. You can observe, there is a little deformation in the frames and lots of artifacts. The car is moving but there is not drifting kind of stuff going on. Capturing night view is good but again not that much impressive. 


Second try

This time we have used the natural language processing(NLP) for prompting technique.

generated using pyramid flow


This is static which we are not supposed it will generate. Again unsatisfied with the result.

As we performed earlier test with CogVideoX, this model is not that much capable of. The output of the model is not that much optimized. 


Recommended settings:

(a) Text to Video generation 

num inference steps=20, 20, 20

video num inference steps=10, 10, 10

height=768, width=1280

guidance scale=9.0

video guidance scale=5.0

temp=16

(b) Image to Video generation

num inference steps=10, 10, 10

temp=16

video guidance scale=4.0


Conclusion:

As suggested in the official paper, we are not pretty much satisfied with the results. We conclude that the model is not that perfect specially for human characters, faster moving objects. But if they release the fine tune variant then we can definitely see some major improvements.