Pyramid Flow: Generate Long Videos

Generating longer videos is one of the challenging task for any diffusion based models. But now it can be possible with Pyramid Flow. A text to video open source model based on Stable Diffusion3 Medium, CogVideoX, Flux1.0, WebVid-10M, OpenVid-1M, Diffusion Forcing, GameNGen, Open-Sora Plan, and VideoLLAMA2.

The entire framework is optimized in an end-to-end manner and with a single unified Diffusion Transformer (DiT). The generation of high-quality 5-second (up to 10-second) videos at 768p resolution and 24 FPS within 20.7k A100 GPU training hours is supported by this method, as demonstrated by extensive experiments. Interested people can access the research paper for in-depth understanding.

Now it can be supported in ComfyUI. Lets dive to the installation section.

Table of Contents:

Installation:

1. Install ComfyUI on your machine.

2.Update it if already installed. Select "Update all" from ComfyUI Manager.

3. Setup the necessary files:

Automatic Download:

Move to your "ComfyUI/custom_nodes" folder. Navigate to folder address bar. Open the command prompt by typing "cmd". Then into the command prompt, just clone the Kijai's repository using following command:

~~git clone https://github.com/kijai/ComfyUI-PyramidFlowWrapper.git~~

While you initiate the workflow for the first time, all the respective models get auto downloaded from Pyramid's Hugging face repository. To get the real-time status, you can switch to ComfyUI's command prompt.

The models are not optimized. As these are raw variants, you need to wait further for the GPU optimization.

Manual Download:

If you do not like to wait for auto download, you can manage all by creating directory structure by your own.

To dot this, first search and install the "Pyramid flow wrapper" from ComfyUI Manager by author Kijai and click "install". This will create a custom nodes folder structure inside your custom nodes folder.

Now, download all the respective models from Pyramid flow's repository. After downloading save them inside your "ComfyUI/models/pyramidflow/pyramid-flow-sd3" folder. Follow the folder structure with relevant model names provided below:

Folder name	Models
causal_video_vae	config.json, diffusion_pytorch_model.safetensors
diffusion_transformer_384p	config.json, diffusion_pytorch_model.safetensors
diffusion_transformer_768p	config.json, diffusion_pytorch_model.safetensors
text_encoder	config.json, model.safetensors
text_encoder_2	config.json, model.safetensors
text_encoder_3	config.json, model-00001-of-00002.safetensors, model-00002-of-00002.safetensors, model.safetensors.index.json
tokenizer	merges.txt, special_tokens_map.json, tokenizer_config.json, vocab.json
tokenizer_2	merges.txt, special_tokens_map.json, tokenizer_config.json, vocab.json
tokenizer_3	special_tokens_map.json, spiece.model, tokenizer.json, tokenizer_config.json

Workflow:

1. The workflow can be found inside your "ComfyUI/custom_nodes/ComfyUI-PyramidFlowWrapper/examples" folder. If you get any red colored error node, directly install it from Manager by clicking on "Install missing custom nodes".

There are the workflows you can choose from:

(a) Text to Video generation

(b) Image to Video generation

2. There are two models for different video generation length:

(a) 384p checkpoint - supports up to 5 seconds with 24FPS video generation for running under 10GB VRAM. It can generate videos in 640x384 resolution.

(b) 768p checkpoint -supports maximum 10 seconds with 24FPS video generation for 10-12 GB VRAM. Its capable in generating videos with 1280x768 reosultion.

We tested with the inputted prompt using 384p checkpoint:

First Try

generated output using pyramid flow model

Prompt used: a Lamborghini car drifting, at night show, highly realistic

Here, is the result. Its 3 seconds video length with frame rate of 8. You can observe, there is a little deformation in the frames and lots of artifacts. The car is moving but there is not drifting kind of stuff going on. Capturing night view is good but again not that much impressive.

Second try

This time we have used the natural language processing(NLP) for prompting technique.

This is static which we are not supposed it will generate. Again unsatisfied with the result.

As we performed earlier test with CogVideoX, this model is not that much capable of. The output of the model is not that much optimized.

Recommended settings:

(a) Text to Video generation

num inference steps=20, 20, 20

video num inference steps=10, 10, 10

height=768, width=1280

guidance scale=9.0

video guidance scale=5.0

temp=16

(b) Image to Video generation

num inference steps=10, 10, 10

temp=16

video guidance scale=4.0

Conclusion:

As suggested in the official paper, we are not pretty much satisfied with the results. We conclude that the model is not that perfect specially for human characters, faster moving objects. But if they release the fine tune variant then we can definitely see some major improvements.

Pyramid Flow: Generate Long Videos

Installation:

Workflow:

Recommended settings:

Conclusion:

Posted by Admin

Search This Blog

Trending

Wan 2.1: Install & Generate Videos/Images locally with lower VRAM

Train your WAN2.1 Lora model on Windows/Linux

Wan2.1 FusionX 14B: Consistent Fast VideoGen with Low VRAM

Easy Install ComfyUI Portable (Windows/Mac/Linux)

Run Stable Diffusion 10x faster on AMD GPUs

Installing Stable Diffusion 3.5 Locally

Our Social Pages

Recent Posts

Important pages

Contact form

Pyramid Flow: Generate Long Videos

Installation:

Workflow:

Recommended settings:

Conclusion:

Posted by Admin

Related Posts

Search This Blog

Trending

Wan 2.1: Install & Generate Videos/Images locally with lower VRAM

Train your WAN2.1 Lora model on Windows/Linux

Wan2.1 FusionX 14B: Consistent Fast VideoGen with Low VRAM

Easy Install ComfyUI Portable (Windows/Mac/Linux)

Run Stable Diffusion 10x faster on AMD GPUs

Installing Stable Diffusion 3.5 Locally

Our Social Community

Our Social Pages

Recent Posts

Important pages

Contact form