If you have ever imagined generating high-quality videos faster than you can watch them, LTX-Video is here to turn that dream into reality. Developed by Lightricks, this groundbreaking model is the first-ever DiT-based video generation system capable of producing stunning 24 FPS videos at a resolution of 768x512 pixels and all in real-time.
The model was trained on a large-scale dataset of diverse video content, giving it the ability to generate varied and realistic scenes. From nature-inspired visuals to urban settings, the possibilities are nearly endless.
Installation
1. Install ComfyUI if you are a new user.
2. Older users need to Select "Update All" to update ComfyUI from the Manager.
3. Install LTX custom nodes from ComfyUI Manager by searching "ComfyUI-LTXVideo" and finally click "Install".
All the necessary files get installed automatically when you run the workflow for the first time. The real time status can be tracked from ComfyUI terminal.
Alternative(Manual):
Move to "ComfyUI/custom_nodes", open your command prompt using "cmd" on folder address bar. Clone the repository using the following command:
git clone https://github.com/Lightricks/ComfyUI-LTXVideo.git
Install dependencies :
For normal Comfy user
pip install -r requirements.txt
For portable ComfyUI
Move to the ComfyUI_windows_portable folder, again open the command prompt and use the command below
python_embeded\python.exe -m pip install -r ComfyUI\custom_nodes\ComfyUI-LTXVideo\requirements.txt
Download the model (ltx-video-2b-v0.9.safetensors) from the Hugging face repository and save it inside "models/checkpoints" folder.
Also need to download text encoders(t5xxl_fp16.safetensors, t5xxl_fp8_e4m3fn.safetensors and t5xxl_fp8_e4m3fn_scaled.safetensors) from Hugging face repository and save them into "models/clip" folder. If you have ever used Flux workflows then it is not required.
The fp8 variant is for 12GB VRAM and lower whereas fp16 for higher end GPUs.
Next is to clone the Pixart-Alpha model into your "ComfyUI/models/text_encoders" folder. If the folder not exist then create it. To do the cloning open the command prompt using "cmd" on folder address bar.
git clone https://huggingface.co/PixArt-alpha/PixArt-XL-2-1024-MS
4. Restart ComfyUI to take effect.
Workflow
1. Get the workflow inside your "ComfyUI/custom_nodes/ComfyUI-LTXVideo/assets" folder or alternatively it can be downloaded from github repository. You will get three different workflows:
(a) Text-to-video
(b) Image-to-video
(c) Video-to-video
2. Drag and drop into ComfyUI.
Lets do some testing with model and how it performs with textual detailed prompts and input image. Here, we are trying to generate a horror movie scene using text-to-video workflow.
Text to Video
We have used positive prompt :
A woman with a haunting presence stands atop the weathered roof of a dilapidated, rust-streaked trailer in a desolate environment. She wears a long, flowing dress that sways gently in the cold wind, its fabric aged and tattered at the edges, blending with the gloom of her surroundings. Her posture is both eerie and commanding, shoulders slightly hunched yet exuding an unsettling authority. Her piercing gaze is locked forward, unyielding and distant, as though peering into another realm. The sky above is dominated by dark, churning clouds, foretelling an impending storm, with streaks of lightning faintly illuminating the horizon. The atmosphere is heavy with tension, shadows from the trailer and nearby debris stretching unnaturally long under the dim, uneven lighting. The scene is captured in hyper-realistic detail, with every element from the grime on the trailer to the strands of her unkempt hair rendered in ultra-high definition, creating a cinematic and chilling portrayal of foreboding.
Negative prompt:
low quality, worst quality, deformed, distorted, disfigured, motion smear, motion artifacts, fused fingers, bad anatomy, weird hand, ugly
Resolution: 768x512
FPS Frame rate per sec: 25
CFG: 3
Steps: 30
Sampler : Euler
Finally, here is our result.
You can see something like the camera footage, more detailed but face looking little deformed and some kind of artifacts still there. Of course, it is 720p resolution. so, you will get low quality but can be upscaled by other upscaling techniques.
Second try: This time we have to generate video for ecommerce stuff. Lets se how this performs.
Positive Prompt used:
Ultra-high-definition close-up of a stunning editorial female model with flawless skin, striking symmetrical facial features, and piercing eyes. She is wearing a structured outfit featuring a vibrant, colorful Balmain-inspired short skirt paired with a chic, form-fitting top adorned with intricate patterns. The look is completed with bold platform shoes that add a statement edge. The model stands confidently, exuding poise and elegance, illuminated by soft, diffused studio lighting that highlights every texture and detail. The background is minimalist, allowing the outfit’s vibrant colors and the model's elegance to dominate the frame. Rendered in 8k resolution for sharp, lifelike detail, ensuring a highly polished and professional aesthetic suitable for a high-fashion editorial spread.
Negative prompt:
low quality, worst quality, deformed, distorted, disfigured, motion smear, motion artifacts, fused fingers, bad anatomy, weird hand, ugly
Resolution:768x512
FPS Frame rate per sec: 25
CFG: 3
Steps: 30
Uploaded image |
Positive Prompt used:
An untouched sandy beach with a small, white boat resting on the shore. The scene features footprints scattered on the sand, gentle ocean waves rolling onto the beach, and a horizon filled with sparse vegetation and a partly cloudy blue sky. Driftwood and natural debris are scattered along the coastline, capturing a peaceful, rustic, and natural atmosphere.
Negative prompt:
low quality, worst quality, deformed, distorted, disfigured, motion smear, motion artifacts, fused fingers, bad anatomy, weird hand, ugly
Resolution:768x512
FPS Frame rate per sec: 25
CFG: 3
Steps: 30
Here is the output.
Output |
Second try: Lets test to generate some kind of haunted movie scene. For this we have inputted image generated using Flux Schnell.
Capturing a haunted movie scene where a women standing alone in untouched sandy beach with a small, white boat resting on the shore. The scene features footprints scattered on the sand, gentle ocean waves rolling onto the beach, and a horizon filled with sparse vegetation and a partly cloudy blue sky. Driftwood and natural debris are scattered along the coastline, capturing a peaceful, rustic, and natural atmosphere.
Negative prompt:
low quality, worst quality, deformed, distorted, disfigured, motion smear, motion artifacts, fused fingers, bad anatomy, weird hand, ugly
Resolution:768x512
FPS Frame rate per sec: 24
CFG: 3
Steps: 40
Here is the output.
We are running the model on RTX 4090 and each video rendering time was 15-23 seconds for 4second long video length. Its better as compared to CogVideoX and Mochi1 in terms of VRAM usage and rendering time but it needs more training data to generate quite good results.
As they have mentioned on their official page, to work with this model, its necessary to have a detailed prompt so that it can be understood better and give a more refined result with improved prompt adherence. You can also use other techniques to enhance your prompting like using LLMs(large language models) DanTag-Tipo or VLM (Vision language models) Florence2 etc. in the background.