The Cosmos diffusion models released by Nvidia team is capable of generating dynamic, high-quality videos from text, images, or even other videos that we explained below.
These pre-trained models are like generalists. They have been trained on massive video datasets that cover a wide range of real-world physical scenarios. This makes them incredibly versatile for tasks that require an understanding of physics.
These models released under NVIDIA open license that gives you the freedom to work for commercial purpose when working with their limitations. To get the deeper insights, access the related information from their research paper.
Installation
1. First get install ComfyUI if you are new to it.
2. Old user need to Update ComfyUI from the Manager section.
3. Now, download the Nvidia Cosmos models from Hugging Face repository and save these into your "ComfyUI/models/diffusion_models" folder. Make sure to use the correct model variant. The 7Billion variant is for lower end and 14Billion is for higher end GPUs.
We observed many people is confusing with their naming convention. Here, "Text-to-World" simply derives "Text-to-Video" flow and "Video-to-World" is "Image/Video-to-Video flow. To get the raw model you can get from their github repository.
4. Download text encoders (oldt5_xxl_fp8_e4m3fn_scaled.safetensors) from Hugging Face and save it into your "ComfyUI/models/text_encoders" folder.
5. Download VAE(cosmos_cv8x8x8_1.0.safetensors) model from Hugging Face and place it inside your "ComfyUI/models/vae" folder.