Mochi1: Generate high quality Consistent Videos

mochi1 text to video model

Mochi 1, an open-source text-to-video diffusion model has been released by Genmo.  Trained with 10 billion parameters built on novel Asymmetric Diffusion Transformer (AsymmDiT) architecture that is also flexible to fine tune. The model is capable of generating output with high fidelity and strong prompt adherence.

The model is registered under Apache2.0 license, that means it can be used for research, educational and commercial purposes. 

Currently, it needs minimum 4 H100 GPU which is really huge for any individual to run, but they also inviting the community to release quantized model so that it easily accessible by the lower end users.

It can be run in ComfyUI, as it consume about 20GB VRAM in the VAE(Variational Auto Encoder) decoding level.


Installation

1. Install ComfyUI into machine.

2. Navigate to "ComfyUI/custom_nodes" folder. Open command prompt using "cmd". Clone the repository by typing following command:

git clone https://github.com/kijai/ComfyUI-MochiWrapper.git

All the respective models gets auto downloaded from Kijai's Hugging Face when you initiate the Workflow for the first time. 

If you are interested to work with the raw model, then you can directly access it from Genmo's Hugging face.

Take it into consideration, the model weights is quite huge in size. So, be patient while its getting downloaded. You can track the real-time status in terminal for ComfyUI running in the background. 

All the models get saved to "ComfyUI/models/diffusion_models/mochi" folder and VAE to "ComfyUI/models/vae/mochi" folder.


Workflow

1. You can get the Workflow from your "ComfyUI-MochiWrapper/examples" folder.

2. Just drag and drop to ComfyUI.

3. Put your positive detailed prompt for better result. 

mochi 1 generation


We do not have H100 stacks, but we tested this model on RTX 4090. The video consistency was really impressive as compared to CogVideoX. But this is massive, eats a lot of your VRAM.
 
With torch compiled support, gguf q8 enabled its quite lower, about 40 minutes with 200 steps. We hope there will be better quantization support in the future. 

Apart from this, they will going to add support for image to video as well.