There are many video diffusion open source models in the market but when it comes to generating audio you do not have much options. MMAudio can solve your problem in generating synchronized audios from your referenced video with inputted text.
The model, is trained on a large dataset based on audio-text and audio-visuals gives you a better synchronized output with your inputted video.
The model is released by the group of University of Illinois Urbana-Champaign, Sony AI, and Sony Group Corporation that is registered under MIT license which means it can used for research and commercial purpose. To get a detailed understanding, you can go through their research paper.
Table of Contents:
Installing custom nodes
1. Get ComfyUI installed into the machine and the older user should update it from the Manager.
2. Inside "ComfyUI/custom_nodes" folder, open command prompt by tying "cmd" on folder address bar. Clone the repository using the following commands provided below:
git clone https://github.com/kijai/ComfyUI-MMAudio.git
3. Next you need to install the required dependencies. For normal ComfyUI users, type this into command prompt:
pip install -r requirements.txt
For ComfyUI portable user, go inside the ComfyUI_windows_portable folder open the command prompt like before, and type this command:
python_embeded\python.exe -m pip install -r ComfyUI\custom_nodes\ComfyUI-MMAudio\requirements.txt
4. Now just download the syncformers, Clip, VAE, and base models from Hugging Face repository.
Get any of the two (fp16 or Fp32) variant shown above. Make sure if you are using fp16 model then use the fp16 based vae and clip models.
Move inside "ComfyUI/models" folder. Create a new "mmaudio" folder. After downloading, save these models inside this folder.
5. Then you also need Nvidia bigvganv2 (used with 44k mode) that get automatically downloaded from Hugging face repository when you run the workflow for the first time. All the files get saved into "ComfyUI/models/mmaudio/nvidia" folder. To get the real-time status you can switch to ComfyUI's backend terminal.
You can also try manually downloading all the files from the respective repository but it has lots of files which is a tedious task for a newbie. So, it's not recommended.
6. Restart and refresh ComfyUI to take effect.
Workflow
1. Get the workflow (mmaudio_test.json file) inside your "ComfyUI/custom_nodes/ComfyUI-MMAudio/examples" folder.
2. Drag and drop into ComfyUI canvas.
3. Follow the steps given below:
(a) Load mmaudio model into the MMAudio node.
(b) Load required VAE, synchformer, and Clip model into MMAudio feature node.
(c) Load your AI video without audio. Here, the video length should be lesser than 1 minute. Longer videos will give you out of memory error if you are using 12GB VRAM or lesser. We uploaded 8 seconds long AI generated video.
We uploaded video without audio sample. This is AI generated video using Google Veo.
(d) Now put your related positive and negative prompts for your video.
We added prompt as: a human skating on moon surface.
CFG: 4.5
Steps: 25
(e) Generate video by click on "Queue" button.
The generated output has been shown below which we have posted on our X(twitter) platform.
MMAudio - Generate Background Audio for your AI videos... 🎧
— Stable Diffusion Tutorials (@SD_Tutorial) December 24, 2024
Install+workflow: 😀😉https://t.co/itYdMkwIK4 pic.twitter.com/RzE5XEbIz5