Animating Objects using AnimateDiff and SD1.5

animating objects using stable diffusion 1.5 and animateDiff

Now a days creating viral AI social media content is not that tough task but creating stuff with creativity is always been headache for AI creators. Here, we are presenting the Video-To-Video workflow for dancing objects that will go viral instantly using Stable Diffusion 1.5 and AnimateDiff. All the explanation will be in detail and easy to understand. 

Make sure you have the basic understanding in working with Stable Diffusion 1.5 and AnimateDiff.

Table of contents:


Installation

1. First, you need to install ComfyUI on your machine and learn the ComfyUI basic.


download workflow

2. Next is to download the workflow (Video2Video) from the CivitAI. Then, drag and drop into ComfyUI. 


missing error node

3. Here, you will get a bunch of nodes in red colored error.  Navigate to Manager and click "Update All". Then, again from the Manager click on "Install missing custom nodes". Just select all and install each of them.

4. Restart and refresh ComfyUI to take effect.


Downloading the models

1. Install the CLIP Vision Model and save them inside "ComfyUI/models/clip_vision" folder. Getting Clip Vison model error denotes that you do not have the respective models installed.

(a)Download Clip Vit-H and rename it to "CLIP-ViT-H-14-laion2B-s32B-b79K.safetensors" (without quotes).
(b)Download Clip Vit bigG and rename it to "CLIP-ViT-bigG-14-laion2B-39B-b160k.safetensors" (without quotes).


2. If you have an error message- "IPAdapter model not found", this means you do not have the IPAdapter model. Navigate to ComfyUI Manager and click Install Models then, Search for "ip-adapter_sd15_vit-G.safetensors" and finally select "Install".
If you haven't installed the IP Adapter yet then you can also go through the IP adapter installation tutorial.

You can also do manual installation using Manager. To do this, download the model from Hugging Face
and save it into the "ComfyUI/models/ipadapter" folder.


Workflow Explanation

1. To work with the respective workflow you need to create a video mask in white foreground and black background. Now, to do this we have multiple ways. Using image to apha masking node in ComfyUI, but it do not generation consistent result.

To fix this problem, we have depth crafter model that generates better estimated consistent depth map video. You can quickly follow our depth crafter tutorial. For better output, use video with clear background for masking generation. 


video upload node

2. Video Upload- After creating black and white masked video, just load it to "Video upload" node.

 As this is tedious process, and you do not get the perfect output at one go. So, trying multiple times will give you better direction. So, you can play with the few options:

Frame load cap value- means how many number of frames your target video you want for processing.(0 means each frames will going to render) Lower give you faster generation time but with shorter clip and vice versa. Its recommended to use low value for trial and error.

Skip first frame- use to skip how many number of frames from the starting value, you want from generation

Skip every nth frames- this is to skip every next frame. setting is to Value=2, means it will render your half video and 1 means render total frames.


upload your background image

3. Load Bg image- Drop your background image to node. This will give your video a proper background effect to look more natural. For Instance- lets say you want to make cactus dance, then use the relevant background like cactus in desert.

4. Load Image - Next, add your object image(Octopus, rock, water, cactus, starfish, etc) to "Load image" node. The limit is endless. It simply depends on your creativity and experience. 

LoRA stack node

5. LoRA Stacker - This node is used for setting up the respective LoRA model you want to use. Their are plenty of LoRA models(ToonYou, DreamShaper etc) you can find on CivitAI if you haven't checked yet. 


IP Adapter Unified Loader node

6. IP Adapter Unified Loader- You can set higher-lower preset for strength to get in the video result. 


IP Adapter Advanced

7. IP Adapter Advanced - This controls how much you want to add weight to your object. The more higher value means more dense and lower means lesser.


setup lora animatediff

8. Load AnimateDiff LoRA - Select your AnimateDiff LoRA model to it.

9. Motion Scale- Adss the amount of motion to your object inside generated video. Generally use the value from 0.5-1 to get your work done efficiently.

10. Load AnimateDiff Model - select your AnimateDiff model. Make sure you use the model trained on Stable Diffusion 1.5 only.

set your checkpoint

11. Efficient loader - Select your checkpoint for Stable Diffusion 1.5. There are examples like DreamshaperLCM, ToonYou, Cyberrealism, Juggernaut etc. You can choose any relevant style needed for your video. Again there are plenty of options you can get from Hugging Face, CivitAI, Github etc. 

Then select the relevant pruned VAE( Variational Auto Encoder) model for SD1.5. 

Inside the prompt box use positive and negative prompt. Its to make sure that you should use more relative detailed prompt. The more detailed prompt the better your output video get influenced with. 

You can master the prompting techniques if you do not know how to do it. Apart from this, negative prompt techniques also adds more detailing to it. More information can be found from our negative prompts tutorial.



KSampler Efficient node

KSampler Efficient- This is responsible to generate video animation preview in real-time. 

Step= 10 (Higher is good but also slower your rendering time), 

CFG=1.5 (influenced prompts to your output), 

Sampler=LCM, 

Scheduler=SGM_uniform (others are experimental), 

Denoise=1 

 

set the seed value

12. Seed- Set this to randomized.

controlnet configuration

13. Controlnet- Into this group, to the load advanced controlnet model node select Sd1.5 QRcode monster checkpoint. 

 Controlnet stacker - strength val= 0.5, start percent= 0, end percent= 0.6 

 Load advanced controlnet model node- load SD1.5 lineart checkpoint. This uses to figuring out the frames in real time.

 Another ControlNEt stacker- strength val=1, start percent= 0, end percent=0.75

 Realsitic line art-To test at the initial stage, set the resolution to minimum, that is 512.

These settings are experimental and using multiple times gives you the clear understandings.

14. Preview (video Combine)- It uses all the generated video frames to generate the resulted video. Also set the frame rate to same as your video you actually inputted to it.


Finally, click to the "Queue" button to start the rendering process. Generally, this takes couple of minutes. The higher your VRAM your machine have the more faster your generation will be. 


Upscaling node

15. Upscale node- The video output you get will be somewhat low. So, you can use any video upscaling techniques to upscale it but will also time consuming or there are multiple alternatives you can try using third party applications.


video combine node

Set the scale value to 2.5 if you want the video to 720 pixels. The most crucial value is the denoise value Adding more will make your result inappropriate. Setting it to lower range (0.2-0.5) is quite better approach.

Tip: The upscaling pipeline also included but if you do not want then so simply drag the mouse cursor to mute them and can be unmuted (using Ctrl+M key). 

To get more overview and learnings, you can dive more into Video-To-Video tutorial using AnimateDiff and SDXL.