Tracking objects with precision with image and videos is one of the challenging task. SAM2 (Sement Anything Model V2) is an open-source model released by MetaAI, registered under Apache2.0 license. This model ensures more accuracy when working with objects segmentation with videos and images as compared with the SAM (older model).
Source: MetaAI |
The model is trained on ~51,000 real-world videos and ~600,000 masklets (spatio-temporal masks). You can get more in-depth knowledge from officially released article.
Installation:
1. First you need to do the ComfyUI installation setup if you are new to Comfy.
2. Install the custom nodes by Kijai. Move to "ComfyUI/custom_nodes" folder. Open command prompt using "cmd" on your folder address bar.
Clone the SAM2 repository using following command:
git clone https://github.com/kijai/ComfyUI-segment-anything-2.git
Download any of models from Hugging Face repository. There are multiple options you can choose with: Base, Tiny,Small, Large.
Save the models inside "ComfyUI/models/sam2" folder. Create "sam2" folder if not exists.
Alternative:
Navigate to ComfyUI Manager and Select "Custom nodes manager".
Search for custom nodes "Segment Anything 2" labeled by Kijai. All the models will be downloaded automatically when you run the workflow for the first time.
Remember it will consume some time as the relevant models downloaded in the background. It can checked in the command prompt running in the background.
3. Restart ComfyUI to take effect.
Workflow:
1. Get the workflow from your "ComfyUI-segment-anything-2/examples" folder. Alternatively, you can download from the Github repository. These are different workflow you get-
(a) florence_segment_2 - This support for detecting individual objects and bounding boxes in a single image with Florence model.
(b) image_batch_bbox_segment - This is helpful for batches and masks with the single-image segmentor.
(c) points_segment_video - Its for extend negative points in individual mode if there are too few in segmenting videos.
Choose the one you want to work with. Drag and drop into ComfyUI.
Workflow1: With Florence2 |
Workflow2: Selecting specific object from the entire subject |
2. Load your image/video into the Load node.
3. Load your relevant SAM2 Model from SAM2 node.
3. Do the segmentation by object selection. The selection starts from 0 which allots the first selection and this goes on.
Now, you will be wondering why even we need this. Actually, this a great methodology if you are working with inpainting, face swapping without losing precision. This also helps in mask creation in real-time video that acts as a supporter for other workflows like adding VFX in AI videos.
Some Limitations:
1. Loses track of objects in challenging scenarios (viewpoint changes, occlusions, crowded scenes, extended videos)
2. Confuses similar-looking objects in crowded scenes
3. Decreased efficiency when segmenting multiple objects simultaneously
4. Misses fine details in fast-moving objects
5. Lacks temporal smoothness in predictions
6. Verifying masklet quality by humans required
7. Selecting frames requiring correction