Tracking objects with precision in images and videos is one of the challenging tasks. SAM2 (Sement Anything Model V2) is an open-source model released by MetaAI, registered under Apache2.0 license. This model ensures more accuracy when working with object segmentation with videos and images when compared with the SAM (older model).
Source: MetaAI |
The model is trained on ~51,000 real-world videos and ~600,000 masklets (spatio-temporal masks). You can get more in-depth knowledge from officially released article.
Installation:
1. First you need to do the ComfyUI installation setup if you are new to Comfy.
2. Install the custom nodes by Kijai. Move to "ComfyUI/custom_nodes" folder. Open command prompt using "cmd" on your folder address bar.
Clone the SAM2 repository using following command:
git clone https://github.com/kijai/ComfyUI-segment-anything-2.git
Download any of the models from Hugging Face repository. There are multiple options you can choose with: Base, Tiny, Small, Large. Save the respective model inside "ComfyUI/models/sam2" folder. Create a "sam2" folder if not exist.
Alternative:
Navigate to ComfyUI Manager and Select "Custom nodes manager".
Search for custom nodes "Segment Anything 2" labeled by Kijai. All the models will be downloaded automatically when you run the workflow for the first time.
Remember it will consume some time as the relevant models are downloaded in the background. It can checked in the command prompt running in the background.
3. Restart ComfyUI to take effect.
Workflow:
1. Get the workflow from your "ComfyUI-segment-anything-2/examples" folder. Alternatively, you can download it from the Github repository. These are different workflows you get-
(a) florence_segment_2 - This supports detecting individual objects and bounding boxes in a single image with the Florence model.
(b) image_batch_bbox_segment - This is helpful for batches and masks with the single-image segmentor.
(c) points_segment_video - Its for extend negative points in individual mode if there are too few in segmenting videos.
Choose the one you want to work with. Drag and drop into ComfyUI.
Workflow1: With Florence2 |
Workflow2: Selecting specific object from the entire subject |
2. Load your image/video into the Load node.
3. Load your relevant SAM2 Model from SAM2 node.
3. Do the segmentation by object selection. The selection starts from 0 which allows the first selection and this goes on.
Now, you will be wondering why even we need this. Actually, this a great methodology if you are working with inpainting, face swapping without losing precision. This also helps in mask creation in real-time video that acts as a supporter for other workflows like adding VFX in AI videos.
Some Limitations:
1. Loses track of objects in challenging scenarios (viewpoint changes, occlusions, crowded scenes, extended videos)
2. Confuses similar-looking objects in crowded scenes
3. Decreased efficiency when segmenting multiple objects simultaneously
4. Misses fine details in fast-moving objects
5. Lacks temporal smoothness in predictions
6. Verifying masklet quality by humans required
7. Selecting frames requiring correction