I gave same prompt input but makes different result in video and single image prediction #533

tissuemother · 2025-01-14T08:21:32Z

`_, _, masks = predictor.add_new_points_or_box(inference_state=state, frame_idx=0, obj_id=0, box=input_box)

        # Iterate over frames and save masked images
        for local_frame_idx, (frame_idx, object_ids, masks) in enumerate(predictor.propagate_in_video(state)):
            mask_to_vis = {}
            bbox_to_vis = {}

            for obj_id, mask in zip(object_ids, masks):
                mask = mask[0].cpu().numpy()
                mask = mask > 0.0
                non_zero_indices = np.argwhere(mask)
                if len(non_zero_indices) == 0:
                    bbox = [0, 0, 0, 0]
                else:
                    y_min, x_min = non_zero_indices.min(axis=0).tolist()
                    y_max, x_max = non_zero_indices.max(axis=0).tolist()

                    bbox = [y_min, x_min, y_max, x_max]
                bbox_to_vis[obj_id] = bbox
                mask_to_vis[obj_id] = mask

                # Load the original frame image
                original_image_path = os.path.join(current_images_dir, image_names[local_frame_idx])
                original_image = cv2.imread(original_image_path)

                # Ensure the mask is the same size as the original image
                mask_resized = cv2.resize(mask.astype(np.uint8) * 255, (original_image.shape[1], original_image.shape[0]), interpolation=cv2.INTER_NEAREST)

                # Apply the mask to the original image
                masked_image = cv2.bitwise_and(original_image, original_image, mask=mask_resized)

                # Save the masked image
                masked_filename = f"masked_frame_{local_frame_idx+400*batch_idx:04d}.png"
                cv2.imwrite(os.path.join(output_path, masked_filename), masked_image)`

I'm currently using SAM2 to segment my custom dataset with both videos and single image. When using the same prompt, the segmentation works fine on a single image, just like in the demo, but when applied to a video, the result doesn't come out as expected, as shown in the image below. When I checked the internal mask data, it seems to have values between -0.05 and 0.05, so I thought it might be a confidence value issue. I tried adjusting the threshold, but the result still doesn't come out as high quality as with a single image and instead shows strange results with patterns. I don't know what the cause of the issue is and need help troubleshooting.

The text was updated successfully, but these errors were encountered:

heyoeyo · 2025-01-14T15:33:01Z

Box selecting can tend to create lots of errors/artifacts if the box isn't extremely tight to the object that you want to segment. Here's an example of a 'loose' box around an object:

You can see it seems to try to segment stuff around/behind the main object, making a mess of things. The right side shows all other mask predictions, none of which are useful in this case.
By comparison, if the box is very tight to the object, it gives a good segmentation (only one of the masks on the right is bad in this case):

So the errors you're getting may just be due to a loose fitting box (and maybe the box used when running images fits tighter?). If tightening the boxes doesn't help, you could also try switching models, since the different model sizes can behave differently. If none of that helps, then using point prompts might be the best option, if possible, since the v2 models seem to handle points much better than boxes.

tissuemother · 2025-01-15T06:51:52Z

Could this be the cause of the problem?

-The object is stationary, and the camera moves around the object while capturing the dataset.(Is SAM2 only trained for fixed camera?)
-If there is significant camera vibration, noise, or blur

heyoeyo · 2025-01-15T15:40:41Z

The object is stationary, and the camera moves around the object while capturing the dataset

From what I've seen, there aren't any issues with moving cameras, though maybe if the camera is moving very fast it could become a problem due to blurring? I don't have any rotating camera examples of my own, but trying it on a video from pexels, it seems to work (this is the tiny v2.1 model), though the rotation isn't fast enough to blur:

If there is significant camera vibration, noise, or blur

Strong blurring could probably break the tracking over time. For your example, do the problems only appear after some number of frames? I was assuming the issue happens on the first frame, but if it's something that happens over time, then ya it could be a blurring/movement issue. Again, I don't have any extreme blurring examples, but using a section of the crab rave video (around 1:15 into the video), the tracking seems to hold on even though almost every frame has severe blurring (of the object, not due to camera movement to be fair):

If your video has a section that is slower/less blurry, maybe it's worth trying the segmentation there first to see if the problem goes away?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I gave same prompt input but makes different result in video and single image prediction #533

I gave same prompt input but makes different result in video and single image prediction #533

tissuemother commented Jan 14, 2025

heyoeyo commented Jan 14, 2025

tissuemother commented Jan 15, 2025

heyoeyo commented Jan 15, 2025

I gave same prompt input but makes different result in video and single image prediction #533

I gave same prompt input but makes different result in video and single image prediction #533

Comments

tissuemother commented Jan 14, 2025

heyoeyo commented Jan 14, 2025

tissuemother commented Jan 15, 2025

heyoeyo commented Jan 15, 2025