Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I gave same prompt input but makes different result in video and single image prediction #533

Open
tissuemother opened this issue Jan 14, 2025 · 3 comments

Comments

@tissuemother
Copy link

`_, _, masks = predictor.add_new_points_or_box(inference_state=state, frame_idx=0, obj_id=0, box=input_box)

        # Iterate over frames and save masked images
        for local_frame_idx, (frame_idx, object_ids, masks) in enumerate(predictor.propagate_in_video(state)):
            mask_to_vis = {}
            bbox_to_vis = {}

            for obj_id, mask in zip(object_ids, masks):
                mask = mask[0].cpu().numpy()
                mask = mask > 0.0
                non_zero_indices = np.argwhere(mask)
                if len(non_zero_indices) == 0:
                    bbox = [0, 0, 0, 0]
                else:
                    y_min, x_min = non_zero_indices.min(axis=0).tolist()
                    y_max, x_max = non_zero_indices.max(axis=0).tolist()

                    bbox = [y_min, x_min, y_max, x_max]
                bbox_to_vis[obj_id] = bbox
                mask_to_vis[obj_id] = mask

                # Load the original frame image
                original_image_path = os.path.join(current_images_dir, image_names[local_frame_idx])
                original_image = cv2.imread(original_image_path)

                # Ensure the mask is the same size as the original image
                mask_resized = cv2.resize(mask.astype(np.uint8) * 255, (original_image.shape[1], original_image.shape[0]), interpolation=cv2.INTER_NEAREST)

                # Apply the mask to the original image
                masked_image = cv2.bitwise_and(original_image, original_image, mask=mask_resized)

                # Save the masked image
                masked_filename = f"masked_frame_{local_frame_idx+400*batch_idx:04d}.png"
                cv2.imwrite(os.path.join(output_path, masked_filename), masked_image)`

I'm currently using SAM2 to segment my custom dataset with both videos and single image. When using the same prompt, the segmentation works fine on a single image, just like in the demo, but when applied to a video, the result doesn't come out as expected, as shown in the image below. When I checked the internal mask data, it seems to have values between -0.05 and 0.05, so I thought it might be a confidence value issue. I tried adjusting the threshold, but the result still doesn't come out as high quality as with a single image and instead shows strange results with patterns. I don't know what the cause of the issue is and need help troubleshooting.

frame_0121

@heyoeyo
Copy link

heyoeyo commented Jan 14, 2025

Box selecting can tend to create lots of errors/artifacts if the box isn't extremely tight to the object that you want to segment. Here's an example of a 'loose' box around an object:

Image

You can see it seems to try to segment stuff around/behind the main object, making a mess of things. The right side shows all other mask predictions, none of which are useful in this case.
By comparison, if the box is very tight to the object, it gives a good segmentation (only one of the masks on the right is bad in this case):

Image

So the errors you're getting may just be due to a loose fitting box (and maybe the box used when running images fits tighter?). If tightening the boxes doesn't help, you could also try switching models, since the different model sizes can behave differently. If none of that helps, then using point prompts might be the best option, if possible, since the v2 models seem to handle points much better than boxes.

@tissuemother
Copy link
Author

Could this be the cause of the problem?

-The object is stationary, and the camera moves around the object while capturing the dataset.(Is SAM2 only trained for fixed camera?)
-If there is significant camera vibration, noise, or blur

@heyoeyo
Copy link

heyoeyo commented Jan 15, 2025

The object is stationary, and the camera moves around the object while capturing the dataset

From what I've seen, there aren't any issues with moving cameras, though maybe if the camera is moving very fast it could become a problem due to blurring? I don't have any rotating camera examples of my own, but trying it on a video from pexels, it seems to work (this is the tiny v2.1 model), though the rotation isn't fast enough to blur:

Image

If there is significant camera vibration, noise, or blur

Strong blurring could probably break the tracking over time. For your example, do the problems only appear after some number of frames? I was assuming the issue happens on the first frame, but if it's something that happens over time, then ya it could be a blurring/movement issue. Again, I don't have any extreme blurring examples, but using a section of the crab rave video (around 1:15 into the video), the tracking seems to hold on even though almost every frame has severe blurring (of the object, not due to camera movement to be fair):

Image

If your video has a section that is slower/less blurry, maybe it's worth trying the segmentation there first to see if the problem goes away?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants