Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Apple's Depth-Pro #34583

Open
wants to merge 169 commits into
base: main
Choose a base branch
from
Open

Conversation

geetu040
Copy link

@geetu040 geetu040 commented Nov 3, 2024

What does this PR do?

Fixes #34020

This PR adds Apple's Depth Pro model to Hugging Face Transformers. Depth Pro is a foundation model for zero-shot metric monocular depth estimation. It leverages a multi-scale vision transformer optimized for dense predictions. It downsamples an image at several scales. At each scale, it is split into patches, which are processed by a ViT-based (Dinov2) patch encoder, with weights shared across scales. Patches are merged into feature maps, upsampled, and fused via a DPT decoder.

Relevant Links

Before submitting

Who can review?

@amyeroberts, @qubvel

@geetu040
Copy link
Author

geetu040 commented Nov 3, 2024

I have implemented the foundational components of the model and manually loaded the weights to ensure that the architecture aligns with the original design and produces consistent output.

Below is a concise overview of the class hierarchy. I would greatly appreciate your feedback or any suggestions for improvements:

DepthProForDepthEstimation
├── depth_pro: DepthProModel
│   ├── encoder: DepthProEncoder
│   │   ├── patch_encoder: DepthProViT
│   │   │   ├── embeddings: DepthProViTEmbeddings
│   │   │   └── encoder: DepthProViTEncoder
│   │   ├── image_encoder: DepthProViT
│   │   │   ├── embeddings: DepthProViTEmbeddings
│   │   │   └── encoder: DepthProViTEncoder
│   ├── decoder: DepthProDecoder
│   └── fov_model: DepthProFOVModel
│       ├── encoder: DepthProViT
│       │   ├── embeddings: DepthProViTEmbeddings
│       │   └── encoder: DepthProViTEncoder
└── head: DepthProDepthEstimationHead

I have a couple of questions:

  1. The encoder: DepthProEncoder outputs features processed at various scales, including hidden states from the intermediate layers of ViTEncoder. Currently, I use BaseModelOutput, returning all features in the last_hidden_state argument. Should I create a dedicated ModelOutput class for DepthProEncoder? If so, it should reside in the same file as the DepthPro classes since it is specific to them.

  2. For handling the FOV (Field of View) output, would it be appropriate to create a class named DepthEstimatorOutputWithFOV in transformers.modeling_outputs, or should it also remain within the DepthPro context?

@Rocketknight1
Copy link
Member

cc @pcuenca as well!

@qubvel
Copy link
Member

qubvel commented Nov 5, 2024

Hi @geetu040! Thanks for working on this model!

Regarding model outputs they should be written if you want to add a new argument or write better docs. In case of intermediate outputs you can store them in BaseModelOutput.hidden_states, for example mllama set default output_hidden_states=True and then select required hidden states from vision transformer.

@geetu040
Copy link
Author

@qubvel @pcuenca Thanks, I have updated the code for hidden_states.

I still need an opinion with fov (field of view)
DepthPro returns the predicted_depth as well as the fov which is a scaler value.

The existing DepthEstimatorOutput class in transformers/src/transformers/modeling_outputs.py looks like this:

class DepthEstimatorOutput(ModelOutput):
    loss: Optional[torch.FloatTensor] = None
    predicted_depth: torch.FloatTensor = None
    hidden_states: Optional[Tuple[torch.FloatTensor, ...]] = None
    attentions: Optional[Tuple[torch.FloatTensor, ...]] = None

Q1: Do I create a new class DepthEstimatorOutputWithFOV or update the existing class?
Q2: User should be given the option to turn the FOV on or off because calculating FOV requires extra processing. In this case should this parameter be a part of model initialization DepthProForDepthEstimation(config, return_fov=True) or should it be kept inside config.

@qubvel
Copy link
Member

qubvel commented Nov 11, 2024

Thanks @geetu040

Q1:

class DepthProDepthEstimatorOutput(DepthEstimatorOutput):
    fov: Optional[torch.FloatTensor] = None

This output can be returned in both cases: fov=None and not None.

Q2:

Yeah, this can be a parameter of the config, but also should be an argument in forward method to override the config parameter (similar to output_hidden_states)

Please, let me know if you have more questions!

@geetu040
Copy link
Author

Yeah, this can be a parameter of the config, but also should be an argument in forward method to override the config parameter (similar to output_hidden_states)

This needs to be done during __init__, because it requires fov_model (another vision transformer) to be initialized.

@qubvel
Copy link
Member

qubvel commented Nov 15, 2024

OK, got it! Then it should be done with config! And anyone can just load a model as following:

model = DepthProForDepthEstimation(checkpoint, fov_model=True)
# or
model = DepthProForDepthEstimation(checkpoint, fov_model=False)

With such initialization fov_model param will be overridden in config

@geetu040
Copy link
Author

  • currently an image is down-scaled to medium resolution (high / 2) and low resolution (high / 4)
  • then patches are created from high, medium and low and concatenated.

I was wondering can we also give this option to the users to decide which scales to use, for example, a user tells in config to use these custom scales image_scales=[0.6, 0.4, 0.3]

  • now an image will downscale to these 3 scales
  • then patches are created from high and scaled images and concatenated.

@qubvel I have looked into the code how this can be implemented, it is do-able and I can easily make this option available and I would prefer that, but I have to ask you as well, do you think this option should be given to the users?

@qubvel
Copy link
Member

qubvel commented Nov 18, 2024

Hi @geetu040, we try to avoid overcomplicated code with lots of parameters, the general rule is to get rid of different code paths / unused params that are not different across pretrained checkpoints. For this particular case, feel free to add it, but only in case it will not introduce extra complexity to the modeling code.

@geetu040
Copy link
Author

geetu040 commented Nov 25, 2024

Hi @qubvel I have a question about the image processor.

the source code from apple/depth-pro preprocesses the image in this sequence normalize -> resize, however in conventional image processor for vit and dpt, the sequence is resize -> normalize

this causes the two outputs to be slightly different from each other.

do you suggest I stay with the convention and ignore the minor difference in output or I make the implementation exactly like the source code, I am not very sure how to do this because the original resize function gives an error if it is simply moved above normalization code and if I use torch.nn.funtional.interpolate that is also not very optimal, it requires data conversions.

Here are the outputs

Different in Outputs

there is a slight difference, this happens because of how the image is pre-processed before being given to the model

Source code results

ic| depth: tensor([[0.9604, 0.9329, 0.8837,  ..., 3.0123, 2.9720, 2.9517],
                   [0.9210, 0.8995, 0.8605,  ..., 3.0148, 3.0120, 3.0106],
                   [0.8811, 0.8655, 0.8366,  ..., 3.0245, 3.0473, 3.0592],
                   ...,
                   [1.2283, 1.2263, 1.2225,  ..., 1.2698, 1.2818, 1.2881],
                   [1.2228, 1.2241, 1.2266,  ..., 1.2679, 1.2806, 1.2872],
                   [1.2167, 1.2223, 1.2333,  ..., 1.2655, 1.2757, 1.2810]])
ic| depth.shape: torch.Size([2268, 3024])
ic| focallength_px: tensor(3362.0200)

HF code results

ic| predicted_depth: [tensor([[0.9727, 0.9443, 0.8937,  ..., 3.0023, 2.9608, 2.9399],
                             [0.9320, 0.9097, 0.8693,  ..., 3.0045, 3.0006, 2.9987],
                             [0.8899, 0.8737, 0.8439,  ..., 3.0129, 3.0352, 3.0469],
                             ...,
                             [1.2393, 1.2373, 1.2334,  ..., 1.2805, 1.2934, 1.3001],
                             [1.2344, 1.2356, 1.2379,  ..., 1.2802, 1.2935, 1.3004],
                             [1.2286, 1.2341, 1.2447,  ..., 1.2788, 1.2892, 1.2947]])]
ic| fov: [tensor(3383.9839)]

Difference in Output Image

visually no difference in the 2 images

Input Image
example

Source code results
Figure_1

HF code results
Figure_2

@geetu040
Copy link
Author

Also how does the weight conversion work?

I have created the script for weight conversion, but when and who uploads that on huggingface? because I would need these converted weights for examples in docstring.

@geetu040
Copy link
Author

geetu040 commented Feb 5, 2025

  • I have updated the code to use the latest BaseImageProcessorFast class.
  • Failing tests are unrelated.
  • PR ready for review again

Copy link
Member

@qubvel qubvel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for updating to a new Fast image processor!! A few more comments

Comment on lines +128 to +144
# DepthPro resizes image after rescaling and normalizing,
# which makes it different from BaseImageProcessorFast._preprocess
def _preprocess(
self,
images: List["torch.Tensor"],
do_resize: bool,
size: SizeDict,
interpolation: Optional["F.InterpolationMode"],
do_center_crop: bool,
crop_size: SizeDict,
do_rescale: bool,
rescale_factor: float,
do_normalize: bool,
image_mean: Optional[Union[float, List[float]]],
image_std: Optional[Union[float, List[float]]],
return_tensors: Optional[Union[str, TensorType]],
) -> BatchFeature:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @yonigozlan on a different order of preprocessing (we discussed it yesterday)

tests/models/depth_pro/test_image_processing_depth_pro.py Outdated Show resolved Hide resolved
@qubvel
Copy link
Member

qubvel commented Feb 5, 2025

@geetu040 please see #34583 (comment) re adding antialias

@geetu040
Copy link
Author

geetu040 commented Feb 5, 2025

@qubvel grouping in one loop was a great suggestion here, glad you noticed that. Grouping them twice was unnecessary, since the size of images did not change after the first group. Thanks for that!

Well this did speed up the fast processor and in fact, it's faster than slow processor now. So I am not skipping the test anymore. I was using is_flanky before because locally it does fail sometimes sporadically, but it seems to fail for other processors locally as well (sporadically). But it's working fine in the CI tests (atleast for the last 2/2 workflows), so I'll avoid flanky tag.

@geetu040
Copy link
Author

geetu040 commented Feb 5, 2025

And everything is up-to-date now, ready for review again. Failing tests are unrelated.

@qubvel
Copy link
Member

qubvel commented Feb 5, 2025

Thanks for applying the changes! 🤗 We’re all set to proceed with the model.

Before merging, there’s just one thing left: we usually transfer the checkpoint to the organization that released the original model, in this case - Apple, but only if you’re okay with that.

If you're fine with it, we can go ahead with the transfer. However, before that we’ll need to:

  • Change the Hub repository name from "DepthPro" to "depth-pro-hf"
  • Update the checkpoint name in the model card snippet
  • Update the checkpoint names in the code

Let us know how you’d like to proceed! 😊

@geetu040
Copy link
Author

geetu040 commented Feb 5, 2025

Apple, but only if you’re okay with that.

I am okay with that.

If you're fine with it, we can go ahead with the transfer. However, before that we’ll need to:

I have updated the hub repository name: https://huggingface.co/geetu040/depth-pro-hf
and updated the checkpoint in model card and code

@geetu040
Copy link
Author

geetu040 commented Feb 5, 2025

And Thank you @qubvel for all the fine detailed reviews.

@qubvel
Copy link
Member

qubvel commented Feb 5, 2025

Thanks, can you please also update repo_id to "apple" everywhere? Like "apple/depth-pro-hf"

@geetu040
Copy link
Author

geetu040 commented Feb 6, 2025

Thanks, can you please also update repo_id to "apple" everywhere? Like "apple/depth-pro-hf"

updated

@geetu040
Copy link
Author

geetu040 commented Feb 6, 2025

@qubvel, DepthProImageProcessingTest::test_fast_is_faster_than_slow has failed in the last (2/5) runs, do you think I should put is_flanky decorator on it? It's guaranteed (with a very high probability) that it passes every time.

@qubvel
Copy link
Member

qubvel commented Feb 6, 2025

Yes, let's add @is_flaky()

@geetu040
Copy link
Author

geetu040 commented Feb 6, 2025

Yes, let's add @is_flaky()

I've updated the test and its working fine now

@geetu040 geetu040 requested a review from qubvel February 6, 2025 12:14
@qubvel
Copy link
Member

qubvel commented Feb 6, 2025

Thanks! woking on checkpoint transfer

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for Apple's Depth-Pro
6 participants