-
Notifications
You must be signed in to change notification settings - Fork 27.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for Apple's Depth-Pro #34583
base: main
Are you sure you want to change the base?
Conversation
I have implemented the foundational components of the model and manually loaded the weights to ensure that the architecture aligns with the original design and produces consistent output. Below is a concise overview of the class hierarchy. I would greatly appreciate your feedback or any suggestions for improvements:
I have a couple of questions:
|
cc @pcuenca as well! |
Hi @geetu040! Thanks for working on this model! Regarding model outputs they should be written if you want to add a new argument or write better docs. In case of intermediate outputs you can store them in |
@qubvel @pcuenca Thanks, I have updated the code for hidden_states. I still need an opinion with The existing class DepthEstimatorOutput(ModelOutput):
loss: Optional[torch.FloatTensor] = None
predicted_depth: torch.FloatTensor = None
hidden_states: Optional[Tuple[torch.FloatTensor, ...]] = None
attentions: Optional[Tuple[torch.FloatTensor, ...]] = None Q1: Do I create a new class |
Thanks @geetu040 Q1: class DepthProDepthEstimatorOutput(DepthEstimatorOutput):
fov: Optional[torch.FloatTensor] = None This output can be returned in both cases: Q2: Yeah, this can be a parameter of the config, but also should be an argument in Please, let me know if you have more questions! |
This needs to be done during |
OK, got it! Then it should be done with config! And anyone can just load a model as following: model = DepthProForDepthEstimation(checkpoint, fov_model=True)
# or
model = DepthProForDepthEstimation(checkpoint, fov_model=False) With such initialization |
I was wondering can we also give this option to the users to decide which scales to use, for example, a user tells in config to use these custom scales
@qubvel I have looked into the code how this can be implemented, it is do-able and I can easily make this option available and I would prefer that, but I have to ask you as well, do you think this option should be given to the users? |
Hi @geetu040, we try to avoid overcomplicated code with lots of parameters, the general rule is to get rid of different code paths / unused params that are not different across pretrained checkpoints. For this particular case, feel free to add it, but only in case it will not introduce extra complexity to the modeling code. |
Hi @qubvel I have a question about the image processor. the source code from this causes the two outputs to be slightly different from each other. do you suggest I stay with the convention and ignore the minor difference in output or I make the implementation exactly like the source code, I am not very sure how to do this because the original Here are the outputs Different in Outputs there is a slight difference, this happens because of how the image is pre-processed before being given to the model Source code results
HF code results
Difference in Output Image visually no difference in the 2 images |
Also how does the weight conversion work? I have created the script for weight conversion, but when and who uploads that on huggingface? because I would need these converted weights for examples in docstring. |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for updating to a new Fast image processor!! A few more comments
src/transformers/models/depth_pro/image_processing_depth_pro_fast.py
Outdated
Show resolved
Hide resolved
# DepthPro resizes image after rescaling and normalizing, | ||
# which makes it different from BaseImageProcessorFast._preprocess | ||
def _preprocess( | ||
self, | ||
images: List["torch.Tensor"], | ||
do_resize: bool, | ||
size: SizeDict, | ||
interpolation: Optional["F.InterpolationMode"], | ||
do_center_crop: bool, | ||
crop_size: SizeDict, | ||
do_rescale: bool, | ||
rescale_factor: float, | ||
do_normalize: bool, | ||
image_mean: Optional[Union[float, List[float]]], | ||
image_std: Optional[Union[float, List[float]]], | ||
return_tensors: Optional[Union[str, TensorType]], | ||
) -> BatchFeature: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @yonigozlan on a different order of preprocessing (we discussed it yesterday)
src/transformers/models/depth_pro/image_processing_depth_pro_fast.py
Outdated
Show resolved
Hide resolved
@geetu040 please see #34583 (comment) re adding |
@qubvel grouping in one loop was a great suggestion here, glad you noticed that. Grouping them twice was unnecessary, since the size of images did not change after the first group. Thanks for that! Well this did speed up the fast processor and in fact, it's faster than slow processor now. So I am not skipping the test anymore. I was using |
And everything is up-to-date now, ready for review again. Failing tests are unrelated. |
Thanks for applying the changes! 🤗 We’re all set to proceed with the model. Before merging, there’s just one thing left: we usually transfer the checkpoint to the organization that released the original model, in this case - Apple, but only if you’re okay with that. If you're fine with it, we can go ahead with the transfer. However, before that we’ll need to:
Let us know how you’d like to proceed! 😊 |
I am okay with that.
I have updated the hub repository name: https://huggingface.co/geetu040/depth-pro-hf |
And Thank you @qubvel for all the fine detailed reviews. |
Thanks, can you please also update repo_id to "apple" everywhere? Like "apple/depth-pro-hf" |
updated |
@qubvel, |
Yes, let's add |
I've updated the test and its working fine now |
Thanks! woking on checkpoint transfer |
What does this PR do?
Fixes #34020
This PR adds Apple's Depth Pro model to Hugging Face Transformers. Depth Pro is a foundation model for zero-shot metric monocular depth estimation. It leverages a multi-scale vision transformer optimized for dense predictions. It downsamples an image at several scales. At each scale, it is split into patches, which are processed by a ViT-based (Dinov2) patch encoder, with weights shared across scales. Patches are merged into feature maps, upsampled, and fused via a DPT decoder.
Relevant Links
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@amyeroberts, @qubvel