Add support for Apple's Depth-Pro #34583

geetu040 · 2024-11-03T06:20:30Z

What does this PR do?

This PR adds Apple's Depth Pro model to Hugging Face Transformers. Depth Pro is a foundation model for zero-shot metric monocular depth estimation. It leverages a multi-scale vision transformer optimized for dense predictions. It downsamples an image at several scales. At each scale, it is split into patches, which are processed by a ViT-based (Dinov2) patch encoder, with weights shared across scales. Patches are merged into feature maps, upsampled, and fused via a DPT decoder.

Relevant Links

Research Paper: Depth Pro: Sharp Monocular Metric Depth in Less Than a Second
Authors: Aleksei Bochkovskii, Amaël Delaunoy, and others
Implementation: apple/ml-depth-pro
Models Weights: apple/DepthPro

Before submitting

Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@amyeroberts, @qubvel

geetu040 · 2024-11-03T06:59:29Z

I have implemented the foundational components of the model and manually loaded the weights to ensure that the architecture aligns with the original design and produces consistent output.

Below is a concise overview of the class hierarchy. I would greatly appreciate your feedback or any suggestions for improvements:

DepthProForDepthEstimation
├── depth_pro: DepthProModel
│   ├── encoder: DepthProEncoder
│   │   ├── patch_encoder: DepthProViT
│   │   │   ├── embeddings: DepthProViTEmbeddings
│   │   │   └── encoder: DepthProViTEncoder
│   │   ├── image_encoder: DepthProViT
│   │   │   ├── embeddings: DepthProViTEmbeddings
│   │   │   └── encoder: DepthProViTEncoder
│   ├── decoder: DepthProDecoder
│   └── fov_model: DepthProFOVModel
│       ├── encoder: DepthProViT
│       │   ├── embeddings: DepthProViTEmbeddings
│       │   └── encoder: DepthProViTEncoder
└── head: DepthProDepthEstimationHead

I have a couple of questions:

The encoder: DepthProEncoder outputs features processed at various scales, including hidden states from the intermediate layers of ViTEncoder. Currently, I use BaseModelOutput, returning all features in the last_hidden_state argument. Should I create a dedicated ModelOutput class for DepthProEncoder? If so, it should reside in the same file as the DepthPro classes since it is specific to them.
For handling the FOV (Field of View) output, would it be appropriate to create a class named DepthEstimatorOutputWithFOV in transformers.modeling_outputs, or should it also remain within the DepthPro context?

Rocketknight1 · 2024-11-04T13:40:59Z

cc @pcuenca as well!

qubvel · 2024-11-05T10:30:42Z

Hi @geetu040! Thanks for working on this model!

Regarding model outputs they should be written if you want to add a new argument or write better docs. In case of intermediate outputs you can store them in BaseModelOutput.hidden_states, for example mllama set default output_hidden_states=True and then select required hidden states from vision transformer.

geetu040 · 2024-11-10T06:27:52Z

@qubvel @pcuenca Thanks, I have updated the code for hidden_states.

I still need an opinion with fov (field of view)
DepthPro returns the predicted_depth as well as the fov which is a scaler value.

The existing DepthEstimatorOutput class in transformers/src/transformers/modeling_outputs.py looks like this:

class DepthEstimatorOutput(ModelOutput):
    loss: Optional[torch.FloatTensor] = None
    predicted_depth: torch.FloatTensor = None
    hidden_states: Optional[Tuple[torch.FloatTensor, ...]] = None
    attentions: Optional[Tuple[torch.FloatTensor, ...]] = None

Q1: Do I create a new class DepthEstimatorOutputWithFOV or update the existing class?
Q2: User should be given the option to turn the FOV on or off because calculating FOV requires extra processing. In this case should this parameter be a part of model initialization DepthProForDepthEstimation(config, return_fov=True) or should it be kept inside config.

qubvel · 2024-11-11T14:01:13Z

Thanks @geetu040

Q1:

class DepthProDepthEstimatorOutput(DepthEstimatorOutput):
    fov: Optional[torch.FloatTensor] = None

This output can be returned in both cases: fov=None and not None.

Q2:

Yeah, this can be a parameter of the config, but also should be an argument in forward method to override the config parameter (similar to output_hidden_states)

Please, let me know if you have more questions!

geetu040 · 2024-11-12T04:52:18Z

Yeah, this can be a parameter of the config, but also should be an argument in forward method to override the config parameter (similar to output_hidden_states)

This needs to be done during __init__, because it requires fov_model (another vision transformer) to be initialized.

qubvel · 2024-11-15T23:21:09Z

OK, got it! Then it should be done with config! And anyone can just load a model as following:

model = DepthProForDepthEstimation(checkpoint, fov_model=True)
# or
model = DepthProForDepthEstimation(checkpoint, fov_model=False)

With such initialization fov_model param will be overridden in config

geetu040 · 2024-11-18T05:35:55Z

currently an image is down-scaled to medium resolution (high / 2) and low resolution (high / 4)
then patches are created from high, medium and low and concatenated.

I was wondering can we also give this option to the users to decide which scales to use, for example, a user tells in config to use these custom scales image_scales=[0.6, 0.4, 0.3]

now an image will downscale to these 3 scales
then patches are created from high and scaled images and concatenated.

@qubvel I have looked into the code how this can be implemented, it is do-able and I can easily make this option available and I would prefer that, but I have to ask you as well, do you think this option should be given to the users?

qubvel · 2024-11-18T10:29:08Z

Hi @geetu040, we try to avoid overcomplicated code with lots of parameters, the general rule is to get rid of different code paths / unused params that are not different across pretrained checkpoints. For this particular case, feel free to add it, but only in case it will not introduce extra complexity to the modeling code.

geetu040 · 2024-11-25T07:36:09Z

Hi @qubvel I have a question about the image processor.

the source code from apple/depth-pro preprocesses the image in this sequence normalize -> resize, however in conventional image processor for vit and dpt, the sequence is resize -> normalize

this causes the two outputs to be slightly different from each other.

do you suggest I stay with the convention and ignore the minor difference in output or I make the implementation exactly like the source code, I am not very sure how to do this because the original resize function gives an error if it is simply moved above normalization code and if I use torch.nn.funtional.interpolate that is also not very optimal, it requires data conversions.

Here are the outputs

Different in Outputs

there is a slight difference, this happens because of how the image is pre-processed before being given to the model

Source code results

ic| depth: tensor([[0.9604, 0.9329, 0.8837,  ..., 3.0123, 2.9720, 2.9517],
                   [0.9210, 0.8995, 0.8605,  ..., 3.0148, 3.0120, 3.0106],
                   [0.8811, 0.8655, 0.8366,  ..., 3.0245, 3.0473, 3.0592],
                   ...,
                   [1.2283, 1.2263, 1.2225,  ..., 1.2698, 1.2818, 1.2881],
                   [1.2228, 1.2241, 1.2266,  ..., 1.2679, 1.2806, 1.2872],
                   [1.2167, 1.2223, 1.2333,  ..., 1.2655, 1.2757, 1.2810]])
ic| depth.shape: torch.Size([2268, 3024])
ic| focallength_px: tensor(3362.0200)

HF code results

ic| predicted_depth: [tensor([[0.9727, 0.9443, 0.8937,  ..., 3.0023, 2.9608, 2.9399],
                             [0.9320, 0.9097, 0.8693,  ..., 3.0045, 3.0006, 2.9987],
                             [0.8899, 0.8737, 0.8439,  ..., 3.0129, 3.0352, 3.0469],
                             ...,
                             [1.2393, 1.2373, 1.2334,  ..., 1.2805, 1.2934, 1.3001],
                             [1.2344, 1.2356, 1.2379,  ..., 1.2802, 1.2935, 1.3004],
                             [1.2286, 1.2341, 1.2447,  ..., 1.2788, 1.2892, 1.2947]])]
ic| fov: [tensor(3383.9839)]

Difference in Output Image

visually no difference in the 2 images

Input Image

Source code results

HF code results

geetu040 · 2024-11-25T07:54:39Z

Also how does the weight conversion work?

I have created the script for weight conversion, but when and who uploads that on huggingface? because I would need these converted weights for examples in docstring.

tests/models/depth_pro/test_image_processing_depth_pro.py

geetu040 · 2025-02-05T09:44:19Z

I have updated the code to use the latest BaseImageProcessorFast class.
Failing tests are unrelated.
PR ready for review again

qubvel

Thanks a lot for updating to a new Fast image processor!! A few more comments

src/transformers/models/depth_pro/image_processing_depth_pro_fast.py

qubvel · 2025-02-05T11:38:19Z

src/transformers/models/depth_pro/image_processing_depth_pro_fast.py

+    # DepthPro resizes image after rescaling and normalizing,
+    # which makes it different from BaseImageProcessorFast._preprocess
+    def _preprocess(
+        self,
+        images: List["torch.Tensor"],
+        do_resize: bool,
+        size: SizeDict,
+        interpolation: Optional["F.InterpolationMode"],
+        do_center_crop: bool,
+        crop_size: SizeDict,
+        do_rescale: bool,
+        rescale_factor: float,
+        do_normalize: bool,
+        image_mean: Optional[Union[float, List[float]]],
+        image_std: Optional[Union[float, List[float]]],
+        return_tensors: Optional[Union[str, TensorType]],
+    ) -> BatchFeature:


cc @yonigozlan on a different order of preprocessing (we discussed it yesterday)

src/transformers/models/depth_pro/image_processing_depth_pro_fast.py

tests/models/depth_pro/test_image_processing_depth_pro.py

qubvel · 2025-02-05T14:10:21Z

@geetu040 please see #34583 (comment) re adding antialias

This reverts commit 5caa0bd.

This reverts commit 3ae1134.

geetu040 · 2025-02-05T15:24:45Z

@qubvel grouping in one loop was a great suggestion here, glad you noticed that. Grouping them twice was unnecessary, since the size of images did not change after the first group. Thanks for that!

Well this did speed up the fast processor and in fact, it's faster than slow processor now. So I am not skipping the test anymore. I was using is_flanky before because locally it does fail sometimes sporadically, but it seems to fail for other processors locally as well (sporadically). But it's working fine in the CI tests (atleast for the last 2/2 workflows), so I'll avoid flanky tag.

geetu040 · 2025-02-05T15:25:01Z

And everything is up-to-date now, ready for review again. Failing tests are unrelated.

qubvel · 2025-02-05T17:05:26Z

Thanks for applying the changes! 🤗 We’re all set to proceed with the model.

Before merging, there’s just one thing left: we usually transfer the checkpoint to the organization that released the original model, in this case - Apple, but only if you’re okay with that.

If you're fine with it, we can go ahead with the transfer. However, before that we’ll need to:

Change the Hub repository name from "DepthPro" to "depth-pro-hf"
Update the checkpoint name in the model card snippet
Update the checkpoint names in the code

Let us know how you’d like to proceed! 😊

geetu040 · 2025-02-05T18:31:26Z

Apple, but only if you’re okay with that.

I am okay with that.

If you're fine with it, we can go ahead with the transfer. However, before that we’ll need to:

I have updated the hub repository name: https://huggingface.co/geetu040/depth-pro-hf
and updated the checkpoint in model card and code

geetu040 · 2025-02-05T18:39:28Z

And Thank you @qubvel for all the fine detailed reviews.

qubvel · 2025-02-05T20:01:54Z

Thanks, can you please also update repo_id to "apple" everywhere? Like "apple/depth-pro-hf"

geetu040 · 2025-02-06T03:25:52Z

Thanks, can you please also update repo_id to "apple" everywhere? Like "apple/depth-pro-hf"

updated

geetu040 · 2025-02-06T03:38:36Z

@qubvel, DepthProImageProcessingTest::test_fast_is_faster_than_slow has failed in the last (2/5) runs, do you think I should put is_flanky decorator on it? It's guaranteed (with a very high probability) that it passes every time.

qubvel · 2025-02-06T09:59:58Z

Yes, let's add @is_flaky()

geetu040 · 2025-02-06T12:14:06Z

Yes, let's add @is_flaky()

I've updated the test and its working fine now

qubvel · 2025-02-06T12:16:17Z

Thanks! woking on checkpoint transfer

implement config and model building blocks

2986dc2

qubvel added New model Vision labels Nov 5, 2024

refactor model architechture

1728a2f

update model outputs

11ce50c

geetu040 added 10 commits November 16, 2024 10:23

update init param to include use_fov_model

27e9593

update param name in config

e74a7f5

fix hidden_states and attentions outputs for fov

8c2460b

sort config

55f6ed3

complete minor todos

b25dffb

update patching

c225deb

update config for encoder

176932d

fix config

dcec522

use correct defaults in config

0384d2f

update merge for compatibility with different image size

85e4f86

geetu040 added 4 commits November 21, 2024 11:04

restructure encoder for custom configuration

00e4aa3

make fov model compatible with custom config

6be242c

replace word "decoder" with "fusion"

0189108

weight conversion script

7614e1a

geetu040 commented Feb 5, 2025

View reviewed changes

tests/models/depth_pro/test_image_processing_depth_pro.py Outdated Show resolved Hide resolved

Merge branch 'main' into depth-pro

75215ed

Merge branch 'main' into depth-pro

ffb3a82

qubvel reviewed Feb 5, 2025

View reviewed changes

geetu040 added 2 commits February 5, 2025 18:07

revert antialias removal

5caa0bd

add antialias in BaseImageProcessorFast

3ae1134

geetu040 added 5 commits February 5, 2025 19:40

Revert "revert antialias removal"

8372ad9

This reverts commit 5caa0bd.

Revert "add antialias in BaseImageProcessorFast"

666f3b7

This reverts commit 3ae1134.

update processor for grouping and antialias

41180e3

try test_fast_is_faster_than_slow without "skip" or "flanky"

1265b12

Merge branch 'main' into depth-pro

86c4604

update checkpoint

4dc850f

Merge branch 'main' into depth-pro

b7f32b9

update checkpoint

592648c

geetu040 added 2 commits February 6, 2025 16:33

use @is_flanky for processor test

162f141

Merge branch 'main' into depth-pro

3a62d63

geetu040 requested a review from qubvel February 6, 2025 12:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for Apple's Depth-Pro #34583

Add support for Apple's Depth-Pro #34583

geetu040 commented Nov 3, 2024 •

edited

Loading

geetu040 commented Nov 3, 2024

Rocketknight1 commented Nov 4, 2024

qubvel commented Nov 5, 2024

geetu040 commented Nov 10, 2024

qubvel commented Nov 11, 2024

geetu040 commented Nov 12, 2024

qubvel commented Nov 15, 2024

geetu040 commented Nov 18, 2024

qubvel commented Nov 18, 2024

geetu040 commented Nov 25, 2024 •

edited

Loading

geetu040 commented Nov 25, 2024

geetu040 commented Feb 5, 2025

qubvel left a comment

qubvel Feb 5, 2025

qubvel commented Feb 5, 2025

geetu040 commented Feb 5, 2025

geetu040 commented Feb 5, 2025

qubvel commented Feb 5, 2025

geetu040 commented Feb 5, 2025

geetu040 commented Feb 5, 2025

qubvel commented Feb 5, 2025 •

edited

Loading

geetu040 commented Feb 6, 2025

geetu040 commented Feb 6, 2025 •

edited

Loading

qubvel commented Feb 6, 2025

geetu040 commented Feb 6, 2025

qubvel commented Feb 6, 2025

Add support for Apple's Depth-Pro #34583

Are you sure you want to change the base?

Add support for Apple's Depth-Pro #34583

Conversation

geetu040 commented Nov 3, 2024 • edited Loading

What does this PR do?

Before submitting

Who can review?

geetu040 commented Nov 3, 2024

Rocketknight1 commented Nov 4, 2024

qubvel commented Nov 5, 2024

geetu040 commented Nov 10, 2024

qubvel commented Nov 11, 2024

geetu040 commented Nov 12, 2024

qubvel commented Nov 15, 2024

geetu040 commented Nov 18, 2024

qubvel commented Nov 18, 2024

geetu040 commented Nov 25, 2024 • edited Loading

geetu040 commented Nov 25, 2024

geetu040 commented Feb 5, 2025

qubvel left a comment

Choose a reason for hiding this comment

qubvel Feb 5, 2025

Choose a reason for hiding this comment

qubvel commented Feb 5, 2025

geetu040 commented Feb 5, 2025

geetu040 commented Feb 5, 2025

qubvel commented Feb 5, 2025

geetu040 commented Feb 5, 2025

geetu040 commented Feb 5, 2025

qubvel commented Feb 5, 2025 • edited Loading

geetu040 commented Feb 6, 2025

geetu040 commented Feb 6, 2025 • edited Loading

qubvel commented Feb 6, 2025

geetu040 commented Feb 6, 2025

qubvel commented Feb 6, 2025

geetu040 commented Nov 3, 2024 •

edited

Loading

geetu040 commented Nov 25, 2024 •

edited

Loading

qubvel commented Feb 5, 2025 •

edited

Loading

geetu040 commented Feb 6, 2025 •

edited

Loading