Skip to content
/ SPA Public

[ICLR 2025] SPA: 3D Spatial-Awareness Enables Effective Embodied Representation

License

Notifications You must be signed in to change notification settings

HaoyiZhu/SPA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

7 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Logo SPA: 3D SPatial-Awareness Enables Effective Embodied Representation

python pytorch lightning hydra black isort license

Project Page | Paper | arXiv | HuggingFace Model | Real-World Codebase | Twitter/X

Haoyi Zhu, Honghui Yang, Yating Wang, Jiange Yang, Liming Wang, Tong He

teaser

SPA is a novel representation learning framework that emphasizes the importance of 3D spatial awareness in embodied AI. It leverages differentiable neural rendering on multi-view images to endow a vanilla Vision Transformer (ViT) with intrinsic spatial understanding. We also present the most comprehensive evaluation of embodied representation learning to date, covering 268 tasks across 8 simulators with diverse policies in both single-task and language-conditioned multi-task scenarios.

πŸ₯³ NEWS:

  • Jan. 2025: SPA is accepted by ICLR 2025!

  • Oct. 2024: Codebase and pre-trained checkpoints are released! Paper is available on arXiv.

πŸ“‹ Contents

πŸ”­ Project Structure

Our codebase draws significant inspiration from the excellent Lightning Hydra Template. The directory structure of this project is organized as follows:

Show directory structure
β”œβ”€β”€ .github                   <- Github Actions workflows
β”‚
β”œβ”€β”€ configs                   <- Hydra configs
β”‚   β”œβ”€β”€ callbacks                         <- Callbacks configs
β”‚   β”œβ”€β”€ data                              <- Data configs
β”‚   β”œβ”€β”€ debug                             <- Debugging configs
β”‚   β”œβ”€β”€ experiment                        <- Experiment configs
β”‚   β”œβ”€β”€ extras                            <- Extra utilities configs
β”‚   β”œβ”€β”€ hydra                             <- Hydra configs
β”‚   β”œβ”€β”€ local                             <- Local configs
β”‚   β”œβ”€β”€ logger                            <- Logger configs
β”‚   β”œβ”€β”€ model                             <- Model configs
β”‚   β”œβ”€β”€ paths                             <- Project paths configs
β”‚   β”œβ”€β”€ trainer                           <- Trainer configs
|   |
β”‚   └── train.yaml            <- Main config for training
β”‚
β”œβ”€β”€ data                   <- Project data
β”‚
β”œβ”€β”€ logs                   <- Logs generated by hydra and lightning loggers
β”‚
β”œβ”€β”€ scripts                <- Shell or Python scripts
|
β”œβ”€β”€ spa                    <- Source code of SPA
β”‚   β”œβ”€β”€ data                     <- Data scripts
β”‚   β”œβ”€β”€ models                   <- Model scripts
β”‚   β”œβ”€β”€ utils                    <- Utility scripts
β”‚   β”‚
β”‚   └── train.py                 <- Run SPA pre-training
β”‚
β”œβ”€β”€ .gitignore                <- List of files ignored by git
β”œβ”€β”€ .project-root             <- File for inferring the position of project root directory
β”œβ”€β”€ requirements.txt          <- File for installing python dependencies
β”œβ”€β”€ setup.py                  <- File for installing project as a package
└── README.md

πŸ”¨ Installation

Basics
# clone project
git clone https://github.com/HaoyiZhu/SPA.git
cd SPA

# crerate conda environment
conda create -n spa python=3.11 -y
conda activate spa

# install PyTorch, please refer to https://pytorch.org/ for other CUDA versions
# e.g. cuda 11.8:
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
# install basic packages
pip3 install -r requirements.txt
SPA
# (optional) if you want to use SPA's volume decoder
cd libs/spa-ops
pip install -e .
cd ../..

# install SPA, so that you can import from anywhere
pip install -e .

🌟 Usage

Example of Using SPA Pre-trained Encoder

We provide pre-trained SPA weights for feature extraction. The checkpoints are available on πŸ€—Hugging Face. You don't need to manually download the weights, as SPA will automatically handle this if needed.

import torch

from spa.models import spa_vit_base_patch16, spa_vit_large_patch16

image = torch.rand((1, 3, 224, 224))  # range in [0, 1]

# Example usage of SPA-Large (recommended)
# or you can use `spa_vit_base_patch16` for SPA-base
model = spa_vit_large_patch16(pretrained=True)
model.eval()

# Freeze the model
model.freeze()

# (Recommended) move to CUDA
image = image.cuda()
model = model.cuda()

# Obtain the [CLS] token
cls_token = model(image)  # torch.Size([1, 1024])

# Obtain the reshaped feature map concatenated with [CLS] token
feature_map_cat_cls = model(
    image, feature_map=True, cat_cls=True
)  # torch.Size([1, 2048, 14, 14])

# Obtain the reshaped feature map without [CLS] token
feature_map_wo_cls = model(
    image, feature_map=True, cat_cls=False
)  # torch.Size([1, 1024, 14, 14])

Note: The inputs will be automatically resized to 224 x 224 and normalized within the SPA ViT encoder.

πŸš€ Pre-Training

Example of Pre-Training on ScanNet

We give an example on pre-training SPA on the ScanNet v2 dataset.

  1. Prepare the dataset

    • Download the ScanNet v2 dataset.
    • Pre-process and extract RGB-D images following PonderV2. The preprocessed data should be put under data/scannet/.
    • Pre-generate metadata for fast data loading. The following command will generate metadata under data/scannet/metadata.
      python scripts/generate_scannet_metadata.py
  2. Run the following command for pre-training. Remember to modify hyper-parameters such as number of nodes and GPU devices according to your machines.

    python spa/train.py experiment=spa_pretrain_vitl trainer.num_nodes=5 trainer.devices=8

πŸ’‘ SPA Large-Scale Evaluation

VC-1 Evaluation

We evaluate on the VC-1's MetaWorld, Adroit, DMControl, and TriFinger benchmarks. Additionally, we have a forked version of the repository that includes code and configuration for evaluating SPA.

  1. Clone the forked VC-1 repo, and follow the instructions in the CortexBench README to set up the MuJoCo and TriFinger environments, as well as download the required datasets.

  2. Create a configuration for spa <spa_model>.yaml(e.g., using SPA-Large as in spa_vit_large.yaml) in <vc-1_path>/vc_models/src/vc_models/conf/model.

  3. To run the VC-1 evaluation for spa, specify the model config as a parameter (embedding=<spa_model>) for each of the benchmarks in cortexbench.

πŸŽ‰ Gotchas

Override any config parameter from command line

This codebase is based on Hydra, which allows for convenient configuration overriding:

python src/train.py trainer.max_epochs=20 seed=300

Note: You can also add new parameters with + sign.

python src/train.py +some_new_param=some_new_value
Train on CPU, GPU, multi-GPU and TPU
# train on CPU
python src/train.py trainer=cpu

# train on 1 GPU
python src/train.py trainer=gpu

# train on TPU
python src/train.py +trainer.tpu_cores=8

# train with DDP (Distributed Data Parallel) (4 GPUs)
python src/train.py trainer=ddp trainer.devices=4

# train with DDP (Distributed Data Parallel) (8 GPUs, 2 nodes)
python src/train.py trainer=ddp trainer.devices=4 trainer.num_nodes=2

# simulate DDP on CPU processes
python src/train.py trainer=ddp_sim trainer.devices=2

# accelerate training on mac
python src/train.py trainer=mps
Train with mixed precision
# train with pytorch native automatic mixed precision (AMP)
python src/train.py trainer=gpu +trainer.precision=16
Use different tricks available in Pytorch Lightning
# gradient clipping may be enabled to avoid exploding gradients
python src/train.py trainer.gradient_clip_val=0.5

# run validation loop 4 times during a training epoch
python src/train.py +trainer.val_check_interval=0.25

# accumulate gradients
python src/train.py trainer.accumulate_grad_batches=10

# terminate training after 12 hours
python src/train.py +trainer.max_time="00:12:00:00"

Note: PyTorch Lightning provides about 40+ useful trainer flags.

Easily debug
# runs 1 epoch in default debugging mode
# changes logging directory to `logs/debugs/...`
# sets level of all command line loggers to 'DEBUG'
# enforces debug-friendly configuration
python src/train.py debug=default

# run 1 train, val and test loop, using only 1 batch
python src/train.py debug=fdr

# print execution time profiling
python src/train.py debug=profiler

# try overfitting to 1 batch
python src/train.py debug=overfit

# raise exception if there are any numerical anomalies in tensors, like NaN or +/-inf
python src/train.py +trainer.detect_anomaly=true

# use only 20% of the data
python src/train.py +trainer.limit_train_batches=0.2 \
+trainer.limit_val_batches=0.2 +trainer.limit_test_batches=0.2

Note: Visit configs/debug/ for different debugging configs.

Resume training from checkpoint
python src/train.py ckpt_path="/path/to/ckpt/name.ckpt"

Note: Checkpoint can be either path or URL.

Note: Currently loading ckpt doesn't resume logger experiment, but it will be supported in future Lightning release.

Create a sweep over hyperparameters
# this will run 9 experiments one after the other,
# each with different combination of seed and learning rate
python src/train.py -m seed=100,200,300 model.optimizer.lr=0.0001,0.00005,0.00001

Note: Hydra composes configs lazily at job launch time. If you change code or configs after launching a job/sweep, the final composed configs might be impacted.

Execute all experiments from folder
python src/train.py -m 'exp_maniskill2_act_policy/maniskill2_task@maniskill2_task=glob(*)'

Note: Hydra provides special syntax for controlling behavior of multiruns. Learn more here. The command above executes all task experiments from configs/exp_maniskill2_act_policy/maniskill2_task.

Execute run for multiple different seeds
python src/train.py -m seed=100,200,300 trainer.deterministic=True

Note: trainer.deterministic=True makes pytorch more deterministic but impacts the performance.

For more instructions, refer to the official documentation for Pytorch Lightning, Hydra, and Lightning Hydra Template.

πŸ“š License

This repository is released under the MIT license.

✨ Acknowledgement

Our work is primarily built upon PointCloudMatters, PonderV2, UniPAD, Pytorch Lightning, Hydra, Lightning Hydra Template, RLBench, PerAct, LIBERO, Meta-Wolrd, ACT, Diffusion Policy, DP3, TIMM, VC1, R3M. We extend our gratitude to all these authors for their generously open-sourced code and their significant contributions to the community.

Contact Haoyi Zhu if you have any questions or suggestions.

πŸ“ Citation

@article{zhu2024spa,
    title = {SPA: 3D Spatial-Awareness Enables Effective Embodied Representation},
    author = {Zhu, Haoyi and and Yang, Honghui and Wang, Yating and Yang, Jiange and Wang, Limin and He, Tong},
    journal = {arXiv preprint arxiv:2410.08208},
    year = {2024},
}