Project Page | Paper | arXiv | HuggingFace Model | Real-World Codebase | Twitter/X
Haoyi Zhu, Honghui Yang, Yating Wang, Jiange Yang, Liming Wang, Tong He
SPA is a novel representation learning framework that emphasizes the importance of 3D spatial awareness in embodied AI. It leverages differentiable neural rendering on multi-view images to endow a vanilla Vision Transformer (ViT) with intrinsic spatial understanding. We also present the most comprehensive evaluation of embodied representation learning to date, covering 268 tasks across 8 simulators with diverse policies in both single-task and language-conditioned multi-task scenarios.
π₯³ NEWS:
-
Jan. 2025: SPA is accepted by ICLR 2025!
-
Oct. 2024: Codebase and pre-trained checkpoints are released! Paper is available on arXiv.
- Project Structure
- Installation
- Usage
- Pre-Training
- SPA Large-Scale Evaluation
- Gotchas
- License
- Acknowledgement
- Citation
Our codebase draws significant inspiration from the excellent Lightning Hydra Template. The directory structure of this project is organized as follows:
Show directory structure
βββ .github <- Github Actions workflows
β
βββ configs <- Hydra configs
β βββ callbacks <- Callbacks configs
β βββ data <- Data configs
β βββ debug <- Debugging configs
β βββ experiment <- Experiment configs
β βββ extras <- Extra utilities configs
β βββ hydra <- Hydra configs
β βββ local <- Local configs
β βββ logger <- Logger configs
β βββ model <- Model configs
β βββ paths <- Project paths configs
β βββ trainer <- Trainer configs
| |
β βββ train.yaml <- Main config for training
β
βββ data <- Project data
β
βββ logs <- Logs generated by hydra and lightning loggers
β
βββ scripts <- Shell or Python scripts
|
βββ spa <- Source code of SPA
β βββ data <- Data scripts
β βββ models <- Model scripts
β βββ utils <- Utility scripts
β β
β βββ train.py <- Run SPA pre-training
β
βββ .gitignore <- List of files ignored by git
βββ .project-root <- File for inferring the position of project root directory
βββ requirements.txt <- File for installing python dependencies
βββ setup.py <- File for installing project as a package
βββ README.md
Basics
# clone project
git clone https://github.com/HaoyiZhu/SPA.git
cd SPA
# crerate conda environment
conda create -n spa python=3.11 -y
conda activate spa
# install PyTorch, please refer to https://pytorch.org/ for other CUDA versions
# e.g. cuda 11.8:
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
# install basic packages
pip3 install -r requirements.txt
SPA
# (optional) if you want to use SPA's volume decoder
cd libs/spa-ops
pip install -e .
cd ../..
# install SPA, so that you can import from anywhere
pip install -e .
Example of Using SPA Pre-trained Encoder
We provide pre-trained SPA weights for feature extraction. The checkpoints are available on π€Hugging Face. You don't need to manually download the weights, as SPA will automatically handle this if needed.
import torch
from spa.models import spa_vit_base_patch16, spa_vit_large_patch16
image = torch.rand((1, 3, 224, 224)) # range in [0, 1]
# Example usage of SPA-Large (recommended)
# or you can use `spa_vit_base_patch16` for SPA-base
model = spa_vit_large_patch16(pretrained=True)
model.eval()
# Freeze the model
model.freeze()
# (Recommended) move to CUDA
image = image.cuda()
model = model.cuda()
# Obtain the [CLS] token
cls_token = model(image) # torch.Size([1, 1024])
# Obtain the reshaped feature map concatenated with [CLS] token
feature_map_cat_cls = model(
image, feature_map=True, cat_cls=True
) # torch.Size([1, 2048, 14, 14])
# Obtain the reshaped feature map without [CLS] token
feature_map_wo_cls = model(
image, feature_map=True, cat_cls=False
) # torch.Size([1, 1024, 14, 14])
Note: The inputs will be automatically resized to
224 x 224
and normalized within the SPA ViT encoder.
Example of Pre-Training on ScanNet
We give an example on pre-training SPA on the ScanNet v2 dataset.
-
Prepare the dataset
- Download the ScanNet v2 dataset.
- Pre-process and extract RGB-D images following PonderV2. The preprocessed data should be put under
data/scannet/
. - Pre-generate metadata for fast data loading. The following command will generate metadata under
data/scannet/metadata
.python scripts/generate_scannet_metadata.py
-
Run the following command for pre-training. Remember to modify hyper-parameters such as number of nodes and GPU devices according to your machines.
python spa/train.py experiment=spa_pretrain_vitl trainer.num_nodes=5 trainer.devices=8
VC-1 Evaluation
We evaluate on the VC-1's MetaWorld, Adroit, DMControl, and TriFinger benchmarks. Additionally, we have a forked version of the repository that includes code and configuration for evaluating SPA.
-
Clone the forked VC-1 repo, and follow the instructions in the CortexBench README to set up the MuJoCo and TriFinger environments, as well as download the required datasets.
-
Create a configuration for spa
<spa_model>.yaml
(e.g., using SPA-Large as in spa_vit_large.yaml) in <vc-1_path>/vc_models/src/vc_models/conf/model. -
To run the VC-1 evaluation for spa, specify the model config as a parameter (embedding=<spa_model>) for each of the benchmarks in cortexbench.
Override any config parameter from command line
This codebase is based on Hydra, which allows for convenient configuration overriding:
python src/train.py trainer.max_epochs=20 seed=300
Note: You can also add new parameters with
+
sign.
python src/train.py +some_new_param=some_new_value
Train on CPU, GPU, multi-GPU and TPU
# train on CPU
python src/train.py trainer=cpu
# train on 1 GPU
python src/train.py trainer=gpu
# train on TPU
python src/train.py +trainer.tpu_cores=8
# train with DDP (Distributed Data Parallel) (4 GPUs)
python src/train.py trainer=ddp trainer.devices=4
# train with DDP (Distributed Data Parallel) (8 GPUs, 2 nodes)
python src/train.py trainer=ddp trainer.devices=4 trainer.num_nodes=2
# simulate DDP on CPU processes
python src/train.py trainer=ddp_sim trainer.devices=2
# accelerate training on mac
python src/train.py trainer=mps
Train with mixed precision
# train with pytorch native automatic mixed precision (AMP)
python src/train.py trainer=gpu +trainer.precision=16
Use different tricks available in Pytorch Lightning
# gradient clipping may be enabled to avoid exploding gradients
python src/train.py trainer.gradient_clip_val=0.5
# run validation loop 4 times during a training epoch
python src/train.py +trainer.val_check_interval=0.25
# accumulate gradients
python src/train.py trainer.accumulate_grad_batches=10
# terminate training after 12 hours
python src/train.py +trainer.max_time="00:12:00:00"
Note: PyTorch Lightning provides about 40+ useful trainer flags.
Easily debug
# runs 1 epoch in default debugging mode
# changes logging directory to `logs/debugs/...`
# sets level of all command line loggers to 'DEBUG'
# enforces debug-friendly configuration
python src/train.py debug=default
# run 1 train, val and test loop, using only 1 batch
python src/train.py debug=fdr
# print execution time profiling
python src/train.py debug=profiler
# try overfitting to 1 batch
python src/train.py debug=overfit
# raise exception if there are any numerical anomalies in tensors, like NaN or +/-inf
python src/train.py +trainer.detect_anomaly=true
# use only 20% of the data
python src/train.py +trainer.limit_train_batches=0.2 \
+trainer.limit_val_batches=0.2 +trainer.limit_test_batches=0.2
Note: Visit configs/debug/ for different debugging configs.
Resume training from checkpoint
python src/train.py ckpt_path="/path/to/ckpt/name.ckpt"
Note: Checkpoint can be either path or URL.
Note: Currently loading ckpt doesn't resume logger experiment, but it will be supported in future Lightning release.
Create a sweep over hyperparameters
# this will run 9 experiments one after the other,
# each with different combination of seed and learning rate
python src/train.py -m seed=100,200,300 model.optimizer.lr=0.0001,0.00005,0.00001
Note: Hydra composes configs lazily at job launch time. If you change code or configs after launching a job/sweep, the final composed configs might be impacted.
Execute all experiments from folder
python src/train.py -m 'exp_maniskill2_act_policy/maniskill2_task@maniskill2_task=glob(*)'
Note: Hydra provides special syntax for controlling behavior of multiruns. Learn more here. The command above executes all task experiments from configs/exp_maniskill2_act_policy/maniskill2_task.
Execute run for multiple different seeds
python src/train.py -m seed=100,200,300 trainer.deterministic=True
Note:
trainer.deterministic=True
makes pytorch more deterministic but impacts the performance.
For more instructions, refer to the official documentation for Pytorch Lightning, Hydra, and Lightning Hydra Template.
This repository is released under the MIT license.
Our work is primarily built upon PointCloudMatters, PonderV2, UniPAD, Pytorch Lightning, Hydra, Lightning Hydra Template, RLBench, PerAct, LIBERO, Meta-Wolrd, ACT, Diffusion Policy, DP3, TIMM, VC1, R3M. We extend our gratitude to all these authors for their generously open-sourced code and their significant contributions to the community.
Contact Haoyi Zhu if you have any questions or suggestions.
@article{zhu2024spa,
title = {SPA: 3D Spatial-Awareness Enables Effective Embodied Representation},
author = {Zhu, Haoyi and and Yang, Honghui and Wang, Yating and Yang, Jiange and Wang, Limin and He, Tong},
journal = {arXiv preprint arxiv:2410.08208},
year = {2024},
}