Dynamem is an open vocabulary mobile manipulation system that works life long in any unseen environments. It can continuously process text queries in the form of "pick up A and place it on B" (e.g. "pick up apple and place it on the plate").
Compared to Stretch AI Agent mentioned here, Dynamem can
- continuously update its semantic memory when they observe the environment changes, which allows the system to work life long in homes without rescanning the environments;
- pick up more objects, especially bowl like objects.
However, there is some reason why sometimes we should still use AI agent:
- Dynamem does an open loop pick up, which requires the robot urdf to be very well calibrated as the robot does not take new observation to correct itself once the action plan is generated.
- Dynamem uses Anygrasp, a closed source gripper pose prediction model. Some researchers or companies might not be allowed or be able to use it.
Click to follow the link to YouTube:
Above shows Dynamem running in NYU kitchen.
Dynamem consists of three components, navigation, picking, and placing. To complete "Pick up A and Place it on B", it will call 4 commands sequentially:
navigate(A)
pick(A)
navigate(B)
place(B)
Besides these commands, Dynamem also provides exploration module
explore()
Dynamem stores (two) voxelized pointcloud for navigation and exploration. The first pointcloud is used to generate obstacle map for running A* path planning while another is used to store vision language features for visual grounding and generate value map for exploration.
In Dynamem paper, three ways to query semantic memory for visual grounding are introduced, in this stack we only set up querying with vision language feature similarity and querying with the hybrid of mLLMs and vision language feature similarity. The first strategy is faster while the second has better performance. By default the stack will chose VL feature similarity to do visual grounding.
In terms of exploration, we discovered that commonly used frontier based exploration (FBE) is not suitable for dynamic environments because obtacles might be moved around, creating new frontier, and already scanned portions of the room might also be changed. Therefore, we introduced a value based exploration that assigns any point in the 2D map a heuristic value evaluating how valuable it is to explore to this point. The detailed analysis is described in Dynamem paper.
Dynamem has two manipulation systems, one is Stretch AI Visual Servoing code, as described in the LLM agent while another is OK-Robot manipulation.
Instructions for AnyGrasp manipulation is put here and instructions for visualn servoing manipulation is put here.
The high level idea for AnyGrasp picking is
- Transform RGBD image from Stretch head camera into a RGB pointcloud.
- AnyGrasp proposes a set of collision free gripper poses given a RGB pointcloud.
- OWLv2 and SAMv2 to select only gripper poses that actually manipulates the target object.
- Transform the selected 6-DoF pose into gripper actions using URDF.
Placing is relatively simpler as all you need to do is to segment the target receptacle in the image and select a middle point to drop on.
The advantages of AnyGrasp manipulation system, compared to visual servoing manipulation in LLM agent includes:
- More general purpose, dealing with objects with different shapes, such as bowls, bananas. The disadvantages includes:
- Open loop so unable to recover from controller errors.
- Reliance on accurate robot calibration and urdf.
You should follow the these instructions to run Dynamem. SLAM and control codes are supposed to be run on the robot while perception models are supposed to be run on the workstation (e.g. a laptop, a lambda machine; might also be run on the robot but not recommended).
So you should clone stretch ai repo with this command
git clone https://github.com/hello-robot/stretch_ai.git --recursive
cd stretch_ai
On BOTH your robot and workstation.
Once you turn on Stretch robot, you should first calibrate it
stretch_free_robot_process.py
stretch_robot_home.py
If you have already run these codes to start up the robot, you may move to the next step.
To run navigation system of Dynamem, you first need to install environment with these commands:
bash ./install.sh
Next you are going to set up your robot launch files, please follow instructions in Stretch AI startup guide to set up either Docker or ROS2.
Then we launch SLAM on the robot.
If you choose to install with ROS2, run
ros2 launch stretch_ros2_bridge server.launch.py
Or if you choose to use docker, run
bash ./scripts/run_stretch_ai_ros2_bridge_server.sh --update
For more information on how to launch your robot, see the Stretch AI startup guide.
Most of AI codes (e.g. VLMs, mLLMs) should be run on the workstation.
You need to first install the conda environments on the workstation, we recommend you run
./install.sh --no-version
mamba activate stretch_ai
If you use visual servo manipulation, you would need to further install SAM2
cd third_party/segment-anything-2
pip install -e .
If you use AnyGrasp manipulation, please refer to these instructions for the installation, you would need to create a new conda environment on your worstation.
No matter whether you choose to run which manipulation, having a well calibrated robot URDF is important, you should follow these steps to set up robot URDF (while visual servo picking does not require accurate robot URDF, placing heuristic is shared between these two systems):
- On your robot, follow instructions described in Stretch Ros2 to calibrate your robot.
- Once you have a well calibrated urdf (in
~/ament_ws/src/stretch_ros2/stretch_description/urdf/stretch.urdf
on your stretch robot), copy it to your workstationsrc/stretch/config/urdf/stretch.urdf
. It is recommended to run following commands on your workstation:
scp hello-robot@[ROBOT IP]:~/ament_ws/src/stretch_ros2/stretch_description/urdf/stretch.urdf stretch_ai/src/stretch/config/urdf/
- Run the following python scripts to replace urdf modification described in OK Robot calibration docs
python src/stretch/config/dynamem_urdf.py --urdf-path src/stretch/config/urdf/stretch.urdf
Note that while URDF calibration is important for both manipulation systems, AnyGrasp manipulation has much higher requirement on robot calibration. On the other hand, even though the calibration is not perfect in visual servo manipulation, in most cases the robot is still going to complete the task.
You might want to check your calibration if the following things happen:
- Floor in the navigation pointcloud does not fall on
z=0
plane. - Manipulation does not follow AnyGrasp predictions.
Firstly you should know the ip address of your robot and workstation by running ifconfig
on these two machines. Continuously tracking ips of different machines is an annoying task. We recommend using Tailscale to manage a series of virtual ip addresses. Run following command on the workstation to run dynamem
python -m stretch.app.run_dynamem --robot_ip $ROBOT_IP --server_ip $WORKSTATION_SERVER_IP -S
robot_ip
is used to communicate robot and server_ip
is used to communicate the server where AnyGrasp runs. If you don't run anygrasp (e.g. navigation only or running Stretch AI visual servoing manipulation instead), then set server_ip
to 127.0.0.1
or just leave it blank.
If you plan to run AnyGrasp on the same workstation, we highly recommend you find the ip of this workstation instead of naivly setting server_ip
to 127.0.0.1
.
Once the robot starts doing OVMM, a rerun window will be popped up to visualize robot's thoughts.
The very first thing is to make sure OK-Robot repo is a submodule in your Stretch AI repo in third_party/
!!!
If not, run git submodule update --init --recursive
to update all submodules.
Next, please strictly follow aforementioned steps to prepare accurate robot URDF!!!
Few steps are needed to be done before you can try AnyGrasp:
- Since AnyGrasp is a closed source model, you should first request for AnyGrasp license following These instructions
- Install a new conda environment for running anygrasp following OK Robot environment installation instructions. NOTE that
stretch_ai
environment does not support AnyGrasp because the AnyGrasp packages conflict withstretch_ai
's python version. - Run AnyGrasp with following commands in a new terminal window
# If you have not yet activated anygrasp conda environment, do so.
conda activate ok-robot-env
# Assume you are in stretch_ai folder in the new window.
cd third_party/ok-robot/ok-robot-manipulation/src/
python demo.py --open_communication --port 5557
To understand more options in running AnyGrasp, please read OK Robot Manipulation.
After AnyGrasp is launched, you can run default Dynamem commands as described above.
python -m stretch.app.run_dynamem --robot_ip $ROBOT_IP --server_ip $WORKSTATION_SERVER_IP
Dynamem support both exploration & mapping and OVMM tasks. So before each task it will ask you whether you want to run E (denoted for exploration) and M (denoted for OVMM).
One exploration iteration includes
- Looking around and scanning new RGBD images;
- Moving towards the point of interest in the map.
To specify how many exploration iterations you want the robot to run after selecting exploration, set up
explore-iter
. For example, if you want the robot to explore for 5 iterations, use the command.
python -m stretch.app.run_dynamem --robot_ip $ROBOT_IP --server_ip $WORKSTATION_SERVER_IP -S --explore-iter 5
As mentioned previously, by default we run visual grounding by doing object detection on the robot observataion with the highest cosine similarity. While this strategy is fast, another querying strategy, prompting GPT-4o to process top-k robot observations has better accuracy.
To try this querying strategy that uses GPT-4o boost your navigation accuracy, you first need to follow OPENAI's instructions to create API keys. After that you can try this version by turning on mllm(-M
) in your scripts:
OPNEAI_API_KEY=$YOUR_API_KEY python -m stretch.app.run_dynamem --robot_ip $ROBOT_IP --server_ip $WORKSTATION_SERVER_IP -S -M
Dynamem stores the semantic memory as a pickle file after initial rotation-in-place and every time navigate(A)
is executed. This allows Dynamem to read from saved pickle file so that it can directly load the semantic memory from previous runs without rotating in place and scanning surroundings again.
You can control memory saving and reading by specifying input-path
and output-path
.
By specifying output-path
, the semantic memory will be saved to dynamem_log/
+ specified-output-path
+ .pkl
; otherwise, the semantic memory will be saved to pickle file named by the current datetime in dynamem_log/
.
By specifying intput-path
, the robot will first read semantic memory from specified pickle file and will skip the rotating in place.
The command looks like this
python -m stretch.app.run_dynamem --robot_ip $ROBOT_IP --server_ip $WORKSTATION_SERVER_IP --output-path $PICKLE_FILE_PATH --input-path $PICKLE_FILE_PATH -S
Dynamem OVMM task implementation hardcodes such API calling sequence: navigating to the target object navigate(A)
, picking up the object pick(A)
, navigating to the target receptacle navigate(B)
, placing the object on the receptacle place(B)
. However, sometimes we might want to interfere with robot task planning. For example, if first picking up fails, we humans might want the robot to try again.
So how can we steer robot actions? One functionality we provided is asking for humans' confirmations. That is to say, even though by default the system still calls navigate(A)
, pick(A)
, navigate(B)
, place(B)
in sequence, but before it implements each module, humans can explicitly tell the robot whether they want it to call this API call.
How is that functionality helpful? Sometimes when the robot is already facing the object, we might not want to waste time in navigation, by selecting N
(no) when asked "Do you want to run navigation?", the robot can skip navigation and directly pick up objects.
The flag -S
in previous commands, it configures Dyname to skip these human confirmantions. To enable this functionality, you need to run
python -m stretch.app.run_dynamem --robot_ip $ROBOT_IP --server_ip $WORKSTATION_SERVER_IP
If you do not have access to AnyGrasp, you can run with the Stretch AI Visual Servoing code, as described in the LLM Agent documentation. In this case, you can run Dynamem with the following command:
python -m stretch.app.run_dynamem --robot_ip $ROBOT_IP --server_ip $WORKSTATION_SERVER_IP --visual-servo
You can also run an equivalent of the LLM agent with Dynamem. In this case, you can run Dynamem with the following command:
python -m stretch.app.run_dynamem --use-llm
All of the flags in the agent documentation are also available in Dynamem:
# Start with voice chat
python -m stretch.app.run_dynamem --use-llm --use-voice
You can specify an LLM, e.g.:
# Run Gemma 2B from Google locally
python -m stretch.app.run_dynamem --use-llm --llm gemma2b
# Run Openai GPT-4o-mini on the cloud, using an OpenAI API key
OPENAI_API_KEY=your_key_here
python -m stretch.app.run_dynamem --use-llm --llm openai
If you find Dynamem useful in your research, please consider citing:
@article{liu2024dynamem,
title={DynaMem: Online Dynamic Spatio-Semantic Memory for Open World Mobile Manipulation},
author={Liu, Peiqi and Guo, Zhanqiu and Warke, Mohit and Chintala, Soumith and Paxton, Chris and Shafiullah, Nur Muhammad Mahi and Pinto, Lerrel},
journal={arXiv preprint arXiv:2411.04999},
year={2024}
}