From 2816b3f60c568ed48dc497ef213db5d54b2ba595 Mon Sep 17 00:00:00 2001 From: shantanuparab-tr Date: Fri, 11 Oct 2024 13:57:17 -0500 Subject: [PATCH] Adding Documentation for Google Colab. (#28) * Troubleshoot Lag in Follower Arms * Reviewed Changes * Apply suggestions from code review * Test Github URL * Added documentation for Google Colab Notebook * PR Review Changes * Added video link * Fixed FAQ Link --------- Co-authored-by: lukeschmitt-tr <85308904+lukeschmitt-tr@users.noreply.github.com> --- docs/files/LeRobot_Notebook.ipynb | 373 ++++++++++++++++++++++++++++++ docs/operation/lerobot_guide.rst | 43 +++- docs/operation/test.txt | 3 + 3 files changed, 418 insertions(+), 1 deletion(-) create mode 100644 docs/files/LeRobot_Notebook.ipynb create mode 100644 docs/operation/test.txt diff --git a/docs/files/LeRobot_Notebook.ipynb b/docs/files/LeRobot_Notebook.ipynb new file mode 100644 index 0000000..3c111f9 --- /dev/null +++ b/docs/files/LeRobot_Notebook.ipynb @@ -0,0 +1,373 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "ff7eafe4", + "metadata": { + "id": "ff7eafe4" + }, + "source": [ + "\n", + "# LeRobot Training Instruction Manual\n", + "\n", + "# πŸŽ‰ Welcome to the LeRobot Training Notebook!\n", + "\n", + "This guide will help you set up and train a model on a cloud-based platform, such as **Google Colab**, using **LeRobot** with **Hugging Face**.\n", + "\n", + "---\n", + "\n", + "## ⚠️ **Disclaimers:**\n", + "\n", + "- **GPU Subscription**: πŸ”‘ Make sure you have the appropriate subscription plan that provides access to the necessary GPU (e.g., **A100**, **T4**). Review pricing and benefits on the cloud provider's website before proceeding.\n", + " \n", + "- **Checkpoint Requirement**: ⏳ If resuming training, ensure that you have the previous training checkpoint available in your session. Without the checkpoint, the training **cannot be resumed**.\n", + "\n", + "---\n", + "\n", + "## πŸ“ **Important Instructions:**\n", + "\n", + "- **Run All Cells Together**: πŸ”„ It is recommended to run all the cells in one go if you plan to leave the session **unmonitored**. This helps avoid session timeouts or disruptions.\n", + "\n", + "- **GPU & Compute Units**: πŸŽ›οΈ Ensure you select a suitable GPU (e.g., **A100**, **T4**) and have enough compute units for your session. A typical 5-hour training session requires approximately **70 compute units**.\n", + "\n", + "- **Monitor Training**: πŸ‘€ It’s advisable to monitor the **first few epochs** to ensure that the training is running smoothly before leaving the session unattended.\n", + "\n", + "- **Local Storage**: πŸ’Ύ You will be prompted to choose whether you want to store the training outputs **locally** at the end of the process.\n", + "\n", + "---\n", + "\n", + "Now, let’s begin the setup process! πŸš€\n", + "\n", + "\n", + "---\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e65a27ba", + "metadata": { + "id": "e65a27ba" + }, + "outputs": [], + "source": [ + "# Collect all necessary inputs from the user\n", + "import os\n", + "import subprocess\n", + "\n", + "# GPU selection (Reminder: Ensure enough compute units for smooth training)\n", + "print(\"Please select a suitable GPU type (e.g., A100, T4) for cloud-based training.\")\n", + "\n", + "# Hugging Face login token\n", + "print(\"Generate a Hugging Face token from: https://huggingface.co/settings/tokens\")\n", + "hf_token = input(\"Please enter your Hugging Face token: \")\n", + "\n", + "# Link to Trossen Robotics Community datasets\n", + "print(\"You can explore datasets from Trossen Robotics Community here: https://huggingface.co/TrossenRoboticsCommunity\")\n", + "\n", + "# Dataset and job details\n", + "dataset_repo_id = input(\"Please enter the dataset repo ID from Hugging Face (e.g., TrossenRoboticsCommunity/aloha_static_logo_assembly): \")\n", + "\n", + "\n", + "print(\"**Important**: Use a valid directory/jobs name. Avoid numbers or special characters other than '_'.\")\n", + "print(\"Example: 'training_results_aloha' or 'aloha_training_output'\")\n", + "\n", + "job_name = input(\"Please enter the job name for this training session: \")\n", + "\n", + "# Output directory with naming format instructions\n", + "\n", + "output_dir = input(\"Enter the output directory name: \")\n", + "\n", + "# Resume flag with disclaimer\n", + "resume_flag = input(\"Do you want to resume training from a previous checkpoint? (yes/no): \")\n", + "resume_cmd = \"--resume\" if resume_flag.lower() == 'yes' else \"\"\n", + "\n", + "# Model upload flag\n", + "upload_choice = input(\"Do you want to upload the model to Hugging Face after training? (yes/no): \")\n", + "model_repo_id = \"\"\n", + "if upload_choice.lower() == 'yes':\n", + " model_repo_id = input(\"Please enter the model repo ID to store your trained model to Hugging Face (e.g., TrossenRoboticsCommunity/aloha_static_logo_assembly): \")\n", + "\n", + "# Local storage flag\n", + "store_locally = input(\"Do you want to store the training outputs locally? (yes/no): \")\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "6350b26b", + "metadata": { + "id": "6350b26b" + }, + "source": [ + "\n", + "## Step 1: GPU Setup & Compute Units\n", + "\n", + "The GPU type you selected earlier will now be configured for this cloud-based training session. Make sure to have enough compute units to support long training sessions, and monitor the first few epochs to ensure smooth execution.\n", + "\n", + "---\n" + ] + }, + { + "cell_type": "markdown", + "id": "46a58617", + "metadata": { + "id": "46a58617" + }, + "source": [ + "\n", + "## Step 2: Installing Dependencies\n", + "\n", + "In this step, we'll install all the necessary dependencies for running LeRobot and performing model training.\n", + "\n", + "Ensure that these packages are successfully installed before proceeding to the next steps.\n", + "\n", + "---\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "49888a14", + "metadata": { + "id": "49888a14" + }, + "outputs": [], + "source": [ + "\n", + "# Install required dependencies\n", + "print(\"Installing required dependencies...\")\n", + "\n", + "!sudo apt-get install libusb-1.0-0-dev\n", + "!pip install --upgrade pyrealsense2 dynamixel-sdk rerun-sdk blinker wandb datasets huggingface-hub hydra-core gitpython flask diffusers InquirerPy\n", + "\n", + "# Install blinker if needed\n", + "!pip install --ignore-installed blinker\n", + "\n", + "\n", + "# Install LeRobot repository\n", + "!git clone https://github.com/huggingface/lerobot.git\n", + "%cd /content/lerobot\n", + "!ls\n", + "!pip install -e .\n", + "!pip install .[intelrealsense,dynamixel]\n", + "\n", + "print(\"Dependencies installed successfully.\")\n" + ] + }, + { + "cell_type": "markdown", + "id": "9000e90a", + "metadata": { + "id": "9000e90a" + }, + "source": [ + "\n", + "## Step 3: Hugging Face Login & Dataset Setup\n", + "\n", + "We will now log into Hugging Face using the token provided. After login, the dataset repo, job name, and output directory that you specified will be configured for the training session.\n", + "\n", + "---\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e5f45843", + "metadata": { + "id": "e5f45843" + }, + "outputs": [], + "source": [ + "# Log in to Hugging Face and verify login\n", + "print(\"Logging into Hugging Face...\")\n", + "!huggingface-cli login --token {hf_token}\n", + "\n", + "# Verify the login by checking the user information\n", + "user_info = !huggingface-cli whoami\n", + "print(f\"Logged in as: {user_info[0]}\")\n" + ] + }, + { + "cell_type": "markdown", + "source": [ + "> **⚠️ Important Notice:**\n", + ">\n", + "> Before you start the training, make sure to edit the `act_aloha_real.yaml` file located at:\n", + ">\n", + "> **Click Here** >> `/content/lerobot/lerobot/configs/policy/act_aloha_real.yaml`\n", + ">\n", + "> This file contains crucial parameters such as `batch_size`, `offline_steps`, and `learning_rate`. You should update these parameters based on your training needs. For example, you can modify:\n", + ">\n", + "> - **Batch Size** (`training.batch_size`): Adjust the number of samples processed in each training step.\n", + "> - **Offline Training Steps** (`training.offline_steps`): Define how many steps to run during offline training.\n", + "> - **Learning Rate** (`training.lr`): Set the learning rate to control how quickly the model learns.\n", + ">\n", + "> Once the file is updated, you can proceed with training.\n" + ], + "metadata": { + "id": "BONV4GelxZla" + }, + "id": "BONV4GelxZla" + }, + { + "cell_type": "markdown", + "id": "dbac0593", + "metadata": { + "id": "dbac0593" + }, + "source": [ + "## Step 5: Model Training or Resumption\n", + "\n", + "Now that everything is set up, we can either begin training the model or resume training from the last checkpoint, depending on your input.\n", + "\n", + "If resuming, make sure the checkpoint is available in your session. The training will continue from the last checkpoint if found.\n", + "\n", + "> **⚠️ Important: GPU Usage**\n", + ">\n", + "> By default, the training is configured to use a **GPU** for faster computation. If the runtime does not have access to a GPU, the training will fail.\n", + ">\n", + "> To avoid this issue:\n", + ">\n", + "> - **Ensure GPU is enabled** in your Colab runtime. You can check this by navigating to **Runtime > Change runtime type > Hardware accelerator** and selecting **GPU**.\n", + "> - If you prefer to use a **CPU** instead, update the `device` argument to `device=cpu` in the training command in the next cell.\n", + "\n", + "---" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ea8896f6", + "metadata": { + "id": "ea8896f6" + }, + "outputs": [], + "source": [ + "# Start or resume training depending on user choice\n", + "%cd /content/lerobot/\n", + "\n", + "if resume_flag.lower() == \"no\":\n", + " print(f\"Starting new training on {dataset_repo_id}...\")\n", + " !python lerobot/scripts/train.py dataset_repo_id={dataset_repo_id} policy=act_aloha_real env=aloha_real hydra.run.dir=outputs/train/{output_dir} hydra.job.name={job_name} device=cuda wandb.enable=false\n", + "else:\n", + " print(f\"Resuming training from {output_dir}... (ensure checkpoint is available)\")\n", + " !python lerobot/scripts/train.py hydra.run.dir={output_dir} resume=true\n" + ] + }, + { + "cell_type": "markdown", + "id": "79fe95cd", + "metadata": { + "id": "79fe95cd" + }, + "source": [ + "\n", + "## Step 5: Uploading the Model (Recommended)\n", + "\n", + "Once the model is trained, you can choose to upload it to Hugging Face for safekeeping. This is **highly recommended** if you are running long sessions or training a valuable model.\n", + "\n", + "Uploading the model will help protect against potential session interruptions or failures.\n", + "\n", + "---\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "10eec178", + "metadata": { + "id": "10eec178" + }, + "outputs": [], + "source": [ + "print(model_repo_id)\n", + "# Model upload step if chosen\n", + "if upload_choice.lower() == \"yes\":\n", + " print(\"Uploading the model to Hugging Face...\")\n", + " !huggingface-cli upload {model_repo_id} outputs/train/{output_dir}/checkpoints/last/pretrained_model\n", + " print(\"Model uploaded to Hugging Face successfully.\")\n", + "else:\n", + " print(\"Model upload skipped.\")\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "061e20bd", + "metadata": { + "id": "061e20bd" + }, + "source": [ + "\n", + "## Step 6: Safeguarding Session Data and Local Storage\n", + "\n", + "To prevent data loss in case of session termination, you can zip the output directory and download it locally. If you selected local storage, the outputs will be saved to your local machine.\n", + "\n", + "Make sure to run this step to save all training outputs before closing your session.\n", + "\n", + "---\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "327f72e0", + "metadata": { + "id": "327f72e0" + }, + "outputs": [], + "source": [ + "# Zip the output directory and download it if local storage is chosen\n", + "%cd /content/lerobot/outputs/train/\n", + "!ls\n", + "if store_locally.lower() == \"yes\":\n", + " print(\"Zipping outputs for download...\")\n", + " !zip -r trained.zip {output_dir}\n", + "\n", + " # Download the zipped file\n", + " from google.colab import files\n", + " files.download('/content/lerobot/outputs/train/trained.zip')\n", + "else:\n", + " print(\"Local storage not selected, skipping download.\")" + ] + }, + { + "cell_type": "markdown", + "id": "ba7bbcc4", + "metadata": { + "id": "ba7bbcc4" + }, + "source": [ + "\n", + "# Troubleshooting and Recommendations\n", + "\n", + "1. **GPU Availability**: Ensure the selected GPU is available on your cloud platform (e.g., Colab).\n", + "2. **Compute Units**: Ensure you have sufficient compute units. Each 5-hour session requires ~70 units.\n", + "3. **Hugging Face Token**: You can generate a token [here](https://huggingface.co/settings/tokens).\n", + "4. **Session Safeguards**: Always download your results (output files) to prevent data loss if the session terminates.\n", + "5. **Checkpoint Reminder**: If resuming training, ensure that the checkpoint file from the previous session is present in the session.\n", + "\n", + "---\n" + ] + } + ], + "metadata": { + "language_info": { + "name": "python" + }, + "colab": { + "provenance": [], + "machine_shape": "hm", + "gpuType": "A100", + "private_outputs": true + }, + "kernelspec": { + "name": "python3", + "display_name": "Python 3" + }, + "accelerator": "GPU" + }, + "nbformat": 4, + "nbformat_minor": 5 +} \ No newline at end of file diff --git a/docs/operation/lerobot_guide.rst b/docs/operation/lerobot_guide.rst index 27f1a84..0df1daa 100644 --- a/docs/operation/lerobot_guide.rst +++ b/docs/operation/lerobot_guide.rst @@ -229,9 +229,10 @@ To train a policy for controlling your robot, use the following command: #. The environment is set with :guilabel:`env=aloha_real`. This configuration is loaded from :file:`lerobot/configs/env/aloha_real.yaml`. #. The device is set to :guilabel:`cuda` to utilize an NVIDIA GPU for training. - #. :guilabel:`wandb.enable=true` is used for visualizing training plots via [Weights and Biases](https://docs.wandb.ai/quickstart). + #. :guilabel:`wandb.enable=true` is used for visualizing training plots via `Weights and Biases `_. Ensure you are logged in by running `wandb login`. + Upload Policy Checkpoints ========================= @@ -250,6 +251,46 @@ To upload intermediate checkpoints: $ huggingface-cli upload ${HF_USER}/act_aloha_test_${CKPT} \ outputs/train/act_aloha_test/checkpoints/${CKPT}/pretrained_model +Google Colab for Training +=============================== + +If you would like to speed up the training process or do not have access to a powerful local machine, you can use the **Google Colab Notebook** that we have prepared for training LeRobot models on a cloud platform. +Colab provides free access to GPUs, which can significantly reduce training time. + +To access and use the Colab notebook, follow these steps: + +1. **Download or Open the Colab Notebook**: You can either download the Colab notebook to your local machine or open it directly in Google Colab for instant use. + + **Options**: + + - :download:`Download the Notebook <../files/LeRobot_Notebook.ipynb>` + + - .. raw:: html + + + Open In Colab + + +2. **GPU Setup**: Colab allows you to leverage powerful GPUs (e.g., T4, A100) to accelerate the training process. + Ensure you have enabled GPU by navigating to **Runtime** > **Change runtime type** > **GPU**. + + If you're new to Google Colab or need more information on how it works, check out the `Google Colab FAQ `_ for answers to common questions. + +3. **Install Dependencies**: The notebook will automatically install all necessary dependencies such as `pyrealsense2`, `dynamixel-sdk`, and other tools required for the LeRobot framework. +4. **Log in to Hugging Face**: Follow the instructions to log in with your Hugging Face token for seamless access to datasets and model uploads. +5. **Start Training**: The notebook is pre-configured with commands to start training with the Aloha policy and datasets. +6. **Monitor Progress**: Keep an eye on the first few training epochs to ensure everything runs smoothly. + +For additional step-by-step instructions, check out our `instructional video `_. + +Benefits of Using Colab +----------------------- + +- **GPU Acceleration**: Google Colab provides free access to NVIDIA GPUs, which can dramatically reduce the time needed for model training. +- **Cloud-Based**: You don’t need to rely on your local machine for heavy computation, and the training session can run in the background. +- **Seamless Integration**: The notebook is integrated with Hugging Face, allowing you to easily access datasets and upload trained models. +- **No Setup Hassle**: All the necessary dependencies and configurations are handled within the notebook, making the setup easy and quick. + Evaluation ========== diff --git a/docs/operation/test.txt b/docs/operation/test.txt new file mode 100644 index 0000000..c0570ca --- /dev/null +++ b/docs/operation/test.txt @@ -0,0 +1,3 @@ +https://github.com/TrossenRobotics/aloha_docs/blob/main/docs/images/LeRobot_Notebook.ipynb + +https://github.com/shantanuparab-tr/aloha_docs/blob/main/docs/images/LeRobot_Notebook.ipynb \ No newline at end of file