Adding Documentation for Google Colab. (#28)

* Troubleshoot Lag in Follower Arms * Reviewed Changes * Apply suggestions from code review * Test Github URL * Added documentation for Google Colab Notebook * PR Review Changes * Added video link * Fixed FAQ Link --------- Co-authored-by: lukeschmitt-tr <[email protected]>
TrossenRobotics · Oct 11, 2024 · 2816b3f · 2816b3f
1 parent d8ccc22
commit 2816b3f
Show file tree

Hide file tree

Showing 3 changed files with 418 additions and 1 deletion.
diff --git a/docs/files/LeRobot_Notebook.ipynb b/docs/files/LeRobot_Notebook.ipynb
@@ -0,0 +1,373 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "id": "ff7eafe4",
+      "metadata": {
+        "id": "ff7eafe4"
+      },
+      "source": [
+        "\n",
+        "# LeRobot Training Instruction Manual\n",
+        "\n",
+        "# 🎉 Welcome to the LeRobot Training Notebook!\n",
+        "\n",
+        "This guide will help you set up and train a model on a cloud-based platform, such as **Google Colab**, using **LeRobot** with **Hugging Face**.\n",
+        "\n",
+        "---\n",
+        "\n",
+        "## ⚠️ **Disclaimers:**\n",
+        "\n",
+        "- **GPU Subscription**: 🔑 Make sure you have the appropriate subscription plan that provides access to the necessary GPU (e.g., **A100**, **T4**). Review pricing and benefits on the cloud provider's website before proceeding.\n",
+        "  \n",
+        "- **Checkpoint Requirement**: ⏳ If resuming training, ensure that you have the previous training checkpoint available in your session. Without the checkpoint, the training **cannot be resumed**.\n",
+        "\n",
+        "---\n",
+        "\n",
+        "## 📝 **Important Instructions:**\n",
+        "\n",
+        "- **Run All Cells Together**: 🔄 It is recommended to run all the cells in one go if you plan to leave the session **unmonitored**. This helps avoid session timeouts or disruptions.\n",
+        "\n",
+        "- **GPU & Compute Units**: 🎛️ Ensure you select a suitable GPU (e.g., **A100**, **T4**) and have enough compute units for your session. A typical 5-hour training session requires approximately **70 compute units**.\n",
+        "\n",
+        "- **Monitor Training**: 👀 It’s advisable to monitor the **first few epochs** to ensure that the training is running smoothly before leaving the session unattended.\n",
+        "\n",
+        "- **Local Storage**: 💾 You will be prompted to choose whether you want to store the training outputs **locally** at the end of the process.\n",
+        "\n",
+        "---\n",
+        "\n",
+        "Now, let’s begin the setup process! 🚀\n",
+        "\n",
+        "\n",
+        "---\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "e65a27ba",
+      "metadata": {
+        "id": "e65a27ba"
+      },
+      "outputs": [],
+      "source": [
+        "# Collect all necessary inputs from the user\n",
+        "import os\n",
+        "import subprocess\n",
+        "\n",
+        "# GPU selection (Reminder: Ensure enough compute units for smooth training)\n",
+        "print(\"Please select a suitable GPU type (e.g., A100, T4) for cloud-based training.\")\n",
+        "\n",
+        "# Hugging Face login token\n",
+        "print(\"Generate a Hugging Face token from: https://huggingface.co/settings/tokens\")\n",
+        "hf_token = input(\"Please enter your Hugging Face token: \")\n",
+        "\n",
+        "# Link to Trossen Robotics Community datasets\n",
+        "print(\"You can explore datasets from Trossen Robotics Community here: https://huggingface.co/TrossenRoboticsCommunity\")\n",
+        "\n",
+        "# Dataset and job details\n",
+        "dataset_repo_id = input(\"Please enter the dataset repo ID from Hugging Face (e.g., TrossenRoboticsCommunity/aloha_static_logo_assembly): \")\n",
+        "\n",
+        "\n",
+        "print(\"**Important**: Use a valid directory/jobs name. Avoid numbers or special characters other than '_'.\")\n",
+        "print(\"Example: 'training_results_aloha' or 'aloha_training_output'\")\n",
+        "\n",
+        "job_name = input(\"Please enter the job name for this training session: \")\n",
+        "\n",
+        "# Output directory with naming format instructions\n",
+        "\n",
+        "output_dir = input(\"Enter the output directory name: \")\n",
+        "\n",
+        "# Resume flag with disclaimer\n",
+        "resume_flag = input(\"Do you want to resume training from a previous checkpoint? (yes/no): \")\n",
+        "resume_cmd = \"--resume\" if resume_flag.lower() == 'yes' else \"\"\n",
+        "\n",
+        "# Model upload flag\n",
+        "upload_choice = input(\"Do you want to upload the model to Hugging Face after training? (yes/no): \")\n",
+        "model_repo_id = \"\"\n",
+        "if upload_choice.lower() == 'yes':\n",
+        "    model_repo_id = input(\"Please enter the model repo ID to store your trained model to Hugging Face (e.g., TrossenRoboticsCommunity/aloha_static_logo_assembly): \")\n",
+        "\n",
+        "# Local storage flag\n",
+        "store_locally = input(\"Do you want to store the training outputs locally? (yes/no): \")\n",
+        "\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "6350b26b",
+      "metadata": {
+        "id": "6350b26b"
+      },
+      "source": [
+        "\n",
+        "## Step 1: GPU Setup & Compute Units\n",
+        "\n",
+        "The GPU type you selected earlier will now be configured for this cloud-based training session. Make sure to have enough compute units to support long training sessions, and monitor the first few epochs to ensure smooth execution.\n",
+        "\n",
+        "---\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "46a58617",
+      "metadata": {
+        "id": "46a58617"
+      },
+      "source": [
+        "\n",
+        "## Step 2: Installing Dependencies\n",
+        "\n",
+        "In this step, we'll install all the necessary dependencies for running LeRobot and performing model training.\n",
+        "\n",
+        "Ensure that these packages are successfully installed before proceeding to the next steps.\n",
+        "\n",
+        "---\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "49888a14",
+      "metadata": {
+        "id": "49888a14"
+      },
+      "outputs": [],
+      "source": [
+        "\n",
+        "# Install required dependencies\n",
+        "print(\"Installing required dependencies...\")\n",
+        "\n",
+        "!sudo apt-get install libusb-1.0-0-dev\n",
+        "!pip install --upgrade  pyrealsense2 dynamixel-sdk rerun-sdk blinker wandb datasets huggingface-hub hydra-core gitpython flask diffusers InquirerPy\n",
+        "\n",
+        "# Install blinker if needed\n",
+        "!pip install --ignore-installed blinker\n",
+        "\n",
+        "\n",
+        "# Install LeRobot repository\n",
+        "!git clone https://github.com/huggingface/lerobot.git\n",
+        "%cd /content/lerobot\n",
+        "!ls\n",
+        "!pip install -e .\n",
+        "!pip install .[intelrealsense,dynamixel]\n",
+        "\n",
+        "print(\"Dependencies installed successfully.\")\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "9000e90a",
+      "metadata": {
+        "id": "9000e90a"
+      },
+      "source": [
+        "\n",
+        "## Step 3: Hugging Face Login & Dataset Setup\n",
+        "\n",
+        "We will now log into Hugging Face using the token provided. After login, the dataset repo, job name, and output directory that you specified will be configured for the training session.\n",
+        "\n",
+        "---\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "e5f45843",
+      "metadata": {
+        "id": "e5f45843"
+      },
+      "outputs": [],
+      "source": [
+        "# Log in to Hugging Face and verify login\n",
+        "print(\"Logging into Hugging Face...\")\n",
+        "!huggingface-cli login --token {hf_token}\n",
+        "\n",
+        "# Verify the login by checking the user information\n",
+        "user_info = !huggingface-cli whoami\n",
+        "print(f\"Logged in as: {user_info[0]}\")\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "> **⚠️ Important Notice:**\n",
+        ">\n",
+        "> Before you start the training, make sure to edit the `act_aloha_real.yaml` file located at:\n",
+        ">\n",
+        "> **Click Here** >> `/content/lerobot/lerobot/configs/policy/act_aloha_real.yaml`\n",
+        ">\n",
+        "> This file contains crucial parameters such as `batch_size`, `offline_steps`, and `learning_rate`. You should update these parameters based on your training needs. For example, you can modify:\n",
+        ">\n",
+        "> - **Batch Size** (`training.batch_size`): Adjust the number of samples processed in each training step.\n",
+        "> - **Offline Training Steps** (`training.offline_steps`): Define how many steps to run during offline training.\n",
+        "> - **Learning Rate** (`training.lr`): Set the learning rate to control how quickly the model learns.\n",
+        ">\n",
+        "> Once the file is updated, you can proceed with training.\n"
+      ],
+      "metadata": {
+        "id": "BONV4GelxZla"
+      },
+      "id": "BONV4GelxZla"
+    },
+    {
+      "cell_type": "markdown",
+      "id": "dbac0593",
+      "metadata": {
+        "id": "dbac0593"
+      },
+      "source": [
+        "## Step 5: Model Training or Resumption\n",
+        "\n",
+        "Now that everything is set up, we can either begin training the model or resume training from the last checkpoint, depending on your input.\n",
+        "\n",
+        "If resuming, make sure the checkpoint is available in your session. The training will continue from the last checkpoint if found.\n",
+        "\n",
+        "> **⚠️ Important: GPU Usage**\n",
+        ">\n",
+        "> By default, the training is configured to use a **GPU** for faster computation. If the runtime does not have access to a GPU, the training will fail.\n",
+        ">\n",
+        "> To avoid this issue:\n",
+        ">\n",
+        "> - **Ensure GPU is enabled** in your Colab runtime. You can check this by navigating to **Runtime > Change runtime type > Hardware accelerator** and selecting **GPU**.\n",
+        "> - If you prefer to use a **CPU** instead, update the `device` argument to `device=cpu` in the training command in the next cell.\n",
+        "\n",
+        "---"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "ea8896f6",
+      "metadata": {
+        "id": "ea8896f6"
+      },
+      "outputs": [],
+      "source": [
+        "# Start or resume training depending on user choice\n",
+        "%cd /content/lerobot/\n",
+        "\n",
+        "if resume_flag.lower() == \"no\":\n",
+        "    print(f\"Starting new training on {dataset_repo_id}...\")\n",
+        "    !python lerobot/scripts/train.py dataset_repo_id={dataset_repo_id} policy=act_aloha_real env=aloha_real hydra.run.dir=outputs/train/{output_dir} hydra.job.name={job_name} device=cuda wandb.enable=false\n",
+        "else:\n",
+        "    print(f\"Resuming training from {output_dir}... (ensure checkpoint is available)\")\n",
+        "    !python lerobot/scripts/train.py hydra.run.dir={output_dir} resume=true\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "79fe95cd",
+      "metadata": {
+        "id": "79fe95cd"
+      },
+      "source": [
+        "\n",
+        "## Step 5: Uploading the Model (Recommended)\n",
+        "\n",
+        "Once the model is trained, you can choose to upload it to Hugging Face for safekeeping. This is **highly recommended** if you are running long sessions or training a valuable model.\n",
+        "\n",
+        "Uploading the model will help protect against potential session interruptions or failures.\n",
+        "\n",
+        "---\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "10eec178",
+      "metadata": {
+        "id": "10eec178"
+      },
+      "outputs": [],
+      "source": [
+        "print(model_repo_id)\n",
+        "# Model upload step if chosen\n",
+        "if upload_choice.lower() == \"yes\":\n",
+        "    print(\"Uploading the model to Hugging Face...\")\n",
+        "    !huggingface-cli upload {model_repo_id}  outputs/train/{output_dir}/checkpoints/last/pretrained_model\n",
+        "    print(\"Model uploaded to Hugging Face successfully.\")\n",
+        "else:\n",
+        "    print(\"Model upload skipped.\")\n",
+        "\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "061e20bd",
+      "metadata": {
+        "id": "061e20bd"
+      },
+      "source": [
+        "\n",
+        "## Step 6: Safeguarding Session Data and Local Storage\n",
+        "\n",
+        "To prevent data loss in case of session termination, you can zip the output directory and download it locally. If you selected local storage, the outputs will be saved to your local machine.\n",
+        "\n",
+        "Make sure to run this step to save all training outputs before closing your session.\n",
+        "\n",
+        "---\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "327f72e0",
+      "metadata": {
+        "id": "327f72e0"
+      },
+      "outputs": [],
+      "source": [
+        "# Zip the output directory and download it if local storage is chosen\n",
+        "%cd /content/lerobot/outputs/train/\n",
+        "!ls\n",
+        "if store_locally.lower() == \"yes\":\n",
+        "    print(\"Zipping outputs for download...\")\n",
+        "    !zip -r trained.zip {output_dir}\n",
+        "\n",
+        "    # Download the zipped file\n",
+        "    from google.colab import files\n",
+        "    files.download('/content/lerobot/outputs/train/trained.zip')\n",
+        "else:\n",
+        "    print(\"Local storage not selected, skipping download.\")"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "ba7bbcc4",
+      "metadata": {
+        "id": "ba7bbcc4"
+      },
+      "source": [
+        "\n",
+        "# Troubleshooting and Recommendations\n",
+        "\n",
+        "1. **GPU Availability**: Ensure the selected GPU is available on your cloud platform (e.g., Colab).\n",
+        "2. **Compute Units**: Ensure you have sufficient compute units. Each 5-hour session requires ~70 units.\n",
+        "3. **Hugging Face Token**: You can generate a token [here](https://huggingface.co/settings/tokens).\n",
+        "4. **Session Safeguards**: Always download your results (output files) to prevent data loss if the session terminates.\n",
+        "5. **Checkpoint Reminder**: If resuming training, ensure that the checkpoint file from the previous session is present in the session.\n",
+        "\n",
+        "---\n"
+      ]
+    }
+  ],
+  "metadata": {
+    "language_info": {
+      "name": "python"
+    },
+    "colab": {
+      "provenance": [],
+      "machine_shape": "hm",
+      "gpuType": "A100",
+      "private_outputs": true
+    },
+    "kernelspec": {
+      "name": "python3",
+      "display_name": "Python 3"
+    },
+    "accelerator": "GPU"
+  },
+  "nbformat": 4,
+  "nbformat_minor": 5
+}