From 6c8a8a942bee6226881c44d51868b56f246eeb9a Mon Sep 17 00:00:00 2001
From: HubGab-Git <97609337+HubGab-Git@users.noreply.github.com>
Date: Sun, 6 Oct 2024 15:33:48 +0200
Subject: [PATCH] finetune_flan_t5_with_tensorboard #4667

---
 .../finetune_flan_t5_with_tensorboard.ipynb   | 560 ++++++++++++++++++
 1 file changed, 560 insertions(+)
 create mode 100644       build_and_train_models/sm-finetune_flan_t5_with_tensorboard/finetune_flan_t5_with_tensorboard.ipynb

diff --git a/      build_and_train_models/sm-finetune_flan_t5_with_tensorboard/finetune_flan_t5_with_tensorboard.ipynb b/      build_and_train_models/sm-finetune_flan_t5_with_tensorboard/finetune_flan_t5_with_tensorboard.ipynb
new file mode 100644
index 0000000000..31315c5d0a
--- /dev/null
+++ b/      build_and_train_models/sm-finetune_flan_t5_with_tensorboard/finetune_flan_t5_with_tensorboard.ipynb	
@@ -0,0 +1,560 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "48b9f283-12e1-4c30-924d-d6bac1f14d6a",
+   "metadata": {},
+   "source": [
+    "# Fine-tuning a HuggingFace FLAN-T5 Model on Amazon SageMaker with TensorBoard Integration\n",
+    "\n",
+    "**Author**: Hubert Gabryel\n",
+    "\n",
+    "**Date**: 2023-10-05\n",
+    "\n",
+    "## Table of Contents\n",
+    "\n",
+    "1.\t[Introduction](#1-introduction)\n",
+    "\n",
+    "\t1.1 [Background](#11-background)\n",
+    "\n",
+    "\t1.2 [Objective](#12-objective)\n",
+    "\n",
+    "2.\t[Setup](#2-setup)\n",
+    "\n",
+    "\t2.1 [Import Libraries](#21-import-libraries)\n",
+    "\n",
+    "\t2.2 [Initialize SageMaker Session and Role](#22-initialize-sagemaker-session-and-role)\n",
+    "\n",
+    "\t2.3 [Model Configuration](#23-model-configuration)\n",
+    "\n",
+    "3.\t[Data Preparation](#3-data-preparation)\n",
+    "\n",
+    "\t3.1 [Download and Prepare the Dataset](#31-download-and-prepare-the-dataset)\n",
+    "\n",
+    "\t3.2 [Load and Preprocess the Data](#32-load-and-preprocess-the-data)\n",
+    "\n",
+    "\t3.3 [Prepare the Data for Training](#33-prepare-the-data-for-training)\n",
+    "\n",
+    "\t3.4 [Visualize Sample Data](#34-visualize-sample-data)\n",
+    "\n",
+    "\t3.5 [Upload Data to S3](#35-upload-data-to-s3)\n",
+    "\n",
+    "4.\t[Training Script Modification](#4-training-script-modification)\n",
+    "\n",
+    "\t4.1 [Download the Training Script](#41-download-the-training-script)\n",
+    "\n",
+    "\t4.2 [Modify the Training Script for TensorBoard Integration](#42-modify-the-training-script-for-tensorboard-integration)\n",
+    "\n",
+    "5.\t[Model Training with TensorBoard Integration](#5-model-training-with-tensorboard-integration)\n",
+    "\n",
+    "\t5.1 [Set Up TensorBoard Output Configuration](#51-set-up-tensorboard-output-configuration)\n",
+    "\n",
+    "\t5.2 [Define Hyperparameters](#52-define-hyperparameters)\n",
+    "\n",
+    "\t5.3 [Create and Fit the Estimator](#53-create-and-fit-the-estimator)\n",
+    "\n",
+    "6.\t[TensorBoard Visualization](#6-tensorboard-visualization)\n",
+    "\n",
+    "\t6.1 [Start TensorBoard from the SageMaker Console](#61-start-tensorboard-from-the-sagemaker-console)\n",
+    "\t\n",
+    "7.\t[Conclusion](#7-conclusion)\n",
+    "\n",
+    "8.\t[References](#8-references)\n",
+    "\n",
+    "\n",
+    "## 1. Introduction\n",
+    "\n",
+    "### 1.1 Background\n",
+    "\n",
+    "In this notebook, we demonstrate how to fine-tune a HuggingFace FLAN-T5 model using Amazon SageMaker’s JumpStart models with TensorBoard integration. This integration allows us to monitor and visualize the training process in real-time, providing valuable insights into model performance.\n",
+    "\n",
+    "### 1.2 Objective\n",
+    "\n",
+    "Our goal is to fine-tune the FLAN-T5 small model on a subset of the Tiny Shakespeare dataset and visualize the training metrics using TensorBoard. We will:\n",
+    "\n",
+    "- Set up the SageMaker environment and import necessary libraries.\n",
+    "- Prepare the dataset for training.\n",
+    "- Modify the training script to include TensorBoard logging.\n",
+    "- Train the model with TensorBoard integration.\n",
+    "- Visualize the training metrics using TensorBoard.\n",
+    "\n",
+    "## 2. Setup\n",
+    "\n",
+    "### 2.1 Import Libraries"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "c3644c9c-adb0-4eb9-9d95-89120ab22dde",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Install or upgrade the SageMaker Python SDK\n",
+    "!pip install -U sagemaker --quiet"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "0d6bcbc8-6abf-4b62-80b3-3e58ece51b66",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Import necessary libraries\n",
+    "import os\n",
+    "import random\n",
+    "import numpy as np\n",
+    "import pandas as pd\n",
+    "from sklearn.model_selection import train_test_split\n",
+    "\n",
+    "import boto3\n",
+    "import sagemaker\n",
+    "from sagemaker import get_execution_role, script_uris\n",
+    "from sagemaker.s3 import S3Uploader, S3Downloader\n",
+    "from sagemaker.jumpstart.estimator import JumpStartEstimator\n",
+    "from sagemaker.debugger import TensorBoardOutputConfig\n",
+    "\n",
+    "# Set random seeds for reproducibility\n",
+    "RANDOM_SEED = 42\n",
+    "random.seed(RANDOM_SEED)\n",
+    "np.random.seed(RANDOM_SEED)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2b572c67-a33c-4b2d-b1dc-aa71192a9682",
+   "metadata": {},
+   "source": [
+    "### 2.2 Initialize SageMaker Session and Role"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e462af99-d22a-4fb8-93e0-d03edb4351eb",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sagemaker_session = sagemaker.Session()\n",
+    "role = get_execution_role()\n",
+    "\n",
+    "# Verify S3 access\n",
+    "try:\n",
+    "    s3_client = sagemaker_session.boto_session.client('s3')\n",
+    "    s3_client.head_bucket(Bucket=sagemaker_session.default_bucket())\n",
+    "    print(\"S3 access confirmed.\")\n",
+    "except Exception as e:\n",
+    "    print(f\"Unable to access S3 bucket: {e}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a46f6523-ddb1-489f-9757-dc9543f475de",
+   "metadata": {},
+   "source": [
+    "### 2.3 Model Configuration"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "ea3f3cda-803f-4bea-a678-f4284efbbb4b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Model configuration\n",
+    "MODEL_ID = 'huggingface-text2text-flan-t5-small'  # Small model to keep training cost low\n",
+    "MODEL_VERSION = '2.1.2'  # Latest model version at the time of writing"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ab6ad668-ac5a-4b88-99c6-17b8c962b69a",
+   "metadata": {},
+   "source": [
+    "## 3. Data Preparation\n",
+    "\n",
+    "### 3.1 Download and Prepare the Dataset\n",
+    "\n",
+    "We will use the Tiny Shakespeare dataset for this example."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "9a444e79-dd2c-4f3f-9a64-33c6d6f713a6",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Download the Tiny Shakespeare dataset\n",
+    "!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt --no-check-certificate"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "40d302be-4997-42cd-b83f-5fe732ac0fa8",
+   "metadata": {},
+   "source": [
+    "### 3.2 Load and Preprocess the Data"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b3c61092-8b0c-4104-8943-bfe8098160d3",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Read the data\n",
+    "with open('input.txt', 'r') as f:\n",
+    "    data = f.read()\n",
+    "\n",
+    "# Limit the data to the first MAX_DATA_LENGTH characters\n",
+    "MAX_DATA_LENGTH = 10000\n",
+    "data = data[:MAX_DATA_LENGTH]\n",
+    "\n",
+    "# Split the data into training and validation sets\n",
+    "TEST_SIZE = 0.2\n",
+    "train_text, val_text = train_test_split(data, test_size=TEST_SIZE, random_state=RANDOM_SEED)\n",
+    "\n",
+    "print(f\"Training data length: {len(train_text)}\")\n",
+    "print(f\"Validation data length: {len(val_text)}\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2db6e736-4088-4518-8515-efd31bbef8ef",
+   "metadata": {},
+   "source": [
+    "### 3.3 Prepare the Data for Training\n",
+    "\n",
+    "We need to format the data into prompts and completions."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "21329da7-f9ed-4869-b8ab-88abdb6f5255",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def prepare_data(text, sequence_length=256, prompt_length=128):\n",
+    "    data = []\n",
+    "    max_index = len(text) - sequence_length + 1\n",
+    "    for i in range(0, max_index, prompt_length):\n",
+    "        prompt = text[i:i+prompt_length]\n",
+    "        completion = text[i+prompt_length:i+sequence_length]\n",
+    "        if len(completion) == (sequence_length - prompt_length):\n",
+    "            data.append({'prompt': prompt, 'completion': completion})\n",
+    "    return data\n",
+    "\n",
+    "# Prepare the training and validation data\n",
+    "train_data = prepare_data(train_text)\n",
+    "val_data = prepare_data(val_text)\n",
+    "\n",
+    "print(f\"Number of training samples: {len(train_data)}\")\n",
+    "print(f\"Number of validation samples: {len(val_data)}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "854bc82d-a469-41b0-a470-da8f7d938101",
+   "metadata": {},
+   "source": [
+    "### 3.4 Visualize Sample Data"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "8ebdcd9c-b32a-45e6-9f9f-8d1d116318b3",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Display a sample from the training data\n",
+    "pd.DataFrame(train_data).head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "517c62b7-cc20-4c2a-b338-880d488a70db",
+   "metadata": {},
+   "source": [
+    "### 3.5 Upload Data to S3"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a204e6f9-8ae8-4199-bec4-c7c8b443bbfe",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Define S3 bucket and prefix\n",
+    "bucket = sagemaker_session.default_bucket()\n",
+    "data_prefix = 'jumpstart-example-data'\n",
+    "\n",
+    "# Save the data to local files\n",
+    "pd.DataFrame(train_data).to_json('train.jsonl', orient='records', lines=True)\n",
+    "pd.DataFrame(val_data).to_json('val.jsonl', orient='records', lines=True)\n",
+    "\n",
+    "# Upload training data\n",
+    "train_s3_uri = sagemaker_session.upload_data(\n",
+    "    path='train.jsonl',\n",
+    "    bucket=bucket,\n",
+    "    key_prefix=f\"{data_prefix}/train.jsonl\"\n",
+    ")\n",
+    "\n",
+    "# Upload validation data\n",
+    "val_s3_uri = sagemaker_session.upload_data(\n",
+    "    path='val.jsonl',\n",
+    "    bucket=bucket,\n",
+    "    key_prefix=f\"{data_prefix}/val.jsonl\"\n",
+    ")\n",
+    "\n",
+    "print(f\"Training data uploaded to: {train_s3_uri}\")\n",
+    "print(f\"Validation data uploaded to: {val_s3_uri}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2e75bf64-0948-4a74-bd67-b1441d00917d",
+   "metadata": {},
+   "source": [
+    "## 4. Training Script Modification\n",
+    "\n",
+    "### 4.1 Download the Training Script\n",
+    "\n",
+    "We need to obtain the default training script provided by the JumpStart model and modify it to integrate TensorBoard."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 15,
+   "id": "a5191d10-8dac-4a3a-9ee1-597a3208bcb2",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from sagemaker.s3 import S3Downloader\n",
+    "\n",
+    "# Retrieve the training script URI\n",
+    "train_script_uri = script_uris.retrieve(\n",
+    "    model_id=MODEL_ID, model_version=MODEL_VERSION, script_scope=\"training\"\n",
+    ")\n",
+    "\n",
+    "# Download the training script\n",
+    "S3Downloader.download(train_script_uri, \"training_script\")\n",
+    "\n",
+    "# Unpack the training script\n",
+    "import tarfile\n",
+    "\n",
+    "with tarfile.open('training_script/sourcedir.tar.gz') as tar:\n",
+    "    tar.extractall('./training_script')\n",
+    "\n",
+    "\n",
+    "with tarfile.open('training_script.tar.gz/sourcedir.tar.gz') as tar:\n",
+    "    tar.extractall('./training_script')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "87befce4-deb1-426f-acbf-803099438ac2",
+   "metadata": {},
+   "source": [
+    "### 4.2 Modify the Training Script for TensorBoard Integration\n",
+    "\n",
+    "We need to modify the train.py script to include TensorBoard logging.\n",
+    "\n",
+    "- Import the TensorBoardCallback:\n",
+    "    In train.py, add:\n",
+    "\n",
+    "```python\n",
+    "from transformers.integrations import TensorBoardCallback\n",
+    "```\n",
+    "\n",
+    "- Modify the Seq2SeqTrainingArguments to include TensorBoard parameters:\n",
+    "\n",
+    " ```python\n",
+    " training_args = Seq2SeqTrainingArguments(\n",
+    "    # ... other arguments ...\n",
+    "    logging_dir=\"/opt/ml/output/tensorboard\",\n",
+    "    report_to=['tensorboard'],\n",
+    "    # ... other arguments ...\n",
+    ")\n",
+    "```\n",
+    "\n",
+    "\n",
+    "- Add the TensorBoardCallback to the trainer:\n",
+    "\n",
+    "```python\n",
+    "if callbacks is None:                   # Added line\n",
+    "    callbacks = []                      # Added line\n",
+    "callbacks.append(TensorBoardCallback()) # Added line\n",
+    "\n",
+    "# Create Trainer instance\n",
+    "    trainer = Seq2SeqTrainer(\n",
+    "        model=model,\n",
+    "        args=training_args,\n",
+    "        train_dataset=dataset[constants.TRAIN],\n",
+    "        eval_dataset=dataset[constants.VALIDATION],\n",
+    "        data_collator=data_collator,\n",
+    "        callbacks=callbacks,\n",
+    "    )\n",
+    "```\n",
+    "\n",
+    "Note: Ensure that the \"/opt/ml/output/tensorboard\" in the training script matches the container_local_output_path in the TensorBoardOutputConfig.\n",
+    "\n",
+    "## 5. Model Training with TensorBoard Integration\n",
+    "\n",
+    "### 5.1 Set Up TensorBoard Output Configuration"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "8fdbc2e0-9b81-420d-bbe2-01dba31aa103",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "tensorboard_output_config = TensorBoardOutputConfig(\n",
+    "    s3_output_path=f's3://{bucket}/tensorboard-output',\n",
+    "    container_local_output_path='/opt/ml/output/tensorboard'  # Should match LOG_DIR in your script\n",
+    ")\n",
+    "\n",
+    "print(f\"TensorBoard logs will be saved to: s3://{bucket}/tensorboard-output\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e7e316f7-6a1e-4f68-932b-9a495d3a56e5",
+   "metadata": {},
+   "source": [
+    "### 5.2 Define Hyperparameters"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 24,
+   "id": "d4c9fe44-cb9c-407b-a80c-5577105b18f6",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "hyperparameters = {\n",
+    "    \"epochs\": \"5\",\n",
+    "    \"batch_size\": \"4\",\n",
+    "    \"learning_rate\": \"5e-5\",\n",
+    "    \"logging_strategy\": \"steps\",\n",
+    "    \"logging_steps\": \"5\",\n",
+    "    \"evaluation_strategy\": \"steps\",\n",
+    "    \"save_strategy\": \"steps\",\n",
+    "    \"eval_steps\": \"25\",\n",
+    "    \"save_steps\": \"25\",\n",
+    "    \"gradient_accumulation_steps\": \"1\",\n",
+    "    \"fp16\": \"true\",\n",
+    "    \"bf16\": \"false\"\n",
+    "}"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "137786e8-56f4-4acc-b4cb-3bf2e2f219fc",
+   "metadata": {},
+   "source": [
+    "### 5.3 Create and Fit the Estimator"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ebc3d8f5-a505-4754-a578-7de4c7221f28",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "estimator = JumpStartEstimator(\n",
+    "    model_id=MODEL_ID,\n",
+    "    model_version=MODEL_VERSION,\n",
+    "    instance_type='ml.g5.xlarge',\n",
+    "    hyperparameters=hyperparameters,\n",
+    "    entry_point='transfer_learning.py',   # Name of main script\n",
+    "    source_dir='training_script',  # Directory containing your scripts\n",
+    "    tensorboard_output_config=tensorboard_output_config\n",
+    ")\n",
+    "\n",
+    "# Start the training job\n",
+    "estimator.fit(\n",
+    "    {\"train\": train_s3_uri, \"validation\": val_s3_uri}\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ac2186f6-8767-4773-ad21-c7e3458ae818",
+   "metadata": {},
+   "source": [
+    "## 6. TensorBoard Visualization\n",
+    "\n",
+    "### 6.1 Start TensorBoard from the SageMaker Console\n",
+    "\n",
+    "\t1.\tNavigate to the SageMaker Console:\n",
+    "\t•\tGo to the Amazon SageMaker Console.\n",
+    "\t2.\tAccess TensorBoard:\n",
+    "\t•\tIn the left-hand navigation pane, click on Applications and IDEs.\n",
+    "\t•\tSelect TensorBoard.\n",
+    "\t3.\tOpen TensorBoard:\n",
+    "\t•\tClick on Open TensorBoard to launch the TensorBoard landing page.\n",
+    "\t4.\tAdd Your Training Job:\n",
+    "\t•\tOn the TensorBoard page, click on Add job.\n",
+    "\t•\tSelect your most recent completed training job from the list.\n",
+    "\t5.\tView Training Metrics:\n",
+    "\t•\tAfter the data loads, navigate to the Scalars tab.\n",
+    "\t•\tHere, you can see charts and graphs of your training metrics.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "731283bc-212b-4810-9e3c-d6f84368b960",
+   "metadata": {},
+   "source": [
+    "## 7. Conclusion\n",
+    "\n",
+    "In this notebook, we demonstrated how to fine-tune a HuggingFace FLAN-T5 model using Amazon SageMaker with TensorBoard integration. We prepared a subset of the Tiny Shakespeare dataset, modified the training script to include TensorBoard logging, and visualized the training metrics.\n",
+    "\n",
+    "Next Steps:\n",
+    "\n",
+    "- Experiment with Hyperparameters: Adjust learning rates, batch sizes, and other hyperparameters to improve model performance.\n",
+    "- Use a Larger Dataset: Try using a larger dataset for better results.\n",
+    "- Deploy the Model: After training, deploy the model using SageMaker’s deployment capabilities for inference.\n",
+    "\n",
+    "## 8. References\n",
+    "\n",
+    "- [Amazon SageMaker Documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-htb-prepare-training-job.html)\n",
+    "- [TensorBoard Documentation](https://www.tensorflow.org/tensorboard/get_started)\n",
+    "- [Tiny Shakespeare Dataset](https://github.com/karpathy/char-rnn/tree/master/data/tinyshakespeare)\n",
+    "- [HuggingFace Transformers Documentation](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Seq2SeqTrainingArguments)\n",
+    "- [SageMaker JumpStart Models](https://sagemaker.readthedocs.io/en/stable/doc_utils/pretrainedmodels.html)\n"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.9"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}