diff --git a/One_Prompt___Fine_Tuned_LLaMA_2.ipynb b/One_Prompt___Fine_Tuned_LLaMA_2.ipynb index 8349ac0..07f986b 100644 --- a/One_Prompt___Fine_Tuned_LLaMA_2.ipynb +++ b/One_Prompt___Fine_Tuned_LLaMA_2.ipynb @@ -2,6 +2,9 @@ "cells": [ { "cell_type": "markdown", + "metadata": { + "id": "wM8MRkf8Dr94" + }, "source": [ "## Describe your model -> fine-tuned LLaMA 2\n", "By Matt Shumer (https://twitter.com/mattshumer_)\n", @@ -15,68 +18,83 @@ "Select a temperature (high=creative, low=precise), and the number of training examples to generate to train the model. From there, just run all the cells.\n", "\n", "You can change the model you want to fine-tune by changing `model_name` in the `Define Hyperparameters` cell." - ], - "metadata": { - "id": "wM8MRkf8Dr94" - } + ] }, { "cell_type": "markdown", - "source": [ - "#Data generation step" - ], "metadata": { "id": "Way3_PuPpIuE" - } + }, + "source": [ + "#Data generation step" + ] }, { "cell_type": "markdown", + "metadata": { + "id": "lY-3DvlIpVSl" + }, "source": [ "Write your prompt here. Make it as descriptive as possible!\n", "\n", "Then, choose the temperature (between 0 and 1) to use when generating data. Lower values are great for precise tasks, like writing code, whereas larger values are better for creative tasks, like writing stories.\n", "\n", "Finally, choose how many examples you want to generate. The more you generate, a) the longer it takes and b) the more expensive data generation will be. But generally, more examples will lead to a higher-quality model. 100 is usually the minimum to start." - ], - "metadata": { - "id": "lY-3DvlIpVSl" - } + ] }, { "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "R7WKZyxtpUPS" + }, + "outputs": [], "source": [ "prompt = \"A model that takes in a puzzle-like reasoning-heavy question in English, and responds with a well-reasoned, step-by-step thought out response in Spanish.\"\n", "temperature = .4\n", "number_of_examples = 100" - ], - "metadata": { - "id": "R7WKZyxtpUPS" - }, - "execution_count": null, - "outputs": [] + ] }, { "cell_type": "markdown", - "source": [ - "Run this to generate the dataset." - ], "metadata": { "id": "1snNou5PrIci" - } + }, + "source": [ + "Run this to generate the dataset." + ] }, { "cell_type": "code", - "source": [ - "!pip install openai" - ], + "execution_count": null, "metadata": { "id": "zuL2UaqlsmBD" }, + "outputs": [], + "source": [ + "!pip install openai tenacity" + ] + }, + { + "cell_type": "code", "execution_count": null, - "outputs": [] + "metadata": {}, + "outputs": [], + "source": [ + "from tenacity import (\n", + " retry,\n", + " stop_after_attempt,\n", + " wait_random_exponential,\n", + ")" + ] }, { "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Rdsd82ngpHCG" + }, + "outputs": [], "source": [ "import os\n", "import openai\n", @@ -84,6 +102,7 @@ "\n", "openai.api_key = \"YOUR KEY HERE\"\n", "\n", + "@retry(wait=wait_random_exponential(min=1, max=60), stop=stop_after_attempt(6))\n", "def generate_example(prompt, prev_examples, temperature=.5):\n", " messages=[\n", " {\n", @@ -118,24 +137,24 @@ " prev_examples.append(example)\n", "\n", "print(prev_examples)" - ], - "metadata": { - "id": "Rdsd82ngpHCG" - }, - "execution_count": null, - "outputs": [] + ] }, { "cell_type": "markdown", - "source": [ - "We also need to generate a system message." - ], "metadata": { "id": "KC6iJzXjugJ-" - } + }, + "source": [ + "We also need to generate a system message." + ] }, { "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "xMcfhW6Guh2E" + }, + "outputs": [], "source": [ "def generate_system_message(prompt):\n", "\n", @@ -160,24 +179,24 @@ "system_message = generate_system_message(prompt)\n", "\n", "print(f'The system message is: `{system_message}`. Feel free to re-run this cell if you want a better result.')" - ], - "metadata": { - "id": "xMcfhW6Guh2E" - }, - "execution_count": null, - "outputs": [] + ] }, { "cell_type": "markdown", - "source": [ - "Now let's put our examples into a dataframe and turn them into a final pair of datasets." - ], "metadata": { "id": "G6BqZ-hjseBF" - } + }, + "source": [ + "Now let's put our examples into a dataframe and turn them into a final pair of datasets." + ] }, { "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "7CEdkYeRsdmB" + }, + "outputs": [], "source": [ "import pandas as pd\n", "\n", @@ -206,24 +225,24 @@ "print('There are ' + str(len(df)) + ' successfully-generated examples. Here are the first few:')\n", "\n", "df.head()" - ], - "metadata": { - "id": "7CEdkYeRsdmB" - }, - "execution_count": null, - "outputs": [] + ] }, { "cell_type": "markdown", - "source": [ - "Split into train and test sets." - ], "metadata": { "id": "A-8dt5qqtpgM" - } + }, + "source": [ + "Split into train and test sets." + ] }, { "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "GFPEn1omtrXM" + }, + "outputs": [], "source": [ "# Split the data into train and test sets, with 90% in the train set\n", "train_df = df.sample(frac=0.9, random_state=42)\n", @@ -232,24 +251,24 @@ "# Save the dataframes to .jsonl files\n", "train_df.to_json('train.jsonl', orient='records', lines=True)\n", "test_df.to_json('test.jsonl', orient='records', lines=True)" - ], - "metadata": { - "id": "GFPEn1omtrXM" - }, - "execution_count": null, - "outputs": [] + ] }, { "cell_type": "markdown", - "source": [ - "# Install necessary libraries" - ], "metadata": { "id": "AbrFgrhG_xYi" - } + }, + "source": [ + "# Install necessary libraries" + ] }, { "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "lPG7wEPetFx2" + }, + "outputs": [], "source": [ "!pip install -q accelerate==0.21.0 peft==0.4.0 bitsandbytes==0.40.2 transformers==4.31.0 trl==0.4.7\n", "\n", @@ -267,24 +286,24 @@ ")\n", "from peft import LoraConfig, PeftModel\n", "from trl import SFTTrainer" - ], - "metadata": { - "id": "lPG7wEPetFx2" - }, - "execution_count": null, - "outputs": [] + ] }, { "cell_type": "markdown", - "source": [ - "# Define Hyperparameters" - ], "metadata": { "id": "moVo0led-6tu" - } + }, + "source": [ + "# Define Hyperparameters" + ] }, { "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "bqfbhUZI-4c_" + }, + "outputs": [], "source": [ "model_name = \"NousResearch/llama-2-7b-chat-hf\" # use this if you have access to the official LLaMA 2 model \"meta-llama/Llama-2-7b-chat-hf\", though keep in mind you'll need to pass a Hugging Face key argument\n", "dataset_name = \"/content/train.jsonl\"\n", @@ -317,24 +336,24 @@ "max_seq_length = None\n", "packing = False\n", "device_map = {\"\": 0}" - ], - "metadata": { - "id": "bqfbhUZI-4c_" - }, - "execution_count": null, - "outputs": [] + ] }, { "cell_type": "markdown", - "source": [ - "#Load Datasets and Train" - ], "metadata": { "id": "F-J5p5KS_MZY" - } + }, + "source": [ + "#Load Datasets and Train" + ] }, { "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "qf1qxbiF-x6p" + }, + "outputs": [], "source": [ "# Load datasets\n", "train_dataset = load_dataset('json', data_files='/content/train.jsonl', split=\"train\")\n", @@ -411,24 +430,24 @@ "pipe = pipeline(task=\"text-generation\", model=model, tokenizer=tokenizer, max_length=200)\n", "result = pipe(prompt)\n", "print(result[0]['generated_text'])" - ], - "metadata": { - "id": "qf1qxbiF-x6p" - }, - "execution_count": null, - "outputs": [] + ] }, { "cell_type": "markdown", - "source": [ - "#Run Inference" - ], "metadata": { "id": "F6fux9om_c4-" - } + }, + "source": [ + "#Run Inference" + ] }, { "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "7hxQ_Ero2IJe" + }, + "outputs": [], "source": [ "from transformers import pipeline\n", "\n", @@ -444,24 +463,24 @@ "gen = pipeline('text-generation', model=model, tokenizer=tokenizer, max_length=max_length)\n", "result = gen(prompt)\n", "print(result[0]['generated_text'].replace(prompt, ''))" - ], - "metadata": { - "id": "7hxQ_Ero2IJe" - }, - "execution_count": null, - "outputs": [] + ] }, { "cell_type": "markdown", - "source": [ - "#Merge the model and store in Google Drive" - ], "metadata": { "id": "Ko6UkINu_qSx" - } + }, + "source": [ + "#Merge the model and store in Google Drive" + ] }, { "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "AgKCL7fTyp9u" + }, + "outputs": [], "source": [ "# Merge and save the fine-tuned model\n", "from google.colab import drive\n", @@ -488,24 +507,24 @@ "# Save the merged model\n", "model.save_pretrained(model_path)\n", "tokenizer.save_pretrained(model_path)" - ], - "metadata": { - "id": "AgKCL7fTyp9u" - }, - "execution_count": null, - "outputs": [] + ] }, { "cell_type": "markdown", - "source": [ - "# Load a fine-tuned model from Drive and run inference" - ], "metadata": { "id": "do-dFdE5zWGO" - } + }, + "source": [ + "# Load a fine-tuned model from Drive and run inference" + ] }, { "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "xg6nHPsLzMw-" + }, + "outputs": [], "source": [ "from google.colab import drive\n", "from transformers import AutoModelForCausalLM, AutoTokenizer\n", @@ -516,15 +535,15 @@ "\n", "model = AutoModelForCausalLM.from_pretrained(model_path)\n", "tokenizer = AutoTokenizer.from_pretrained(model_path)" - ], - "metadata": { - "id": "xg6nHPsLzMw-" - }, - "execution_count": null, - "outputs": [] + ] }, { "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "fBK2aE2KzZ05" + }, + "outputs": [], "source": [ "from transformers import pipeline\n", "\n", @@ -532,20 +551,15 @@ "gen = pipeline('text-generation', model=model, tokenizer=tokenizer)\n", "result = gen(prompt)\n", "print(result[0]['generated_text'])" - ], - "metadata": { - "id": "fBK2aE2KzZ05" - }, - "execution_count": null, - "outputs": [] + ] } ], "metadata": { "accelerator": "GPU", "colab": { + "gpuType": "V100", "machine_shape": "hm", - "provenance": [], - "gpuType": "V100" + "provenance": [] }, "kernelspec": { "display_name": "Python 3", @@ -557,4 +571,4 @@ }, "nbformat": 4, "nbformat_minor": 0 -} \ No newline at end of file +}