Merge pull request microsoft#300 from carlotta94c/openai-support-ch8

Adding non-Azure Openai endpoints support to ch8
Thellton · Feb 16, 2024 · 118b4ca · 118b4ca
2 parents 36f7c74 + 1a8e759
commit 118b4ca
Show file tree

Hide file tree

Showing 10 changed files with 327 additions and 9 deletions.
diff --git a/04-prompt-engineering-fundamentals/SETUP.md → 00-course-setup/SETUP.md b/04-prompt-engineering-fundamentals/SETUP.md → 00-course-setup/SETUP.md
@@ -4,7 +4,9 @@ We have instrumented this repository with a _dev container_ that comes with a Py
 
 ## 1. Create `.env` file
 
-The default notebook is set up for use with an [Azure OpenAI service resource](https://learn.microsoft.com/azure/ai-services/openai?WT.mc_id=academic-105485-koreyst). To configure this, we need to setup local environment variables for Azure as follows:
+The default notebook (identified by the 'aoai-' suffix) is set up for use with an [Azure OpenAI service resource](https://learn.microsoft.com/azure/ai-services/openai?WT.mc_id=academic-105485-koreyst). However, you have the option to run your assignments by using non-Azure openAI endpoints (choose the 'oai-' prefixed notebooks in this case).
+
+To configure this, we need to setup local environment variables for Azure as follows:
 
 1. Look in the root folder for a `.env.copy` file. It should contain a list of name-value pairs like this:
 
@@ -23,6 +25,12 @@ The default notebook is set up for use with an [Azure OpenAI service resource](h
 
 3. (Option) If you use GitHub Codespaces, you have the option to save environment variables as _Codespaces secrets_ associated with this repository. In that case, you won't need to setup a local .env file. **However, note that this option works only if you use GitHub Codespaces.** You will still need to setup the .env file if you use Docker Desktop instead.
 
+The above steps should be executed also if you are using the non-Azure OpenAI endpoints. In that case, you will need to populate the .env file with the appropriate values for the OpenAI service.
+
+```bash
+OPENAI_API_KEY='<add your OpenAI key here>'
+```
+
 
 ## 2. Populate `.env` file
 
@@ -32,6 +40,7 @@ Let's take a quick look at the variable names to understand what they represent:
 |:---|:---|
 |AZURE_OPENAI_ENDPOINT| This is the deployed endpoint for an Azure OpenAI resource|
 |AZURE_OPENAI_KEY | This is the authorization key for using that service  |
+|OPENAI_API_KEY | This is the authorization key for using the service for non-Azure OpenAI endpoints |
 |AZURE_OPENAI_DEPLOYMENT| This is the _text generation_ model deployment endpoint |
 |AZURE_OPENAI_EMBEDDINGS_DEPLOYMENT | This is the _text embeddings_ model deployment endpoint |
 | | | 
@@ -69,4 +78,7 @@ AZURE_OPENAI_DEPLOYMENT='gpt-35-turbo'
 AZURE_OPENAI_EMBEDDINGS_DEPLOYMENT='text-embedding-ada-002'
 ```
 
-**Don't forget to save the .env file when done**. You can now exit the file and return to the instructions for running the notebook.
+**Don't forget to save the .env file when done**. You can now exit the file and return to the instructions for running the notebook.
+
+### 2.3 Use OpenAI Public API
+Your OpenAI API key can be found in your [OpenAI account](https://platform.openai.com/api-keys?WT.mc_id=academic-105485-koreyst). If you don't have one, you can sign up for an account and create an API key. Once you have the key, you can use it to populate the `OPENAI_API_KEY` variable in the `.env` file.
diff --git a/08-building-search-applications/README.md b/08-building-search-applications/README.md
@@ -150,7 +150,7 @@ az cognitiveservices account deployment create \
 
 ## Solution
 
-Open the [solution notebook](./solution.ipynb?WT.mc_id=academic-105485-koreyst) in GitHub Codespaces and follow the instructions in the Jupyter Notebook.
+Open the [solution notebook](./python/aoai-solution.ipynb?WT.mc_id=academic-105485-koreyst) in GitHub Codespaces and follow the instructions in the Jupyter Notebook.
 
 When you run the notebook, you'll be prompted to enter a query. The input box will look like this:
 

diff --git a/...ations/notebook-azure-openai-simple.ipynb → ...applications/python/aoai-assignment.ipynb b/...ations/notebook-azure-openai-simple.ipynb → ...applications/python/aoai-assignment.ipynb
@@ -6,7 +6,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "!pip install openai dotenv"
+    "%pip install openai python-dotenv"
    ]
   },
   {
@@ -35,7 +35,7 @@
    "outputs": [],
    "source": [
     "# Dependencies for embeddings_utils\n",
-    "!pip install matplotlib plotly scikit-learn pandas"
+    "%pip install matplotlib plotly scikit-learn pandas"
    ]
   },
   {

diff --git a/...ilding-search-applications/solution.ipynb → ...h-applications/python/aoai-solution.ipynb b/...ilding-search-applications/solution.ipynb → ...h-applications/python/aoai-solution.ipynb
@@ -29,7 +29,7 @@
     "model = os.environ['AZURE_OPENAI_EMBEDDINGS_DEPLOYMENT']\n",
     "\n",
     "SIMILARITIES_RESULTS_THRESHOLD = 0.75\n",
-    "DATASET_NAME = \"embedding_index_3m.json\""
+    "DATASET_NAME = \"../embedding_index_3m.json\""
    ]
   },
   {

diff --git a/08-building-search-applications/python/oai-assignment.ipynb b/08-building-search-applications/python/oai-assignment.ipynb
@@ -0,0 +1,106 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%pip install openai python-dotenv"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "from openai import OpenAI\n",
+    "from dotenv import load_dotenv\n",
+    "import numpy as np\n",
+    "load_dotenv()\n",
+    "\n",
+    "API_KEY = os.getenv(\"OPENAI_API_KEY\",\"\")\n",
+    "assert API_KEY, \"ERROR: OpenAI Key is missing\"\n",
+    "\n",
+    "client = OpenAI(\n",
+    "    api_key=API_KEY\n",
+    "    )"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Dependencies for embeddings_utils\n",
+    "%pip install matplotlib plotly scikit-learn pandas"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def cosine_similarity(a, b):\n",
+    "    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "text = 'the quick brown fox jumped over the lazy dog'\n",
+    "model = 'text-embedding-ada-002'\n",
+    "\n",
+    "client.embeddings.create(input = [text], model=model).data[0].embedding"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# compare several words\n",
+    "automobile_embedding    = client.embeddings.create(input = 'automobile', model=model).data[0].embedding\n",
+    "vehicle_embedding       = client.embeddings.create(input = 'vehicle', model=model).data[0].embedding\n",
+    "dinosaur_embedding      = client.embeddings.create(input = 'dinosaur', model=model).data[0].embedding\n",
+    "stick_embedding         = client.embeddings.create(input = 'stick', model=model).data[0].embedding\n",
+    "\n",
+    "# comparing cosine similarity, automobiles vs automobiles should be 1.0, i.e exactly the same, while automobiles vs dinosaurs should be between 0 and 1, i.e. not the same\n",
+    "print(cosine_similarity(automobile_embedding, automobile_embedding))\n",
+    "print(cosine_similarity(automobile_embedding, vehicle_embedding))\n",
+    "print(cosine_similarity(automobile_embedding, dinosaur_embedding))\n",
+    "print(cosine_similarity(automobile_embedding, stick_embedding))"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.13"
+  },
+  "orig_nbformat": 4
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/08-building-search-applications/python/oai-solution.ipynb b/08-building-search-applications/python/oai-solution.ipynb
@@ -0,0 +1,200 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In order to run the following noteboooks, if you haven't done yet, you need to set the openai key inside .env file as `OPENAI_API_KEY`"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "import pandas as pd\n",
+    "import numpy as np\n",
+    "from openai import OpenAI\n",
+    "from dotenv import load_dotenv\n",
+    "\n",
+    "load_dotenv()\n",
+    "\n",
+    "API_KEY = os.getenv(\"OPENAI_API_KEY\",\"\")\n",
+    "assert API_KEY, \"ERROR: OpenAI Key is missing\"\n",
+    "\n",
+    "client = OpenAI(\n",
+    "    api_key=API_KEY\n",
+    "    )\n",
+    "\n",
+    "model = 'text-embedding-ada-002'\n",
+    "\n",
+    "SIMILARITIES_RESULTS_THRESHOLD = 0.75\n",
+    "DATASET_NAME = \"../embedding_index_3m.json\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Next, we are going to load the Embedding Index into a Pandas Dataframe. The Embedding Index is stored in a JSON file called `embedding_index_3m.json`. The Embedding Index contains the Embeddings for each of the YouTube transcripts up until late Oct 2023."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def load_dataset(source: str) -> pd.core.frame.DataFrame:\n",
+    "    # Load the video session index\n",
+    "    pd_vectors = pd.read_json(source)\n",
+    "    return pd_vectors.drop(columns=[\"text\"], errors=\"ignore\").fillna(\"\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Next, we are going to create a function called `get_videos` that will search the Embedding Index for the query. The function will return the top 5 videos that are most similar to the query. The function works as follows:\n",
+    "\n",
+    "1. First, a copy of the Embedding Index is created.\n",
+    "2. Next, the Embedding for the query is calculated using the OpenAI Embedding API.\n",
+    "3. Then a new column is created in the Embedding Index called `similarity`. The `similarity` column contains the cosine similarity between the query Embedding and the Embedding for each video segment.\n",
+    "4. Next, the Embedding Index is filtered by the `similarity` column. The Embedding Index is filtered to only include videos that have a cosine similarity greater than or equal to 0.75.\n",
+    "5. Finally, the Embedding Index is sorted by the `similarity` column and the top 5 videos are returned."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def cosine_similarity(a, b):\n",
+    "    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))\n",
+    "\n",
+    "def get_videos(\n",
+    "    query: str, dataset: pd.core.frame.DataFrame, rows: int\n",
+    ") -> pd.core.frame.DataFrame:\n",
+    "    # create a copy of the dataset\n",
+    "    video_vectors = dataset.copy()\n",
+    "\n",
+    "    # get the embeddings for the query    \n",
+    "    query_embeddings = client.embeddings.create(input=query, model=model).data[0].embedding\n",
+    "\n",
+    "    # create a new column with the calculated similarity for each row\n",
+    "    video_vectors[\"similarity\"] = video_vectors[\"ada_v2\"].apply(\n",
+    "        lambda x: cosine_similarity(np.array(query_embeddings), np.array(x))\n",
+    "    )\n",
+    "\n",
+    "    # filter the videos by similarity\n",
+    "    mask = video_vectors[\"similarity\"] >= SIMILARITIES_RESULTS_THRESHOLD\n",
+    "    video_vectors = video_vectors[mask].copy()\n",
+    "\n",
+    "    # sort the videos by similarity\n",
+    "    video_vectors = video_vectors.sort_values(by=\"similarity\", ascending=False).head(\n",
+    "        rows\n",
+    "    )\n",
+    "\n",
+    "    # return the top rows\n",
+    "    return video_vectors.head(rows)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "This function is very simple, it just prints out the results of the search query."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def display_results(videos: pd.core.frame.DataFrame, query: str):\n",
+    "    def _gen_yt_url(video_id: str, seconds: int) -> str:\n",
+    "        \"\"\"convert time in format 00:00:00 to seconds\"\"\"\n",
+    "        return f\"https://youtu.be/{video_id}?t={seconds}\"\n",
+    "\n",
+    "    print(f\"\\nVideos similar to '{query}':\")\n",
+    "    for _, row in videos.iterrows():\n",
+    "        youtube_url = _gen_yt_url(row[\"videoId\"], row[\"seconds\"])\n",
+    "        print(f\" - {row['title']}\")\n",
+    "        print(f\"   Summary: {' '.join(row['summary'].split()[:15])}...\")\n",
+    "        print(f\"   YouTube: {youtube_url}\")\n",
+    "        print(f\"   Similarity: {row['similarity']}\")\n",
+    "        print(f\"   Speakers: {row['speaker']}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "1. First, the Embedding Index is loaded into a Pandas Dataframe.\n",
+    "2. Next, the user is prompted to enter a query.\n",
+    "3. Then the `get_videos` function is called to search the Embedding Index for the query.\n",
+    "4. Finally, the `display_results` function is called to display the results to the user.\n",
+    "5. The user is then prompted to enter another query. This process continues until the user enters `exit`.\n",
+    "\n",
+    "![](media/notebook_search.png)\n",
+    "\n",
+    "You will be prompted to enter a query. Enter a query and press enter. The application will return a list of videos that are relevant to the query. The application will also return a link to the place in the video where the answer to the question is located.\n",
+    "\n",
+    "Here are some queries to try out:\n",
+    "\n",
+    "- What is Azure Machine Learning?\n",
+    "- How do convolutional neural networks work?\n",
+    "- What is a neural network?\n",
+    "- Can I use Jupyter Notebooks with Azure Machine Learning?\n",
+    "- What is ONNX?"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "pd_vectors = load_dataset(DATASET_NAME)\n",
+    "\n",
+    "# get user query from imput\n",
+    "while True:\n",
+    "    query = input(\"Enter a query: \")\n",
+    "    if query == \"exit\":\n",
+    "        break\n",
+    "    videos = get_videos(query, pd_vectors, 5)\n",
+    "    display_results(videos, query)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "venv",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.13"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/...ding-search-applications/requirements.txt → ...arch-applications/python/requirements.txt b/...ding-search-applications/requirements.txt → ...arch-applications/python/requirements.txt
diff --git a/08-building-search-applications/translations/cn/README.md b/08-building-search-applications/translations/cn/README.md
@@ -154,7 +154,7 @@ az cognitiveservices account deployment create \
 
 ## 解决方案
 
-在 GitHub Codespaces 中打开 [solution notebook](../../solution.ipynb?WT.mc_id=academic-105485-koreyst) 并按照 Jupyter Notebook 中的说明进行操作。
+在 GitHub Codespaces 中打开 [solution notebook](../../python/aoai-solution.ipynb?WT.mc_id=academic-105485-koreyst) 并按照 Jupyter Notebook 中的说明进行操作。
 
 当您运行 notebook 时，系统将提示您输入查询。 输入框将如下所示：