diff --git a/.env.example b/.env.example
new file mode 100644
index 0000000..47efcc7
--- /dev/null
+++ b/.env.example
@@ -0,0 +1,4 @@
+# API Keys and Other Secrets
+OPENAI_API_KEY="something"
+MISTRAL_API_KEY="anoher-secret"
+HUGGINGFACE_TOKEN="and-anotherg"
diff --git a/.gitignore b/.gitignore
index 5eff2a9..691c57a 100644
--- a/.gitignore
+++ b/.gitignore
@@ -13,7 +13,7 @@ slides/
# Distribution / packaging
.Python
-env/
+.venv/
build/
develop-eggs/
dist/
diff --git a/data/docs/who-docs/Cholera-Report.pdf b/data/docs/who-docs/Cholera-Report.pdf
new file mode 100644
index 0000000..7ecf47d
Binary files /dev/null and b/data/docs/who-docs/Cholera-Report.pdf differ
diff --git a/data/docs/who-docs/Dengue-Global-situation.pdf b/data/docs/who-docs/Dengue-Global-situation.pdf
new file mode 100644
index 0000000..defa68c
Binary files /dev/null and b/data/docs/who-docs/Dengue-Global-situation.pdf differ
diff --git a/data/docs/who-docs/Hepatitis-Chad.pdf b/data/docs/who-docs/Hepatitis-Chad.pdf
new file mode 100644
index 0000000..502796a
Binary files /dev/null and b/data/docs/who-docs/Hepatitis-Chad.pdf differ
diff --git a/data/docs/who-docs/MidEast-COVID.pdf b/data/docs/who-docs/MidEast-COVID.pdf
new file mode 100644
index 0000000..0a76ac4
Binary files /dev/null and b/data/docs/who-docs/MidEast-COVID.pdf differ
diff --git a/notebooks/malawi-nov-24/2-document-classification-with-sklearn.ipynb b/notebooks/malawi-nov-24/2-document-classification-with-sklearn.ipynb
deleted file mode 100644
index 20c458b..0000000
--- a/notebooks/malawi-nov-24/2-document-classification-with-sklearn.ipynb
+++ /dev/null
@@ -1,575 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# Building a Document Classification System\n",
- "The NumPy (Numerical Python) library used for working iwith arrays, and the Scikit-learn library is a python library built on NumPy, SciPy and matplotlib for data analytics and machine learning. The NLTK (Natural Language Toolkit) provides access to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# Ensuring that you have the necessary libraries\n",
- "# !pip install nltk\n",
- "# !pip install numpy\n",
- "# !pip install scikit-learn"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 1,
- "metadata": {},
- "outputs": [],
- "source": [
- "import numpy as np\n",
- "import nltk\n",
- "from nltk.corpus import reuters\n",
- "from sklearn.feature_extraction.text import TfidfVectorizer\n",
- "from sklearn.model_selection import train_test_split\n",
- "from sklearn.svm import LinearSVC\n",
- "from sklearn.metrics import accuracy_score, classification_report\n",
- "\n",
- "from sklearn.feature_extraction.text import CountVectorizer\n",
- "from sklearn.naive_bayes import MultinomialNB"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### 1. Load your data"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "The Reuters-21578 dataset is one of the most widely used data collections for text categorization research. It is a collection of documents with news articles and the original corpus has 10,369 documents and a vocabulary of 29,930 word and has labeled categories such as \"earnings\", \"acquisitions\".. etc. You can read metadata about the dataset on [Hugging Face](https://huggingface.co/datasets/ucirvine/reuters21578)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 2,
- "metadata": {},
- "outputs": [
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "[nltk_data] Downloading package reuters to\n",
- "[nltk_data] /Users/dunstanmatekenya/nltk_data...\n",
- "[nltk_data] Package reuters is already up-to-date!\n"
- ]
- },
- {
- "data": {
- "text/plain": [
- "True"
- ]
- },
- "execution_count": 2,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "# download the dataset\n",
- "nltk.download('reuters')"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 3,
- "metadata": {},
- "outputs": [],
- "source": [
- "# Load the Reuters-21578 dataset\n",
- "documents = reuters.fileids()\n",
- "train_docs = list(filter(lambda doc: doc.startswith(\"train\"), documents))\n",
- "test_docs = list(filter(lambda doc: doc.startswith(\"test\"), documents))"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### 2. Prepare your data"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Prepare the data by extracting the raw text and category labels for both the training and testing documents. Assumption is that each document has only one category label, so we take only the first category label for each document."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 10,
- "metadata": {},
- "outputs": [],
- "source": [
- "# Prepare the data\n",
- "train_data = [reuters.raw(doc_id) for doc_id in train_docs]\n",
- "train_labels = [reuters.categories(doc_id)[0] for doc_id in train_docs]\n",
- "test_data = [reuters.raw(doc_id) for doc_id in test_docs]\n",
- "test_labels = [reuters.categories(doc_id)[0] for doc_id in test_docs]"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Question-How many different classes are in the training data?"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Explore some of the training examples"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 17,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Article content: COMPUTER TERMINAL SYSTEMS <CPML> COMPLETES SALE\n",
- " Computer Terminal Systems Inc said\n",
- " it has completed the sale of 200,000 shares of its common\n",
- " stock, and warrants to acquire an additional one mln shares, to\n",
- " <Sedio N.V.> of Lugano, Switzerland for 50,000 dlrs.\n",
- " The company said the warrants are exercisable for five\n",
- " years at a purchase price of .125 dlrs per share.\n",
- " Computer Terminal said Sedio also has the right to buy\n",
- " additional shares and increase its total holdings up to 40 pct\n",
- " of the Computer Terminal's outstanding common stock under\n",
- " certain circumstances involving change of control at the\n",
- " company.\n",
- " The company said if the conditions occur the warrants would\n",
- " be exercisable at a price equal to 75 pct of its common stock's\n",
- " market price at the time, not to exceed 1.50 dlrs per share.\n",
- " Computer Terminal also said it sold the technolgy rights to\n",
- " its Dot Matrix impact technology, including any future\n",
- " improvements, to <Woodco Inc> of Houston, Tex. for 200,000\n",
- " dlrs. But, it said it would continue to be the exclusive\n",
- " worldwide licensee of the technology for Woodco.\n",
- " The company said the moves were part of its reorganization\n",
- " plan and would help pay current operation costs and ensure\n",
- " product delivery.\n",
- " Computer Terminal makes computer generated labels, forms,\n",
- " tags and ticket printers and terminals.\n",
- " \n",
- "\n",
- " n\\, Label: acq\n"
- ]
- }
- ],
- "source": [
- "print(\"Article content: {} n\\, Label: {}\".format(train_data[1], train_labels[1]))"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### 3. Vectorizing the text data"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "- Vectorize the text data using the TfidVectorizer from scikit-learn. TF-IDF is an abbreviation for Term Frequency Inverse Document Frequency. This is very common algorithm to transform text into a meaningful representation of numbers which is used to fit machine algorithm for prediction. \n",
- "- Its worth noting that nowadays, this vectorization approach is not commonly used. We will cover **word embeddings** tomorrow which is a better approach to represent words as numbers because **vector embeddings** can capture semantic meanings better.\n",
- "\n",
- "For the sklearn TF-IDF vectorizer, you can learn more about it [here](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 18,
- "metadata": {},
- "outputs": [],
- "source": [
- "# Vectorize the text data\n",
- "vectorizer = TfidfVectorizer(stop_words=\"english\", max_features=1000)\n",
- "X_train = vectorizer.fit_transform(train_data)\n",
- "X_test = vectorizer.transform(test_data)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "#### Question: What role are the ```stop words``` playing in the code above? You might have learned this from Prof. Mohamad Ali already."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### 4. Training a Linear Support Vector Machine (LinearSVC) classifier using the vectorized training data and corresponding label"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 20,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "
LinearSVC()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearSVC()
"
- ],
- "text/plain": [
- "LinearSVC()"
- ]
- },
- "execution_count": 20,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "# Train the classifier\n",
- "classifier = LinearSVC()\n",
- "classifier.fit(X_train, train_labels)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "classifier."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### 5. Evaluate the classifier used and calculate the accuracy score as well as some other metrics (Precision, Recall and F-1 score)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 21,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Accuracy: 0.876117919841007\n",
- " precision recall f1-score support\n",
- "\n",
- " acq 0.95 0.96 0.96 719\n",
- " alum 0.33 0.18 0.24 22\n",
- " barley 1.00 0.71 0.83 14\n",
- " bop 0.77 0.80 0.79 30\n",
- " carcass 0.79 0.65 0.71 17\n",
- " castor-oil 0.00 0.00 0.00 1\n",
- " cocoa 0.94 1.00 0.97 17\n",
- " coconut 0.00 0.00 0.00 2\n",
- " coconut-oil 0.00 0.00 0.00 2\n",
- " coffee 0.89 0.96 0.92 25\n",
- " copper 0.93 0.93 0.93 15\n",
- " corn 0.85 0.81 0.83 48\n",
- " cotton 1.00 0.86 0.92 14\n",
- " cpi 0.62 0.62 0.62 24\n",
- " cpu 0.00 0.00 0.00 1\n",
- " crude 0.79 0.93 0.86 182\n",
- " dfl 0.00 0.00 0.00 1\n",
- " dlr 0.70 0.72 0.71 43\n",
- " dmk 0.00 0.00 0.00 1\n",
- " earn 0.98 0.99 0.98 1083\n",
- " fuel 1.00 0.22 0.36 9\n",
- " gas 0.75 0.33 0.46 9\n",
- " gnp 0.59 0.89 0.71 19\n",
- " gold 0.96 0.96 0.96 26\n",
- " grain 0.71 0.77 0.74 77\n",
- " groundnut 0.00 0.00 0.00 3\n",
- " heat 1.00 0.75 0.86 4\n",
- " hog 1.00 0.50 0.67 4\n",
- " housing 1.00 0.67 0.80 3\n",
- " income 1.00 0.80 0.89 5\n",
- " instal-debt 1.00 1.00 1.00 1\n",
- " interest 0.78 0.76 0.77 124\n",
- " ipi 1.00 1.00 1.00 11\n",
- " iron-steel 0.69 0.64 0.67 14\n",
- " jet 0.00 0.00 0.00 1\n",
- " jobs 0.73 0.85 0.79 13\n",
- " l-cattle 0.00 0.00 0.00 2\n",
- " lead 0.83 0.42 0.56 12\n",
- " lei 1.00 1.00 1.00 3\n",
- " livestock 0.50 0.50 0.50 6\n",
- " lumber 0.00 0.00 0.00 5\n",
- " meal-feed 0.20 0.17 0.18 6\n",
- " money-fx 0.65 0.65 0.65 96\n",
- " money-supply 0.80 0.83 0.81 29\n",
- " naphtha 0.00 0.00 0.00 1\n",
- " nat-gas 0.64 0.54 0.58 13\n",
- " nickel 0.00 0.00 0.00 1\n",
- " oilseed 0.54 0.54 0.54 13\n",
- " orange 0.75 0.33 0.46 9\n",
- " palladium 0.00 0.00 0.00 1\n",
- " palm-oil 0.67 1.00 0.80 4\n",
- " pet-chem 1.00 0.50 0.67 6\n",
- " platinum 0.00 0.00 0.00 3\n",
- " potato 1.00 0.67 0.80 3\n",
- " propane 0.00 0.00 0.00 2\n",
- " rape-oil 0.00 0.00 0.00 1\n",
- " reserves 1.00 0.64 0.78 14\n",
- " retail 1.00 1.00 1.00 1\n",
- " rice 0.00 0.00 0.00 1\n",
- " rubber 0.69 1.00 0.82 9\n",
- " ship 0.39 0.41 0.40 39\n",
- " silver 0.00 0.00 0.00 0\n",
- " soy-oil 0.00 0.00 0.00 2\n",
- " soybean 0.00 0.00 0.00 2\n",
- "strategic-metal 0.00 0.00 0.00 6\n",
- " sugar 0.71 0.96 0.81 25\n",
- " tea 0.00 0.00 0.00 3\n",
- " tin 0.71 0.50 0.59 10\n",
- " trade 0.70 0.93 0.80 76\n",
- " veg-oil 0.54 0.64 0.58 11\n",
- " wpi 0.62 0.56 0.59 9\n",
- " yen 0.00 0.00 0.00 6\n",
- " zinc 0.00 0.00 0.00 5\n",
- "\n",
- " accuracy 0.88 3019\n",
- " macro avg 0.53 0.48 0.49 3019\n",
- " weighted avg 0.86 0.88 0.87 3019\n",
- "\n"
- ]
- },
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "/Users/dunstanmatekenya/anaconda3/lib/python3.10/site-packages/sklearn/metrics/_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.\n",
- " _warn_prf(average, modifier, msg_start, len(result))\n",
- "/Users/dunstanmatekenya/anaconda3/lib/python3.10/site-packages/sklearn/metrics/_classification.py:1344: UndefinedMetricWarning: Recall and F-score are ill-defined and being set to 0.0 in labels with no true samples. Use `zero_division` parameter to control this behavior.\n",
- " _warn_prf(average, modifier, msg_start, len(result))\n",
- "/Users/dunstanmatekenya/anaconda3/lib/python3.10/site-packages/sklearn/metrics/_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.\n",
- " _warn_prf(average, modifier, msg_start, len(result))\n",
- "/Users/dunstanmatekenya/anaconda3/lib/python3.10/site-packages/sklearn/metrics/_classification.py:1344: UndefinedMetricWarning: Recall and F-score are ill-defined and being set to 0.0 in labels with no true samples. Use `zero_division` parameter to control this behavior.\n",
- " _warn_prf(average, modifier, msg_start, len(result))\n",
- "/Users/dunstanmatekenya/anaconda3/lib/python3.10/site-packages/sklearn/metrics/_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.\n",
- " _warn_prf(average, modifier, msg_start, len(result))\n",
- "/Users/dunstanmatekenya/anaconda3/lib/python3.10/site-packages/sklearn/metrics/_classification.py:1344: UndefinedMetricWarning: Recall and F-score are ill-defined and being set to 0.0 in labels with no true samples. Use `zero_division` parameter to control this behavior.\n",
- " _warn_prf(average, modifier, msg_start, len(result))\n"
- ]
- }
- ],
- "source": [
- "# Evaluate the classifier\n",
- "y_pred = classifier.predict(X_test)\n",
- "accuracy = accuracy_score(test_labels, y_pred)\n",
- "print(\"Accuracy:\", accuracy)\n",
- "print(classification_report(test_labels, y_pred))"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### 6. Classify new documents (new BBC headlines) by vectorizing them using the same TfidfVectorizer and predicting their labels using the trained classifier"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 22,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Predicted labels: ['ship' 'ship' 'acq']\n"
- ]
- }
- ],
- "source": [
- "# Classify new documents (recent headlines obtained from BBC news regarding Tunisia)\n",
- "new_docs = [\n",
- " \"Tunisia says 23 people missing in Mediterranean sea.\",\n",
- " \"Tunisia officials arrested in dispute over flag display.\",\n",
- " \"Tunisia lawyer arrested during live news broadcast.\"\n",
- "]\n",
- "new_docs_vectors = vectorizer.transform(new_docs)\n",
- "predicted_labels = classifier.predict(new_docs_vectors)\n",
- "print(\"Predicted labels:\", predicted_labels)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Discussion"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "How did this classifier fare? What can you do to improve the model? \n",
- "Ans: Experimenting with different preprocessing techniques, feature extraction models and classification algorithms."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Trying with a different classifier"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Steps 1 - 3 will be the same."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# Load the Reuters-21578 dataset\n",
- "documents = reuters.fileids()\n",
- "train_docs = list(filter(lambda doc: doc.startswith(\"train\"), documents))\n",
- "test_docs = list(filter(lambda doc: doc.startswith(\"test\"), documents))\n",
- "\n",
- "# Prepare the data\n",
- "train_data = [reuters.raw(doc_id) for doc_id in train_docs]\n",
- "train_labels = [reuters.categories(doc_id)[0] for doc_id in train_docs]\n",
- "test_data = [reuters.raw(doc_id) for doc_id in test_docs]\n",
- "test_labels = [reuters.categories(doc_id)[0] for doc_id in test_docs]\n",
- "\n",
- "# Vectorize the text data\n",
- "vectorizer = CountVectorizer(stop_words=\"english\", max_features=1000)\n",
- "X_train = vectorizer.fit_transform(train_data)\n",
- "X_test = vectorizer.transform(test_data)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Different Classifier (Multinomial Naive Bayes)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "classifier = MultinomialNB()\n",
- "classifier.fit(X_train, train_labels)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# Evaluate the classifier\n",
- "y_pred = classifier.predict(X_test)\n",
- "accuracy = accuracy_score(test_labels, y_pred)\n",
- "print(\"Accuracy:\", accuracy)\n",
- "print(classification_report(test_labels, y_pred))"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# Classify new documents (recent headlines obtained from BBC news regarding Tunisia)\n",
- "new_docs = [\n",
- " \"Tunisia says 23 people missing in Mediterranean sea.\",\n",
- " \"Tunisia officials arrested in dispute over flag display.\",\n",
- " \"Tunisia lawyer arrested during live news broadcast.\"\n",
- "]\n",
- "new_docs_vectors = vectorizer.transform(new_docs)\n",
- "predicted_labels = classifier.predict(new_docs_vectors)\n",
- "print(\"Predicted labels:\", predicted_labels)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Discussion: Compare the results"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "The choice of classifier depends on the specific characteristics of your dataset and the problem at hand. Multinomial Naive Bayes is known to work well with text data and can handle high-dimensional feature spaces efficiently. However, it assumes that the features are independent of each other, which may not always be the case in real-world scenarios.\n",
- "\n",
- "You can also experiment with different classifiers, such as Logistic Regression, Random Forest, or Gradient Boosting, and compare their performance to find the best fit for your dataset. You can also refine the model by trying different feature extraction techniques and hyperparameters."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### There are also other ways you can approach this, for example, Document Classification using BERT. Here is a notebook example on Kaggle that you can explore: https://www.kaggle.com/code/merishnasuwal/document-classification-using-bert"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "BERT (Bidirectional Encoder Representations from Transformers) and other Transformer encoder architectures can also be used on a variety of tasks in NLP (natural language processing). They compute vector-space representations of natural language that are suitable for use in deep learning models. The BERT family of models uses the Transformer encoder architecture to process each token of input text in the full context of all tokens before and after. BERT models are usually pre-trained on a large corpus of text, then fine-tuned for specific tasks."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": []
- }
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "Python 3 (ipykernel)",
- "language": "python",
- "name": "python3"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.10.8"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 4
-}
diff --git a/notebooks/malawi-nov-24/3-intro-langchain.ipynb b/notebooks/malawi-nov-24/3-intro-langchain.ipynb
index 82950fc..dedc751 100644
--- a/notebooks/malawi-nov-24/3-intro-langchain.ipynb
+++ b/notebooks/malawi-nov-24/3-intro-langchain.ipynb
@@ -1,14 +1,6 @@
{
"cells": [
{
- "attachments": {
- "7153af0c-fb8b-4b47-826e-57ac60696e0c.png": {
- "image/png": ""
- },
- "faf11697-6be8-49bc-ab24-b3c4385b8a67.png": {
- "image/png": ""
- }
- },
"cell_type": "markdown",
"id": "740ffa74-4eda-4843-9b5b-486caab1153b",
"metadata": {
@@ -17,15 +9,14 @@
"source": [
"# Introduction to LangChain\n",
"---------\n",
- "![image.png](attachment:faf11697-6be8-49bc-ab24-b3c4385b8a67.png)![image.png](attachment:7153af0c-fb8b-4b47-826e-57ac60696e0c.png)\n",
"\n",
- "**DIHPA'24**\n",
+ "**ICTAM, AI and LLM Training, Nkopola, November 2024**\n",
"\n",
"**Author:** Dunstan Matekenya \n",
"\n",
"**Affiliation:** DECAT, The World Bank Group \n",
"\n",
- "**Date:** May 30, 2024\n",
+ "**Date:** November, 2024\n",
"\n",
"\n",
"## What you will learn \n",
@@ -65,15 +56,150 @@
},
{
"cell_type": "code",
- "execution_count": 1,
+ "execution_count": 7,
"id": "b9a13ee9-f3d9-4141-a1a2-929cdc1b5113",
"metadata": {},
- "outputs": [],
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "True"
+ ]
+ },
+ "execution_count": 7,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
"source": [
"import os\n",
- "from pathlib import Path"
+ "from pathlib import Path\n",
+ "\n",
+ "\n",
+ "# =======================\n",
+ "# ENVIRONMENT HANDLING\n",
+ "# =======================\n",
+ "from dotenv import load_dotenv\n",
+ "load_dotenv()"
]
},
+ {
+ "cell_type": "markdown",
+ "id": "51203f5f",
+ "metadata": {},
+ "source": [
+ "# Test OpenAI and HuggingFace"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 16,
+ "id": "28693074",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "44942.48999999999"
+ ]
+ },
+ "execution_count": 16,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "(1498083*0.15) - (1498083*0.12)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 18,
+ "id": "4ee60ee5",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "0.75"
+ ]
+ },
+ "execution_count": 18,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "9/12"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 11,
+ "id": "8d810eca",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "44983.889999999985"
+ ]
+ },
+ "execution_count": 11,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "224919.44999999998 - 179935.56"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "id": "9b2abf53",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from langchain_openai import ChatOpenAI\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 8,
+ "id": "98bc75bc",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Result : Malawi is located in southeastern Africa, bordered by Tanzania to the north and northeast, Zambia to the west, and Mozambique to the east, south and southwest.\n"
+ ]
+ }
+ ],
+ "source": [
+ "llm = ChatOpenAI()\n",
+ "res = llm.invoke(\"Where is Malawi?\").content\n",
+ "print(f\"Result : {res}\")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "049080c6",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Result : I'm just a computer program, so I don't have feelings or emotions. How can I assist you today?\n"
+ ]
+ }
+ ],
+ "source": []
+ },
{
"cell_type": "markdown",
"id": "1e1dad1b-4014-48e9-b911-2095c9864a84",
@@ -521,27 +647,6 @@
"print(output)"
]
},
- {
- "cell_type": "code",
- "execution_count": 1,
- "id": "e9bda8e1-dce1-49c1-b308-82063fa53e6a",
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "3544"
- ]
- },
- "execution_count": 1,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "2*1772"
- ]
- },
{
"cell_type": "markdown",
"id": "7381c4a5-5a33-405c-b82a-baacfebe6e56",
@@ -559,7 +664,7 @@
"id": "f4edd5e5-44fa-49b0-b150-64a580da8f66",
"metadata": {},
"source": [
- "### . Prompt templates\n",
+ "### Prompt templates\n",
"Prompt templates are used for creating prompts in a more modular way, so they can be reused and built on. Chains act as the glue in LangChain; bringing the other components together into workflows that pass inputs and outputs between the different components\n",
"- They are recipes for generating prompts\n",
"- Flexible and modular\n",
@@ -659,14 +764,6 @@
"llm(full_prompt)"
]
},
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "e11ea305-a3c9-439f-b135-236e26c39ac1",
- "metadata": {},
- "outputs": [],
- "source": []
- },
{
"cell_type": "markdown",
"id": "c3e8c433-8672-4ae8-9579-66d5201bc657",
@@ -1876,9 +1973,9 @@
],
"metadata": {
"kernelspec": {
- "display_name": "Python3.12-audio",
+ "display_name": ".venv",
"language": "python",
- "name": "audio"
+ "name": "python3"
},
"language_info": {
"codemirror_mode": {
diff --git a/notebooks/malawi-nov-24/LLM_Teaching_Notebook_with_Conclusion.ipynb b/notebooks/malawi-nov-24/LLM_Teaching_Notebook_with_Conclusion.ipynb
index 3b556e5..21253b9 100644
--- a/notebooks/malawi-nov-24/LLM_Teaching_Notebook_with_Conclusion.ipynb
+++ b/notebooks/malawi-nov-24/LLM_Teaching_Notebook_with_Conclusion.ipynb
@@ -1,3915 +1,3929 @@
{
- "cells": [
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "L-hCapPatC1R"
+ },
+ "source": [
+ "# Introduction to LLMs\n",
+ "Large Language Models (LLMs) are AI models that can generate human-like text based on given prompts. They work by:\n",
+ "\n",
+ "\n",
+ "* Understanding and tokenizing the input text.\n",
+ "* Predicting the most likely sequence of words (tokens) to follow.\n",
+ "* Using probabilities to generate coherent and contextually relevant outputs.\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "VVoiHfQdt0rp"
+ },
+ "source": [
+ "##Setting Up the Environment\n",
+ "Below is the code to set up the environment using the Hugging Face transformers library:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "HvKJfKCws8pQ",
+ "outputId": "133950ce-ede3-4435-c477-17da1f1ce4d4"
+ },
+ "outputs": [
{
- "cell_type": "markdown",
- "metadata": {
- "id": "L-hCapPatC1R"
- },
- "source": [
- "# Introduction to LLMs\n",
- "Large Language Models (LLMs) are AI models that can generate human-like text based on given prompts. They work by:\n",
- "\n",
- "\n",
- "* Understanding and tokenizing the input text.\n",
- "* Predicting the most likely sequence of words (tokens) to follow.\n",
- "* Using probabilities to generate coherent and contextually relevant outputs.\n",
- "\n"
- ]
+ "ename": "ModuleNotFoundError",
+ "evalue": "No module named 'torch'",
+ "output_type": "error",
+ "traceback": [
+ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+ "\u001b[0;31mModuleNotFoundError\u001b[0m Traceback (most recent call last)",
+ "Cell \u001b[0;32mIn[1], line 2\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[38;5;66;03m# Importing necessary libraries\u001b[39;00m\n\u001b[0;32m----> 2\u001b[0m \u001b[38;5;28;01mimport\u001b[39;00m \u001b[38;5;21;01mtorch\u001b[39;00m\n\u001b[1;32m 3\u001b[0m \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01mtransformers\u001b[39;00m \u001b[38;5;28;01mimport\u001b[39;00m AutoTokenizer, AutoModelForCausalLM, pipeline, set_seed\n\u001b[1;32m 4\u001b[0m \u001b[38;5;28;01mimport\u001b[39;00m \u001b[38;5;21;01mmatplotlib\u001b[39;00m\u001b[38;5;21;01m.\u001b[39;00m\u001b[38;5;21;01mpyplot\u001b[39;00m \u001b[38;5;28;01mas\u001b[39;00m \u001b[38;5;21;01mplt\u001b[39;00m\n",
+ "\u001b[0;31mModuleNotFoundError\u001b[0m: No module named 'torch'"
+ ]
+ }
+ ],
+ "source": [
+ "# Importing necessary libraries\n",
+ "import torch\n",
+ "from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, set_seed\n",
+ "import matplotlib.pyplot as plt\n",
+ "\n",
+ "# Checking if GPU is available\n",
+ "device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n",
+ "print(f\"Using device: {device}\")\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "eUs4IPMTuBof"
+ },
+ "source": [
+ "# Tokenization\n",
+ "In this section, we'll explain tokenization and demonstrate it using a sample prompt."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 397,
+ "referenced_widgets": [
+ "ed52f970e5ae4c11863e8e46f9ae828f",
+ "d54161703d0c489e9cc2e0c2bc8cdde0",
+ "dd73d4c00bcb4815a47e466ffeb764f7",
+ "d0f03e93e94a48e88bb807ce9fcd3626",
+ "247cef68024a44d792583d015eee698f",
+ "eedbbeb237564aa6bf1024699404a0d8",
+ "da9b0dfeb44b46bb81534080eaf0047f",
+ "d3da6b9fec5149f596dd627fdc0ddf28",
+ "7db8af670bd5408ea41df7b09762233d",
+ "909e5b085ea848b29d1aac95c4dd1ee7",
+ "0023a18e9bf7450997c5af11070b5cc2",
+ "643145af92da4183b9607eb7e2e6d0f7",
+ "9d78f9dff3dd45218b53496767b00c70",
+ "66f3d33e3f514f31bf0e6b2ce35801be",
+ "836388e618d944aebb9dce0dea9c56e1",
+ "39cbea48cd8848b19b9dc20736708df1",
+ "ab0a2da0da2e4b3eb966395e54729ffa",
+ "786d8c0044ba4e02809f8edb03af9c2a",
+ "83f9d9548db2456ab0a26a96e03e9581",
+ "58b715f058384647b70276588eb2bf92",
+ "1dd3f850d0d9484bb5313a067392a139",
+ "995591cb74bc4395813e80fe8087f623",
+ "a98d240475f549a894b283fcec9097a8",
+ "f41832a72fdb4cb79530c9fc44693185",
+ "00ad49ed4b5e447eb6915708c628f491",
+ "a4f33b68497e470593bb3c53e2e73703",
+ "83c8a96b28a3406b86167b22a3bf1c7c",
+ "427a17311b0c4419897745ccb319fc0c",
+ "1d77d79208c54108bbe8ef2c72fcbe4c",
+ "5e3f481ead9b4dffb22b019fb7286a75",
+ "9a1af9363cc44ec5a89ddda9a94466fd",
+ "4d2fa322281a45ab8a1464be57439749",
+ "ff32f023a7c043599760bf3b3bf83cb2",
+ "97784968743c46a3b094f942a617f505",
+ "3a22215d257e44da912f098e75bcad67",
+ "459cba8b3a834331985dffea92c541bd",
+ "b155188a446441348e47eeefee321a7b",
+ "e2f11ebb13c24dcd81cdba4bedf3ded4",
+ "355db76789934773b18bbba7be586bf3",
+ "c32dd63b33a147edb63e8b2ef0da5376",
+ "2817038686c042aa836954a2c1bdd314",
+ "59935aff77684051ad3eef45af4a8d39",
+ "4eb227bcee9e409199cc51caab0c7696",
+ "bef5ad3f28674648a365a6f60eb21371",
+ "eb5e86fa57df476ea37013c7912a743e",
+ "7ccbeb6d75264e489e768710b10e0ab0",
+ "5c159af238114e4abe4bd36cf482d119",
+ "2ef6a55cd791473992084f043bd9226c",
+ "f6744817df8a4006bd0ff225aa9e4aab",
+ "7adf04d2c6c04e9b8246c70dfc58fdcc",
+ "bb30e9fd347f4166b1966997d462c5cd",
+ "5af88d5e9e1041b7ab5a8826e3ed1fa5",
+ "20446918559f460da71d704b4a3816c5",
+ "ceee626d21094ac486ddbb88634512ad",
+ "272d275596eb495a889496b640279965",
+ "a8033e173deb48448ab9f3d78bd0339c",
+ "fc63e2a321524fdb85fd891bb611246a",
+ "f04694026e2a41e391281b788248c2a2",
+ "d422c21430884a37b730da4955f3a574",
+ "05cf4d5fea694221aa439a784f45ac94",
+ "a646d1d0286d489c98c81bdffb2559ed",
+ "32ef2f84eb174debae5f2a2e6c8011dd",
+ "fd9e1e0e51974c8a8c473c464f99d018",
+ "c56b0483ad8549089f908989e5616dd1",
+ "d595683824734914a9f67634d4e08f98",
+ "6050b0136e754378b71b1295dab6c2c8",
+ "8832939841fc4180a3aeafdc7aaafb63",
+ "964d2abc5311445d8c9e515fee2ec571",
+ "c8d64d6591db4bec9fb752345e249530",
+ "2ea9aa390af84712893e431108ba80d2",
+ "e5c16313f7c94eb7a390a99f5d09ef77",
+ "9e874cda300f4056807b378d386789cb",
+ "85baaaffafea4de693f7637e5e1f09c8",
+ "6377db896bed44299ffdf88704c925ff",
+ "c8b3fb7df3fa4d0f8d57efe9721d3a67",
+ "6e3b8ae4aa044e8e878c3f3ea1e76887",
+ "e3d9044e0c0f4cf8b3783d617b0e4649"
+ ]
},
+ "id": "Ea5dAG_KuDyl",
+ "outputId": "4a40cd1d-2e02-4999-aa04-398ed4f1cf6e"
+ },
+ "outputs": [
{
- "cell_type": "markdown",
- "metadata": {
- "id": "VVoiHfQdt0rp"
- },
- "source": [
- "##Setting Up the Environment\n",
- "Below is the code to set up the environment using the Hugging Face transformers library:"
- ]
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_auth.py:94: UserWarning: \n",
+ "The secret `HF_TOKEN` does not exist in your Colab secrets.\n",
+ "To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.\n",
+ "You will be able to reuse this secret in all of your notebooks.\n",
+ "Please note that authentication is recommended but still optional to access public models or datasets.\n",
+ " warnings.warn(\n"
+ ]
},
{
- "cell_type": "code",
- "execution_count": 2,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/"
- },
- "id": "HvKJfKCws8pQ",
- "outputId": "133950ce-ede3-4435-c477-17da1f1ce4d4"
+ "data": {
+ "application/vnd.jupyter.widget-view+json": {
+ "model_id": "ed52f970e5ae4c11863e8e46f9ae828f",
+ "version_major": 2,
+ "version_minor": 0
},
- "outputs": [
- {
- "output_type": "stream",
- "name": "stdout",
- "text": [
- "Using device: cpu\n"
- ]
- }
- ],
- "source": [
- "# Importing necessary libraries\n",
- "import torch\n",
- "from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, set_seed\n",
- "import matplotlib.pyplot as plt\n",
- "\n",
- "# Checking if GPU is available\n",
- "device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n",
- "print(f\"Using device: {device}\")\n"
+ "text/plain": [
+ "tokenizer_config.json: 0%| | 0.00/26.0 [00:00, ?B/s]"
]
+ },
+ "metadata": {},
+ "output_type": "display_data"
},
{
- "cell_type": "markdown",
- "metadata": {
- "id": "eUs4IPMTuBof"
+ "data": {
+ "application/vnd.jupyter.widget-view+json": {
+ "model_id": "643145af92da4183b9607eb7e2e6d0f7",
+ "version_major": 2,
+ "version_minor": 0
},
- "source": [
- "# Tokenization\n",
- "In this section, we'll explain tokenization and demonstrate it using a sample prompt."
+ "text/plain": [
+ "config.json: 0%| | 0.00/665 [00:00, ?B/s]"
]
+ },
+ "metadata": {},
+ "output_type": "display_data"
},
{
- "cell_type": "code",
- "execution_count": 3,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/",
- "height": 397,
- "referenced_widgets": [
- "ed52f970e5ae4c11863e8e46f9ae828f",
- "d54161703d0c489e9cc2e0c2bc8cdde0",
- "dd73d4c00bcb4815a47e466ffeb764f7",
- "d0f03e93e94a48e88bb807ce9fcd3626",
- "247cef68024a44d792583d015eee698f",
- "eedbbeb237564aa6bf1024699404a0d8",
- "da9b0dfeb44b46bb81534080eaf0047f",
- "d3da6b9fec5149f596dd627fdc0ddf28",
- "7db8af670bd5408ea41df7b09762233d",
- "909e5b085ea848b29d1aac95c4dd1ee7",
- "0023a18e9bf7450997c5af11070b5cc2",
- "643145af92da4183b9607eb7e2e6d0f7",
- "9d78f9dff3dd45218b53496767b00c70",
- "66f3d33e3f514f31bf0e6b2ce35801be",
- "836388e618d944aebb9dce0dea9c56e1",
- "39cbea48cd8848b19b9dc20736708df1",
- "ab0a2da0da2e4b3eb966395e54729ffa",
- "786d8c0044ba4e02809f8edb03af9c2a",
- "83f9d9548db2456ab0a26a96e03e9581",
- "58b715f058384647b70276588eb2bf92",
- "1dd3f850d0d9484bb5313a067392a139",
- "995591cb74bc4395813e80fe8087f623",
- "a98d240475f549a894b283fcec9097a8",
- "f41832a72fdb4cb79530c9fc44693185",
- "00ad49ed4b5e447eb6915708c628f491",
- "a4f33b68497e470593bb3c53e2e73703",
- "83c8a96b28a3406b86167b22a3bf1c7c",
- "427a17311b0c4419897745ccb319fc0c",
- "1d77d79208c54108bbe8ef2c72fcbe4c",
- "5e3f481ead9b4dffb22b019fb7286a75",
- "9a1af9363cc44ec5a89ddda9a94466fd",
- "4d2fa322281a45ab8a1464be57439749",
- "ff32f023a7c043599760bf3b3bf83cb2",
- "97784968743c46a3b094f942a617f505",
- "3a22215d257e44da912f098e75bcad67",
- "459cba8b3a834331985dffea92c541bd",
- "b155188a446441348e47eeefee321a7b",
- "e2f11ebb13c24dcd81cdba4bedf3ded4",
- "355db76789934773b18bbba7be586bf3",
- "c32dd63b33a147edb63e8b2ef0da5376",
- "2817038686c042aa836954a2c1bdd314",
- "59935aff77684051ad3eef45af4a8d39",
- "4eb227bcee9e409199cc51caab0c7696",
- "bef5ad3f28674648a365a6f60eb21371",
- "eb5e86fa57df476ea37013c7912a743e",
- "7ccbeb6d75264e489e768710b10e0ab0",
- "5c159af238114e4abe4bd36cf482d119",
- "2ef6a55cd791473992084f043bd9226c",
- "f6744817df8a4006bd0ff225aa9e4aab",
- "7adf04d2c6c04e9b8246c70dfc58fdcc",
- "bb30e9fd347f4166b1966997d462c5cd",
- "5af88d5e9e1041b7ab5a8826e3ed1fa5",
- "20446918559f460da71d704b4a3816c5",
- "ceee626d21094ac486ddbb88634512ad",
- "272d275596eb495a889496b640279965",
- "a8033e173deb48448ab9f3d78bd0339c",
- "fc63e2a321524fdb85fd891bb611246a",
- "f04694026e2a41e391281b788248c2a2",
- "d422c21430884a37b730da4955f3a574",
- "05cf4d5fea694221aa439a784f45ac94",
- "a646d1d0286d489c98c81bdffb2559ed",
- "32ef2f84eb174debae5f2a2e6c8011dd",
- "fd9e1e0e51974c8a8c473c464f99d018",
- "c56b0483ad8549089f908989e5616dd1",
- "d595683824734914a9f67634d4e08f98",
- "6050b0136e754378b71b1295dab6c2c8",
- "8832939841fc4180a3aeafdc7aaafb63",
- "964d2abc5311445d8c9e515fee2ec571",
- "c8d64d6591db4bec9fb752345e249530",
- "2ea9aa390af84712893e431108ba80d2",
- "e5c16313f7c94eb7a390a99f5d09ef77",
- "9e874cda300f4056807b378d386789cb",
- "85baaaffafea4de693f7637e5e1f09c8",
- "6377db896bed44299ffdf88704c925ff",
- "c8b3fb7df3fa4d0f8d57efe9721d3a67",
- "6e3b8ae4aa044e8e878c3f3ea1e76887",
- "e3d9044e0c0f4cf8b3783d617b0e4649"
- ]
- },
- "id": "Ea5dAG_KuDyl",
- "outputId": "4a40cd1d-2e02-4999-aa04-398ed4f1cf6e"
+ "data": {
+ "application/vnd.jupyter.widget-view+json": {
+ "model_id": "a98d240475f549a894b283fcec9097a8",
+ "version_major": 2,
+ "version_minor": 0
},
- "outputs": [
- {
- "output_type": "stream",
- "name": "stderr",
- "text": [
- "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_auth.py:94: UserWarning: \n",
- "The secret `HF_TOKEN` does not exist in your Colab secrets.\n",
- "To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.\n",
- "You will be able to reuse this secret in all of your notebooks.\n",
- "Please note that authentication is recommended but still optional to access public models or datasets.\n",
- " warnings.warn(\n"
- ]
- },
- {
- "output_type": "display_data",
- "data": {
- "text/plain": [
- "tokenizer_config.json: 0%| | 0.00/26.0 [00:00, ?B/s]"
- ],
- "application/vnd.jupyter.widget-view+json": {
- "version_major": 2,
- "version_minor": 0,
- "model_id": "ed52f970e5ae4c11863e8e46f9ae828f"
- }
- },
- "metadata": {}
- },
- {
- "output_type": "display_data",
- "data": {
- "text/plain": [
- "config.json: 0%| | 0.00/665 [00:00, ?B/s]"
- ],
- "application/vnd.jupyter.widget-view+json": {
- "version_major": 2,
- "version_minor": 0,
- "model_id": "643145af92da4183b9607eb7e2e6d0f7"
- }
- },
- "metadata": {}
- },
- {
- "output_type": "display_data",
- "data": {
- "text/plain": [
- "vocab.json: 0%| | 0.00/1.04M [00:00, ?B/s]"
- ],
- "application/vnd.jupyter.widget-view+json": {
- "version_major": 2,
- "version_minor": 0,
- "model_id": "a98d240475f549a894b283fcec9097a8"
- }
- },
- "metadata": {}
- },
- {
- "output_type": "display_data",
- "data": {
- "text/plain": [
- "merges.txt: 0%| | 0.00/456k [00:00, ?B/s]"
- ],
- "application/vnd.jupyter.widget-view+json": {
- "version_major": 2,
- "version_minor": 0,
- "model_id": "97784968743c46a3b094f942a617f505"
- }
- },
- "metadata": {}
- },
- {
- "output_type": "display_data",
- "data": {
- "text/plain": [
- "tokenizer.json: 0%| | 0.00/1.36M [00:00, ?B/s]"
- ],
- "application/vnd.jupyter.widget-view+json": {
- "version_major": 2,
- "version_minor": 0,
- "model_id": "eb5e86fa57df476ea37013c7912a743e"
- }
- },
- "metadata": {}
- },
- {
- "output_type": "display_data",
- "data": {
- "text/plain": [
- "model.safetensors: 0%| | 0.00/548M [00:00, ?B/s]"
- ],
- "application/vnd.jupyter.widget-view+json": {
- "version_major": 2,
- "version_minor": 0,
- "model_id": "a8033e173deb48448ab9f3d78bd0339c"
- }
- },
- "metadata": {}
- },
- {
- "output_type": "display_data",
- "data": {
- "text/plain": [
- "generation_config.json: 0%| | 0.00/124 [00:00, ?B/s]"
- ],
- "application/vnd.jupyter.widget-view+json": {
- "version_major": 2,
- "version_minor": 0,
- "model_id": "8832939841fc4180a3aeafdc7aaafb63"
- }
- },
- "metadata": {}
- },
- {
- "output_type": "stream",
- "name": "stdout",
- "text": [
- "Tokens: tensor([[ 8001, 9542, 9345, 318, 25449, 262, 995, 13]])\n",
- "Decoded Text: Artificial Intelligence is transforming the world.\n"
- ]
- }
- ],
- "source": [
- "# Load a pre-trained GPT model and its tokenizer\n",
- "model_name = \"gpt2\"\n",
- "tokenizer = AutoTokenizer.from_pretrained(model_name)\n",
- "model = AutoModelForCausalLM.from_pretrained(model_name).to(device)\n",
- "\n",
- "# Sample text for tokenization\n",
- "text = \"Artificial Intelligence is transforming the world.\"\n",
- "\n",
- "# Tokenizing the text\n",
- "tokens = tokenizer.encode(text, return_tensors=\"pt\").to(device)\n",
- "print(f\"Tokens: {tokens}\")\n",
- "\n",
- "# Decoding tokens back to text\n",
- "decoded_text = tokenizer.decode(tokens[0])\n",
- "print(f\"Decoded Text: {decoded_text}\")\n"
+ "text/plain": [
+ "vocab.json: 0%| | 0.00/1.04M [00:00, ?B/s]"
]
+ },
+ "metadata": {},
+ "output_type": "display_data"
},
{
- "cell_type": "markdown",
- "metadata": {
- "id": "Swo_D3eOuVO0"
+ "data": {
+ "application/vnd.jupyter.widget-view+json": {
+ "model_id": "97784968743c46a3b094f942a617f505",
+ "version_major": 2,
+ "version_minor": 0
},
- "source": [
- "Tokenization converts text into numerical tokens that the model can understand.\n",
- "Each token represents a word, part of a word, or a special character."
+ "text/plain": [
+ "merges.txt: 0%| | 0.00/456k [00:00, ?B/s]"
]
+ },
+ "metadata": {},
+ "output_type": "display_data"
},
{
- "cell_type": "markdown",
- "metadata": {
- "id": "7lBpkJvRuZdv"
+ "data": {
+ "application/vnd.jupyter.widget-view+json": {
+ "model_id": "eb5e86fa57df476ea37013c7912a743e",
+ "version_major": 2,
+ "version_minor": 0
},
- "source": [
- "# Understanding Token Probabilities\n",
- "We'll explore how LLMs generate text using token probabilities."
+ "text/plain": [
+ "tokenizer.json: 0%| | 0.00/1.36M [00:00, ?B/s]"
]
+ },
+ "metadata": {},
+ "output_type": "display_data"
},
{
- "cell_type": "code",
- "execution_count": 4,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/"
- },
- "id": "sn17gwMcuRb7",
- "outputId": "157692d5-0e73-48f5-b8ce-5114ad621423"
+ "data": {
+ "application/vnd.jupyter.widget-view+json": {
+ "model_id": "a8033e173deb48448ab9f3d78bd0339c",
+ "version_major": 2,
+ "version_minor": 0
},
- "outputs": [
- {
- "output_type": "stream",
- "name": "stdout",
- "text": [
- "Token: is, Probability: 0.2314\n",
- "Token: in, Probability: 0.0686\n",
- "Token: and, Probability: 0.0668\n",
- "Token: will, Probability: 0.0649\n",
- "Token: ,, Probability: 0.0427\n"
- ]
- }
- ],
- "source": [
- "# Generate output with token probabilities\n",
- "input_ids = tokenizer.encode(\"The future of AI\", return_tensors=\"pt\").to(device)\n",
- "output = model(input_ids)\n",
- "\n",
- "# Extract logits (raw scores) for the next token prediction\n",
- "logits = output.logits[0, -1, :]\n",
- "probs = torch.softmax(logits, dim=0)\n",
- "\n",
- "# Show the top 5 probable tokens\n",
- "top_k = 5\n",
- "top_k_indices = torch.topk(probs, top_k).indices\n",
- "top_k_probs = torch.topk(probs, top_k).values\n",
- "\n",
- "# Display top-k tokens with their probabilities\n",
- "for i in range(top_k):\n",
- " token = tokenizer.decode(top_k_indices[i].item())\n",
- " print(f\"Token: {token}, Probability: {top_k_probs[i].item():.4f}\")\n"
+ "text/plain": [
+ "model.safetensors: 0%| | 0.00/548M [00:00, ?B/s]"
]
+ },
+ "metadata": {},
+ "output_type": "display_data"
},
{
- "cell_type": "markdown",
- "metadata": {
- "id": "edrAVmvgujfh"
+ "data": {
+ "application/vnd.jupyter.widget-view+json": {
+ "model_id": "8832939841fc4180a3aeafdc7aaafb63",
+ "version_major": 2,
+ "version_minor": 0
},
- "source": [
- "#Text Generation\n",
- "We'll demonstrate how LLMs generate text and how different parameters affect the output."
+ "text/plain": [
+ "generation_config.json: 0%| | 0.00/124 [00:00, ?B/s]"
]
+ },
+ "metadata": {},
+ "output_type": "display_data"
},
{
- "cell_type": "code",
- "execution_count": 5,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/"
- },
- "id": "Xa9UVyBMugwg",
- "outputId": "00883b42-8a5c-45c3-f1ba-4d544c4d383f"
- },
- "outputs": [
- {
- "output_type": "stream",
- "name": "stderr",
- "text": [
- "The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.\n",
- "Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n",
- "The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.\n"
- ]
- },
- {
- "output_type": "stream",
- "name": "stdout",
- "text": [
- "The advancements in artificial intelligence and artificial intelligence technologies have enabled us to build new technologies to help us solve many of the challenges of our time. However, many of our technological achievements have been based on the assumption that humans will eventually become super-intelligent\n"
- ]
- }
- ],
- "source": [
- "# Generating text using different decoding methods\n",
- "def generate_text(prompt, max_length=50, temperature=0.7, top_k=50, top_p=0.9):\n",
- " input_ids = tokenizer(prompt, return_tensors=\"pt\").input_ids.to(device)\n",
- " output = model.generate(\n",
- " input_ids,\n",
- " max_length=max_length,\n",
- " temperature=temperature,\n",
- " top_k=top_k,\n",
- " top_p=top_p,\n",
- " do_sample=True,\n",
- " )\n",
- " return tokenizer.decode(output[0], skip_special_tokens=True)\n",
- "\n",
- "# Example prompt\n",
- "prompt = \"The advancements in artificial intelligence\"\n",
- "print(generate_text(prompt))\n"
- ]
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Tokens: tensor([[ 8001, 9542, 9345, 318, 25449, 262, 995, 13]])\n",
+ "Decoded Text: Artificial Intelligence is transforming the world.\n"
+ ]
+ }
+ ],
+ "source": [
+ "# Load a pre-trained GPT model and its tokenizer\n",
+ "model_name = \"gpt2\"\n",
+ "tokenizer = AutoTokenizer.from_pretrained(model_name)\n",
+ "model = AutoModelForCausalLM.from_pretrained(model_name).to(device)\n",
+ "\n",
+ "# Sample text for tokenization\n",
+ "text = \"Artificial Intelligence is transforming the world.\"\n",
+ "\n",
+ "# Tokenizing the text\n",
+ "tokens = tokenizer.encode(text, return_tensors=\"pt\").to(device)\n",
+ "print(f\"Tokens: {tokens}\")\n",
+ "\n",
+ "# Decoding tokens back to text\n",
+ "decoded_text = tokenizer.decode(tokens[0])\n",
+ "print(f\"Decoded Text: {decoded_text}\")\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Swo_D3eOuVO0"
+ },
+ "source": [
+ "Tokenization converts text into numerical tokens that the model can understand.\n",
+ "Each token represents a word, part of a word, or a special character."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "7lBpkJvRuZdv"
+ },
+ "source": [
+ "# Understanding Token Probabilities\n",
+ "We'll explore how LLMs generate text using token probabilities."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
},
+ "id": "sn17gwMcuRb7",
+ "outputId": "157692d5-0e73-48f5-b8ce-5114ad621423"
+ },
+ "outputs": [
{
- "cell_type": "markdown",
- "metadata": {
- "id": "y39j59swuueC"
- },
- "source": [
- "Temperature: Controls randomness in generation. Higher values (e.g., 1.0) make output more random.\n",
- "\n",
- "Top-k Sampling: Limits selection to the top-k probable tokens.\n",
- "\n",
- "Top-p Sampling (Nucleus Sampling): Limits selection to tokens that cover a cumulative probability of p."
- ]
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Token: is, Probability: 0.2314\n",
+ "Token: in, Probability: 0.0686\n",
+ "Token: and, Probability: 0.0668\n",
+ "Token: will, Probability: 0.0649\n",
+ "Token: ,, Probability: 0.0427\n"
+ ]
+ }
+ ],
+ "source": [
+ "# Generate output with token probabilities\n",
+ "input_ids = tokenizer.encode(\"The future of AI\", return_tensors=\"pt\").to(device)\n",
+ "output = model(input_ids)\n",
+ "\n",
+ "# Extract logits (raw scores) for the next token prediction\n",
+ "logits = output.logits[0, -1, :]\n",
+ "probs = torch.softmax(logits, dim=0)\n",
+ "\n",
+ "# Show the top 5 probable tokens\n",
+ "top_k = 5\n",
+ "top_k_indices = torch.topk(probs, top_k).indices\n",
+ "top_k_probs = torch.topk(probs, top_k).values\n",
+ "\n",
+ "# Display top-k tokens with their probabilities\n",
+ "for i in range(top_k):\n",
+ " token = tokenizer.decode(top_k_indices[i].item())\n",
+ " print(f\"Token: {token}, Probability: {top_k_probs[i].item():.4f}\")\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "edrAVmvgujfh"
+ },
+ "source": [
+ "#Text Generation\n",
+ "We'll demonstrate how LLMs generate text and how different parameters affect the output."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
},
+ "id": "Xa9UVyBMugwg",
+ "outputId": "00883b42-8a5c-45c3-f1ba-4d544c4d383f"
+ },
+ "outputs": [
{
- "cell_type": "markdown",
- "metadata": {
- "id": "txvqGkF1u0UV"
- },
- "source": [
- "#Stochastic Nature of LLMs\n",
- "Here, we'll show how randomness affects outputs by generating multiple responses for the same prompt."
- ]
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.\n",
+ "Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n",
+ "The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.\n"
+ ]
},
{
- "cell_type": "code",
- "execution_count": 6,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/"
- },
- "id": "xcnsCdHlu4v6",
- "outputId": "28c11c0e-20ef-4469-a00e-641770274de3"
- },
- "outputs": [
- {
- "output_type": "stream",
- "name": "stderr",
- "text": [
- "The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.\n",
- "Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n",
- "The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.\n",
- "Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n"
- ]
- },
- {
- "output_type": "stream",
- "name": "stdout",
- "text": [
- "Output 1: The advancements in artificial intelligence have allowed researchers to create more accurate models of human behavior, including how people behave in the world. The results are presented at the International Conference on Artificial Intelligence (ICAI) in Seoul, South Korea, on October 7-\n",
- "\n"
- ]
- },
- {
- "output_type": "stream",
- "name": "stderr",
- "text": [
- "The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.\n",
- "Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n"
- ]
- },
- {
- "output_type": "stream",
- "name": "stdout",
- "text": [
- "Output 2: The advancements in artificial intelligence have been made possible by the development of new algorithms that are able to recognize people's faces in a variety of ways, from the way they look to how they behave.\n",
- "\n",
- "The first such algorithm, known as the Face\n",
- "\n",
- "Output 3: The advancements in artificial intelligence and robotics have made it possible to create a computer that is capable of performing complex tasks that have never been done before.\n",
- "\n",
- "\"We're doing things that will not be possible before,\" said Dr. Gail M.\n",
- "\n"
- ]
- }
- ],
- "source": [
- "# Generate multiple outputs for the same prompt\n",
- "set_seed(42) # Setting a seed for reproducibility\n",
- "for i in range(3):\n",
- " print(f\"Output {i+1}: {generate_text(prompt)}\\n\")\n"
- ]
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "The advancements in artificial intelligence and artificial intelligence technologies have enabled us to build new technologies to help us solve many of the challenges of our time. However, many of our technological achievements have been based on the assumption that humans will eventually become super-intelligent\n"
+ ]
+ }
+ ],
+ "source": [
+ "# Generating text using different decoding methods\n",
+ "def generate_text(prompt, max_length=50, temperature=0.7, top_k=50, top_p=0.9):\n",
+ " input_ids = tokenizer(prompt, return_tensors=\"pt\").input_ids.to(device)\n",
+ " output = model.generate(\n",
+ " input_ids,\n",
+ " max_length=max_length,\n",
+ " temperature=temperature,\n",
+ " top_k=top_k,\n",
+ " top_p=top_p,\n",
+ " do_sample=True,\n",
+ " )\n",
+ " return tokenizer.decode(output[0], skip_special_tokens=True)\n",
+ "\n",
+ "# Example prompt\n",
+ "prompt = \"The advancements in artificial intelligence\"\n",
+ "print(generate_text(prompt))\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "y39j59swuueC"
+ },
+ "source": [
+ "Temperature: Controls randomness in generation. Higher values (e.g., 1.0) make output more random.\n",
+ "\n",
+ "Top-k Sampling: Limits selection to the top-k probable tokens.\n",
+ "\n",
+ "Top-p Sampling (Nucleus Sampling): Limits selection to tokens that cover a cumulative probability of p."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "txvqGkF1u0UV"
+ },
+ "source": [
+ "#Stochastic Nature of LLMs\n",
+ "Here, we'll show how randomness affects outputs by generating multiple responses for the same prompt."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
},
+ "id": "xcnsCdHlu4v6",
+ "outputId": "28c11c0e-20ef-4469-a00e-641770274de3"
+ },
+ "outputs": [
{
- "cell_type": "markdown",
- "metadata": {
- "id": "9H14GOeivCXC"
- },
- "source": [
- " # Prompt Engineering\n",
- "\n",
- "This section covers how modifying prompts can change the outputs generated."
- ]
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.\n",
+ "Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n",
+ "The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.\n",
+ "Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n"
+ ]
},
{
- "cell_type": "code",
- "execution_count": 7,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/"
- },
- "id": "EQTyEIbcu_TI",
- "outputId": "1470032c-e517-49ed-9daf-2cc382fb33ed"
- },
- "outputs": [
- {
- "output_type": "stream",
- "name": "stderr",
- "text": [
- "The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.\n",
- "Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n"
- ]
- },
- {
- "output_type": "stream",
- "name": "stdout",
- "text": [
- "Prompt: Explain the concept of machine learning.\n"
- ]
- },
- {
- "output_type": "stream",
- "name": "stderr",
- "text": [
- "The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.\n",
- "Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n"
- ]
- },
- {
- "output_type": "stream",
- "name": "stdout",
- "text": [
- "Explain the concept of machine learning.\n",
- "\n",
- "This is the third article in a series of articles in Computer Vision and Machine Learning.\n",
- "\n",
- "The first article is from the Computer Vision and Machine Learning Institute.\n",
- "\n",
- "The second article is from the\n",
- "\n",
- "--------------------------------------------------\n",
- "\n",
- "Prompt: Explain the concept of machine learning in simple terms.\n"
- ]
- },
- {
- "output_type": "stream",
- "name": "stderr",
- "text": [
- "The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.\n",
- "Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n"
- ]
- },
- {
- "output_type": "stream",
- "name": "stdout",
- "text": [
- "Explain the concept of machine learning in simple terms.\n",
- "\n",
- "The machine learning is a system of learning that can be applied to any task. In particular, machine learning is a system that can be applied to any task in the world.\n",
- "\n",
- "\n",
- "\n",
- "--------------------------------------------------\n",
- "\n",
- "Prompt: Explain machine learning to a 10-year-old.\n",
- "Explain machine learning to a 10-year-old.\n",
- "\n",
- "The next step is to find a way to teach students the basics of machine learning.\n",
- "\n",
- "\"We have some of the most advanced systems available,\" said Chris Binder, a\n",
- "\n",
- "--------------------------------------------------\n",
- "\n"
- ]
- }
- ],
- "source": [
- "# Testing prompt engineering\n",
- "prompts = [\n",
- " \"Explain the concept of machine learning.\",\n",
- " \"Explain the concept of machine learning in simple terms.\",\n",
- " \"Explain machine learning to a 10-year-old.\",\n",
- "]\n",
- "\n",
- "for p in prompts:\n",
- " print(f\"Prompt: {p}\")\n",
- " print(generate_text(p))\n",
- " print(\"\\n\" + \"-\"*50 + \"\\n\")\n"
- ]
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Output 1: The advancements in artificial intelligence have allowed researchers to create more accurate models of human behavior, including how people behave in the world. The results are presented at the International Conference on Artificial Intelligence (ICAI) in Seoul, South Korea, on October 7-\n",
+ "\n"
+ ]
},
{
- "cell_type": "markdown",
- "metadata": {
- "id": "2fa5476a"
- },
- "source": [
- "\n",
- "## Tokenization Explained in Detail\n",
- "\n",
- "In this section, we'll dive deeper into tokenization. The process of tokenization is essential for transforming raw text into tokens that the model can understand. Each token represents a word, part of a word, or even special characters.\n",
- "\n",
- "Let's see a practical demonstration of how a sentence is tokenized using a pre-trained GPT model. We will also visualize how text is broken down into tokens.\n"
- ]
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.\n",
+ "Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n"
+ ]
},
{
- "cell_type": "code",
- "execution_count": 8,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/",
- "height": 444
- },
- "id": "a0c8b391",
- "outputId": "6bd7567c-5647-4d80-8918-abb3e72b2fc2"
- },
- "outputs": [
- {
- "output_type": "stream",
- "name": "stdout",
- "text": [
- "Tokens: tensor([[ 8001, 9542, 9345, 318, 25449, 262, 995, 13]])\n",
- "Decoded Text: Artificial Intelligence is transforming the world.\n"
- ]
- },
- {
- "output_type": "display_data",
- "data": {
- "text/plain": [
- "