Skip to content

Commit

Permalink
chore: Initialize repository
Browse files Browse the repository at this point in the history
  • Loading branch information
LyubomirT committed Jul 27, 2024
1 parent 2768613 commit 3fcb155
Show file tree
Hide file tree
Showing 2 changed files with 561,810 additions and 0 deletions.
1 change: 1 addition & 0 deletions lyubomirt-toxicity-detector-nb.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"cells":[{"cell_type":"markdown","metadata":{},"source":["# Welcome to the LyubomirT's Toxicity Detector model notebook!\n","\n","This notebook trains and runs a lightweight BERT-based model for detecting toxicity in text. The model is trained on the Severity of Toxic Comments dataset, which is a binary classification task. The model is trained using the `transformers` library and `torch`.\n","\n","To train the model yourself, please run all the cells from the beginning. Note that you will need a GPU to train the model in a reasonable amount of time.\n","\n","## Quick inference\n","\n","If you just want to see the model in action, you can skip the training and jump straight to the \"Inference\" section. There, you can input your own text and see the model's predictions. Run all cells except the \"Training\" section."]},{"cell_type":"code","execution_count":null,"metadata":{"_cell_guid":"b1076dfc-b9ad-4769-8c92-a6c4dae69d19","_uuid":"8f2839f25d086af736a60e9eeb907d3b93b6e0e5","execution":{"iopub.execute_input":"2024-07-27T04:33:59.593672Z","iopub.status.busy":"2024-07-27T04:33:59.592911Z","iopub.status.idle":"2024-07-27T04:33:59.599547Z","shell.execute_reply":"2024-07-27T04:33:59.598629Z","shell.execute_reply.started":"2024-07-27T04:33:59.593612Z"},"trusted":true},"outputs":[],"source":["!pip install pandas\n","!pip install transformers\n","!pip install torch\n","!pip install tqdm\n","!pip install scikit-learn\n","\n","# Cell 1: Import necessary libraries\n","import pandas as pd\n","import numpy as np\n","import torch\n","from torch.utils.data import Dataset, DataLoader\n","from transformers import BertTokenizer, BertForSequenceClassification\n","from sklearn.model_selection import train_test_split\n","from sklearn.metrics import classification_report\n","import torch.nn.functional as F\n","from tqdm import tqdm\n","import os"]},{"cell_type":"markdown","metadata":{},"source":["# Read and preprocess the data\n","\n","First, we need to load the data and preprocess it. We will use the `pandas` library to load the data and preprocess it. In the dataset, we have labels for toxicity such as `toxic`, `severe_toxic`, `obscene`, `threat`, and `insult`. We will use all of these labels to train the model."]},{"cell_type":"code","execution_count":12,"metadata":{"execution":{"iopub.execute_input":"2024-07-27T04:33:59.602304Z","iopub.status.busy":"2024-07-27T04:33:59.601868Z","iopub.status.idle":"2024-07-27T04:34:00.934430Z","shell.execute_reply":"2024-07-27T04:34:00.933663Z","shell.execute_reply.started":"2024-07-27T04:33:59.602268Z"},"trusted":true},"outputs":[],"source":["# Cell 2: Data Loading and Preprocessing\n","# Load the dataset\n","df = pd.read_csv('train.csv')\n","\n","# Define the labels we want to predict\n","labels = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult']\n","\n","# Split the data into training and validation sets\n","train_texts, val_texts, train_labels, val_labels = train_test_split(\n"," df['comment_text'].tolist(),\n"," df[labels].values,\n"," test_size=0.2,\n"," random_state=42\n",")\n","\n","# Initialize the BERT tokenizer\n","tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')"]},{"cell_type":"markdown","metadata":{},"source":["# The dataset\n","\n","The dataset we will use is the \"Severity of Toxic Comments\" dataset. It is a binary classification task where the goal is to predict whether a comment is toxic or not. The dataset contains comments from Wikipedia's talk page edits. Each comment is labeled with one or more of the following labels: `toxic`, `severe_toxic`, `obscene`, `threat`, `insult`, and `identity_hate`."]},{"cell_type":"code","execution_count":8,"metadata":{"execution":{"iopub.execute_input":"2024-07-27T04:34:00.935832Z","iopub.status.busy":"2024-07-27T04:34:00.935562Z","iopub.status.idle":"2024-07-27T04:34:00.945295Z","shell.execute_reply":"2024-07-27T04:34:00.944474Z","shell.execute_reply.started":"2024-07-27T04:34:00.935809Z"},"trusted":true},"outputs":[{"ename":"NameError","evalue":"name 'train_texts' is not defined","output_type":"error","traceback":["\u001b[1;31m---------------------------------------------------------------------------\u001b[0m","\u001b[1;31mNameError\u001b[0m Traceback (most recent call last)","Cell \u001b[1;32mIn[8], line 35\u001b[0m\n\u001b[0;32m 27\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m {\n\u001b[0;32m 28\u001b[0m \u001b[38;5;124m'\u001b[39m\u001b[38;5;124mtext\u001b[39m\u001b[38;5;124m'\u001b[39m: text,\n\u001b[0;32m 29\u001b[0m \u001b[38;5;124m'\u001b[39m\u001b[38;5;124minput_ids\u001b[39m\u001b[38;5;124m'\u001b[39m: encoding[\u001b[38;5;124m'\u001b[39m\u001b[38;5;124minput_ids\u001b[39m\u001b[38;5;124m'\u001b[39m]\u001b[38;5;241m.\u001b[39mflatten(),\n\u001b[0;32m 30\u001b[0m \u001b[38;5;124m'\u001b[39m\u001b[38;5;124mattention_mask\u001b[39m\u001b[38;5;124m'\u001b[39m: encoding[\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mattention_mask\u001b[39m\u001b[38;5;124m'\u001b[39m]\u001b[38;5;241m.\u001b[39mflatten(),\n\u001b[0;32m 31\u001b[0m \u001b[38;5;124m'\u001b[39m\u001b[38;5;124mlabels\u001b[39m\u001b[38;5;124m'\u001b[39m: torch\u001b[38;5;241m.\u001b[39mFloatTensor(label)\n\u001b[0;32m 32\u001b[0m }\n\u001b[0;32m 34\u001b[0m \u001b[38;5;66;03m# Create Dataset objects\u001b[39;00m\n\u001b[1;32m---> 35\u001b[0m train_dataset \u001b[38;5;241m=\u001b[39m ToxicDataset(\u001b[43mtrain_texts\u001b[49m, train_labels, tokenizer)\n\u001b[0;32m 36\u001b[0m val_dataset \u001b[38;5;241m=\u001b[39m ToxicDataset(val_texts, val_labels, tokenizer)\n\u001b[0;32m 38\u001b[0m \u001b[38;5;66;03m# Create DataLoader objects\u001b[39;00m\n","\u001b[1;31mNameError\u001b[0m: name 'train_texts' is not defined"]}],"source":["# Cell 3: Dataset and DataLoader\n","class ToxicDataset(Dataset):\n"," def __init__(self, texts, labels, tokenizer, max_len=128):\n"," self.texts = texts\n"," self.labels = labels\n"," self.tokenizer = tokenizer\n"," self.max_len = max_len\n","\n"," def __len__(self):\n"," return len(self.texts)\n","\n"," def __getitem__(self, item):\n"," text = str(self.texts[item])\n"," label = self.labels[item]\n","\n"," encoding = self.tokenizer.encode_plus(\n"," text,\n"," add_special_tokens=True,\n"," max_length=self.max_len,\n"," return_token_type_ids=False,\n"," padding='max_length',\n"," truncation=True,\n"," return_attention_mask=True,\n"," return_tensors='pt',\n"," )\n","\n"," return {\n"," 'text': text,\n"," 'input_ids': encoding['input_ids'].flatten(),\n"," 'attention_mask': encoding['attention_mask'].flatten(),\n"," 'labels': torch.FloatTensor(label)\n"," }\n","\n","# Create Dataset objects\n","train_dataset = ToxicDataset(train_texts, train_labels, tokenizer)\n","val_dataset = ToxicDataset(val_texts, val_labels, tokenizer)\n","\n","# Create DataLoader objects\n","train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)\n","val_loader = DataLoader(val_dataset, batch_size=32, shuffle=False)"]},{"cell_type":"markdown","metadata":{},"source":["# Model Definition\n","\n","This notebook uses the `transformers` library to load a pre-trained BERT model and fine-tune it on the Severity of toxic comments dataset. The model is defined as a class `ToxicClassifier` that inherits from `torch.nn.Module`. The model uses the `BertForSequenceClassification` class from the `transformers` library to load a pre-trained BERT model with a classification head. The model is initialized with the number of classes in the dataset."]},{"cell_type":"code","execution_count":7,"metadata":{"execution":{"iopub.execute_input":"2024-07-27T04:34:00.946698Z","iopub.status.busy":"2024-07-27T04:34:00.946410Z","iopub.status.idle":"2024-07-27T04:34:01.356857Z","shell.execute_reply":"2024-07-27T04:34:01.356093Z","shell.execute_reply.started":"2024-07-27T04:34:00.946675Z"},"trusted":true},"outputs":[{"ename":"NameError","evalue":"name 'labels' is not defined","output_type":"error","traceback":["\u001b[1;31m---------------------------------------------------------------------------\u001b[0m","\u001b[1;31mNameError\u001b[0m Traceback (most recent call last)","Cell \u001b[1;32mIn[7], line 12\u001b[0m\n\u001b[0;32m 9\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m output\u001b[38;5;241m.\u001b[39mlogits\n\u001b[0;32m 11\u001b[0m \u001b[38;5;66;03m# Initialize the model\u001b[39;00m\n\u001b[1;32m---> 12\u001b[0m model \u001b[38;5;241m=\u001b[39m ToxicClassifier(\u001b[38;5;28mlen\u001b[39m(\u001b[43mlabels\u001b[49m))\n\u001b[0;32m 14\u001b[0m \u001b[38;5;66;03m# Move the model to GPU if available\u001b[39;00m\n\u001b[0;32m 15\u001b[0m device \u001b[38;5;241m=\u001b[39m torch\u001b[38;5;241m.\u001b[39mdevice(\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mcuda\u001b[39m\u001b[38;5;124m'\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m torch\u001b[38;5;241m.\u001b[39mcuda\u001b[38;5;241m.\u001b[39mis_available() \u001b[38;5;28;01melse\u001b[39;00m \u001b[38;5;124m'\u001b[39m\u001b[38;5;124mcpu\u001b[39m\u001b[38;5;124m'\u001b[39m)\n","\u001b[1;31mNameError\u001b[0m: name 'labels' is not defined"]}],"source":["# Cell 4: Model Definition\n","class ToxicClassifier(torch.nn.Module):\n"," def __init__(self, n_classes):\n"," super(ToxicClassifier, self).__init__()\n"," self.bert = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=n_classes)\n"," \n"," def forward(self, input_ids, attention_mask):\n"," output = self.bert(input_ids=input_ids, attention_mask=attention_mask)\n"," return output.logits\n","\n","# Initialize the model\n","model = ToxicClassifier(len(labels))\n","\n","# Move the model to GPU if available\n","device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')\n","model = model.to(device)"]},{"cell_type":"markdown","metadata":{},"source":["# Training Loop\n","\n","This is the training loop of the LyubomirT-Toxic-Detector model, here we train the model using the AdamW optimizer with a learning rate of 2e-5, a step size of 1, and a gamma of 0.1. We use the binary cross-entropy loss function and the mixed precision training technique to speed up the training process. We also use gradient accumulation to reduce the memory usage during training."]},{"cell_type":"code","execution_count":null,"metadata":{"execution":{"iopub.execute_input":"2024-07-27T04:34:01.359409Z","iopub.status.busy":"2024-07-27T04:34:01.359065Z","iopub.status.idle":"2024-07-27T05:19:09.894324Z","shell.execute_reply":"2024-07-27T05:19:09.893218Z","shell.execute_reply.started":"2024-07-27T04:34:01.359375Z"},"trusted":true},"outputs":[],"source":["# Cell 5: Training Loop\n","def train_model(model, train_loader, val_loader, epochs=3, accumulation_steps=2):\n"," optimizer = torch.optim.AdamW(model.parameters(), lr=2e-5)\n"," scaler = torch.cuda.amp.GradScaler() # Mixed Precision Training\n"," scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=1, gamma=0.1) # Learning Rate Scheduler\n","\n"," for epoch in range(epochs):\n"," model.train()\n"," train_loss = 0\n"," optimizer.zero_grad()\n"," for i, batch in enumerate(tqdm(train_loader, desc=f'Epoch {epoch + 1}/{epochs}')):\n"," input_ids = batch['input_ids'].to(device)\n"," attention_mask = batch['attention_mask'].to(device)\n"," labels = batch['labels'].to(device)\n","\n"," with torch.cuda.amp.autocast(): # Mixed Precision Training\n"," outputs = model(input_ids, attention_mask)\n"," loss = F.binary_cross_entropy_with_logits(outputs, labels)\n"," \n"," scaler.scale(loss).backward()\n"," \n"," # Gradient accumulation\n"," if (i + 1) % accumulation_steps == 0:\n"," scaler.step(optimizer)\n"," scaler.update()\n"," optimizer.zero_grad()\n","\n"," train_loss += loss.item()\n","\n"," # Validation\n"," model.eval()\n"," val_loss = 0\n"," predictions = []\n"," true_labels = []\n"," with torch.no_grad():\n"," for batch in tqdm(val_loader, desc='Validation'):\n"," input_ids = batch['input_ids'].to(device)\n"," attention_mask = batch['attention_mask'].to(device)\n"," labels = batch['labels'].to(device)\n","\n"," with torch.cuda.amp.autocast(): # Mixed Precision Training\n"," outputs = model(input_ids, attention_mask)\n"," loss = F.binary_cross_entropy_with_logits(outputs, labels)\n"," \n"," val_loss += loss.item()\n","\n"," predictions.extend(torch.sigmoid(outputs).cpu().numpy())\n"," true_labels.extend(labels.cpu().numpy())\n","\n"," avg_train_loss = train_loss / len(train_loader)\n"," avg_val_loss = val_loss / len(val_loader)\n"," print(f'Epoch {epoch + 1}/{epochs}:')\n"," print(f'Train Loss: {avg_train_loss:.4f}')\n"," print(f'Validation Loss: {avg_val_loss:.4f}')\n","\n"," scheduler.step() # Update the learning rate\n","\n","# Adjust the DataLoader for better performance\n","train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True, num_workers=4, pin_memory=True)\n","val_loader = DataLoader(val_dataset, batch_size=32, shuffle=False, num_workers=4, pin_memory=True)\n","\n","# Train the model\n","train_model(model, train_loader, val_loader)\n"]},{"cell_type":"markdown","metadata":{},"source":["# Inference\n","\n","To run inference without training, just run all cells except the one with the training loop. The model is already trained and saved as `lyubomirt-toxicity-detector.pth`. Feel free to input your own text in the `text` variable to see the model's output.\n","\n","Also note that the model doesn't require a GPU to run inference, so you can run the code on a CPU. It's still pretty fast, my 14-core CPU takes around 0.1 seconds to process a single text."]},{"cell_type":"code","execution_count":16,"metadata":{"execution":{"iopub.execute_input":"2024-07-27T05:24:44.049351Z","iopub.status.busy":"2024-07-27T05:24:44.048691Z","iopub.status.idle":"2024-07-27T05:24:59.528859Z","shell.execute_reply":"2024-07-27T05:24:59.527847Z","shell.execute_reply.started":"2024-07-27T05:24:44.049318Z"},"trusted":true},"outputs":[{"name":"stderr","output_type":"stream","text":["Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']\n","You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n"]},{"name":"stdout","output_type":"stream","text":["Model loaded from file.\n","Input text: Hello!\n","toxic: Probability = 0.0008, Prediction = 0\n","severe_toxic: Probability = 0.0004, Prediction = 0\n","obscene: Probability = 0.0005, Prediction = 0\n","threat: Probability = 0.0003, Prediction = 0\n","insult: Probability = 0.0004, Prediction = 0\n"]}],"source":["# Define the path to save/load the model\n","MODEL_PATH = 'lyubomirt-toxicity-detector.pth'\n","model = ToxicClassifier(len(labels))\n","device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')\n","model = model.to(device)\n","\n","# Cell 6: Model Initialization\n","def initialize_model():\n"," # Check if the model file exists\n"," if os.path.exists(MODEL_PATH):\n"," # Load the existing model\n"," model.load_state_dict(torch.load(MODEL_PATH, map_location=device))\n"," model.eval()\n"," print(\"Model loaded from file.\")\n"," else:\n"," # Initialize and train your model here (this is just a placeholder)\n"," # For example: model = YourModelClass()\n"," # Train the model\n"," # Save the model after training\n"," torch.save(model.state_dict(), MODEL_PATH)\n"," print(\"Model saved to file.\")\n","\n","# Call this function once to initialize the model\n","initialize_model()\n","\n","# Cell 6: Inference Function\n","def predict_toxicity(text):\n"," model.eval()\n"," encoding = tokenizer.encode_plus(\n"," text,\n"," add_special_tokens=True,\n"," max_length=128,\n"," return_token_type_ids=False,\n"," padding='max_length',\n"," truncation=True,\n"," return_attention_mask=True,\n"," return_tensors='pt',\n"," )\n"," \n"," input_ids = encoding['input_ids'].to(device)\n"," attention_mask = encoding['attention_mask'].to(device)\n"," \n"," with torch.no_grad():\n"," outputs = model(input_ids, attention_mask)\n"," \n"," probabilities = torch.sigmoid(outputs).cpu().numpy()[0]\n"," predictions = (probabilities > 0.5).astype(int)\n"," \n"," result = {}\n"," for label, prob, pred in zip(labels, probabilities, predictions):\n"," result[label] = {'probability': float(prob), 'prediction': int(pred)}\n"," \n"," return result\n","\n","# Example usage\n","text = input(\"Enter text: \")\n","result = predict_toxicity(text)\n","print(f\"Input text: {text}\")\n","for label, values in result.items():\n"," print(f\"{label}: Probability = {values['probability']:.4f}, Prediction = {values['prediction']}\")"]}],"metadata":{"kaggle":{"accelerator":"nvidiaTeslaT4","dataSources":[{"datasetId":1840062,"sourceId":3003760,"sourceType":"datasetVersion"}],"dockerImageVersionId":30747,"isGpuEnabled":true,"isInternetEnabled":true,"language":"python","sourceType":"notebook"},"kernelspec":{"display_name":"Python 3","language":"python","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.8.6"}},"nbformat":4,"nbformat_minor":4}
Loading

0 comments on commit 3fcb155

Please sign in to comment.