UTA-DataScience · rcghpge · Feb 24, 2024 · Feb 28, 2024 · Mar 1, 2024 · Mar 1, 2024
diff --git a/Kaggle Project/Kaggle Tabular Data.ipynb b/Kaggle Project/Kaggle Tabular Data.ipynb
@@ -0,0 +1,146 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "e64a1b88",
+   "metadata": {},
+   "source": [
+    "# Tabular Kaggle Project\n",
+    "\n",
+    "Guideline for steps for the Kaggle Tabular Project. You will \"turn in\" a GitHub repository, modeled after [Project Template](https://github.com/UTA-DataScience/ProjectTempate) on the day of the final, May 3rd 1:30 pm. During the final period we will have about 5 minutes to go over your project and your results.\n",
+    "\n",
+    "You can find a list of possible Tabular datasets here on [Excel File in Teams](https://mavsuta.sharepoint.com/:x:/r/sites/Course_2242_data_3402_001-vUhPXzAGLgTnk/Shared%20Documents/General/TabularDatasets.xlsx?d=w17e157db75904dfcb03a78c84f10e2e6&csf=1&web=1&e=KHi7m9). You are not limited to these datasets. If you find a Kaggle challenge not listed that you would like to attempt, please go check with Dr. Farbin to make sure it is viable.\n",
+    "\n",
+    "This notebook outlines the steps you shoud follow. The file(s) in the GitHub repository should contain these steps. Note that you will be only considering classification projects.\n",
+    "\n",
+    "## Define Project\n",
+    "\n",
+    "* Provide Project link.\n",
+    "* Short paragraph describing the challenge. \n",
+    "* Briefly describe the data.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a65cd3e3",
+   "metadata": {},
+   "source": [
+    "## Data Loading and Initial Look\n",
+    "\n",
+    "* Load the data. \n",
+    "* Count the number of rows (data points) and features.\n",
+    "* Any missing values? \n",
+    "* Make a table, where each row is a feature or collection of features:\n",
+    "    * Is the feature categorical or numerical\n",
+    "    * What values? \n",
+    "        * e.g. for categorical: \"0,1,2\"\n",
+    "        * e.g. for numerical specify the range\n",
+    "    * How many missing values\n",
+    "    * Do you see any outliers?\n",
+    "        * Define outlier.\n",
+    "* For classification is there class imbalance?\n",
+    "* What is the target:\n",
+    "    * Classification: how is the target encoded (e.g. 0 and 1)?\n",
+    "    * Regression: what is the range?"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "27c59841",
+   "metadata": {},
+   "source": [
+    "## Data Visualization\n",
+    "\n",
+    "* For classification: compare histogram every feature between the classes. Lots of examples of this in class.\n",
+    "* For regression: \n",
+    "    * Define 2 or more class based on value of the regression target.\n",
+    "        * For example: if regression target is between 0 and 1:\n",
+    "            * 0.0-0.25: Class 1\n",
+    "            * 0.25-0.5: Class 2\n",
+    "            * 0.5-0.75: Class 3\n",
+    "            * 0.75-1.0: Class 4\n",
+    "    * Compare histograms of the features between the classes.\n",
+    "        \n",
+    "* Note that for categorical features, often times the information in the histogram could be better presented in a table.    \n",
+    "* Make comments on what features look most promising for ML task."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ba73f3b0",
+   "metadata": {},
+   "source": [
+    "## Data Cleaning and Preperation for Machine Learning\n",
+    "\n",
+    "* Perform any data cleaning. Be clear what are you doing, for what feature. \n",
+    "* Determinine if rescaling is important for your Machine Learning model.\n",
+    "    * If so select strategy for each feature.\n",
+    "    * Apply rescaling.\n",
+    "* Visualize the features before and after cleaning and rescaling.\n",
+    "* One-hot encode your categorical features."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "39c8d295",
+   "metadata": {},
+   "source": [
+    "## Machine Learning\n",
+    "\n",
+    "\n",
+    "### Problem Formulation\n",
+    "\n",
+    "* Remove unneed columns, for example:\n",
+    "    * duplicated\n",
+    "    * categorical features that were turned into one-hot.\n",
+    "    * features that identify specific rows, like ID number.\n",
+    "    * make sure your target is properly encoded also.\n",
+    "* Split training sample into train, validation, and test sub-samples.\n",
+    "\n",
+    "### Train ML Algorithm\n",
+    "\n",
+    "* You only need one algorithm to work. You can do more if you like.\n",
+    "* For now, focus on making it work, rather than best result.\n",
+    "* Try to get a non-trivial result.\n",
+    "\n",
+    "### Evaluate Performance on Validation Sample\n",
+    "\n",
+    "* Compute the usual metric for your ML task.\n",
+    "* Compute the score for the kaggle challenge.\n",
+    "\n",
+    "### Apply ML to the challenge test set\n",
+    "\n",
+    "* Once trained, apply the ML algorithm the the test dataset and generate the submission file.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "12b0e44d",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.6"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/Kaggle Project/README.md b/Kaggle Project/README.md
@@ -0,0 +1,121 @@
+![](UTA-DataScience-Logo.png)
+
+# Project Title
+
+* **One Sentence Summary** Ex: This repository holds an attempt to apply LSTMs to Stock Market using data from
+"Get Rich" Kaggle challenge (provide link). 
+
+## Overview
+
+* This section could contain a short paragraph which include the following:
+  * **Definition of the tasks / challenge**  Ex: The task, as defined by the Kaggle challenge is to use a time series of 12 features, sampled daily for 1 month, to predict the next day's price of a stock.
+  * **Your approach** Ex: The approach in this repository formulates the problem as regression task, using deep recurrent neural networks as the model with the full time series of features as input. We compared the performance of 3 different network architectures.
+  * **Summary of the performance achieved** Ex: Our best model was able to predict the next day stock price within 23%, 90% of the time. At the time of writing, the best performance on Kaggle of this metric is 18%.
+
+## Summary of Workdone
+
+Include only the sections that are relevant an appropriate.
+
+### Data
+
+* Data:
+  * Type: For example
+    * Input: medical images (1000x1000 pixel jpegs), CSV file: image filename -> diagnosis
+    * Input: CSV file of features, output: signal/background flag in 1st column.
+  * Size: How much data?
+  * Instances (Train, Test, Validation Split): how many data points? Ex: 1000 patients for training, 200 for testing, none for validation
+
+#### Preprocessing / Clean up
+
+* Describe any manipulations you performed to the data.
+
+#### Data Visualization
+
+Show a few visualization of the data and say a few words about what you see.
+
+### Problem Formulation
+
+* Define:
+  * Input / Output
+  * Models
+    * Describe the different models you tried and why.
+  * Loss, Optimizer, other Hyperparameters.
+
+### Training
+
+* Describe the training:
+  * How you trained: software and hardware.
+  * How did training take.
+  * Training curves (loss vs epoch for test/train).
+  * How did you decide to stop training.
+  * Any difficulties? How did you resolve them?
+
+### Performance Comparison
+
+* Clearly define the key performance metric(s).
+* Show/compare results in one table.
+* Show one (or few) visualization(s) of results, for example ROC curves.
+
+### Conclusions
+
+* State any conclusions you can infer from your work. Example: LSTM work better than GRU.
+
+### Future Work
+
+* What would be the next thing that you would try.
+* What are some other studies that can be done starting from here.
+
+## How to reproduce results
+
+* In this section, provide instructions at least one of the following:
+   * Reproduce your results fully, including training.
+   * Apply this package to other data. For example, how to use the model you trained.
+   * Use this package to perform their own study.
+* Also describe what resources to use for this package, if appropirate. For example, point them to Collab and TPUs.
+
+### Overview of files in repository
+
+* Describe the directory structure, if any.
+* List all relavent files and describe their role in the package.
+* An example:
+  * utils.py: various functions that are used in cleaning and visualizing data.
+  * preprocess.ipynb: Takes input data in CSV and writes out data frame after cleanup.
+  * visualization.ipynb: Creates various visualizations of the data.
+  * models.py: Contains functions that build the various models.
+  * training-model-1.ipynb: Trains the first model and saves model during training.
+  * training-model-2.ipynb: Trains the second model and saves model during training.
+  * training-model-3.ipynb: Trains the third model and saves model during training.
+  * performance.ipynb: loads multiple trained models and compares results.
+  * inference.ipynb: loads a trained model and applies it to test data to create kaggle submission.
+
+* Note that all of these notebooks should contain enough text for someone to understand what is happening.
+
+### Software Setup
+* List all of the required packages.
+* If not standard, provide or point to instruction for installing the packages.
+* Describe how to install your package.
+
+### Data
+
+* Point to where they can download the data.
+* Lead them through preprocessing steps, if necessary.
+
+### Training
+
+* Describe how to train the model
+
+#### Performance Evaluation
+
+* Describe how to run the performance evaluation.
+
+
+## Citations
+
+* Provide any references.
+
+
+
+
+
+
+
diff --git a/Kaggle Project/UTA-DataScience-Logo.png b/Kaggle Project/UTA-DataScience-Logo.png