Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pushing assignments #2

Open
wants to merge 44 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
44 commits
Select commit Hold shift + click to select a range
de60a2e
pushing to Quiz folder
rcghpge Feb 24, 2024
e748e64
pushing Lab 4 to Github repo
rcghpge Feb 28, 2024
6395160
Add files via upload
rcghpge Mar 1, 2024
3556674
Add files via upload
rcghpge Mar 1, 2024
a49fb70
Add files via upload
rcghpge Mar 1, 2024
3f3c25f
Add files via upload
rcghpge Mar 7, 2024
1f7d6df
Create t.txt
rcghpge Mar 13, 2024
c1e2890
Add files via upload
rcghpge Mar 13, 2024
2b1b3dc
Lab 5
rcghpge Mar 26, 2024
743218a
Lab 6
rcghpge Mar 26, 2024
b183184
UML Diagram
rcghpge Mar 26, 2024
eded6a3
UML Diagram
rcghpge Mar 26, 2024
7c873af
UML Diagram
rcghpge Mar 26, 2024
3428cf3
Rename Lab.6 - RobertCocker .ipynb to Lab.6 - RobertCocker.ipynb
rcghpge Mar 26, 2024
f2113d7
quiz 2
rcghpge Apr 1, 2024
f142783
Create Lab.7
rcghpge Apr 6, 2024
9e38d8d
Delete Labs/Lab.7
rcghpge Apr 6, 2024
864420f
Create t.txt
rcghpge Apr 6, 2024
04df9f6
Delete Labs/Lab.7/t directory
rcghpge Apr 6, 2024
1ea722b
Create t.txt
rcghpge Apr 6, 2024
05187fb
Delete Labs/Lab.7/t.txt
rcghpge Apr 6, 2024
ea3c4f6
Create t.txt
rcghpge Apr 6, 2024
fee3fcb
Lab 7 materials
rcghpge Apr 6, 2024
31513b0
Lab 7
rcghpge Apr 6, 2024
6a2b3a4
Delete Labs/Lab.7/t.txt
rcghpge Apr 6, 2024
946d58a
Lab 7
rcghpge Apr 6, 2024
0df581f
Lab 7
rcghpge Apr 6, 2024
1cae978
Lab 7 updated
rcghpge Apr 7, 2024
615526d
Create t.txt
rcghpge Apr 18, 2024
73411e6
Delete Labs/Lab.8/t.txt
rcghpge Apr 18, 2024
d3040d8
Create t.txt
rcghpge Apr 18, 2024
2718fdf
Add files via upload
rcghpge Apr 18, 2024
79a6d95
Delete Labs/Lab.8/README.md
rcghpge Apr 18, 2024
aabaff0
Kaggle instructions
rcghpge Apr 18, 2024
db91a9a
Create t.txt
rcghpge Apr 18, 2024
7dc8783
Kaggle project instructions
rcghpge Apr 18, 2024
9acdc1c
Delete Labs/Kaggle Tabular Data.ipynb
rcghpge Apr 18, 2024
efa706b
Delete Kaggle Project/t.txt
rcghpge Apr 18, 2024
974e54c
Project template
rcghpge Apr 18, 2024
7a3c608
Lab 8
rcghpge Apr 19, 2024
00cd960
Delete Labs/Lab.8/t.txt
rcghpge Apr 19, 2024
0113f9d
Lab 8
rcghpge Apr 19, 2024
563ef0b
Update README.md
rcghpge Apr 23, 2024
94545eb
Update README.md
rcghpge May 13, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
146 changes: 146 additions & 0 deletions Kaggle Project/Kaggle Tabular Data.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,146 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "e64a1b88",
"metadata": {},
"source": [
"# Tabular Kaggle Project\n",
"\n",
"Guideline for steps for the Kaggle Tabular Project. You will \"turn in\" a GitHub repository, modeled after [Project Template](https://github.com/UTA-DataScience/ProjectTempate) on the day of the final, May 3rd 1:30 pm. During the final period we will have about 5 minutes to go over your project and your results.\n",
"\n",
"You can find a list of possible Tabular datasets here on [Excel File in Teams](https://mavsuta.sharepoint.com/:x:/r/sites/Course_2242_data_3402_001-vUhPXzAGLgTnk/Shared%20Documents/General/TabularDatasets.xlsx?d=w17e157db75904dfcb03a78c84f10e2e6&csf=1&web=1&e=KHi7m9). You are not limited to these datasets. If you find a Kaggle challenge not listed that you would like to attempt, please go check with Dr. Farbin to make sure it is viable.\n",
"\n",
"This notebook outlines the steps you shoud follow. The file(s) in the GitHub repository should contain these steps. Note that you will be only considering classification projects.\n",
"\n",
"## Define Project\n",
"\n",
"* Provide Project link.\n",
"* Short paragraph describing the challenge. \n",
"* Briefly describe the data.\n"
]
},
{
"cell_type": "markdown",
"id": "a65cd3e3",
"metadata": {},
"source": [
"## Data Loading and Initial Look\n",
"\n",
"* Load the data. \n",
"* Count the number of rows (data points) and features.\n",
"* Any missing values? \n",
"* Make a table, where each row is a feature or collection of features:\n",
" * Is the feature categorical or numerical\n",
" * What values? \n",
" * e.g. for categorical: \"0,1,2\"\n",
" * e.g. for numerical specify the range\n",
" * How many missing values\n",
" * Do you see any outliers?\n",
" * Define outlier.\n",
"* For classification is there class imbalance?\n",
"* What is the target:\n",
" * Classification: how is the target encoded (e.g. 0 and 1)?\n",
" * Regression: what is the range?"
]
},
{
"cell_type": "markdown",
"id": "27c59841",
"metadata": {},
"source": [
"## Data Visualization\n",
"\n",
"* For classification: compare histogram every feature between the classes. Lots of examples of this in class.\n",
"* For regression: \n",
" * Define 2 or more class based on value of the regression target.\n",
" * For example: if regression target is between 0 and 1:\n",
" * 0.0-0.25: Class 1\n",
" * 0.25-0.5: Class 2\n",
" * 0.5-0.75: Class 3\n",
" * 0.75-1.0: Class 4\n",
" * Compare histograms of the features between the classes.\n",
" \n",
"* Note that for categorical features, often times the information in the histogram could be better presented in a table. \n",
"* Make comments on what features look most promising for ML task."
]
},
{
"cell_type": "markdown",
"id": "ba73f3b0",
"metadata": {},
"source": [
"## Data Cleaning and Preperation for Machine Learning\n",
"\n",
"* Perform any data cleaning. Be clear what are you doing, for what feature. \n",
"* Determinine if rescaling is important for your Machine Learning model.\n",
" * If so select strategy for each feature.\n",
" * Apply rescaling.\n",
"* Visualize the features before and after cleaning and rescaling.\n",
"* One-hot encode your categorical features."
]
},
{
"cell_type": "markdown",
"id": "39c8d295",
"metadata": {},
"source": [
"## Machine Learning\n",
"\n",
"\n",
"### Problem Formulation\n",
"\n",
"* Remove unneed columns, for example:\n",
" * duplicated\n",
" * categorical features that were turned into one-hot.\n",
" * features that identify specific rows, like ID number.\n",
" * make sure your target is properly encoded also.\n",
"* Split training sample into train, validation, and test sub-samples.\n",
"\n",
"### Train ML Algorithm\n",
"\n",
"* You only need one algorithm to work. You can do more if you like.\n",
"* For now, focus on making it work, rather than best result.\n",
"* Try to get a non-trivial result.\n",
"\n",
"### Evaluate Performance on Validation Sample\n",
"\n",
"* Compute the usual metric for your ML task.\n",
"* Compute the score for the kaggle challenge.\n",
"\n",
"### Apply ML to the challenge test set\n",
"\n",
"* Once trained, apply the ML algorithm the the test dataset and generate the submission file.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "12b0e44d",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.6"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
121 changes: 121 additions & 0 deletions Kaggle Project/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
![](UTA-DataScience-Logo.png)

# Project Title

* **One Sentence Summary** Ex: This repository holds an attempt to apply LSTMs to Stock Market using data from
"Get Rich" Kaggle challenge (provide link).

## Overview

* This section could contain a short paragraph which include the following:
* **Definition of the tasks / challenge** Ex: The task, as defined by the Kaggle challenge is to use a time series of 12 features, sampled daily for 1 month, to predict the next day's price of a stock.
* **Your approach** Ex: The approach in this repository formulates the problem as regression task, using deep recurrent neural networks as the model with the full time series of features as input. We compared the performance of 3 different network architectures.
* **Summary of the performance achieved** Ex: Our best model was able to predict the next day stock price within 23%, 90% of the time. At the time of writing, the best performance on Kaggle of this metric is 18%.

## Summary of Workdone

Include only the sections that are relevant an appropriate.

### Data

* Data:
* Type: For example
* Input: medical images (1000x1000 pixel jpegs), CSV file: image filename -> diagnosis
* Input: CSV file of features, output: signal/background flag in 1st column.
* Size: How much data?
* Instances (Train, Test, Validation Split): how many data points? Ex: 1000 patients for training, 200 for testing, none for validation

#### Preprocessing / Clean up

* Describe any manipulations you performed to the data.

#### Data Visualization

Show a few visualization of the data and say a few words about what you see.

### Problem Formulation

* Define:
* Input / Output
* Models
* Describe the different models you tried and why.
* Loss, Optimizer, other Hyperparameters.

### Training

* Describe the training:
* How you trained: software and hardware.
* How did training take.
* Training curves (loss vs epoch for test/train).
* How did you decide to stop training.
* Any difficulties? How did you resolve them?

### Performance Comparison

* Clearly define the key performance metric(s).
* Show/compare results in one table.
* Show one (or few) visualization(s) of results, for example ROC curves.

### Conclusions

* State any conclusions you can infer from your work. Example: LSTM work better than GRU.

### Future Work

* What would be the next thing that you would try.
* What are some other studies that can be done starting from here.

## How to reproduce results

* In this section, provide instructions at least one of the following:
* Reproduce your results fully, including training.
* Apply this package to other data. For example, how to use the model you trained.
* Use this package to perform their own study.
* Also describe what resources to use for this package, if appropirate. For example, point them to Collab and TPUs.

### Overview of files in repository

* Describe the directory structure, if any.
* List all relavent files and describe their role in the package.
* An example:
* utils.py: various functions that are used in cleaning and visualizing data.
* preprocess.ipynb: Takes input data in CSV and writes out data frame after cleanup.
* visualization.ipynb: Creates various visualizations of the data.
* models.py: Contains functions that build the various models.
* training-model-1.ipynb: Trains the first model and saves model during training.
* training-model-2.ipynb: Trains the second model and saves model during training.
* training-model-3.ipynb: Trains the third model and saves model during training.
* performance.ipynb: loads multiple trained models and compares results.
* inference.ipynb: loads a trained model and applies it to test data to create kaggle submission.

* Note that all of these notebooks should contain enough text for someone to understand what is happening.

### Software Setup
* List all of the required packages.
* If not standard, provide or point to instruction for installing the packages.
* Describe how to install your package.

### Data

* Point to where they can download the data.
* Lead them through preprocessing steps, if necessary.

### Training

* Describe how to train the model

#### Performance Evaluation

* Describe how to run the performance evaluation.


## Citations

* Provide any references.







Binary file added Kaggle Project/UTA-DataScience-Logo.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading