Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Grupa 3] Michał Piasecki + Bartłomiej Siński [milestone2] Feature Engineering #132

Open
wants to merge 20 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Empty file.
286 changes: 286 additions & 0 deletions Prace_domowe/Praca_domowa2/[Grupa 3] PD 2 Michał Piasecki/PD2.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,286 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"import sklearn\n",
"import category_encoders as ce\n",
"from sklearn.impute import KNNImputer\n",
"import math\n",
"import matplotlib.pyplot as plt\n",
"from IPython.display import clear_output\n",
"\n",
"import warnings\n",
"warnings.filterwarnings('ignore')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"allegro = pd.read_csv(\"allegro-api-transactions.csv\")\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"allegro.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Target encoding na allegro_data"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"allegro_target_encoding = allegro.copy(deep = True)\n",
"means = allegro.groupby(\"it_location\")['price'].mean()\n",
"allegro_target_encoding['it_location'] = allegro_target_encoding['it_location'].map(means)\n",
"allegro_target_encoding.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"W przypadku onehot encodingu, ze względu na to że mamy bardzo dużo wartości w kolumnie 'it_location', pojawi się bardzo dużo nowych kolumn. Może okazać się, że będziemy mieli za dużą ilość kolumn żeby wytrenować nasz model. One hot encoding będzie dobrym rozwiązaniem jeśli tabela kategoryczna, która przekształcamy ma mało różnych wartości."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# One hot encoding"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"allegro_one_hot = allegro.copy(deep = True)\n",
"main_categories = pd.get_dummies(allegro['main_category'])\n",
"allegro_one_hot = pd.concat([allegro_one_hot, main_categories], axis=1)\n",
"allegro_one_hot.drop(labels = ['main_category'], axis = 1,inplace = True)\n",
"allegro_one_hot"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Metoda one hot encoding tworzy nam z kolumny o wyrazach kategorycznych macierz ( n x m ), gdzie n - liczba wierszy, m - liczba kategorii w kolumnie. Każdy wiersz ma jedynkę w tej kolumnie z której miał wartośc i zera w innych kolumnach. Dobra metoda gdy mamy mało kategorii."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"allegro_cat = allegro.copy(deep = True)\n",
"CatBoostEncoder = ce.CatBoostEncoder()\n",
"CatBoostEncoder = CatBoostEncoder.fit(allegro['main_category'], y = allegro['price'])\n",
"allegro_cat['main_category'] = CatBoostEncoder.transform(allegro_cat[\"main_category\"])\n",
"allegro_cat"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Metoda działa tak samo jak target encoding z tą różnicą, że pomija wartość targetu aktualnego wiersza, aby zminimalizować efekt outlierów. Metoda działa w locie."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"allegro_count = allegro.copy(deep = True)\n",
"CountEncoder = ce.CountEncoder()\n",
"CountEncoder = CountEncoder.fit(allegro_count['main_category'])\n",
"allegro_count['main_category'] = CountEncoder.transform(allegro_count[\"main_category\"])\n",
"allegro_count"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Metoda zamienia każdą wartość na ilość wystąpień danej kategorii w kolumnie."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"allegro_reduced = allegro[['price','it_seller_rating', 'it_quantity']]\n",
"allegro_reduced = allegro_reduced[0:40000]\n",
"n = len(allegro_reduced['price'])\n",
"allegro_reduced"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Zmniejszylem liczbe rekordow bo bardzo wolno mi dzialal KKNIImputer :("
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"results = [0 for i in range(10)]\n",
"for i in range(10):\n",
" X = allegro_reduced.copy(deep=True)\n",
" # zmieniam ziarno za kazdym razem zeby miec rozne wiersze w roznich eksperymentach\n",
" np.random.seed((i+1) * 10)\n",
" chosen_idx = np.random.choice(n , replace = False, size = 4000)\n",
" X.it_seller_rating[chosen_idx] = np.nan\n",
" imputer = KNNImputer(n_neighbors=5, weights=\"uniform\")\n",
" X = imputer.fit_transform(X)\n",
" results[i] = math.sqrt(np.sum(pow(X[:,1] - allegro_reduced[\"it_seller_rating\"], 2)) / n)\n",
" "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"results "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"x = [i for i in range(10)]\n",
"plt.scatter(x,results)\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"standard_deviation = math.sqrt(np.sum(pow(results - results.mean(),2)))\n",
"standard_deviation"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To odchylenie wyszlo bardzo duże i wgl jakoś kiepsko to wygląda"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"second_results = [0 for i in range(10)]\n",
"for i in range(10):\n",
" X = allegro_reduced.copy(deep=True)\n",
" # zmieniam ziarno za kazdym razem zeby miec rozne wiersze w roznich eksperymentach\n",
" np.random.seed((i+1) * 10)\n",
" chosen_idx = np.random.choice(n , replace = False, size = 4000)\n",
" X.it_seller_rating[chosen_idx] = np.nan\n",
" X.it_quantity[chosen_idx] = np.nan\n",
" imputer = KNNImputer(n_neighbors=5, weights=\"uniform\")\n",
" X = imputer.fit_transform(X)\n",
" second_results[i] = math.sqrt(np.sum(pow(X[:,1] - allegro_reduced[\"it_seller_rating\"], 2)) / n)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"second_results"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"second_standard_deviation = math.sqrt(np.sum(pow(second_results - results.mean(),2)))\n",
"second_standard_deviation"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"x = [i for i in range(10)]\n",
"plt.scatter(x,second_results)\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Po usunięciu wartości z obu kolumn różnice w wynikach pomiędzy prawdziwą kolumną it_seller_rating, a to stworzoną po algorytmie k-nearest-neighbours są bardzo duże, również odchylenie standardowe jest znacznie większe. Wydaje mi się, że w obydwu przypadkach błąd jest stosunkowo duży."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.7"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@

Loading