mini-pw · michalpiasecki0 · Mar 2, 2021 · Mar 3, 2021 · Mar 8, 2021 · Mar 8, 2021
diff --git a/Prace_domowe/Praca_domowa1/Grupa3/.gitignore b/Prace_domowe/Praca_domowa1/Grupa3/.gitignore
diff --git a/Prace_domowe/Praca_domowa2/[Grupa 3] PD 2 Michał Piasecki/PD2.ipynb b/Prace_domowe/Praca_domowa2/[Grupa 3] PD 2 Michał Piasecki/PD2.ipynb
@@ -0,0 +1,286 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pandas as pd\n",
+    "import numpy as np\n",
+    "import sklearn\n",
+    "import category_encoders as ce\n",
+    "from sklearn.impute import KNNImputer\n",
+    "import math\n",
+    "import matplotlib.pyplot as plt\n",
+    "from IPython.display import clear_output\n",
+    "\n",
+    "import warnings\n",
+    "warnings.filterwarnings('ignore')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "allegro = pd.read_csv(\"allegro-api-transactions.csv\")\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "allegro.head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Target encoding na allegro_data"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "allegro_target_encoding = allegro.copy(deep = True)\n",
+    "means = allegro.groupby(\"it_location\")['price'].mean()\n",
+    "allegro_target_encoding['it_location'] = allegro_target_encoding['it_location'].map(means)\n",
+    "allegro_target_encoding.head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "W przypadku onehot encodingu, ze względu na to że mamy bardzo dużo wartości w kolumnie 'it_location', pojawi się bardzo dużo nowych kolumn. Może okazać się, że będziemy mieli za dużą ilość kolumn żeby wytrenować nasz model. One hot encoding będzie dobrym rozwiązaniem jeśli tabela kategoryczna, która przekształcamy ma mało różnych wartości."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# One hot encoding"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "allegro_one_hot = allegro.copy(deep = True)\n",
+    "main_categories = pd.get_dummies(allegro['main_category'])\n",
+    "allegro_one_hot = pd.concat([allegro_one_hot, main_categories], axis=1)\n",
+    "allegro_one_hot.drop(labels = ['main_category'], axis = 1,inplace = True)\n",
+    "allegro_one_hot"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Metoda one hot encoding tworzy nam z kolumny o wyrazach kategorycznych macierz ( n x m ), gdzie n - liczba wierszy, m - liczba kategorii w kolumnie. Każdy wiersz ma jedynkę w tej kolumnie z której miał wartośc i zera w innych kolumnach. Dobra metoda gdy mamy mało kategorii."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "allegro_cat = allegro.copy(deep = True)\n",
+    "CatBoostEncoder = ce.CatBoostEncoder()\n",
+    "CatBoostEncoder = CatBoostEncoder.fit(allegro['main_category'], y = allegro['price'])\n",
+    "allegro_cat['main_category'] = CatBoostEncoder.transform(allegro_cat[\"main_category\"])\n",
+    "allegro_cat"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Metoda działa tak samo jak target encoding z tą różnicą, że pomija wartość targetu aktualnego wiersza, aby zminimalizować efekt outlierów. Metoda działa w locie."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "allegro_count = allegro.copy(deep = True)\n",
+    "CountEncoder = ce.CountEncoder()\n",
+    "CountEncoder = CountEncoder.fit(allegro_count['main_category'])\n",
+    "allegro_count['main_category'] = CountEncoder.transform(allegro_count[\"main_category\"])\n",
+    "allegro_count"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Metoda zamienia każdą wartość na ilość wystąpień danej kategorii w kolumnie."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "allegro_reduced = allegro[['price','it_seller_rating', 'it_quantity']]\n",
+    "allegro_reduced = allegro_reduced[0:40000]\n",
+    "n = len(allegro_reduced['price'])\n",
+    "allegro_reduced"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Zmniejszylem liczbe rekordow bo bardzo wolno mi dzialal KKNIImputer :("
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "results = [0 for i in range(10)]\n",
+    "for i in range(10):\n",
+    "    X = allegro_reduced.copy(deep=True)\n",
+    "    # zmieniam ziarno za kazdym razem zeby miec rozne wiersze w roznich eksperymentach\n",
+    "    np.random.seed((i+1) * 10)\n",
+    "    chosen_idx = np.random.choice(n , replace = False, size = 4000)\n",
+    "    X.it_seller_rating[chosen_idx] = np.nan\n",
+    "    imputer = KNNImputer(n_neighbors=5, weights=\"uniform\")\n",
+    "    X = imputer.fit_transform(X)\n",
+    "    results[i] = math.sqrt(np.sum(pow(X[:,1] - allegro_reduced[\"it_seller_rating\"], 2)) / n)\n",
+    "    "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "results "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "x = [i for i in range(10)]\n",
+    "plt.scatter(x,results)\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "standard_deviation = math.sqrt(np.sum(pow(results - results.mean(),2)))\n",
+    "standard_deviation"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "To odchylenie wyszlo bardzo duże i wgl jakoś kiepsko to wygląda"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "second_results = [0 for i in range(10)]\n",
+    "for i in range(10):\n",
+    "    X = allegro_reduced.copy(deep=True)\n",
+    "    # zmieniam ziarno za kazdym razem zeby miec rozne wiersze w roznich eksperymentach\n",
+    "    np.random.seed((i+1) * 10)\n",
+    "    chosen_idx = np.random.choice(n , replace = False, size = 4000)\n",
+    "    X.it_seller_rating[chosen_idx] = np.nan\n",
+    "    X.it_quantity[chosen_idx] = np.nan\n",
+    "    imputer = KNNImputer(n_neighbors=5, weights=\"uniform\")\n",
+    "    X = imputer.fit_transform(X)\n",
+    "    second_results[i] = math.sqrt(np.sum(pow(X[:,1] - allegro_reduced[\"it_seller_rating\"], 2)) / n)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "second_results"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "second_standard_deviation = math.sqrt(np.sum(pow(second_results - results.mean(),2)))\n",
+    "second_standard_deviation"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "x = [i for i in range(10)]\n",
+    "plt.scatter(x,second_results)\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Po usunięciu wartości z obu kolumn różnice w wynikach pomiędzy prawdziwą kolumną it_seller_rating, a to stworzoną po algorytmie k-nearest-neighbours są bardzo duże, również odchylenie standardowe jest znacznie większe. Wydaje mi się, że w obydwu przypadkach błąd jest stosunkowo duży."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.7.7"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
diff --git a/Prace_domowe/Praca_domowa2/[Grupa 3] PD 2 Michał Piasecki/PD2.pdf b/Prace_domowe/Praca_domowa2/[Grupa 3] PD 2 Michał Piasecki/PD2.pdf
diff --git a/Prace_domowe/Praca_domowa2/[Grupa 3] PD 2 Michał Piasecki/git.keep b/Prace_domowe/Praca_domowa2/[Grupa 3] PD 2 Michał Piasecki/git.keep
@@ -0,0 +1 @@
+