diff --git a/colab_notebooks/zillow_kaggle_zestimate_comp.ipynb b/colab_notebooks/zillow_kaggle_zestimate_comp.ipynb new file mode 100644 index 00000000..4dfce9de --- /dev/null +++ b/colab_notebooks/zillow_kaggle_zestimate_comp.ipynb @@ -0,0 +1,4030 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "scfLT2i0MLyD" + }, + "source": [ + "# Environment Sanity Check #\n", + "\n", + "Click the _Runtime_ dropdown at the top of the page, then _Change Runtime Type_ and confirm the instance type is _GPU_.\n", + "\n", + "Check the output of `!nvidia-smi` to make sure you've been allocated a Tesla T4.\n", + "\n", + "#Setup:\n", + "\n", + "1. Install most recent Miniconda release compatible with Google Colab's Python install (3.6.7)\n", + "2. Install RAPIDS libraries\n", + "3. Set necessary environment variables\n", + "4. Copy RAPIDS .so files into current working directory, a workaround for conda/colab interactions\n", + "- **TLDR**\n", + " - Hit `Shift` + `Enter`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 312 + }, + "colab_type": "code", + "id": "W-um5d-x7o46", + "outputId": "a604e66b-95d7-44fb-f8d3-848fcedaf796" + }, + "outputs": [], + "source": [ + "\"\"\"make sure we have the right GPU\n", + "> column 1 row 3 == Tesla T4\n", + "\"\"\"\n", + "# display gpu specs\n", + "!nvidia-smi" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "kkEdr1VmigyU" + }, + "source": [ + "### Install RAPIDS AI" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "p129YxxnihcV" + }, + "outputs": [], + "source": [ + "!wget -nc https://raw.githubusercontent.com/rapidsai/notebooks-contrib/master/utils/rapids-colab.sh\n", + "# RAPIDS 0.10 nightly\n", + "!bash rapids-colab.sh \n", + "\n", + "import sys, os\n", + "\n", + "sys.path.append('/usr/local/lib/python3.6/site-packages/')\n", + "os.environ['NUMBAPRO_NVVM'] = '/usr/local/cuda/nvvm/lib64/libnvvm.so'\n", + "os.environ['NUMBAPRO_LIBDEVICE'] = '/usr/local/cuda/nvvm/libdevice/'" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "1CsdVW7SU9Li" + }, + "source": [ + "# Zillow Kaggle Competition on RAPIDS AI\n", + "- initially based off eswar3's [Zillow prediction models]( https://github.com/eswar3/Zillow-prediction-models) repo\n", + "## Download Data\n", + "- to download the data, please plug in your kaggle api username & key\n", + " - you can set up your kaggle api at `https://www.kaggle.com/YOUR USERNAME HERE/account`\n", + " - learn more: https://github.com/Kaggle/kaggle-api#api-credentials" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "x1dLRTm168Tk" + }, + "outputs": [], + "source": [ + "!pip install kaggle\n", + "!mkdir /root/.kaggle\n", + "\n", + "# plug api -- get your own API key\n", + "!echo '{\"username\":\"warobson\",\"key\":\"\"}' > /root/.kaggle/kaggle.json\n", + "!chmod 600 /root/.kaggle/kaggle.json\n", + "\n", + "# !kaggle datasets download\n", + "!kaggle competitions download -c zillow-prize-1\n", + "\n", + "# unzip kaggle data\n", + "!unzip -q \"/content/sample_submission.csv.zip\"\n", + "!unzip -q \"/content/train_2016_v2.csv.zip\"\n", + "!unzip -q \"/content/properties_2016.csv.zip\"\n", + "!unzip -q \"/content/train_2017.csv.zip\"\n", + "!unzip -q \"/content/properties_2017.csv.zip\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "LICr9uz8do9K" + }, + "source": [ + "#### How is the data saved?\n", + "- inside content directory " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 173 + }, + "colab_type": "code", + "id": "6n75DyJ-dm4B", + "outputId": "64ac687e-39d6-4bb1-f4b7-5476c9de3b84" + }, + "outputs": [], + "source": [ + "# display content folder contents\n", + "!ls \"/content/\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "Lpa1b4edIXuT" + }, + "source": [ + "# Imports\n", + "### RAPIDS\n", + "* `cuDf`\n", + " - words here\n", + "* `cuML`\n", + " - words here\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "ZKN5zuROroJD" + }, + "outputs": [], + "source": [ + "# rapids \n", + "import cudf, cuml \n", + "# switch to cupy next update (once docker has it)\n", + "import numpy as np\n", + "# general \n", + "import seaborn as sns\n", + "import matplotlib.pyplot as plt" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "YJeywzd2efw7" + }, + "source": [ + "## Data\n", + "* `properties_2016`\n", + " - aprox. 27,000,000 residential properties \n", + " - 58 attributes each\n", + "* `train_2016_v2`\n", + " - 90,000 transaction records for closings in the year 2016\n", + " * Merge datasets on `property_id`" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 156 + }, + "colab_type": "code", + "id": "2EfApIzCfEtr", + "outputId": "bc1e37d1-9ab8-4561-fa39-5af420480a72" + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
parcelidairconditioningtypeidarchitecturalstyletypeidbasementsqftbathroomcntbedroomcntbuildingclasstypeidbuildingqualitytypeidcalculatedbathnbrdecktypeid...numberofstoriesfireplaceflagstructuretaxvaluedollarcnttaxvaluedollarcntassessmentyearlandtaxvaluedollarcnttaxamounttaxdelinquencyflagtaxdelinquencyyearcensustractandblock
010754147nullnullnull0.00.0nullnullnullnull...nullnullnull9.02015.09.0nullNonenullnull
110759547nullnullnull0.00.0nullnullnullnull...nullnullnull27516.02015.027516.0nullNonenullnull
210843547nullnullnull0.00.0nullnullnullnull...nullnull650756.01413387.02015.0762631.020800.37Nonenullnull
310859147nullnullnull0.00.03.07.0nullnull...1.0null571346.01156834.02015.0585488.014557.57Nonenullnull
410879947nullnullnull0.00.04.0nullnullnull...nullnull193796.0433491.02015.0239695.05725.17Nonenullnull
\n", + "

5 rows × 58 columns

\n", + "
" + ], + "text/plain": [ + " parcelid airconditioningtypeid architecturalstyletypeid basementsqft \\\n", + "0 10754147 null null null \n", + "1 10759547 null null null \n", + "2 10843547 null null null \n", + "3 10859147 null null null \n", + "4 10879947 null null null \n", + "\n", + " bathroomcnt bedroomcnt buildingclasstypeid buildingqualitytypeid \\\n", + "0 0.0 0.0 null null \n", + "1 0.0 0.0 null null \n", + "2 0.0 0.0 null null \n", + "3 0.0 0.0 3.0 7.0 \n", + "4 0.0 0.0 4.0 null \n", + "\n", + " calculatedbathnbr decktypeid ... numberofstories fireplaceflag \\\n", + "0 null null ... null null \n", + "1 null null ... null null \n", + "2 null null ... null null \n", + "3 null null ... 1.0 null \n", + "4 null null ... null null \n", + "\n", + " structuretaxvaluedollarcnt taxvaluedollarcnt assessmentyear \\\n", + "0 null 9.0 2015.0 \n", + "1 null 27516.0 2015.0 \n", + "2 650756.0 1413387.0 2015.0 \n", + "3 571346.0 1156834.0 2015.0 \n", + "4 193796.0 433491.0 2015.0 \n", + "\n", + " landtaxvaluedollarcnt taxamount taxdelinquencyflag taxdelinquencyyear \\\n", + "0 9.0 null None null \n", + "1 27516.0 null None null \n", + "2 762631.0 20800.37 None null \n", + "3 585488.0 14557.57 None null \n", + "4 239695.0 5725.17 None null \n", + "\n", + " censustractandblock \n", + "0 null \n", + "1 null \n", + "2 null \n", + "3 null \n", + "4 null \n", + "\n", + "[5 rows x 58 columns]" + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# import 2016 properties\n", + "prop2016 = cudf.read_csv('zillow/properties_2016.csv')\n", + "\n", + "# peek display 2016 properties\n", + "prop2016.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 121 + }, + "colab_type": "code", + "id": "uynoUxpx8Xsn", + "outputId": "b64b7b32-c1f9-4cf3-c50d-36e90dc51a64" + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
parcelidlogerrortransactiondate
0110165940.02762016-01-01
114366692-0.16842016-01-01
212098116-0.00402016-01-01
3126434130.02182016-01-02
414432541-0.00502016-01-02
\n", + "
" + ], + "text/plain": [ + " parcelid logerror transactiondate\n", + "0 11016594 0.0276 2016-01-01\n", + "1 14366692 -0.1684 2016-01-01\n", + "2 12098116 -0.0040 2016-01-01\n", + "3 12643413 0.0218 2016-01-02\n", + "4 14432541 -0.0050 2016-01-02" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# import train 2016 data\n", + "train2016 = cudf.read_csv('zillow/train_2016_v2.csv',\n", + " parse_dates=[\"transactiondate\"])\n", + "\n", + "# peek display 2016 train\n", + "train2016.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "gGiscxESJDrl" + }, + "source": [ + "## [Zillow Prediction Model](https://colab.research.google.com/github/eswar3/Zillow-prediction-models/blob/master/Step%202a-Approach1.ipynb)\n", + "\n", + " In this approach the properties data and transaction data are merged together before adressing any missing values\n", + "\n", + "\n", + "#### Merging Data \n", + " - we will start by merging the two dataframes\n", + " - then rename the new dataframe's attributes to be meaningful \n", + " - e.g. from `pooltypeid7` to `pool_with_spa_tub_no` and `structuretaxvaluedollarcnt` to `structure_tax`" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 156 + }, + "colab_type": "code", + "id": "o4CvSIcwm4B2", + "outputId": "4e59a51a-ebd6-4fe5-b037-3165e57e3b85" + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
parcelidlogerrortransactiondateac_idarchitecturalstyletypeidbasement_sqfttotal_bathbedroomcntbuildingclasstypeidbuildingqualitytypeid...fireplaceflagstructure_taxtotal_parcel_taxassessmentyearland_taxtotal_property_tax_2016taxdelinquencyflagtaxdelinquencyyearcensustractandblocktransaction_month
0171299710.04212016-01-25nullnullnull3.04.0nullnull...null266718.0444528.02015.0177810.05108.38Nonenull6.111005e+131
1129219490.02662016-01-251.0nullnull3.04.0null4.0...null361522.0506127.02015.0144605.06150.23Nonenull6.037404e+131
214502581-0.00602016-01-25nullnullnull2.53.0nullnull...null170960.0339273.02015.0168313.05487.92Nonenull6.059032e+131
310946127-0.10202016-01-251.0nullnull3.02.0null4.0...null144440.0389200.02015.0244760.04326.54Nonenull6.037311e+131
411835451-0.00302016-01-25nullnullnull3.05.0null7.0...null144020.0235739.02015.091719.03698.87Nonenull6.037530e+131
\n", + "

5 rows × 61 columns

\n", + "
" + ], + "text/plain": [ + " parcelid logerror transactiondate ac_id architecturalstyletypeid \\\n", + "0 17129971 0.0421 2016-01-25 null null \n", + "1 12921949 0.0266 2016-01-25 1.0 null \n", + "2 14502581 -0.0060 2016-01-25 null null \n", + "3 10946127 -0.1020 2016-01-25 1.0 null \n", + "4 11835451 -0.0030 2016-01-25 null null \n", + "\n", + " basement_sqft total_bath bedroomcnt buildingclasstypeid \\\n", + "0 null 3.0 4.0 null \n", + "1 null 3.0 4.0 null \n", + "2 null 2.5 3.0 null \n", + "3 null 3.0 2.0 null \n", + "4 null 3.0 5.0 null \n", + "\n", + " buildingqualitytypeid ... fireplaceflag structure_tax total_parcel_tax \\\n", + "0 null ... null 266718.0 444528.0 \n", + "1 4.0 ... null 361522.0 506127.0 \n", + "2 null ... null 170960.0 339273.0 \n", + "3 4.0 ... null 144440.0 389200.0 \n", + "4 7.0 ... null 144020.0 235739.0 \n", + "\n", + " assessmentyear land_tax total_property_tax_2016 taxdelinquencyflag \\\n", + "0 2015.0 177810.0 5108.38 None \n", + "1 2015.0 144605.0 6150.23 None \n", + "2 2015.0 168313.0 5487.92 None \n", + "3 2015.0 244760.0 4326.54 None \n", + "4 2015.0 91719.0 3698.87 None \n", + "\n", + " taxdelinquencyyear censustractandblock transaction_month \n", + "0 null 6.111005e+13 1 \n", + "1 null 6.037404e+13 1 \n", + "2 null 6.059032e+13 1 \n", + "3 null 6.037311e+13 1 \n", + "4 null 6.037530e+13 1 \n", + "\n", + "[5 rows x 61 columns]" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# merge 2016 train and property dataframes by parcel id\n", + "df_train=''\n", + "df_train = train2016.merge(prop2016, how='left', on='parcelid')\n", + "\n", + "# add column inidcaticating month of transaction\n", + "df_train['transaction_month'] = df_train['transactiondate'].dt.month\n", + "\n", + "# set colums to be renamed for general english understandability \n", + "rename_these = {\"bathroomcnt\": \"total_bath\",\n", + " \"fullbathcnt\": \"full_bath\",\n", + " \"threequarterbathnbr\": \"half_bath\",\n", + " \"yardbuildingsqft17\": \"patio_sqft\",\n", + " \"yardbuildingsqft26\":\"storage_sqft\",\n", + " \"decktypeid\": \"deck_flag\",\n", + " \"pooltypeid7\": \"pool_with_spa_tub_no\", \n", + " \"pooltypeid2\": \"pool_with_spa_tub_yes\",\n", + " \"hashottuborspa\": \"has_hottub_or_spa\", \n", + " \"pooltypeid10\": \"just_hottub_or_spa\",\n", + " \"calculatedfinishedsquarefeet\":\"total_finished_living_area_sqft\", \n", + " \"finishedsquarefeet12\": \"finished_living_area_sqft\",\n", + " \"lotsizesquarefeet\": \"lot_area_sqft\",\n", + " \"finishedsquarefeet50\":\"finished_living_area_entryfloor_sqft1\",\n", + " \"finishedfloor1squarefeet\":\"finished_living_area_entryfloor_sqft2\",\n", + " \"finishedsquarefeet6\": \"base_unfinished_and_finished_area_sqft\",\n", + " \"finishedsquarefeet15\": \"total_area_sqft\",\n", + " \"finishedsquarefeet13\": \"preimeter_living_area_sqft\",\n", + " \"taxvaluedollarcnt\":\"total_parcel_tax\",\n", + " \"landtaxvaluedollarcnt\":\"land_tax\",\n", + " \"taxamount\":\"total_property_tax_2016\",\n", + " \"structuretaxvaluedollarcnt\":\"structure_tax\",\n", + " \"garagetotalsqft\":\"garage_sqft\",\n", + " \"fireplacecnt\":\"fireplace_count\",\n", + " \"buildingqualitytypeid \":\"building_quality_id\",\n", + " \"heatingorsystemtypeid\":\"heating_system_id\",\n", + " \"airconditioningtypeid\":\"ac_id\",\n", + " \"storytypeid\": \"basement_flag\",\n", + " \"basementsqft\": \"basement_sqft\",\n", + " \"poolsizesum\": \"pool_sqft\",\n", + " \"poolcnt\": \"pool_count\"}\n", + "# rename columns \n", + "df_train = df_train.rename(columns = rename_these)\n", + "\n", + "# what's the data frame look like?\n", + "df_train.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "YdtyBI2jFnJv" + }, + "source": [ + "## Conforming Attribute Values\n", + "### #0 boolean columns & null = 0s cases \n", + "* `pool_count`, `pool_with_spa_tub_no` and `pool_with_spa_tub_yes` are all binary variables, replace all NULL values with zero\n", + "* `basement_flag` has values 7 & `Null` but is supposed to be bool, convert the `7`s to `1`s and the `Null`s to `0`s \n", + "* patio and shed variables with null values are assumed to have none\n", + "* deck_flag has only 2 values, `66` and `null`\n", + " - convert it into binary flag\n" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "z3bPdNONHTYI" + }, + "outputs": [], + "source": [ + "# replace missing pool count values so we booling\n", + "the_bool_club = ['pool_count','pool_with_spa_tub_no','pool_with_spa_tub_yes',\n", + " 'basement_flag','patio_sqft','storage_sqft', 'deck_flag']\n", + "\n", + "for col in the_bool_club:\n", + " # convert null values to 0\n", + " df_train[col]=df_train[col].fillna(0)\n", + "\n", + "# convert 7s and 66s to 1s\n", + "df_train['basement_flag'] = df_train['basement_flag'].replace(7, 1)\n", + "df_train['deck_flag'] = df_train['deck_flag'].replace(66, 1)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "5MbGy6r7JLLD" + }, + "source": [ + "### #1 The pool\n", + "* When pool is present and if it has tub/spa then `just_hottub_or_spa` = 0" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 156 + }, + "colab_type": "code", + "id": "B3-1V93smA9A", + "outputId": "52e1a5d7-869a-443f-ac2d-40504992dc14" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "before\n", + "1.0 1161\n", + "Name: just_hottub_or_spa, dtype: int32\n", + "\n", + "after\n", + "0.0 1204\n", + "1.0 1161\n", + "Name: just_hottub_or_spa, dtype: int32\n" + ] + } + ], + "source": [ + "print(f'before\\n{df_train.just_hottub_or_spa.value_counts()}\\n')\n", + "\n", + "# if poolcnt=1 and has_hottub_or_spa=1 and just_hottub_or_spa is null\n", + "conditions = ((df_train['pool_count'] == 1) \n", + " & (df_train['has_hottub_or_spa'] == 1) \n", + " & (df_train['just_hottub_or_spa'].isna() == True))\n", + "# then just_hottub_or_spa = 0\n", + "df_train.just_hottub_or_spa.loc[conditions] = 0\n", + "\n", + "print(f'after\\n{df_train.just_hottub_or_spa.value_counts()}')" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "v6E3-_XlSGBs" + }, + "source": [ + "\n", + "- when `has_hottub_or_spa` is null and `just_hottub_or_spa` is null\n", + " - both should be zero\n" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "Xa12WFccSGM6" + }, + "outputs": [], + "source": [ + "# if both has hottub and just hottub are null\n", + "conditions = ((df_train['has_hottub_or_spa'].isna() == True) \n", + " & (df_train['just_hottub_or_spa'].isna() == True))\n", + "# just hottub or spa = 0 \n", + "df_train.just_hottub_or_spa.loc[conditions] = 0\n", + "\n", + "# now, if has hottub is null and just hottub is 0 \n", + "conditions = ((df_train['has_hottub_or_spa'].isna() == True) \n", + " & (df_train['just_hottub_or_spa'] == 0))\n", + "# has hottub or spa = 0 \n", + "df_train.has_hottub_or_spa.loc[conditions] = 0" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "5umCCWN73qxw" + }, + "source": [ + "- when there is no pool\n", + " - if there is tub/spa \n", + " - then `just_hottub_or_spa` = 1" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 69 + }, + "colab_type": "code", + "id": "FBgs7zJm3qk-", + "outputId": "78c76ac5-2b7f-4f98-9615-8a335bc3214e" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "0.0 89114\n", + "1.0 1161\n", + "Name: just_hottub_or_spa, dtype: int32" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# when poolcnt=0, has_hottub_or_spa=1\n", + "conditions = ((df_train['pool_count'] == 0) \n", + " & (df_train['has_hottub_or_spa'] == 1))\n", + "# just_hottub_or_spa=1\n", + "df_train.just_hottub_or_spa.loc[conditions] = 1\n", + "\n", + "# let's check the values\n", + "df_train.just_hottub_or_spa.value_counts()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "3LsRr1aoSCVx" + }, + "source": [ + "* When there is no pool, set pool size to zero instead of na" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "NtdyXCbx0TKx" + }, + "outputs": [], + "source": [ + "# where there is no pool\n", + "conditions = df_train['pool_count']==0\n", + "# square footage of non existant pool is 0 \n", + "df_train.pool_sqft.loc[conditions] = 0" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "3hQFkXmAgQPY" + }, + "source": [ + "### #2 The basement\n", + "* Where `basement_flag` is zero, `basement_sqft` should also be zero\n" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "kMuCOqAmLTmY" + }, + "outputs": [], + "source": [ + "# where there is no basement\n", + "conditions = df_train['basement_flag'] == 0\n", + "# fun fact: we just did this with the pool\n", + "df_train.basement_sqft.loc[conditions] = 0" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "wU6Uohb-PDYB" + }, + "source": [ + "### #3 The fireplace\n", + "There seems to be inconsistency between the `fireplace_flag` and `fireplace_count`\n", + "- 90,053 flag values are null\n", + "- 80,688 `fireplace_count` values are null\n", + " * 9,385 (-11.5%) difference, but a boatload either way" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 52 + }, + "colab_type": "code", + "id": "OZM6lXmmpj5k", + "outputId": "ecf62d1d-b036-41ad-8052-a3090ae590ef" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "there are 80668 fireplace_count nulls\n", + "there are 90053 fireplaceflag nulls\n" + ] + } + ], + "source": [ + "print(f\"there are {df_train['fireplace_count'].isna().sum()} fireplace_count \\\n", + "nulls\\nthere are {df_train['fireplaceflag'].isna().sum()} fireplaceflag nulls\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "v9ZAzFoIpkSF" + }, + "source": [ + "* context driven solutions\n", + " * where neither flag nor count exists, `fireplaceflag == False`\n", + " * when `fireplace_count` is more than zero `fireplaceflag` should be `True`\n", + " * if `fireplaceflag == False`, the `fireplace_count` is logically `0`" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 52 + }, + "colab_type": "code", + "id": "i3YRZgU_qZhA", + "outputId": "e45a7a96-2e1d-47d2-a0bd-48ece42cbb6e" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "there are 222 fireplace_count nulls\n", + "there are 0 fireplaceflag nulls\n" + ] + } + ], + "source": [ + "# null flags with null counts are zero\n", + "conditions = ((df_train['fireplace_count'].isna()==True) \n", + " & (df_train['fireplaceflag'].isna()==True))\n", + "df_train.fireplaceflag.loc[conditions] = False\n", + "\n", + "# true flags for positive fireplace counts\n", + "conditions = df_train['fireplace_count'] > 0\n", + "df_train.fireplaceflag.loc[conditions] = True\n", + "\n", + "# set fireplace count nulls to 0 where false flags are\n", + "conditions = ((df_train['fireplace_count'].isna()==True) \n", + " & (df_train['fireplaceflag']==False))\n", + "df_train.fireplace_count.loc[conditions] = 0\n", + "\n", + "print(f\"there are {df_train['fireplace_count'].isna().sum()} fireplace_count \\\n", + "nulls\\nthere are {df_train['fireplaceflag'].isna().sum()} fireplaceflag nulls\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "pYntUejosOn3" + }, + "source": [ + "### #4 The garage\n", + "* Properties with no garages would have NA values for both " + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "L9mGs-mK9E0Q" + }, + "outputs": [], + "source": [ + "garage = ['garagecarcnt', 'garage_sqft']\n", + "# where garage car count and garage square feet are null\n", + "conditions = ((df_train['garagecarcnt'].isna()==True) \n", + " & (df_train['garage_sqft'].isna()==True))\n", + "# set both to 0\n", + "df_train[garage].loc[conditions] = 0" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "0uV115W6-ohW" + }, + "source": [ + "Exploring the data farther, we see\n", + "- `garage_sqft` holds over 8,900 measurements of 0 despite the garage's car count being 1 or more \n" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 121 + }, + "colab_type": "code", + "id": "gbbUIbwJ-ouS", + "outputId": "310a4cdf-01a0-4fc3-ed1b-0e2f5e668518" + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
garagecarcntgarage_sqft
182.00.0
201.00.0
321.00.0
362.00.0
421.00.0
\n", + "
" + ], + "text/plain": [ + " garagecarcnt garage_sqft\n", + "18 2.0 0.0\n", + "20 1.0 0.0\n", + "32 1.0 0.0\n", + "36 2.0 0.0\n", + "42 1.0 0.0" + ] + }, + "execution_count": 14, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# show rows where garage count and square feet don't add up\n", + "conditions = (df_train.garagecarcnt > 0) & (df_train.garage_sqft == 0)\n", + "\n", + "# give a display\n", + "df_train.loc[conditions][garage].head()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "5I1O76QKA8Cb" + }, + "source": [ + "- these 0 values need to be null\n", + " - because no garage holding 1 or more cars in 2016 measured 0sqft" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "eWVtoty0A9Jt" + }, + "outputs": [], + "source": [ + "# where garage count and square feet don't add up\n", + "conditions = (df_train.garagecarcnt>0) & (df_train.garage_sqft==0)\n", + "# insert a NaN value\n", + "df_train.garage_sqft.loc[conditions] = np.nan" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "seb6r5wx5Bbz" + }, + "source": [ + "### #5 The bath\n", + "* `total_bath` & `calculatedbathnbr` are near-duplicates w/ `calculated` having more nulls\n", + " - let's drop it\n", + "* if `full_bath` is null and `half_bath` is also null\n", + " - let's make `total_bath` = 0 \n", + " - because we can't truthfully assume it's any more " + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "EgMNToed5BMu" + }, + "outputs": [], + "source": [ + "# drop calculated bath column\n", + "df_train = df_train.drop('calculatedbathnbr', axis=1)\n", + "\n", + "# if full_bath is null & half_bath is null\n", + "conditions = ((df_train['full_bath'].isnull()==True) \n", + " & (df_train['half_bath'].isnull()==True) \n", + " & (df_train['total_bath']==0))\n", + "# total_bath=0\n", + "df_train.total_bath.loc[conditions] = np.nan\n", + "\n", + "# when full_bath==total_bath, half_bath=0 \n", + "df_train.half_bath.loc[df_train.full_bath == df_train.total_bath] = 0" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "Sh8cG0pr4_hl" + }, + "source": [ + "### #6 Mode Imputation \n", + "* scaling down the latitude and longitide\n", + " - knn imput takes more time due to the larger numbers\n", + " - standardizing gives better results on most algorithms\n", + " - this is a competition, we came to win" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "kitrNxKgLWUd" + }, + "outputs": [], + "source": [ + "df_train['latitude'] = df_train.latitude / 100000\n", + "df_train['longitude'] = df_train.longitude / 100000" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "y6bhRhu5YZ1d" + }, + "source": [ + "### #7 numberofstories & unitcnt & roomcnt\n", + "* we can devise unit count based on property land type\n", + " - so we can now go ahead and correct the unit counts for each given property" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 208 + }, + "colab_type": "code", + "id": "yHZH4rMNLfBA", + "outputId": "97106bb4-10f2-49a9-f821-03a3972db136" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "1.0 86035\n", + "2.0 2372\n", + "4.0 884\n", + "3.0 622\n", + "5.0 1\n", + "6.0 1\n", + "9.0 1\n", + "11.0 1\n", + "70.0 1\n", + "143.0 1\n", + "Name: unitcnt, dtype: int32" + ] + }, + "execution_count": 18, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# where room count is 0, go ahead and NaN it\n", + "df_train.roomcnt.loc[df_train['roomcnt'] == 0] = np.nan\n", + "\n", + "\"\"\"\n", + "propertylandusetypeid & unitcnt are related \n", + " these are the propertylandusetypeid codes & their definitions\n", + " \n", + "#246 -Duplex (2 Units, Any Combination)\n", + "#247 -Triplex (3 Units, Any Combination)\n", + "#248 -Quadruplex (4 Units, Any Combination)\n", + "#260 -Residential General\n", + "#261 -Single Family Residential\n", + "#263 -Mobile Home\n", + "#264 -Townhouse\n", + "#266 -Condominium\n", + "#267 -Cooperative\n", + "#269 -Planned Unit Development\n", + "#275 -Residential Common Area \n", + "#31 - Commercial/Office/Residential Mixed Used\n", + "#47 -Store/Office (Mixed Use)\n", + "#265 -Cluster Home\n", + "\"\"\"\n", + "\n", + "# one unit \n", + "ones = [260,261,263,264,266,267,269,275]\n", + "for one in ones:\n", + " # adjust conditions to one unit indicator\n", + " conditions = ((df_train['propertylandusetypeid'] == one) \n", + " & (df_train['unitcnt'].isna()))\n", + " df_train.unitcnt.loc[conditions] = 1\n", + "\n", + "# two units \n", + "twos = [31,47,246]\n", + "for two in twos:\n", + " # adjust conditions to two unit indicator\n", + " conditions = ((df_train['propertylandusetypeid'] == two) \n", + " & (df_train['unitcnt'].isna()))\n", + " df_train.unitcnt.loc[conditions] = 2\n", + "\n", + "# three units\n", + "conditions = ((df_train['propertylandusetypeid'] == 247) \n", + " & (df_train['unitcnt'].isna()))\n", + "df_train.unitcnt.loc[conditions] = 3\n", + "\n", + "# four units\n", + "conditions = ((df_train['propertylandusetypeid'] == 248) \n", + " & (df_train['unitcnt'].isna()))\n", + "df_train.unitcnt.loc[conditions] = 4\n", + "\n", + "# let's see how out unit counts look\n", + "df_train.unitcnt.value_counts()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "02yLicmxLs3C" + }, + "source": [ + "### #8 Time to Cut\n", + "**Because of the adjustments made so far a number of columns are no longer needed**\n", + "* transaction date column is no longer of use\n", + " - and can be dropped \n", + "* `preimeter_living_area_sqft` and `total_finished_living_area_sqft` have the same values \n", + " - except that `preimeter_living_area_sqft` has more duplicates\n", + "* `total_area_sqft` and `total_finished_living_area_sqft` have the same values \n", + " - except that \"total_area_sqft\" has more duplicates\n", + "* `total_finished_living_area_sqft` and `finished_living_area_sqft` have the same values \n", + " - except that `finished_living_area_sqft` has more duplicates\n", + "* `base_unfinished_and_finished_area_sqft` and `total_finished_living_area_sqft` have the same values \n", + " - except that `base_unfinished_and_finished_area_sqft` has more duplicates\n", + "* different counties follow different land use code\n", + " - to compare different counties, zillow has created it's own `propertylandusetypeid`\n", + " - hence we can drop `propertycountylandusecode`\n", + " - the same applies to `propertyzoningdesc`\n", + "* Most zip id's either invalid or out of city\n", + " - since enough information about location is given in latitude and longitude \n", + " - let's drop other location related fields\n", + " - `regionidcity`\n", + " - `regionidzip`\n", + " - `regionidneighborhood`\n", + "* `assessmentyear` has a constant value for all rows\n", + " - let's drop it" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "OtOgzOqHLyid" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "BEFORE: (90275, 60)\n", + "AFTER: (90275, 48)\n" + ] + } + ], + "source": [ + "print(f\"BEFORE: {df_train.shape}\")\n", + "\n", + "# collect columns to drop\n", + "cut = ['propertyzoningdesc','propertycountylandusecode',\n", + " 'base_unfinished_and_finished_area_sqft','finished_living_area_sqft',\n", + " 'total_area_sqft','preimeter_living_area_sqft','regionidzip',\n", + " 'regionidcity','regionidneighborhood','assessmentyear','transactiondate',\n", + " 'censustractandblock']\n", + "# cut columns form dataframe\n", + "df_train = df_train.drop(cut, axis=1)\n", + "\n", + "print(f\"AFTER: {df_train.shape}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "icDvpvSD6BSb" + }, + "source": [ + "### #9 Tax, Year, & Census\n", + "- if tax deliquency flag is null, assume there is no unpaid tax on the property\n", + " - an issue arrises here because `taxdelinquencyflag` is a `StringColumn`\n", + " - i.e. null values indicate no tax delinquency, all other values are `Y` for yes\n", + " - because of this, the normal method of.." + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 311 + }, + "colab_type": "code", + "id": "8lYcO_T5XKNN", + "outputId": "596cfad3-890d-4241-b8b8-347673082a7f" + }, + "outputs": [ + { + "ename": "TypeError", + "evalue": "fill_value must be a string or a string series", + "output_type": "error", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;31m# how we'd normally take care of this\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mdf_train\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'taxdelinquencyflag'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfillna\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", + "\u001b[0;32m~/anaconda3/envs/rapidsenv/lib/python3.7/site-packages/cudf/core/series.py\u001b[0m in \u001b[0;36mfillna\u001b[0;34m(self, value, method, axis, inplace, limit)\u001b[0m\n\u001b[1;32m 1186\u001b[0m \u001b[0;32mraise\u001b[0m \u001b[0mNotImplementedError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"The axis keyword is not supported\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1187\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1188\u001b[0;31m \u001b[0mdata\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_column\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfillna\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mvalue\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0minplace\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0minplace\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1189\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1190\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0minplace\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m~/anaconda3/envs/rapidsenv/lib/python3.7/site-packages/cudf/core/column/string.py\u001b[0m in \u001b[0;36mfillna\u001b[0;34m(self, fill_value, inplace)\u001b[0m\n\u001b[1;32m 719\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0misinstance\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mfill_value\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_column\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mStringColumn\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 720\u001b[0m ):\n\u001b[0;32m--> 721\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mTypeError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"fill_value must be a string or a string series\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 722\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 723\u001b[0m \u001b[0;31m# replace fill_value with nvstrings\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;31mTypeError\u001b[0m: fill_value must be a string or a string series" + ] + } + ], + "source": [ + "# how we'd normally take care of this\n", + "df_train['taxdelinquencyflag'].fillna(0)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "tA6xG6h59rLi" + }, + "source": [ + "- ...comes with error. \n", + " - Why?\n", + " - the series we are trying to fill the null values of is a string series\n", + " - because of this `.fillna()` requires a sting value (e.g. '0') instead of an int value (e.g. 0)\n", + " - So, what now?\n", + " - there is an easy and straightforward solution with masked assigning!! \n", + " - First\n", + " - switch 1 (current True, actual False) to -1\n", + " - Then\n", + " - switch 0 (current False, actual True) to 1 to reflect True status\n", + " - Finally\n", + " - switch -1 (old True, actual False) to 0 to reflect False status" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 69 + }, + "colab_type": "code", + "id": "Svp6J0cJ5dL0", + "outputId": "03862711-e104-4954-bf9c-61bd51b3a9e3" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "0 88492\n", + "1 1783\n", + "Name: taxdelinquencyflag, dtype: int32" + ] + }, + "execution_count": 21, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# if bool 'Y'/None is already set, change string to int bool column via .isna()\n", + "df_train['taxdelinquencyflag'] = df_train['taxdelinquencyflag'].isna()\n", + "\n", + "# next we must correct the values, with 1 (True) for 'Y' and 0 for no\n", + "switcharoo = [(1,-1),(0,1),(-1,0)]\n", + "# switch values in order\n", + "for pair in switcharoo:\n", + " # tag old value and new value it will be replaced with\n", + " old, new = pair\n", + " # replace old value with new value\n", + " df_train['taxdelinquencyflag'] = df_train['taxdelinquencyflag'].replace(old, new)\n", + " \n", + "# display values in tax delinquency flag column\n", + "df_train['taxdelinquencyflag'].value_counts()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "w5EAdWXaCTRU" + }, + "source": [ + "- Convert years\n", + " - from yy\n", + " - to 2016 - yyyy \n" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 243 + }, + "colab_type": "code", + "id": "6Bic66I9LfGC", + "outputId": "baaa5387-bbd7-4242-a336-0b6b90606935" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "0.0 88492\n", + "2.0 628\n", + "1.0 518\n", + "3.0 210\n", + "4.0 154\n", + "6.0 89\n", + "5.0 85\n", + "7.0 63\n", + "8.0 24\n", + "9.0 8\n", + "10.0 3\n", + "17.0 1\n", + "Name: taxdelinquencyyear, dtype: int32" + ] + }, + "execution_count": 22, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# no delinquency? set year to 0\n", + "df_train.taxdelinquencyyear.loc[df_train.taxdelinquencyflag == 0] = 0\n", + "\n", + "# collect x and xx formatted delinquency years w/ matching xxxx year format pair\n", + "year_pairs = [(99,1999), (6,2006), (7,2007), (8,2008), (9,2009), (10,2010),\n", + " (11,2011), (12,2012), (13,2013), (14,2014), (15,2015)]\n", + "# go through the pairs individually \n", + "for year in year_pairs:\n", + " # split the pair in question \n", + " old, new = year\n", + " # replace old year (e.g. 99) with new year (e.g. 1999)\n", + " df_train.taxdelinquencyyear.loc[df_train.taxdelinquencyyear == old] = new\n", + "\n", + "# adjust delinquency year relative to training year (2016) \n", + "df_train.taxdelinquencyyear.loc[df_train.taxdelinquencyyear>0] = 2016 - df_train.taxdelinquencyyear.loc[df_train.taxdelinquencyyear>0]\n", + "\n", + "# what've we got? \n", + "df_train.taxdelinquencyyear.value_counts()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "ya7xLHzdGVcs" + }, + "source": [ + "- values in `rawcensustractandblock` represent multiple fields concatened together as float values\n", + " - by converting those values to string we can split each and build new columns:\n", + " - `census_tractnumber`\n", + " - `block_number`" + ] + }, + { + "cell_type": "code", + "execution_count": 45, + "metadata": {}, + "outputs": [], + "source": [ + "# ttt=df_train.copy()\n", + "df_train=ttt.copy()\n", + "\n", + "# origional column\n", + "\"\"\"\n", + "\n", + "# both are float columns now\n", + "#rawcensustractandblock\n", + "s_rawcensustractandblock=df_train.rawcensustractandblock.apply(lambda x: str(x))\n", + "\n", + "df_train['census_tractnumber']=s_rawcensustractandblock.str.slice(4,11)\n", + "df_train['block_number']=s_rawcensustractandblock.str.slice(start=11)\n", + "df_train['block_number']=df_train['block_number'].apply(lambda x: x[:4]+'.'+x[4:]+'0' )\n", + "df_train['block_number']=df_train['block_number'].apply(lambda x: int(round(float(x),0)) )\n", + "df_train['block_number']=df_train['block_number'].apply(lambda x: str(x).ljust(4,'0') )\n", + "\n", + "#droping censustractandblock since this is just a duplicate of rawcensustractandblock\n", + "df_train=df_train.drop('censustractandblock', axis=1)\n", + "\n", + "# drooping rawcensustractandblock, since it's already stored as substrings in different column names\n", + "df_train=df_train.drop('rawcensustractandblock', axis=1)\n", + "\n", + "\"\"\"\n", + "pass" + ] + }, + { + "cell_type": "code", + "execution_count": 46, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 489 + }, + "colab_type": "code", + "id": "Sg0eN-K1QdZy", + "outputId": "a90de47f-5c88-4834-df44-75a9dedcd07c" + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
census_tractnumberblock_number
00053.032043
14037.032000
20320.483000
33107.021000
45303.012001
\n", + "
" + ], + "text/plain": [ + " census_tractnumber block_number\n", + "0 0053.03 2043\n", + "1 4037.03 2000\n", + "2 0320.48 3000\n", + "3 3107.02 1000\n", + "4 5303.01 2001" + ] + }, + "execution_count": 46, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# copy rawcensustractandblock with values as string instead of float\n", + "string_data = cudf.Series(df_train['rawcensustractandblock'].values_to_string())\n", + "\n", + "# print(type(string_data))\n", + "# print(len(string_data))\n", + "# print(string_data)\n", + "\n", + "# \"\"\"\n", + "# CURRENT ERROR IN CONVERSION OF VALUES\n", + "# \"\"\"\n", + "# print(f\"\\nNOTE: THERE APPEARS TO BE AN ERROR WHEN CONVERTING TO STRING\\n\"\n", + "# f\" > somewhat random numbers added to end of some values\\n >> e.g. 004, 006\"\n", + "# f\"\\n\\n\\ndf_train['rawcensustractandblock'].head(10).values\\n\"\n", + "# f\"{df_train['rawcensustractandblock'].head(10).values}\\n\\n\"\n", + "# f\"data.head(10).values\\n{string_data.head(10).values}\\n\\n\\n\"\n", + "# f\"THE SAME NUMBERS OCCOUR IN THE FIRST WHEN PUT INTO A LIST\\n\"\n", + "# f\" > not sure how to deal with this now\\n\"\n", + "# f\" >> difficult to reproduce without data\\n\\n\")\n", + "# \"\"\"\n", + "# CURRENT ERROR IN CONVERSION OF VALUES\n", + "# \"\"\"\n", + "\n", + "# set new tract number \n", + "df_train['census_tractnumber'] = string_data.str.slice(4, 11)\n", + "\n", + "# set/adjust block number\n", + "df_train['block_number'] = string_data.str.slice(11)\n", + "df_train['block_number'] = df_train.block_number.str.slice(0,4).str.cat(df_train.block_number.str.slice(4), '.')\n", + "df_train['block_number'] = df_train.block_number.astype('float').round(0).astype('int')\n", + "df_train['block_number'] = df_train.block_number.astype('str').str.ljust(4, '0')\n", + "\n", + "# drop raw census tract and block column, no longer needed\n", + "df_train = df_train.drop('rawcensustractandblock', axis=1)\n", + "\n", + "\"\"\"\n", + "CORRECT NUMBERS THAT SHOULD BE DISPLAYED BY BELOW PRINT STATEMENT\n", + " > currently not being seen due to prior mentioned error\n", + "\n", + "tractnumber\n", + "0 1066.46\n", + "1 0524.22\n", + "2 4638.00\n", + "3 2963.00\n", + "4 0423.38\n", + "dtype: object\n", + "\n", + "blocknumber\n", + "0 1001\n", + "1 2024\n", + "2 3004\n", + "3 2002\n", + "4 1006\n", + "dtype: object\n", + "\"\"\"\n", + "df_train[['census_tractnumber', 'block_number']].head()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "T71orw51lpTN" + }, + "source": [ + "## Dealing with Missing Values\n", + "### #1 Setting standards\n", + "- Despite corecting and adjusting the data to this point, there are still some columns holding a large majority of null values\n", + "- For some columns, this majority represents over 95% of values\n", + " - Let's identify those columns\n", + " - And drop columns with more than 95% null values \n" + ] + }, + { + "cell_type": "code", + "execution_count": 47, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 86 + }, + "colab_type": "code", + "id": "xhCosNpXvTVU", + "outputId": "2d969756-decb-4912-94f6-19836eb0323a" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " field percentage\n", + "7 buildingclasstypeid 0.999823\n", + "3 architecturalstyletypeid 0.997109\n", + "33 typeconstructiontypeid 0.996688\n" + ] + } + ], + "source": [ + "# calculate null value % for each column & frame it\n", + "missingvalues_prop = (df_train.isnull().sum()/len(df_train)).reset_index()\n", + "missingvalues_prop.columns = ['field','percentage']\n", + "\n", + "# sort by null values percentage, from highest % to lowest\n", + "missingvalues_prop = missingvalues_prop.sort_values(by='percentage', \n", + " ascending=False)\n", + "# identify columns with > 95% of values null\n", + "missingvaluescols = missingvalues_prop.loc[missingvalues_prop['percentage'] > 0.95]\n", + "\n", + "# display columns with highest % null values\n", + "print(missingvaluescols)\n", + "\n", + "# drop columns with more than 95% null values\n", + "df_train = df_train.drop(missingvaluescols['field'], axis=1)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "az6t2ntBCMRe" + }, + "source": [ + "### #2 Working with Remaining Values\n", + "- the majority of values still missing in unitcnt are rows were `propertylandusetypeid` = 265, \n", + " - which is Cluster Home (i.e. group of houses with shared walls)\n", + " - each cluster is anywhere between 5 to 25 units\n", + " - here we will asssume 10 units as reassonable count" + ] + }, + { + "cell_type": "code", + "execution_count": 48, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 225 + }, + "colab_type": "code", + "id": "yB2lzAyopS_S", + "outputId": "db6c7add-5452-4535-8948-a426654851b7" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "1.0 86035\n", + "2.0 2372\n", + "4.0 884\n", + "3.0 622\n", + "10.0 356\n", + "5.0 1\n", + "6.0 1\n", + "9.0 1\n", + "11.0 1\n", + "70.0 1\n", + "143.0 1\n", + "Name: unitcnt, dtype: int32" + ] + }, + "execution_count": 48, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# highly related propertylandusetypeid\n", + "df_train['unitcnt'].loc[df_train['propertylandusetypeid'] == 265] = 10\n", + "\n", + "# let's see what we've got\n", + "df_train['unitcnt'].value_counts()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "iR1rBlz-dOdH" + }, + "source": [ + "- a number of pool sizes are null despite there being a pool\n", + " - let's calculate the average pool size\n", + " - and assume those null values are pools of average size" + ] + }, + { + "cell_type": "code", + "execution_count": 49, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + }, + "colab_type": "code", + "id": "-icFDeLSoJwl", + "outputId": "b1ed39c3-3a14-4dc1-eb48-b3429da5cffe" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "16932\n" + ] + }, + { + "data": { + "text/plain": [ + "0" + ] + }, + "execution_count": 49, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# how's it look before?\n", + "print(df_train.pool_sqft.isna().sum())\n", + "\n", + "# calculate the average pool square footage for properties with a pool(s)\n", + "poolsizesum_mean = df_train.pool_sqft.loc[df_train['pool_count'] > 0].mean()\n", + "\n", + "# where the property has a pool(s) but pool square feet is 0\n", + "conditions = ((df_train['pool_count'] > 0) \n", + " & (df_train['pool_sqft'].isna()==True))\n", + "\n", + "# set pool square feet to the average pool square footage of pool properties\n", + "df_train['pool_sqft'].loc[conditions] = poolsizesum_mean\n", + "\n", + "# display new null count\n", + "df_train.pool_sqft.isna().sum()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "AyGeXJfEmJBU" + }, + "source": [ + "- total parcel tax\n", + "- structure tax\n", + "- land tax" + ] + }, + { + "cell_type": "code", + "execution_count": 50, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "393" + ] + }, + "execution_count": 50, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# how many rows have values in total parcel tax that do not add up given land tax and structure tax\n", + "len(df_train.loc[df_train['total_parcel_tax'] != df_train['land_tax'] + df_train['structure_tax']])" + ] + }, + { + "cell_type": "code", + "execution_count": 51, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "6\n", + "380\n", + "1\n", + "1\n", + "\n", + "6\n", + "380\n", + "1\n", + "1\n" + ] + } + ], + "source": [ + "print(df_train.total_property_tax_2016.isnull().sum())\n", + "print(df_train.structure_tax.isnull().sum())\n", + "print(df_train.total_parcel_tax.isnull().sum())\n", + "print(df_train.land_tax.isnull().sum())\n", + "print()\n", + "\n", + "# where land tax is not a null value\n", + "condition_1 = df_train.land_tax.isnull() == False\n", + "# where total parceltax is not a null value\n", + "condition_2 = df_train.total_parcel_tax.isnull()==False\n", + "\n", + "# pull the total parcel tax column\n", + "total_parcel_tax_not_null = df_train.loc[condition_1 & condition_2, 'total_parcel_tax']\n", + "# pull the land tax column\n", + "land_tax_not_null = df_train.loc[condition_1 & condition_2, 'land_tax']\n", + "\n", + "# total_parcel_tax = structure_tax + land_tax\n", + "# -> structure_tax = total_parcel_tax - land_tax\n", + "correct_structure_tax = total_parcel_tax_not_null - land_tax_not_null\n", + "\n", + "# set the structure_tax values in rows where total and land taxes are not null to these correct values \n", + "df_train['structure_tax'].loc[condition_1 & condition_2] = correct_structure_tax\n", + "\n", + "# where structure tax is still 0, there isn't structure tax\n", + "df_train['structure_tax'].loc[df_train['structure_tax'] == 0] = np.nan\n", + "\n", + "print(df_train.total_property_tax_2016.isnull().sum())\n", + "print(df_train.structure_tax.isnull().sum())\n", + "print(df_train.total_parcel_tax.isnull().sum())\n", + "print(df_train.land_tax.isnull().sum())" + ] + }, + { + "cell_type": "code", + "execution_count": 52, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "380" + ] + }, + "execution_count": 52, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# how many rows have values in total parcel tax that do not add up given land tax and structure tax\n", + "len(df_train.loc[df_train['total_parcel_tax'] != df_train['land_tax'] + df_train['structure_tax']])" + ] + }, + { + "cell_type": "code", + "execution_count": 53, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + }, + "colab_type": "code", + "id": "8SID48LOpYvu", + "outputId": "6d20a3ba-4360-4554-908d-f6d673aece12" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "(90275, 45)" + ] + }, + "execution_count": 53, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# regionidcounty is exact copy of fips code, dropping the dulicate column\n", + "df_train = df_train.drop(['regionidcounty'], axis=1)\n", + "df_train.shape" + ] + }, + { + "cell_type": "code", + "execution_count": 54, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + }, + "colab_type": "code", + "id": "tWmM2J8_pkg1", + "outputId": "6362e07f-e363-4884-b0c5-9380b5fee956" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "1421\n", + "0\n", + "0\n", + "1421\n" + ] + } + ], + "source": [ + "#*******************************\n", + "#bedroomcnt #1421 zero bed room houses ??, observed it's missing all other room count also missing\n", + "# where there is no bedroom, null is a better representation \n", + "\n", + "# before\n", + "print(len(df_train['bedroomcnt'].loc[df_train['bedroomcnt'] == 0]))\n", + "print(df_train.bedroomcnt.isnull().sum())\n", + "\n", + "df_train['bedroomcnt'].loc[df_train['bedroomcnt'] == 0] = np.nan\n", + "\n", + "# after\n", + "print(len(df_train['bedroomcnt'].loc[df_train['bedroomcnt'] == 0]))\n", + "print(df_train.bedroomcnt.isnull().sum())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Room Count\n", + "caluculate full bath and half bath again from total bath as it has few extra columns (fixes 500 missing values in roomcnt)" + ] + }, + { + "cell_type": "code", + "execution_count": 55, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 208 + }, + "colab_type": "code", + "id": "3qnP2L9LpmeJ", + "outputId": "c0eabce4-3232-4435-8733-779526f18c57" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "1165\n", + "1182\n", + "1182\n", + "1421\n", + "69700\n", + "\n", + "1165\n", + "1182\n", + "1182\n", + "1421\n", + "1416\n" + ] + } + ], + "source": [ + "# propertylandusetypeid & total living area\n", + "# total_bath 1165\n", + "# full_bath 1182\n", + "# half_bath 1182\n", + "# bedroomcnt 1421\n", + "# roomcnt 1416\n", + "\n", + "print(df_train.total_bath.isna().sum())\n", + "print(df_train.full_bath.isnull().sum())\n", + "print(df_train.half_bath.isnull().sum())\n", + "print(df_train.bedroomcnt.isnull().sum())\n", + "print(df_train.roomcnt.isnull().sum())\n", + "print()\n", + "\n", + "# roomcnt = (full_bath + half_bath) + bedroomcnt\n", + "# total_bath = fullbath+ 0.5(half_bath)\n", + "\n", + "# where full & half bath and bedroom count are not null, but room count is null\n", + "conditions = ((df_train['full_bath'].isna() == False) \n", + " & (df_train['half_bath'].isna() == False) \n", + " & (df_train['bedroomcnt'].isna() == False) \n", + " & (df_train['roomcnt'].isna() == True))\n", + "\n", + "# calculate room count including all full & half baths along with bedroom count\n", + "new_values = df_train.full_bath.loc[conditions] + df_train.half_bath.loc[conditions] + df_train.bedroomcnt.loc[conditions]\n", + "\n", + "# df_train['roomcnt'] = df_train['roomcnt'].masked_assign(new_values, conditions)\n", + "df_train.roomcnt.loc[conditions] = new_values\n", + "\n", + "\n", + "# most bedroom count and roomcount null are in same place\n", + "# all column null count 1133 all columns are null\n", + "\n", + "print(df_train.total_bath.isna().sum())\n", + "print(df_train.full_bath.isnull().sum())\n", + "print(df_train.half_bath.isnull().sum())\n", + "print(df_train.bedroomcnt.isnull().sum())\n", + "print(df_train.roomcnt.isnull().sum())" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "Mvy51Ckev9CX" + }, + "source": [ + "- correct number of stories by Zillow's `propertylandusetypeid` indicator\n", + " - where null values are not\n", + " - number of stories can be set to mode\n", + " - where there are null values\n", + " - number of stories can be set to the generally accepted number of stories" + ] + }, + { + "cell_type": "code", + "execution_count": 56, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 260 + }, + "colab_type": "code", + "id": "IW4CG2InpolD", + "outputId": "02375307-54e2-432b-8b87-1397c73d56b2" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "BEFORE\n", + "1.0 12016\n", + "2.0 8044\n", + "3.0 508\n", + "4.0 2\n", + "Name: numberofstories, dtype: int32\n", + "69705 remaining null values\n", + "\n", + "AFTER\n", + "1.0 20154\n", + "2.0 423\n", + "3.0 4\n", + "Name: numberofstories, dtype: int32\n", + "69694 remaining null values\n" + ] + } + ], + "source": [ + "# before (what's it look like?)\n", + "print(f'BEFORE\\n{df_train.numberofstories.value_counts()}\\n'\n", + " f'{df_train.numberofstories.isnull().sum()} remaining null values\\n')\n", + "\n", + "#numberofstories\t69705\n", + "\n", + "# store ids and general number of stories \n", + "zillow_type_ids = [(31,2), (246,2), (247,2), (248,2), (260,2), (261,1), \n", + " (263,1), (266,1), (267,1), (269, 2), (275,1)]\n", + "\n", + "# go through each id pair \n", + "for type_id in zillow_type_ids:\n", + " # split the pair into type id and number of stories\n", + " t_id, n_stories = type_id\n", + "\n", + " # when type id matches and story count is not null\n", + " conditions = ((df_train['propertylandusetypeid'] == t_id) \n", + " & (df_train['numberofstories'].isna() == False))\n", + "\n", + " # calculate the mode story count for matching id properties\n", + " mode_stories = df_train.numberofstories.loc[conditions].value_counts()\n", + " \n", + " # when there is at least one value in the value_counts of this property type\n", + " if len(mode_stories) > 0:\n", + " # set mode stories to the most popular value\n", + " mode_stories = mode_stories[0]\n", + " # otherwise\n", + " else:\n", + " # set mode stories to the general average for this property type\n", + " mode_stories = n_stories\n", + "\n", + " # and set those non null values to the most common value seen\n", + " df_train['numberofstories'].loc[conditions] = mode_stories\n", + "\n", + " # when type id matches and story count is null\n", + " conditions = ((df_train['propertylandusetypeid'] == t_id) \n", + " & (df_train['numberofstories'].isna() == False))\n", + " # set null values to the common number of stories seen in that type id\n", + " df_train['numberofstories'].loc[conditions] = n_stories\n", + "\n", + "# edge cases\n", + "conditions = ((df_train.propertylandusetypeid==264) \n", + " & (df_train.numberofstories.isnull()))\n", + "df_train.numberofstories.loc[conditions] = 2\n", + "\n", + "# what's it looking like? \n", + "print(f'AFTER\\n{df_train.numberofstories.value_counts()}\\n'\n", + " f'{df_train.numberofstories.isnull().sum()} remaining null values')" + ] + }, + { + "cell_type": "code", + "execution_count": 57, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 295 + }, + "colab_type": "code", + "id": "AHcMsDCxprd4", + "outputId": "30481b2c-e035-4478-d62f-63e10a09c17e" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "BEFORE\n", + "0.0 80446\n", + "1.0 8165\n", + "2.0 1106\n", + "3.0 312\n", + "4.0 21\n", + "5.0 3\n", + "Name: fireplace_count, dtype: int32\n", + "222 remaining null values\n", + "\n", + "AFTER\n", + "0.0 80446\n", + "8165.0 9607\n", + "1.0 222\n", + "Name: fireplace_count, dtype: int32\n", + "0 remaining null values\n" + ] + } + ], + "source": [ + "# before (what's it looking like?) \n", + "print(f'BEFORE\\n{df_train.fireplace_count.value_counts()}\\n'\n", + " f'{df_train.fireplace_count.isnull().sum()} remaining null values\\n')\n", + "\n", + "# where there is a fire place, and count is not null\n", + "conditions = ((df_train.fireplaceflag==1) \n", + " & (df_train.fireplace_count.isna() == False))\n", + "# calculate the mode fireplace count \n", + "mode_fire_count = df_train.loc[conditions, 'fireplace_count'].value_counts()[0]\n", + "# and set those non null values to the most common fireplace count\n", + "df_train['fireplace_count'].loc[conditions] = mode_fire_count\n", + "\n", + "# where there is a fire place, and count is null\n", + "conditions = ((df_train.fireplaceflag==1) \n", + " & (df_train.fireplace_count.isna() == True))\n", + "# set null values to the most common fireplace count\n", + "df_train.fireplace_count.loc[conditions] = 1\n", + "\n", + "# after\n", + "print(f'AFTER\\n{df_train.fireplace_count.value_counts()}\\n'\n", + " f'{df_train.fireplace_count.isnull().sum()} remaining null values')" + ] + }, + { + "cell_type": "code", + "execution_count": 58, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 317 + }, + "colab_type": "code", + "id": "FIuSWoJspt3H", + "outputId": "cb11c3a1-1658-4bce-cbde-a1a47ccdc0a8" + }, + "outputs": [ + { + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAAAZoAAAEsCAYAAAD6lXULAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy8QZhcZAAAgAElEQVR4nO3deVgT594+8DsEgqggBgHjcqriAVMVURBcsFYURYxrbVXcrVqta9UqKq94UPRg8eC+tPbVWqvnqFVRRMXlfWvrcelpa62ivsrBlQCagODGkszvD3/kGFkMyCQE7891cV1knknmO0+We+bJZEYiCIIAIiIikdhYugAiIqreGDRERCQqBg0REYmKQUNERKJi0BARkagYNEREJCoGzVto7dq18PLyKvY3ZswYS5dWrSUlJSEkJAStWrVCcHBwifPcvn3b6DlRKpV4//33sWjRImRlZZV7mYcPH8aBAweKTR82bBg+++yzcj8elc/Fixexbt06S5dhcbaWLoAsw9HREVu2bCk2jcRRWFiIefPmISgoCEuXLkWtWrXKnH/+/Pnw8fGBTqdDSkoK4uLioFar8dVXX5VruYcPH8bTp08xYMCANymfKujixYvYsGEDpk6daulSLIpB85aSSqXw8fExef7nz5+jRo0aIlZUvaWnp+Pp06fo168f/Pz8Xjt/s2bNDM+Pr68vnj9/juXLl1v8ebD08sk6ceiMiiksLISXlxe++eYbLF26FB06dDDaIk5KSsKgQYPQunVrBAYGIjY2FoWFhUaPkZiYiJ49e8Lb2xsjR47EpUuX4OXlhfj4eKNl7Nq1y+h+cXFx6Ny5s9G0+/fvY+bMmWjfvj3atGmD8ePH49atW4b2ouGmY8eOISIiAr6+vnjvvfewbt06vHrii2vXrmHixInw9fVF27Zt8dFHH+Hs2bMoKChAp06dsHHjxmL9MXToUMyYMaPMPjt8+DBUKhVatWqF999/H6tXr4ZOpwMA7NmzB927dwcATJw4EV5eXtiwYUOZj/eqWrVqQa/XQ6/XG6Z9//33GDp0KPz9/eHv74/Ro0fjypUrhvY5c+bg5MmTOHv2rGEo7tXlHjhwAD169EC7du0wYcIEZGRkGNqK+vXw4cOYM2cO/Pz8MGXKFACATqfDqlWr0LVrV7Rq1QoqlQqHDx8uV78U9Y2XlxeuXr2K4cOHo02bNhg4cCCuXr2KJ0+eYN68eWjXrh169OiBxMTE1/aTTqfDxo0b0bNnT7Rq1QrvvfceFixYYDTP9u3bERwcjFatWqFnz57Yvn27UfucOXPw0UcfGU0r6ovTp08D+M/rd8eOHYiNjUVAQAA6deqEJUuWID8/37Buy5cvh06ne+uHp7lH8xZ7NRykUikkEonh9ldffYWAgACsWLHC8IF96NAhzJ07F8OGDcOsWbNw+/ZtrFy5EsCLNygAXLp0CbNnz0avXr0QERGBa9euYebMmRWqUavVYtiwYahXrx6ioqJgb2+PzZs3Y9y4cTh69ChkMplh3piYGPTq1Qtr1qzBTz/9hLVr18LT0xM9e/YEANy4cQPDhg2Dh4cHoqKiUKdOHVy+fBlqtRp2dnbo378/9u/fj8mTJxse89atW/jtt9/w5ZdfllrjDz/8gFmzZmHQoEGYO3curl27hjVr1uDRo0dYtGgRunfvDkdHR8yYMcMwJKZQKMpcb0EQUFhYCL1ej5s3b2Lr1q3o3LkzatasaZgnLS0NgwYNQuPGjZGfn4+DBw9i+PDhOHz4MBo2bIjp06cjPT0dz58/R0REBAAYLffXX39Feno65s+fj6dPn2LZsmWIjIzEpk2bjGpZvny5oV9tbF5sm/7tb3/DN998g6lTp6Jly5Y4cuQIZs2aBRsbG/Tu3dukfnnZ3LlzMWLECEycOBGxsbGYMWMGWrRogWbNmmHt2rXYvXs35s6dC19fX7i7u5fabwsXLkRCQgImTJgAPz8/ZGdn48SJE4b2nTt3YtmyZRg7diw6d+6Ms2fPYtmyZSgoKMDHH39c5nNSki1btqBTp06IjY3F1atXERcXh0aNGmHs2LHo3r07bty4gR07dmDnzp0A3uLhaYHeOmvWrBE8PT2L/Z05c0YQBEEoKCgQPD09hUGDBhndT6fTCV26dBEWLlxoNP3vf/+74O3tLWRnZwuCIAhTpkwRVCqVoNfrDfOsXbtW8PT0FA4cOGC0jJ07dxo91t/+9jehU6dOhtuxsbFCQECA8OjRI8M0rVYr+Pj4CLt27RIEQRBu3boleHp6CuHh4UaP1adPH2H27NmG29OmTRO6du0qPH/+vMR+uXnzpuDp6Sn8/PPPhmkrV64UAgMDhcLCwhLvIwiCMHDgQGHMmDFG0zZu3CgolUohIyPDqMYffvih1Md5eb5X/1QqlZCenl7q/XQ6nVBQUCD06NFD2Lhxo2H65MmThdGjRxebf+jQoYKfn5+Qk5NjmLZlyxbBy8tLyMvLM6pl2rRpRvfVaDRC69athQ0bNhhNHzt2rBAaGlquftm9e7fg6ekpxMfHG+Y5ceKE4OnpKURERBimZWdnCy1atBD+8Y9/lNoH169fFzw9PYUdO3aU2F5QUCB06tSp2Os3IiJC8PPzM6z37NmzhQ8//NBonlefv6LX78iRI43mmzhxojB06FDD7a1btwpKpbLUmt8WHDp7Szk6OmLv3r1Gf97e3kbzvP/++0a3U1JSkJGRgd69e6OwsNDw16FDBzx//hw3b94E8GKPJigoyGjvqLSjrF7nn//8JwIDA1GzZk3D8hwdHdGyZUtcvnzZaN7AwECj2x4eHkZDQefPn0efPn1gb29f4rI8PDzQtm1b7Nu3DwCg1+sRHx+PAQMGQCqVlnifgoICXLt2DSEhIUbTQ0NDodPp8Pvvv5d7nQEgIiICe/fuxZ49e7Bu3TrUqFEDEydOxLNnzwzz3LhxA59++ik6deoEpVKJli1b4s6dO0bDimXx9vY22sJu3rw5BEFAZmam0Xyvvg6uX7+OvLy8Yuvcu3dv3Lx5E9nZ2eXul44dOxr+f+eddwAAHTp0MEyrU6cOnJ2djZ7PV50/fx4AMHDgwBLb1Wo1Hj58WGJNOTk5htdvebzuNUcvcOjsLSWVStG6desy53FxcTG6XXR47bhx40qcPz09HQCg0WiK3ffV26bKysrC5cuXcejQoWJtrwbGq8MSdnZ2yMvLM9x+9OgRXF1dy1ze4MGDER0djYiICPzyyy9IT0/HoEGDSp1fq9VCp9OVur7Z2dllLq8077zzjtHz4+Pjgy5duuDAgQMYNmwYcnNzMW7cOLi7u2P+/PlQKBSwt7fHggULjNa5LE5OTka37ezsAKDY/V9dtwcPHgAA6tWrZzS96HZOTg7y8vLK1S8v11JUx+uez1dlZ2fD0dHRaHjxZUUB+mrdRTU9evSo1McuTXlrfFsxaKhUL++RAC+2KgFg2bJl8PT0LDZ/48aNAbx442o0GqO2V29LpVLY2tqioKDAaPqrb3ZnZ2e8++67+OSTT4otr3bt2iauyX/qL/qQLE1oaCiWLVuGpKQknD59Gu3atUPTpk1LnV8ul0MqlUKr1RpNL1pfZ2fnctVYGldXV9SpUwcpKSkAXny/kpmZiR07dhj2AIAXH/KV7dXXQVFYazQaow/ahw8fAngRGrVq1TJLv7zM2dkZubm5ePbsGRwcHIq1u7m5GdXwak1Fr297e/vXvi6pfDh0RiZr3rw5XF1dcf/+fbRu3brYX9GHR+vWrXHq1CmjI76OHz9u9FgSiQTu7u6GD07gxRFD586dM5qvQ4cOuHHjBry8vIotr6wAKEmHDh2QmJhoOCqoJDVr1kRoaCi+/fZbnDhxosy9GeDFFqxSqcTRo0eNph85cgRSqRRt2rQpV42lycjIQHZ2tuHL/OfPnwOA0cEQP//8s2Gv8uX6KnsL28vLC/b29sXW+ejRo2jevDmcnZ3N1i8vKxp+K+kHqgDQoEED1KtXr8Sa6tSpg+bNmwMA6tevj3v37hm9Ts6cOVOhmuzs7KDT6YodePO24R4NmUwqlWLu3LlYsGABcnJy0KVLF9ja2uLu3bs4fvw4Nm7cCJlMhgkTJmDo0KGYNWsWBg4ciOvXrxu+93hZjx49sHv3brRo0QINGjTAnj17DB+gRT7++GMkJCRg1KhRGDFiBNzc3PDw4UNcuHAB/v7+CA0NNbn+6dOnY/DgwRgxYgTGjBkDZ2dnXLlyBfXq1TMa1x88eDCGDBmCmjVrGo6get3jTpw4EQsXLkRISAiuXbuGtWvXYujQoYat6PL697//DScnJwiCgPT0dGzZsgVOTk6G9W3bti0cHBwQERGBcePGIS0tDevXry+2vGbNmuH06dM4ceIE3N3d4e7uXuGaisjlcowcORLr1q2DjY0N3n33XRw9ehQ//fQTVq1aZZhPjH4pS/PmzfHBBx8gOjoaDx8+hK+vLx49eoQTJ05g5cqVkEqlmDJlCqKiouDk5ISOHTvi/Pnz2L17Nz7//HNDaAcHB2PdunWIiIjAgAEDcPny5VLD63WaNWsGANi2bRv8/f3h6OhY7g2k6oBBQ+XSr18/ODk5YfPmzdi7dy9sbGzwpz/9Cd26dYOt7YuXk4+PD1auXIm4uDicOHEC3t7eiIuLK/bbhOnTpyMrKwtxcXGws7PDyJEj4eHhgb179xrmcXFxwe7duxEXF4dly5YhJycHbm5u8PX1hZeXV7lq9/DwwM6dOxEbG4uFCxdCIpHgz3/+c7FTsfj4+KBevXro0qWLScNzXbt2xcqVK7Fp0ybEx8dDLpdj/PjxmDZtWrnqe9ny5csN/9erVw+tW7dGdHS0YY/Gzc0Nq1evRkxMDCZNmoSmTZsiKiqq2O+ARowYgevXr2P+/PnIycnBjBkz8Omnn1a4riKfffYZ7OzssGPHDmi1WjRp0gQrV640CmYx+uV1lixZgoYNG2Lfvn3YvHkzXFxc0KVLF0N7WFgYCgoK8O233+Kbb76BQqHA/PnzMXr0aMM8LVq0wNKlS7F582YkJSWhQ4cOiI6OxvDhw8tdT4cOHTB27Fhs27YNsbGx6NChA7Zt21YZq2pVJILASzmT+HJyctC+fXusWLEC/fv3t3Q5Zbp27Rr69++Pb7/9Fv7+/pYuh8jqcY+G6P/TarVITU3FqlWr0KJFC4YMUSXhwQBE/9/JkycxfPhwaLVao6ErInozHDojIiJRcY+GiIhExaAhIiJR8WCAUmRlPYFez1FFIiJT2NhIULduyRf0Y9CUQq8XGDRERJXAbENnn376Kfr164cBAwYgLCwMV69eBQCkpqZiyJAh6NWrF4YMGWJ05lkx2oiIyLzMdtRZbm6u4QR8J06cwPr167F//36MGjUKH3zwAfr374/4+Hh8//33hiveidFmKo3mMfdoiIhMZGMjgYtLyWfSMNsezctneX38+DEkEgk0Gg2Sk5OhUqkAACqVCsnJydBqtaK0ERGR+Zn1O5qFCxfizJkzEAQBW7ZsgVqthru7u+GiUlKpFG5ublCr1RAEodLb5HK5ybWWlsxERFQ+Zg2a6OhoAC9O471ixQrMmDHDnIsvFw6dERGZrkoMnb1swIABOH/+POrXr4+MjAzodDoAL65HkpmZCYVCAYVCUeltRERkfmYJmidPnkCtVhtunzp1CnXq1IGLiwuUSiUSEhIAAAkJCVAqlZDL5aK0ERGR+ZnlqLOHDx/i008/xbNnz2BjY4M6depg3rx5aNmyJVJSUhAeHo6cnBw4OTkhJibGcLEgMdpMxaEzIiLTlTV0xpNqloJB83ZzdpTBroa9pctAwfM8ZOeWfulpoqqirKDhmQGISmBXwx6Jo8ZaugyEbt8KMGjIyvGkmkREJCoGDRERiYpBQ0REomLQEBGRqBg0REQkKgYNERGJikFDRESiYtAQEZGoGDRERCQqBg0REYmKQUNERKJi0BARkagYNEREJCoGDRERiYpBQ0REomLQEBGRqBg0REQkKgYNERGJikFDRESiYtAQEZGoGDRERCQqBg0REYmKQUNERKJi0BARkagYNEREJCpbcywkKysLc+fOxZ07dyCTyfDOO+8gKioKcrkcQUFBkMlksLe3BwDMmTMHXbp0AQCkpqYiPDwc2dnZcHZ2RkxMDJo0afJGbUREZF5m2aORSCQYP348jh07hkOHDqFx48aIjY01tK9Zswbx8fGIj483hAwAREZGIiwsDMeOHUNYWBgWLVr0xm1ERGReZgkaZ2dnBAQEGG77+PggLS2tzPtoNBokJydDpVIBAFQqFZKTk6HVaivcRkRE5meWobOX6fV67Nq1C0FBQYZpc+bMgSAI8PX1xaxZs+Dk5AS1Wg13d3dIpVIAgFQqhZubG9RqNQRBqFCbXC43uU4Xl9qVuNZEFefq6mjpEojeiNmDZsmSJahZsyZGjBgBAPjuu++gUCiQn5+P6OhoREVFGQ2rWYpG8xh6vWDpMshCqtKH+4MHuZYugei1bGwkpW6gm/Wos5iYGNy+fRurVq2Cjc2LRSsUCgCATCZDWFgYfv31V8P0jIwM6HQ6AIBOp0NmZiYUCkWF24iIyPzMFjRxcXG4fPky1q9fD5lMBgB4+vQpcnNfbK0JgoDExEQolUoAgIuLC5RKJRISEgAACQkJUCqVkMvlFW4jIiLzkwiCIPr40I0bN6BSqdCkSRPUqFEDANCoUSOEh4dj2rRp0Ol00Ov18PDwQEREBNzc3AAAKSkpCA8PR05ODpycnBATE4NmzZq9UZupOHT2dnN1dUTiqLGWLgOh27dy6IysQllDZ2YJGmvEoHm7MWiIyqfKfEdDRERvHwYNERGJyuyHN5N46taRwVZmb9EaCvPzkPUo36I1EFHVwqCpRmxl9vhlxXiL1uA7dwsABg0R/QeHzoiISFQMGiIiEhWDhoiIRMWgISIiUTFoiIhIVAwaIiISFYOGiIhExaAhIiJRMWiIiEhUDBoiIhIVg4aIiETFoCEiIlExaIiISFQMGiIiEhWDhoiIRMWgISIiUTFoiIhIVAwaIiISFYOGiIhExaAhIiJRMWiIiEhUDBoiIhKVWYImKysLEyZMQK9evdC3b19MnToVWq0WAJCamoohQ4agV69eGDJkCG7dumW4nxhtRERkXmYJGolEgvHjx+PYsWM4dOgQGjdujNjYWABAZGQkwsLCcOzYMYSFhWHRokWG+4nRRkRE5mWWoHF2dkZAQIDhto+PD9LS0qDRaJCcnAyVSgUAUKlUSE5OhlarFaWNiIjMz9bcC9Tr9di1axeCgoKgVqvh7u4OqVQKAJBKpXBzc4NarYYgCJXeJpfLzb26RERvPbMHzZIlS1CzZk2MGDECycnJ5l68yVxcalu6BKvl6upo6RKqFfYnWTuzBk1MTAxu376NTZs2wcbGBgqFAhkZGdDpdJBKpdDpdMjMzIRCoYAgCJXeVh4azWPo9YJIPSGOqvKB9OBBrqVLeGNVpS+B6tGfVP3Z2EhK3UA32+HNcXFxuHz5MtavXw+ZTAYAcHFxgVKpREJCAgAgISEBSqUScrlclDYiIjI/iSAIom+237hxAyqVCk2aNEGNGjUAAI0aNcL69euRkpKC8PBw5OTkwMnJCTExMWjWrBkAiNJmKmvdo/llxXiL1uA7d0u12AJ3dXVE4qixli4Dodu3Vov+pOqvrD0aswSNNWLQVAyDpnIxaMhaVImhMyIiejsxaIiISFQMGiIiEhWDhoiIRMWgISIiUTFoiIhIVAwaIiISFYOGiIhExaAhIiJRMWiIiEhUDBoiIhIVg4aIiETFoCEiIlGZHDRff/11idO3bt1aacUQEVH1Y3LQrF+/vsTpGzdurLRiiIio+nntpZzPnj0LANDr9Th37hxevnzNvXv3UKtWLfGqIyIiq/faoFm4cCEAIC8vDwsWLDBMl0gkcHV1RUREhHjVERGR1Xtt0Jw6dQoAMHfuXKxYsUL0goiIqHp5bdAUeTlk9Hq9UZuNDQ9eIyKikpkcNFeuXEFUVBSuX7+OvLw8AIAgCJBIJLh69apoBRIRkXUzOWjCw8PRrVs3LFu2DDVq1BCzJiIiqkZMDpr79+/js88+g0QiEbMeIiKqZkz+ciU4OBg//fSTmLUQEVE1ZPIeTV5eHqZOnQpfX1/Uq1fPqI1HoxERUWlMDprmzZujefPmYtZCRETVkMlBM3XqVDHrICKiasrkoCk6FU1JOnbsWCnFEBFR9WNy0BSdiqZIVlYWCgoK4O7ujpMnT5Z535iYGBw7dgz379/HoUOH4OnpCQAICgqCTCaDvb09AGDOnDno0qULACA1NRXh4eHIzs6Gs7MzYmJi0KRJkzdqIyIi8zM5aIpORVNEp9Nh48aNJp1Us3v37hg1ahSGDx9erG3NmjWG4HlZZGQkwsLC0L9/f8THx2PRokXYvn37G7UREZH5VfjcMVKpFJMmTcKWLVteO6+fnx8UCoXJj63RaJCcnAyVSgUAUKlUSE5OhlarrXAbERFZhsl7NCU5c+bMG/+Ac86cORAEAb6+vpg1axacnJygVqvh7u4OqVQK4EWoubm5Qa1WQxCECrXJ5fJy1eXiUvuN1utt5urqaOkSqhX2J1k7k4Oma9euRqHy7Nkz5OfnIzIyssIL/+6776BQKJCfn4/o6GhERUUhNja2wo9XmTSax9DrhdfPWIVUlQ+kBw9yLV3CG6sqfQlUj/6k6s/GRlLqBrrJQfPFF18Y3XZwcEDTpk1Ru3bFt/yLhtNkMhnCwsIwefJkw/SMjAzodDpIpVLodDpkZmZCoVBAEIQKtRERkWWY/B2Nv78//P394efnhyZNmqBly5ZvFDJPnz5Fbu6LLTVBEJCYmAilUgkAcHFxgVKpREJCAgAgISEBSqUScrm8wm1ERGQZEuHlazOX4fHjx4iKikJiYiIKCwtha2uLPn36ICIiAo6OZQ8zLF26FElJSXj48CHq1q0LZ2dnbNq0CdOmTYNOp4Ner4eHhwciIiLg5uYGAEhJSUF4eDhycnLg5OSEmJgYNGvW7I3aysNah85+WTHeojX4zt1SLYZ6XF0dkThqrKXLQOj2rdWiP6n6K2vozOSgCQ8Px5MnTzBr1iw0bNgQ9+/fR1xcHBwcHBATE1OpBVcFDJqKYdBULgYNWYtK+Y7mxx9/xIkTJ+Dg4AAAaNq0KZYvX47g4ODKqZKIiKolk7+jsbe3L/Z7lKysLMhkskovioiIqg+T92gGDx6McePGYcyYMWjQoAHS0tKwbds2fPjhh2LWR0REVs7koJk8eTLc3d1x6NAhZGZmws3NDePHj2fQEBFRmUweOouOjkbTpk2xbds2JCYmYtu2bfDw8EB0dLSY9RERkZUzOWgSEhLQqlUro2mtWrUy/GaFiIioJCYHjUQigV6vN5pW9BsYIiKi0pgcNH5+fli9erUhWPR6PdauXQs/Pz/RiiMiIutXrgufffLJJwgMDESDBg2gVqvh6uqKTZs2iVkfERFZOZODpn79+ti/fz8uXboEtVoNhUIBb29v2NhU+JI2RET0FijX9WhsbGzg4+MDHx8fseohIqJqhrsjREQkKgYNERGJikFDRESiYtAQEZGoGDRERCQqBg0REYmKQUNERKJi0BARkagYNEREJCoGDRERiYpBQ0REomLQEBGRqBg0REQkKgYNERGJikFDRESiMkvQxMTEICgoCF5eXvi///s/w/TU1FQMGTIEvXr1wpAhQ3Dr1i1R24iIyPzMEjTdu3fHd999h4YNGxpNj4yMRFhYGI4dO4awsDAsWrRI1DYiIjI/swSNn58fFAqF0TSNRoPk5GSoVCoAgEqlQnJyMrRarShtRERkGeW6lHNlUqvVcHd3h1QqBQBIpVK4ublBrVZDEIRKb5PL5eWqz8WldiWu7dvF1dXR0iVUK+xPsnYWC5qqTqN5DL1esHQZ5VJVPpAePMi1dAlvrKr0JVA9+pOqPxsbSakb6BYLGoVCgYyMDOh0OkilUuh0OmRmZkKhUEAQhEpvIyIiy7DY4c0uLi5QKpVISEgAACQkJECpVEIul4vSRkREliERBEH08aGlS5ciKSkJDx8+RN26deHs7IzDhw8jJSUF4eHhyMnJgZOTE2JiYtCsWTMAEKWtPKx16OyXFeMtWoPv3C3VYqjH1dURiaPGWroMhG7fWi36k6q/sobOzBI01ohBUzEMmsrFoCFrUVbQ8MwAREQkKgYNERGJikFDRESiYtAQEZGoGDRERCQqBg0REYmKQUNERKJi0BARkagYNEREJCoGDRERiYpBQ0REomLQEBGRqBg0REQkKgYNERGJikFDRESiYtAQEZGoGDRERCQqBg0REYmKQUNERKJi0BARkagYNEREJCoGDRERiYpBQ0REomLQEBGRqBg0REQkKgYNERGJytbSBQBAUFAQZDIZ7O3tAQBz5sxBly5dkJqaivDwcGRnZ8PZ2RkxMTFo0qQJAFS4jYiIzKvK7NGsWbMG8fHxiI+PR5cuXQAAkZGRCAsLw7FjxxAWFoZFixYZ5q9oGxERmVeVCZpXaTQaJCcnQ6VSAQBUKhWSk5Oh1Wor3EZEROZXJYbOgBfDZYIgwNfXF7NmzYJarYa7uzukUikAQCqVws3NDWq1GoIgVKhNLpdbbP2IiN5WVSJovvvuOygUCuTn5yM6OhpRUVEYM2aMRWtycalt0eVbM1dXR0uXUK2wP8naVYmgUSgUAACZTIawsDBMnjwZ8+fPR0ZGBnQ6HaRSKXQ6HTIzM6FQKCAIQoXaykOjeQy9XhBjdUVTVT6QHjzItXQJb6yq9CVQPfqTqj8bG0mpG+gW/47m6dOnyM198UYSBAGJiYlQKpVwcXGBUqlEQkICACAhIQFKpRJyubzCbUREZH4W36PRaDSYNm0adDod9Ho9PDw8EBkZCQBYvHgxwsPDsWHDBjg5OSEmJsZwv4q2ERGReUkEQbCu8SEzsdahs19WjLdoDb5zt1SLoR5XV0ckjhpr6TIQun1rtehPqv6q9NAZERFVbwwaIiISFYOGiIhExaAhIiJRMWiIiEhUDBoiIhIVg4aIiETFoCEiIlExaIiISFQMGiIiEhWDhoiIRGXxk2paA0enGqhhb2fRGp7nFSA357lFayAiqggGjQlq2NshbJ/nB7cAABDQSURBVO53Fq1h54rhyAWDhoisD4fOiIhIVAwaIiISFYOGiIhExaAhIiJRMWiIiEhUDBoiIhIVg4aIiETF39EQkejqOMkgs7e3dBnIz8vDo5x8S5fx1mHQEJHoZPb2+Nv8TyxdBmYt3wyAQWNuHDojIiJRMWiIiEhUDBoiIhIVg4aIiETFoCEiIlFV26BJTU3FkCFD0KtXLwwZMgS3bt2ydElERG+lahs0kZGRCAsLw7FjxxAWFoZFixZZuiQiordStfwdjUajQXJyMrZu3QoAUKlUWLJkCbRaLeRyuUmPYWMjMbpdr26tSq+zvF6tqSQyJxczVFI2U+q0Bg71LN+XQPXpTydn9mdlcXS0h0xm2av+5ucXIDc3z3C7rH6VCIIgmKMoc7p8+TLmzZuHw4cPG6aFhobiiy++QMuWLS1YGRHR26faDp0REVHVUC2DRqFQICMjAzqdDgCg0+mQmZkJhUJh4cqIiN4+1TJoXFxcoFQqkZCQAABISEiAUqk0+fsZIiKqPNXyOxoASElJQXh4OHJycuDk5ISYmBg0a9bM0mUREb11qm3QEBFR1VAth86IiKjqYNAQEZGoGDRERCQqBg0REYmKQUNERKJi0BARkagYNEREJKpqefZmMk1aWhqOHj0KtVoN4MWpe3r27IlGjRpZuDJjrLNyWUudVH1IFy9evNjSRVRHaWlp2LNnDxITE/Hjjz/i5s2bcHV1hZOTk6VLAwDs2bMH8+bNg6urK+rXr4/atWsjLS0NK1euRK1atarMWa5ZZ+WyljqBqv8eKsI6X49nBhDBnj17sG7dOvTo0cNwIk+1Wo2TJ09iypQp+PDDDy1cIdCrVy/s2rWr2PnftFothg4diqSkJAtVZox1Vi5rqdMa3kMA6zSZQJWuZ8+egkajKTZdo9EIwcHBFqiouB49epQ4Xa/Xl9pmCayzcllLndbwHhIE1mkqfkcjAr1eX+KZouvWrQuhiuxABgYGYvz48fjoo4/QoEEDAC92rXfv3o3OnTtbuLr/YJ2Vy1rqtIb3EMA6TcWhMxH85S9/wd27d0t8Mzdq1AhV4WsxvV6PgwcP4siRI0hLSwMANGjQACEhIejfvz9sbKrGAYmss3JZS53W8B4CWKepGDQisJY3M1FVZS3vIdZpGgYNFXPlypUqdfRRaVhn5bKWOsn6VI24fYtcuXLF0iW81urVqy1dgklYZ+Wyljqt4T0EsM6XMWjMrCq/mf/5z38CAL788ksLV1K6J0+e4MqVK3j8+HGVrvNlVbnOZ8+e4fLly8jJyanSdb6sKr+HXsY6/4NDZ2+pmzdvFpv28ccf47//+78hCAKaN29ugaqKW7RoEWbOnAm5XI5ffvkF06ZNQ926daHVavHFF18gMDDQ0iUCAAICAtC3b1988MEHUCqVli6nVMePH8e8efPg5uaGmJgYzJw5Ew4ODtBoNFi+fDmCgoIsXSJVQzy82cz69u2LQ4cOWboMqFQqw9EnRR4+fIgJEyZAIpHg5MmTFqrM2MWLFw2HZa5evRqbNm2Ct7c3UlNTMXv27CoTNLVq1YKNjQ3GjRuH+vXr44MPPkDfvn1Rp04dS5dmZN26ddi1axdycnIwceJEbNy4Ee3atUNKSgpmz55d5YImKysL6enpAID69eujbt26Fq6IKoJBI4KS9haKZGVlmbGS0k2dOhW///47Fi9ejIYNGwIAgoKCcOrUKQtXZiwvL8/w/5MnT+Dt7Q0AaNq0KQoKCixVVjF16tTBggUL8Pnnn+PkyZPYt28fVq5ciffffx+DBw+uMr9RkUgk8PLyAvAiHNu1awcA8PDwsGRZxdy5cwf/9V//heTkZLi5uQEAMjMz8e677+Ivf/kLmjRpYtkCTVBVNiqBF587sbGxUKvV6N69O4YPH25omzZtGtauXSvq8hk0IlCpVGjYsGGJP4TKzs62QEXFTZ06FcnJyZg9ezb69++PYcOGQSKRWLqsYjp27Ii//vWvmDFjBgICApCYmIjQ0FCcOXMGzs7Oli6vGDs7O4SEhCAkJASZmZnYt28flixZgqNHj1q6NAAvgiYlJQU5OTl4+vQpLl68CB8fH6SmpkKn01m6PIO5c+ciLCwMW7duNRx6q9frcejQIcybNw//+Mc/LFzhC9awUQkAkZGRaNSoEbp27Ypdu3bh7NmzWLVqFWxtbXH37l3Rl8/vaETQvXt37Ny5E+7u7sXaunbtih9++MECVZUsPz8fa9aswR9//IHU1FScPn3a0iUZyc/Px4oVKxAfHw9nZ2fcvXsXtra2CAgIwOLFi9G4cWNLlwgAGDBgAA4cOGDpMl7rf/7nfzBv3jzY2NggLi4OX375JR48eID09HQsXrwYKpXK0iUCAEJCQkoN57LazK1FixalblRmZmbi8uXLFqiquP79+yM+Ph4AIAgCoqKicOfOHWzYsAFDhgwR/bXLPRoR9OzZE/fv3y8xaIKDgy1QUelkMhnmzJmDixcv4sKFC5YupxiZTIaIiAjMmjULd+7cgU6nQ4MGDarcWP369estXYJJunXrZvQ8+/v74+rVq6hfvz7q1atnwcqMOTs7IyEhAX369DHsaQuCgEOHDlWpsyI3bNiwzI3KqiI/P9/wv0QiQWRkJGJiYjBx4kSj4WmxcI+GiKqcW7duITIyElevXjV8iGdkZKBFixZYvHgxmjVrZuEKX4iJiUFwcLDhu66XLV26FBERERaoqriJEydiwoQJaN++vdH0uLg4bN68GdeuXRN1+QwaIqqytFqt0QXaSjoxJL1ednY2JBJJiUdB3rx5U/SfM/AHm0RUZcnlcrRs2RItW7Y0hEzfvn0tXJVpqlKdzs7OpR5q/9lnn4m+fH5HQ0RVTmlHcwmCUKWO5rKWo84sXSeDhoiqHGv4iQDAOk3FoCGiKsdajuZinabhdzREVOUU/USgJFXpJwKs0zQ86oyIiETFPRoiIhIVg4aIiETFoCGrExQUZLhIW3ksWrTIcKqY8+fP47333it13vDwcMTFxQEA/vWvf6FXr14VK9aMXu6XTZs2YeHChRau6PUOHjyIcePGldo+cuRI7Nmzx4wVkRh41Bm9NaKioip0Pz8/Pxw7dqySqxHXpEmTDP/fu3cP3bt3x5UrV2Br+/q3/Pnz5/H555+b5QSr/fr1Q79+/URfDlkW92iIiEhUDBqySn/88QdCQ0PRvn17zJ8/H3l5edi3bx+GDRtmNJ+Xlxdu374NwHg47FXJyckYOHAg2rZti5kzZxqd0fbVYbagoCB8/fXX6Nu3L3x9fYvN/9VXXyEwMBCBgYHYs2ePUQ1ZWVmYNGkS2rVrh8GDB2PVqlWGmu/duwcvLy8UFhYaHuvloaM7d+5g1KhRCAgIQEBAAGbPno2cnJwS12ft2rWYM2cOAGDEiBEAgPbt26Nt27a4cOEC/P39cf36dcP8Go0G3t7euH//PiZMmIDMzEy0bdsWbdu2RUZGBtq0aWP0C/LLly+jQ4cOKCgowL59+zB06FAsWbIEvr6+CAkJwdmzZw3z5ubmYsGCBQgMDESXLl0QFxdnuPbNq8/ZmTNnEBISAl9fX0RFRZX4A0OyPgwaskqHDh3C119/jePHjyM1NRUbNmyo8GPl5+djypQp6N+/Py5cuICQkBAkJSWVeZ8jR45gy5YtOHnyJK5fv459+/YBAE6fPo1t27Zh69atOH78eLFLL0RFRcHe3h4//fQTli1bhu+//97kOgVBwCeffIIff/wRR44cQXp6uklXRtyxYwcA4Oeff8Zvv/0Gf39/hIaG4uDBg4Z5EhIS0KlTJzRs2BBfffUV3Nzc8Ntvv+G3336Du7s7/P39ceTIEcP8Bw8eRJ8+fWBnZwcAuHTpEho3boxz585h+vTpmDp1quEX5/PmzYOtrS2SkpJw4MABnDlzpsTvXbRaLaZNm4aZM2fi3Llz+NOf/oRff/3V5P6hqotBQ1Zp+PDhUCgUcHZ2xuTJk3H48OEKP9bvv/+OgoICjB492nCFzNatW5d5n5EjR8Ld3R3Ozs7o1q0brl69CuBFAA0aNAh//vOf4eDggKlTpxruo9PpkJSUhOnTp6NmzZrw9PTEwIEDTa7znXfeQefOnSGTySCXyzF27Fj8/PPPFVrngQMHIiEhAXq9HgAQHx9f5nclAwcONASTTqfD4cOH0b9/f0O7XC439F9oaCiaNm2K//3f/8XDhw9x+vRpLFiwADVr1oSLiwvGjBlT4vN1+vRpNG/eHCEhIbCzs8Po0aOr1DVyqOJ4MABZJYVCYfi/QYMGyMzMrPBjZWZmwt3d3ehS1g0aNCjzPq6urob/HRwcDMvPzMxEq1atSqxTq9WisLCwWO2m0mg0WLp0Kf71r3/hyZMnEAShwhcBa9OmDRwcHHDhwgW4urrizp076N69e6nzd+/eHZGRkbh79y5SU1NRu3ZteHt7G9pL6r/MzEykpaWhsLAQgYGBhja9Xm/UB0UyMzNRv359w22JRFLifGR9GDRklYquUQIAaWlpcHNzg4ODA54/f26Y/uDBA5Mey9XVFRkZGRAEwfBhmZaWVqHLRLu5uSEjI6PEOuVyOWxtbaFWq+Hh4VGsvWbNmgCA58+fo3bt2sXWYeXKlZBIJDh48CDq1q2LEydOmHQk3csB8LKivRRXV1f06tUL9vb2pc5vb2+P3r174+DBg/j3v/9ttDcDoFj/qdVqBAUFoX79+pDJZDh37txrj3hzdXVFenq64bYgCEb9Q9aLQ2dklXbu3In09HRkZ2dj8+bNCA0NRYsWLXDjxg1cvXoVeXl5Jn1/AQA+Pj6wtbXF9u3bUVhYiKSkJPzxxx8VqiskJAT79u1DSkoKnj17ZnSJZ6lUiuDgYKxbtw7Pnj3DzZs3sX//fkO7XC6Hu7s74uPjodPpsHfvXty9e9fQ/uTJE9SsWRNOTk7IyMjAli1bTKpJLpfDxsbG6LGAF9eRP3HiBA4ePIgBAwYYpru4uCA7Oxu5ubnF5t+/fz9OnTpVbJhNq9Vi+/btKCgowJEjR5CSkoKuXbvCzc0NnTt3xl//+lc8fvwYer0ed+7cKfGy4V27dsWNGzeQlJSEwsJCbN++HQ8fPjRpHalqY9CQVVKpVBg3bhx69OiBxo0bY/LkyWjatCmmTJmCMWPGoGfPnvD19TXpsWQyGdauXYv9+/ejffv2SExMrPCJBrt27YqRI0di1KhRCA4Oho+Pj2EZwIsfjT59+hSdO3dGeHg4Bg0aZHT/JUuW4Ouvv0ZAQABu3ryJtm3bGtqmTp2K5ORk+Pn5YeLEiejZs6dJNTk4OGDSpEkYNmwY/Pz8cPHiRQBA/fr18e6770IikcDPz88wv4eHB/r06YMePXrAz8/PsIfm6+sLGxsbtGzZEo0aNTJahre3N27fvo0OHTpg1apVWLNmDerWrQsAWLFiBQoKCgxHCU6fPr3EvU25XI7Vq1dj5cqVCAgIwO3bt0u8RDJZH55Uk0hEKSkpUKlU+OOPP0ocOtq3bx/27NmDXbt2WaA6YP78+XBzczP5KoujRo1C37598eGHHxqmWXodqOrjHg1RJTt+/Djy8/Px6NEjfPHFF+jWrZtJv8g3t3v37uH48eMYPHiwSfNfunQJycnJ6N27t8iVUXXDoCGqZH//+9/RsWNHBAcHQyqVYvHixZYuqZhVq1ahb9+++Pjjj0066GHevHkYO3YsFixYYDhQgchUHDojIiJRcY+GiIhExaAhIiJRMWiIiEhUDBoiIhIVg4aIiET1/wDlbaau8oEbuQAAAABJRU5ErkJggg==\n", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# set basic sns \n", + "color = sns.color_palette()\n", + "sns.set(style=\"darkgrid\")\n", + "# convert dataframe to pandas for ease of use with sns\n", + "pd_train = df_train.to_pandas()\n", + "# set ax plot\n", + "ax = sns.countplot(x=\"buildingqualitytypeid\", data=pd_train)\n", + "# adjust fringe aesthetics\n", + "plt.xticks(rotation='vertical')\n", + "plt.title(\"Frequency of Bathroom count\", fontsize=15)\n", + "# display the graph\n", + "plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": 59, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 274 + }, + "colab_type": "code", + "id": "KOHPCFRSp5y9", + "outputId": "e0f3fe2e-a82a-49e8-a798-a3f79a30bcee" + }, + "outputs": [ + { + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAAAX8AAAD7CAYAAACCEpQdAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy8QZhcZAAAZbklEQVR4nO3dfXBU9aHG8Wc360YC7CS7hLcasc5tuHFGpBMHGOrwErxkpgVtYaZYedERday1tfY6aumLVqiYlltkCt7QXm4ptUP/IUMrzC04g+JFilVrUVpqbEBgRhLYTbwBxITNnvsHbMzL7mbfzzn5fT9/hXP2/M5zTvY8uzl72OOxLMsSAMAoXrsDAACKj/IHAANR/gBgIMofAAxE+QOAgSh/ADAQ5Q8ABvLZHaCj44JiMUuh0ChFIuftjpMVstuD7MXn1tzS8Mnu9XpUUTEy5zFtL/9YzFIsZvX+7FZktwfZi8+tuSWy98VpHwAwEOUPAAai/AHAQEOWf0NDg+rq6jR58mQ1NzdLkjo6OnTfffepvr5eCxcu1EMPPaT29vaChwUA5MeQH/jOmzdPK1as0NKlS3uneTwe3XvvvZo+fbqkyy8Q69at0zPPPFO4pCi4zkMHFW7aoWh7RPJ6pVhMvmBIYxYtVmDGzMGPifN6FZg1W+OX3ZVyzIFjZZstPk7lwvqstzUX6W5Twn11RXOCceNjffTa/+qTo0cTzhtqPb5gSGVTpujjd97pl09S0t9t33nxaR//8311vrpfisUG/X5bX/j1p/MG8nikgV8UfGV93lGjZFmWrAsXeqdlzOvV1ZMnK9p25vK2lJZK3d391plsP7ZOuVHdl6KD9m1eDNzuFMeEU3jS/Urnuro6NTY2qrq6etC8PXv2aPv27dq6dWvGASKR84rFLFVWjtbZs+cyXt4JhkP2zkMH1bZtq6zu7kGP8fj9GrfibklK+hhJCsyZ2+/JnmjM+FiZvAAkG+dfHvq6PDd8Pu1x8iHdbUq1P1NKVJ65rqekRJJH6ommNy9JhsCcuRoxwq+2/9mT5sbYKMV+LKaBx0S2+naM1+tRKDQq5zFzPucfi8W0fft21dXV5RwG9gk37UhaIFZ3t8JNO1I+RtLld4NDjBkfK9dsVne3Tv7mtxmNkw/pbtNQ+yqpFIWV9Xp6ehIXf7J5STJ0vrpfbXteSr0up3BA8UuDjwknyfk6/9WrV6usrEzLli3Lavm+r2CVlaNzjWMbt2dv7kj9mU10iPmSpFis335INma0oz2j/ZVsnK5wpOj7Pd1tGmp/ZqtY60kom9M0phtwTOQi38/1nMq/oaFBJ06cUGNjo7ze7P6I4LSPveLZfRXBhOem43wVQUlK+Rh5vf32Q7IxfRXBjPZXsnFKx4SKvt/T3aah9mcu6y/GehKKH+O8CKRvwDGRLUed9lm/fr2OHDmiTZs2ye/35xwE9hqzaLE8SX6PHr9fYxYtTvkYSQrMmj3kmPGxcs3m8ft17fKlSZYonHS3aah9lZTHk3xWtuspKZFKkrzPSzQvSYbArNkaV/9vqdflFCn2YzENPCacpOSpp556KtUD1qxZo+9///s6c+aM9uzZo6amJk2bNk3f+c53dNVVV2nnzp363e9+pwMHDuhLX/pSxgEuXuyWZUkjR5bq44+zOEfqAMMhe+k1VboqFNInH3yg2MWLl9/lWZZ8wZDG3nGnAjNmDn5MnNerwOw5gz7YGvj4vmNlItk4VfPnFX2/p7tNSfdVCr5gSGOXLlP04seKhsOD56WxHl8wpFHTZ6in89yn+e5cqlGf/3zi3+2AefEM3kBAXSdPXj533uf3WzV7pv6v9eyn8wZKVLpX1ucdNUq66irp0qXeaRnzenX1v/6rFLMub0tp6aC/RJLtx8CUG+UJhgbt27wYuN1Jjols9e0Yj8ejsrLc33CnfbVPoXDax15kt4dbs7s1tzR8stt+2gcA4F6UPwAYiPIHAANR/gBgIMofAAxE+QOAgSh/ADAQ5Q8ABqL8AcBAlD8AGCjnr3QGgGzk6y5vyA7lD6DoBt6BLNoeUdu2rZLEC0CRcNoHQNHl6y5vyB7lD6Dokt2Apmg3pgHlD6D4fMFQRtORf5Q/gKLL113ekD0+8AVQdPEPdbnaxz6UPwBbBGbMpOxtxGkfADAQ5Q8ABqL8AcBAlD8AGIjyBwADUf4AYCDKHwAMRPkDgIEofwAw0JDl39DQoLq6Ok2ePFnNzc29048fP64lS5aovr5eS5Ys0QcffFDInACAPBry6x3mzZunFStWaOnSpf2mP/nkk7rzzjt1++236/e//71++MMfatu2bQULarfWF36tzlf3S7GY5PUqMGu2yv7lcwo37VBze0TyeqVYrKDfUdL3zke9PB7J75e6utLKMHCM5kGPKJ54ztZf/0q6dCnj5ZNlv7qmRtf+++ODpp/8jwZ9cvTopxNGjJAuXsx4vflg537PhVtzSw7LPmKEqn/+n7ZG8FiWZaXzwLq6OjU2Nqq6ulqRSET19fV6/fXXVVJSop6eHk2fPl179+5VMBjMKEAkcl6xmKXKytE6e/ZcVhtRaK0v/Fqdr7w8eMaVsh3I4/dr3Iq78/oCMPDOR0NJlCHTMdxs4AvAoOIH7JbBC0DffvR6PQqFRuW8+qzO+Z8+fVrjxo1TSUmJJKmkpERjx47V6dOncw7kRJ2v7k88I0HxS4W5I1GiOx+lkihDpmO42cCip/jhODb91Rln+7d69n0Fq6wcbWOS5JqTlHwq0Y72vG5Pc0d7zhmyGcPN+m27jTmAZDLpiHz3Y1blP2HCBLW1tamnp6f3tM+ZM2c0YcKEjMdyw2mfZKd3UvFVBPO6Pb6KYMa3uBuYIZsx3MyxzyfginSfo4457RMKhVRTU6Ndu3ZJknbt2qWampqMz/e7RWDW7MQzvIl3XyHuSJTozkepJMqQ6RhudnVNTcp/A7YbMcLW1Q/5ge+aNWu0d+9ehcNhVVRUqLy8XLt371ZLS4ueeOIJdXZ2KhAIqKGhQddff33GAVzxzl+pr/aJuvhqHzvlerVPMm642geGy/Bqn0K880/7ap9CcUv5p0J2e5C9+NyaWxo+2W097QMAcDfKHwAMRPkDgIEofwAwEOUPAAai/AHAQJQ/ABiI8gcAA1H+AGAgyh8ADET5A4CBKH8AMBDlDwAGovwBwECUPwAYiPIHAANR/gBgIMofAAxE+QOAgSh/ADAQ5Q8ABqL8AcBAlD8AGIjyBwADUf4AYCDKHwAMRPkDgIF8uQ7w8ssva8OGDbIsS7FYTN/85jc1f/78fGQDABRITuVvWZYee+wx/fa3v1V1dbX+8Y9/6Gtf+5puvfVWeb38UWGKzkMHFW7aoWh7RL5gSGMWLVZgxky7Y2VlOG2Lmwyn/Z5oWyQ5bvtyfufv9Xp17tw5SdK5c+c0duxYit8gnYcOqm3bVlnd3ZKkaHtEbdu2SpLtT+5MDadtcZPhtN8TbUvrr7ZI8kg90d5pTti+nFra4/Houeee04MPPqi5c+fqG9/4hp599tl8ZYMLhJt29D7R46zuboWbdtiUKHvDaVvcZDjt90Tbop6e3uKPc8L25fTOPxqNavPmzXr++edVW1urt956S4888oh2796tkSNHpjVGKDSq9+fKytG5xLGVqdmbO9oTTo92tBdln+RzHcXeFrc+Z/Kdu5j7vdD7PNm2JJLp9uU7e07lf/ToUZ05c0a1tbWSpNraWo0YMUItLS2aMmVKWmNEIucVi1mqrByts2fP5RLHNiZn91UEFW2PJJxe6H2S7/1ezG1x63OmELmLtd+Lsc+TbUuyx6abp292r9fT701ztnI67TN+/Hi1trbq2LFjkqSWlhaFw2Fde+21OQeDO4xZtFgev7/fNI/f3/shl5sMp21xk+G03xNti0pKpJL+77OdsH05vfOvrKzUU089pYcfflgej0eStHbtWpWXl+clHJwv/oGV065kyMZw2hY3GU77Pdm2JJpm9/Z5LMuy7AzAaR97kd0ebs3u1tzS8MnuiNM+AAB3ovwBwECUPwAYiPIHAANR/gBgIMofAAxE+QOAgSh/ADAQ5Q8ABqL8AcBAOd/Mpdiav/l16eLFftPi35Xx8T/fV+er+6VYTPJ6FZg1W52vvFz4TAVfQ+GQ3R5uze7W3JKN2T0eybIkr1eKxfhun7hMvtsnUfH3iu9gAHA4j9+vcSvuTvsFgO/2SVb8EsUPwDWccCcvd5U/AAwT6d70pVAofwCwgS8YsnX97ir/ESOSz7tyMxkAcDon3MnLVeVf/fP/TPgC4AuGNH7lfQrMmXv5E3Xp8tU+c+YWOSEADBB/Y3qlm3zBUEYf9haKq672cSqy24PsxefW3NLwyW7m1T4AgLyg/AHAQJQ/ABiI8gcAA1H+AGAgyh8ADET5A4CBKH8AMBDlDwAGyvlmLl1dXXrmmWf0pz/9SaWlpZo6dapWr16dj2wAgALJufx/+tOfqrS0VHv27JHH41E4HM5HLgBwvc5DBxVu2qFoe8Qxd/CKy6n8L1y4oJ07d2r//v3yXPnyojFjxuQlGAC4Weehg2rbtlVWd7eky9/f37ZtqyQ54gUgp3P+p06dUnl5uTZu3KhFixZp+fLlevPNN/OVDQBcK9y0o7f445xwB6+4nN75R6NRnTp1SjfccIMef/xxHT58WA888IBeeukljRqV3rfO9f12usrK0bnEsRXZ7UH24nNrbqm42Zs72hNOj3a0Z5Uj39lzKv+JEyfK5/NpwYIFkqSbbrpJFRUVOn78uG688ca0xuArne1Fdnu4Nbtbc0vFz+6rCCa8VaOvIphxDsd9pXMwGNT06dP12muvSZKOHz+uSCSiSZMm5RwMANxszKLF8vj9/aY54Q5ecTlf7fOjH/1Iq1atUkNDg3w+n37yk58oEAjkIxsAuFb8Q91hebWPJFVVVek3v/lNPrIAwLASmDHTMWU/EP/DFwAMRPkDgIEofwAwEOUPAAai/AHAQJQ/ABiI8gcAA1H+AGAgyh8ADET5A4CBKH8AMBDlDwAGovwBwECUPwAYiPIHAANR/gBgIMofAAxE+QOAgSh/ADAQ5Q8ABqL8AcBAlD8AGIjyBwADUf4AYCDKHwAMRPkDgIEofwAwUN7Kf+PGjZo8ebKam5vzNSQAoEB8+Rjkb3/7m/76179q4sSJ+RguY52HDirctEPR9oh8wZDGLFqswIyZQ87LdsyT/9GgT44e7X1sv5c7j0e66iqpu1vyeqVYrHd5Sb1jxufZzc0v1WQvPrfmlpybvfq/ttqy3pzLv7u7W08//bTWrVunu+66Kx+ZMtJ56KDatm2V1d0tSYq2R9S2bWvv/GTzUr0ApBrzo9f+t1/xD2JZl4tf6i33aHtErb/6b0mW1NPTbx4AszXfe7ctLwA5l/+GDRt02223qaqqKh95MhZu2tFb0nFWd7fCTTt6f040L1X5pxoz2h7JLmhPNLvlAKAAcir/t99+W++++64effTRrMcIhUb1/lxZOTrj5Zs72hNOjyaZHp+Xal3ZjAkA2Uqn+7Lpx1RyKv833nhDx44d07x58yRJra2tWrlypdauXatbbrklrTEikfOKxSxVVo7W2bPnMs7gqwgmfDfuqwhKUtJ5qdaVasys3/kDQBJDdV/ffvR6Pf3eNGcrp6t97r//fh04cED79u3Tvn37NH78eG3ZsiXt4s+HMYsWy+P395vm8fs1ZtHilPOyHfPqmprsgpb4pJKS7JYFgDzLy9U+doqfu091RU+mV/ukGjMwY+agq336cdnVPgDsZdfVPh7Lsixb1nxFrqd9nIDs9iB78bk1tzR8sjvitA8AwJ0ofwAwEOUPAAai/AHAQJQ/ABiI8gcAA1H+AGAgyh8ADET5A4CBKH8AMBDlDwAGovwBwECUPwAYiPIHAANR/gBgIMofAAxE+QOAgSh/ADAQ5Q8ABqL8AcBAlD8AGIjyBwADUf4AYCDKHwAMRPkDgIEofwAwEOUPAAby5bJwR0eHHnvsMZ08eVJ+v1+TJk3S008/rWAwmK98AIACyOmdv8fj0b333qs9e/boxRdfVFVVldatW5evbACAAsmp/MvLyzV9+vTef0+dOlUffvhhzqEAAIXlsSzLysdAsVhM99xzj+rq6rRixYp8DAkAKJCczvn3tXr1apWVlWnZsmUZLReJnFcsZqmycrTOnj2XrzhFRXZ7kL343JpbGj7ZvV6PQqFROY+Zl/JvaGjQiRMn1NjYKK+XC4gAwOlyLv/169fryJEj+sUvfiG/35+PTACAAsup/N9//301Njbquuuu0x133CFJuuaaa7Rp06a8hAMAFEZO5f+5z31O7733Xr6yAACKhBP0AGAgyh8ADET5A4CBKH8AMBDlDwAGovwBwECUPwAYiPIHAANR/gBgoLx9qycu6zx0UOGmHYq2R+QLhjRm0WIFZszMeLmyKVP08TvvJByn72O9o0bJsixZFy4MuVyi9TW3RySvV4rF8r4vCq05yXRPaamsri75giH5xo3VJ++957jtS5bd6dyaW3Jgdq9XgVmzNX7ZXbasPm/f55+t4fSVzp2HDqpt21ZZ3d298zx+v8atuDvlC0Ci5QaKjyNpyMcmWq7v+tNZH4DiCMyZO+QLQCG+0pnTPnkUbtoxqFCt7m6Fm3ZkvNxA8XHSeexQ6890DACF0/nqflvWy2mfPIq2RzKanu78TB831HLZjgOgAGw6Jck7/zzyBUMZTU93ft/HpfvYVONnMwaAArHpBliUfx6NWbRYngE3tPH4/RqzaHHGyw0UHyedxw61/kzHAFA4gVmzbVkvp33yKP6haqZX+yRabqirdnK52mfg+tx6tU8ybrjaB+Bqn2F0tY8bkd0ebs3u1tzS8MnO1T4AgKxR/gBgIMofAAxE+QOAgWy/2sfr9ST82W3Ibg+yF59bc0vDI3u+tsH2q30AAMXHaR8AMBDlDwAGovwBwECUPwAYiPIHAANR/gBgIMofAAxE+QOAgSh/ADBQQcq/oaFBdXV1mjx5spqbm3unv/zyy/ryl7+s22+/XQsXLtTevXvTmnf8+HEtWbJE9fX1WrJkiT744INCxE6Z/ZVXXtFXvvIVLVy4UMuWLdOpU6fSyufk7B0dHbrvvvtUX1+vhQsX6qGHHlJ7e7srsve1cePGQcs5PXtXV5eefPJJzZ8/XwsXLtQPfvAD12R3wrGa6rmb7fHo5OwFOVatAnjjjTesDz/80Jo7d6713nvvWZZlWbFYzLr55pt7/3306FFr6tSpVk9PT8p5lmVZy5cvt3bu3GlZlmXt3LnTWr58eSFiJ83+0UcfWdOmTbOOHTvWm+Gee+7pXSZVPidn7+josA4dOtS7/LPPPmt997vfdUX2uCNHjlgrV6605syZ07ucG7KvXr3a+vGPf2zFYjHLsizr7NmzrsjulGM11XM32+PRydkLcawWpPzjBpb/tGnTrDfffNOyLMv685//bM2fP3/IeeFw2KqtrbWi0ahlWZYVjUat2tpaKxKJFDJ6v+yHDx+2vvjFL/bO6+josKqrq61IJJIyn9OzD/THP/7RuuuuuyzLcv5+tyzL6urqsr761a9aJ0+e7Lec07OfP3/eqq2ttc6fPz9oDKdnd+KxalmfPnezPR6dnj3ZMpaV/X4v2rd6ejwePffcc3rwwQdVVlamCxcuaPPmzUPOO336tMaNG6eSkhJJUklJicaOHavTp08rGAwWJftnP/tZhcNhvfPOO5oyZYpefPHF3myWZSXNl2qeE7L3zRCLxbR9+3bV1dX1znd69g0bNui2225TVVVVv+Wcnr2kpETl5eXauHGjXn/9dY0cOVIPP/ywbr75ZsdnDwaDjjtW+z53U2Vw4rGabvZCHKtFK/9oNKrNmzfr+eefV21trd566y098sgj2r17t0pLS5POc4LRo0dr/fr1Wrt2rbq6ujRr1iwFAgH5fD5dunTJ7ngppcre1+rVq1VWVqZly5bZlHSwVNnffvttvfvuu3r00UftjpnQUM+ZU6dO6YYbbtDjjz+uw4cP64EHHtBLL71kd2xJqbOnOo7t0ve5+/e//922HNnIJnu+jtWilf/Ro0d15swZ1dbWSpJqa2s1YsQItbS0yOPxJJ33mc98Rm1tberp6VFJSYl6enp05swZTZgwoVjRJUkzZ87UzJkzJUnhcFhbtmxRVVWVLl68mDSfZVmOzh7X0NCgEydOqLGxUV7v5WsAJkyY4OjsL7zwgo4dO6Z58+ZJklpbW7Vy5UqtXbtWNTU1js7+ySefyOfzacGCBZKkm266SRUVFTp+/LgmTpzo6OypjmM7jtWBz91Uz9tUx6Mdx2om2ZMtI2V/rBbtUs/x48ertbVVx44dkyS1tLQoHA7r2muvTTkvFAqppqZGu3btkiTt2rVLNTU1RfsTOO7s2bOSLv/J9bOf/Ux33HGHysrKUuZzenZJWr9+vY4cOaJNmzbJ7/f3LuP07Pfff78OHDigffv2ad++fRo/fry2bNmiW265xfHZg8Ggpk+frtdee03S5Ss1IpGIJk2a5PjsTjpWEz13sz0enZ492TLpLJdMQW7msmbNGu3du1fhcFgVFRUqLy/X7t279Yc//EG//OUv5fFcvhPNt771Ld16662SlHJeS0uLnnjiCXV2dioQCKihoUHXX399vmOnzP69731Pf/nLX3Tp0iV94Qtf0KpVq1RaWjpkPidnf//997VgwQJdd911uvrqqyVJ11xzjTZt2uT47APV1dWpsbFR1dXVrsh+6tQprVq1Sh999JF8Pp++/e1va/bs2a7I7oRjNdVzN9vj0cnZC3GscicvADAQ/8MXAAxE+QOAgSh/ADAQ5Q8ABqL8AcBAlD8AGIjyBwADUf4AYKD/B0l3Ui3/skQ5AAAAAElFTkSuQmCC\n", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# let's look more into year built vs type \n", + "plt.plot(pd_train.yearbuilt, pd_train.buildingqualitytypeid, 'ro')\n", + "# display the graph\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "_647tI5Lp94v" + }, + "source": [ + "### Final adjustments\n", + "- filling nans" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "ofZIC0EdKJ0Y" + }, + "source": [ + "# -----current: test ready-----\n", + "- converting to pandas \n", + " - to see what's going on\n", + " - figuring out what can and what can't be replicated in cuML" + ] + }, + { + "cell_type": "code", + "execution_count": 60, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "-4A3-sjRp8AE" + }, + "outputs": [], + "source": [ + "from sklearn import neighbors\n", + "# from cuml.preprocessing.model_selection import train_test_split\n", + "from sklearn.model_selection import StratifiedKFold,GridSearchCV,train_test_split\n", + "#location seems to be related to building quality, (knnclassifier)\n", + "\n", + "def fillna_knn(df, base, target):\n", + " data_colnames = [target] + base\n", + " #print(\"data_colnames\",data_colnames)\n", + " missing_values_boolflag = df[target].isnull() #true for missing rows, false for columns with values\n", + " #print(\"miss\",missing_values_boolflag.head())\n", + " not_missing_boolflag = ~missing_values_boolflag \n", + " #print(\"not miss\",not_missing_boolflag.head())\n", + " number_of_missing_val = missing_values_boolflag.sum()\n", + " print(\"# of miss\",number_of_missing_val)\n", + " not_missing_rows = df.loc[not_missing_boolflag, data_colnames]\n", + " #print(not_missing_rows.head())\n", + " Y = not_missing_rows[target]\n", + " X = not_missing_rows[base]\n", + " X_train, X_test, Y_train, Y_test = train_test_split(X, Y, \n", + " test_size=0.20,\n", + " random_state=3192,\n", + " stratify=Y)\n", + " metrics = ['euclidean'] \n", + " weights = ['distance'] \n", + " numNeighbors = [5,10,15,20,25]\n", + " param_grid = dict(metric=metrics,weights=weights,n_neighbors=numNeighbors)\n", + " cv = StratifiedKFold(n_splits=3,random_state=3192,shuffle=False)\n", + " grid = GridSearchCV(neighbors.KNeighborsClassifier(n_jobs=-1),param_grid=param_grid,cv=cv,scoring='f1_weighted',refit=True,return_train_score=True,verbose=1,n_jobs=-1,pre_dispatch='n_jobs')\n", + " grid.fit(X_train ,Y_train)\n", + " #print(\"grid.cv_results_\",grid.cv_results_)\n", + " print(\"grid.best_estimator_\",grid.best_estimator_)\n", + " print(\"grid.best_params_\",grid.best_params_)\n", + " print(\"grid.scorer_\",grid.scorer_)\n", + " #print(\"grid.n_splits_\",grid.n_splits_)\n", + " y_true, y_pred = Y_test, grid.predict(X_test)\n", + " \n", + " Z = grid.predict(df.loc[missing_values_boolflag, base])\n", + " #df.loc[ missing_values_boolflag, target ] = Z\n", + " return Z" + ] + }, + { + "cell_type": "code", + "execution_count": 63, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 573 + }, + "colab_type": "code", + "id": "AT8Osn51lD9v", + "outputId": "8ab0690a-2e06-468e-b7ce-f4d051a3ce83" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "CURRENT DF SITUATION\n", + "\n", + "SHAPE = (90275, 45)\n", + "NULL COUNT = 32911\n", + "VALUE COUNTS\n", + "7.0 29310\n", + "4.0 23839\n", + "1.0 2627\n", + "10.0 1461\n", + "12.0 119\n", + "8.0 5\n", + "6.0 2\n", + "11.0 1\n", + "Name: buildingqualitytypeid, dtype: int32\n", + "\n" + ] + }, + { + "data": { + "text/plain": [ + "0 null\n", + "1 4.0\n", + "2 null\n", + "3 4.0\n", + "4 7.0\n", + "Name: buildingqualitytypeid, dtype: float64" + ] + }, + "execution_count": 63, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "print('CURRENT DF SITUATION\\n')\n", + "\n", + "print(f'SHAPE = {df_train.shape}')\n", + "print(f'NULL COUNT = {df_train.buildingqualitytypeid.isnull().sum()}\\nVALUE COUNTS\\n{df_train.buildingqualitytypeid.value_counts()}\\n')\n", + "\n", + "df_train['buildingqualitytypeid'].head()" + ] + }, + { + "cell_type": "code", + "execution_count": 64, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 225 + }, + "colab_type": "code", + "id": "79bB7JKdAEtX", + "outputId": "32b79160-fd19-4d39-988a-fc5fcd7c3284" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "NULL COUNT = 0\n", + "VALUE COUNTS\n", + "-1.0 32911\n", + " 7.0 29310\n", + " 4.0 23839\n", + " 1.0 2627\n", + " 10.0 1461\n", + " 12.0 119\n", + " 8.0 5\n", + " 6.0 2\n", + " 11.0 1\n", + "Name: buildingqualitytypeid, dtype: int32\n" + ] + } + ], + "source": [ + "df_train['buildingqualitytypeid'] = df_train['buildingqualitytypeid'].fillna(-1)\n", + "\n", + "print(f'NULL COUNT = {df_train.buildingqualitytypeid.isnull().sum()}\\nVALUE COUNTS\\n{df_train.buildingqualitytypeid.value_counts()}')" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "DVgF1c_p_bN1" + }, + "source": [ + "# -----current: break-----\n", + "- break 1 of 2" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 903 + }, + "colab_type": "code", + "id": "6eES-hq--NKZ", + "outputId": "2bc86856-507d-47bf-cfab-d29649cba819" + }, + "outputs": [], + "source": [ + "# make safe copy\n", + "test = df_train.copy()\n", + "df_train = test.copy()\n", + "# switch to pandas (figuring out what's going on)\n", + "df_train = df_train.to_pandas()\n", + "\n", + "print(df_train.info())" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 762 + }, + "colab_type": "code", + "id": "mAB9bsrPAGzQ", + "outputId": "d847758e-212e-4de8-85c4-89b469b71c48" + }, + "outputs": [], + "source": [ + "# say we run this whole thing by buildingqualitytypeid\n", + "# drop building types that aren't seen at least 3 times in the data\n", + "# df_train = df_train.groupby(\"buildingqualitytypeid\").filter(lambda x: x.buildingqualitytypeid.size > 3)\n", + "\n", + "# BACK TO cuDF\n", + "df_train = cudf.from_pandas(df_train)\n", + "\n", + "print(df_train.buildingqualitytypeid.value_counts())\n", + "print()\n", + "print(df_train.buildingqualitytypeid.isnull().sum())\n", + "print(df_train.shape)\n", + "print()\n", + "\n", + "type_ids = list(set(df_train.buildingqualitytypeid.values))\n", + "from time import sleep\n", + "safe = []\n", + "for tid in type_ids:\n", + " print(tid)\n", + " sleep(5)\n", + " t = len(df_train.loc[df_train.buildingqualitytypeid == tid])\n", + " if t > 3:\n", + " safe.append(tid)\n", + " else:\n", + " print(f'{tid} count too low @ {t}')\n", + "for tid in type_ids:\n", + " if tid not in safe:\n", + " df_train = df_train.loc[df_train.buildingqualitytypeid != tid]\n", + "\n", + "print()\n", + "print(df_train.buildingqualitytypeid.value_counts())\n", + "print()\n", + "\n", + "df_train['buildingqualitytypeid'] = df_train['buildingqualitytypeid'].replace(-1,np.nan)\n", + "print(df_train.buildingqualitytypeid.isnull().sum())\n", + "print(df_train.shape)\n", + "\n", + "# BACK TO PANDAS\n", + "df_train = df_train.to_pandas()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "Zl7eXGt_g1uU" + }, + "source": [ + "# -----current: break-----\n", + "- break 2 of 2\n", + " - below is last cell run" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 557 + }, + "colab_type": "code", + "id": "Q3ZBSOHm-79A", + "outputId": "e9ddb9b3-0bb0-4cf7-fa8e-ca35b9ea7f46" + }, + "outputs": [], + "source": [ + "# run cell above (currently broken) as would be in pandas\n", + "not_df_train = df_train.to_pandas()\n", + "not_df_train = not_df_train.groupby(\"buildingqualitytypeid\").filter(lambda x: x.buildingqualitytypeid.size > 3)\n", + "\n", + "missing_values = fillna_knn(not_df_train, \n", + " base = ['latitude', 'longitude'], \n", + " target = 'buildingqualitytypeid')\n", + "\n", + "print(\"predicted output shape\",missing_values.shape)\n", + "missing_values_boolflag = not_df_train['buildingqualitytypeid'].isnull()\n", + "not_df_train.loc[missing_values_boolflag, 'buildingqualitytypeid'] = missing_values\n", + "\n", + "print(not_df_train.buildingqualitytypeid.isnull().sum())" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "bgXh5OATEacY" + }, + "source": [ + "# BELOW NOT (really) RUN\n", + "- if run, was in pandas" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 278 + }, + "colab_type": "code", + "id": "oTh_XPErqkHf", + "outputId": "3e667bca-70c5-4b66-c7d2-12d171cb140b" + }, + "outputs": [], + "source": [ + "print(df_train.heating_system_id.isnull().sum())\n", + "print(df_train.shape)\n", + "temp=df_train.copy()\n", + "temp['heating_system_id']=temp['heating_system_id'].fillna(-1)\n", + "temp=temp.groupby(\"heating_system_id\").filter(lambda x: x.heating_system_id.size > 3)\n", + "temp['heating_system_id'] = temp['heating_system_id'].replace(-1,np.nan)\n", + "print(temp.heating_system_id.isnull().sum())\n", + "print(temp.shape)\n", + "\n", + "missing_values=fillna_knn(temp,\n", + " base = [ 'latitude', 'longitude' ] ,\n", + " target = 'heating_system_id')\n", + "\n", + "print(\"predicted output shape\",missing_values.shape)\n", + "missing_values_boolflag = df_train['heating_system_id'].isnull()\n", + "df_train.loc[ missing_values_boolflag, 'heating_system_id' ] = missing_values\n", + "\n", + "\n", + "print(df_train.heating_system_id.isnull().sum())" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 278 + }, + "colab_type": "code", + "id": "oVjNSkUYqnCt", + "outputId": "80fc7e87-36cd-44b7-96e9-ef0631c7d10c" + }, + "outputs": [], + "source": [ + "print(df_train.ac_id.isnull().sum())\n", + "print(df_train.shape)\n", + "temp=df_train.copy()\n", + "temp['ac_id']=temp['ac_id'].fillna(-1)\n", + "temp=temp.groupby(\"ac_id\").filter(lambda x: x.ac_id.size > 3)\n", + "temp['ac_id'] = temp['ac_id'].replace(-1,np.nan)\n", + "print(temp.ac_id.isnull().sum())\n", + "print(temp.shape)\n", + "\n", + "missing_values=fillna_knn(temp,\n", + " base = [ 'latitude', 'longitude' ] ,\n", + " target = 'ac_id')\n", + "\n", + "print(\"predicted output shape\",missing_values.shape)\n", + "missing_values_boolflag = df_train['ac_id'].isnull()\n", + "df_train.loc[ missing_values_boolflag, 'ac_id' ] = missing_values\n", + "\n", + "print(df_train.ac_id.isnull().sum())" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 278 + }, + "colab_type": "code", + "id": "qTbcYbexqr0Y", + "outputId": "3459affa-a41a-4241-ab62-f0dfcadda039" + }, + "outputs": [], + "source": [ + "#yearbuilt\n", + "print(df_train.yearbuilt.isnull().sum())\n", + "print(df_train.shape)\n", + "temp=df_train.copy()\n", + "temp['yearbuilt']=temp['yearbuilt'].fillna(-1)\n", + "temp=temp.groupby(\"yearbuilt\").filter(lambda x: x.yearbuilt.size > 3)\n", + "temp['yearbuilt'] = temp['yearbuilt'].replace(-1,np.nan)\n", + "print(temp.yearbuilt.isnull().sum())\n", + "print(temp.shape)\n", + "\n", + "missing_values=fillna_knn(temp,\n", + " base = [ 'latitude', 'longitude','buildingqualitytypeid','propertylandusetypeid' ] ,\n", + " target = 'yearbuilt')\n", + "\n", + "print(\"predicted output shape\",missing_values.shape)\n", + "missing_values_boolflag = df_train['yearbuilt'].isnull()\n", + "df_train.loc[ missing_values_boolflag, 'yearbuilt' ] = missing_values\n", + "print(df_train.yearbuilt.isnull().sum())" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "Gx1LYGmfqxLk" + }, + "outputs": [], + "source": [ + "#location seems to be related to building quality, (knnregressor)\n", + "from sklearn.model_selection import KFold\n", + "\n", + "def fillna_knnr( df, base, target):\n", + " data_colnames = [ target ] + base\n", + " #print(\"data_colnames\",data_colnames)\n", + " missing_values_boolflag = df[target].isnull() #true for missing rows, false for columns with values\n", + " #print(\"miss\",missing_values_boolflag.head())\n", + " not_missing_boolflag = ~missing_values_boolflag \n", + " #print(\"not miss\",not_missing_boolflag.head())\n", + " number_of_missing_val = missing_values_boolflag.sum()\n", + " print(\"# of miss\",number_of_missing_val)\n", + " not_missing_rows = df.loc[ not_missing_boolflag, data_colnames]\n", + " #print(not_missing_rows.head())\n", + " Y = not_missing_rows[target]\n", + " X = not_missing_rows[base]\n", + " X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.20, random_state=3192)\n", + " metrics = ['euclidean'] \n", + " weights = ['distance'] \n", + " numNeighbors = [5,10,15,20,25]\n", + " param_grid = dict(metric=metrics,weights=weights,n_neighbors=numNeighbors)\n", + " cv = KFold(n_splits=3,random_state=3192,shuffle=False) \n", + " grid = GridSearchCV(neighbors.KNeighborsRegressor(n_jobs=-1),param_grid=param_grid,cv=cv,scoring='neg_mean_absolute_error',refit=True,return_train_score=True,verbose=1,n_jobs=-1,pre_dispatch='n_jobs')\n", + " grid.fit(X_train ,Y_train)\n", + " #print(\"grid.cv_results_\",grid.cv_results_)\n", + " print(\"grid.best_estimator_\",grid.best_estimator_)\n", + " print(\"grid.best_params_\",grid.best_params_)\n", + " print(\"grid.scorer_\",grid.scorer_)\n", + " #print(\"grid.n_splits_\",grid.n_splits_)\n", + " y_true, y_pred = Y_test, grid.predict(X_test) \n", + " Z = grid.predict(df.loc[missing_values_boolflag, base])\n", + " #df.loc[ missing_values_boolflag, target ] = Z\n", + " return Z" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 606 + }, + "colab_type": "code", + "id": "pj5PXm7ozg5l", + "outputId": "3d42279f-221c-444c-8795-05a0832f97cd" + }, + "outputs": [], + "source": [ + "#garage_sqft\n", + "print(df_train.garage_sqft.isnull().sum())\n", + "print(df_train.shape)\n", + "temp=df_train.loc[df_train.garagecarcnt>0,df_train.columns].copy()\n", + "\n", + "print(temp.garage_sqft.isnull().sum())\n", + "print(temp.shape)\n", + "\n", + "missing_values=fillna_knnr(temp,\n", + " base = [ 'latitude', 'longitude','garagecarcnt'] ,\n", + " target = 'garage_sqft')\n", + "\n", + "print(\"predicted output shape\",missing_values.shape)\n", + "missing_values_boolflag = df_train['garage_sqft'].isnull()\n", + "df_train.loc[missing_values_boolflag, 'garage_sqft'] = missing_values\n", + "print(df_train.garage_sqft.isnull().sum())" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "b7e5CFTyzg_M" + }, + "outputs": [], + "source": [ + "df_train = df_train.drop('parcelid', axis=1)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "YxGquCOOzhD7" + }, + "outputs": [], + "source": [ + "#All the other columns with missing values seems to be integer, will need regression to be imputed,\n", + "#time to get categorical variables hot encoded\n", + "\n", + "#Identify numerical columns to produce a heatmap\n", + "catcols = ['ac_id','buildingqualitytypeid','deck_flag','fips', 'heating_system_id','has_hottub_or_spa',\n", + " 'just_hottub_or_spa', 'pool_with_spa_tub_yes','pool_with_spa_tub_no','propertylandusetypeid','basement_flag'\n", + " ,'fireplaceflag','taxdelinquencyflag']\n", + "numcols = [x for x in df_train.columns if x not in catcols]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "uVZkszJEzhHj" + }, + "outputs": [], + "source": [ + "#total_finished_living_area_sqft\n", + "\n", + "print(df_train.total_finished_living_area_sqft.isnull().sum())\n", + "print(df_train.shape)\n", + "temp=df_train.copy()\n", + "print(temp.total_finished_living_area_sqft.isnull().sum())\n", + "print(temp.shape)\n", + "missing_values=fillna_knnr(temp,\n", + " base = [ 'latitude', 'longitude','basementsqft','numberofstories','poolcnt','garagecarcnt','garage_sqft','propertylandusetypeid'] ,\n", + " target = 'total_finished_living_area_sqft')\n", + "\n", + "print(\"predicted output shape\",missing_values.shape)\n", + "missing_values_boolflag = df_train['total_finished_living_area_sqft'].isnull()\n", + "df_train.loc[ missing_values_boolflag, 'total_finished_living_area_sqft' ] = missing_values\n", + "print(df_train.total_finished_living_area_sqft.isnull().sum())" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "CVrTMb92zhLX" + }, + "outputs": [], + "source": [ + "#total_bath\t1165\n", + "#full_bath\t1182\n", + "#half_bath\t1182\n", + "#roomcnt\t1416\n", + "#bedroomcnt\t1421\n", + "\n", + "#total_finished_living_area_sqft\n", + "\n", + "print(df_train.total_bath.isnull().sum())\n", + "print(df_train.shape)\n", + "temp=df_train.copy()\n", + "print(temp.total_bath.isnull().sum())\n", + "print(temp.shape)\n", + "missing_values=fillna_knnr(temp,\n", + " base = ['propertylandusetypeid','total_finished_living_area_sqft' ] ,\n", + " target = 'total_bath')\n", + "\n", + "print(\"predicted output shape\",missing_values.shape)\n", + "missing_values_boolflag = df_train['total_bath'].isnull()\n", + "df_train.loc[ missing_values_boolflag, 'total_bath' ] = missing_values\n", + "print(df_train.total_bath.isnull().sum())#total_bath\t1165\n", + "#full_bath\t1182\n", + "#half_bath\t1182\n", + "#roomcnt\t1416\n", + "#bedroomcnt\t1421\n", + "\n", + "#total_finished_living_area_sqft\n", + "\n", + "print(df_train.total_bath.isnull().sum())\n", + "print(df_train.shape)\n", + "temp=df_train.copy()\n", + "print(temp.total_bath.isnull().sum())\n", + "print(temp.shape)\n", + "missing_values=fillna_knnr(temp,\n", + " base = ['propertylandusetypeid','total_finished_living_area_sqft' ] ,\n", + " target = 'total_bath')\n", + "\n", + "print(\"predicted output shape\",missing_values.shape)\n", + "missing_values_boolflag = df_train['total_bath'].isnull()\n", + "df_train.loc[ missing_values_boolflag, 'total_bath' ] = missing_values\n", + "print(df_train.total_bath.isnull().sum())" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "BjIKlu-tzhPI" + }, + "outputs": [], + "source": [ + "# rop half_bath and full bath, as there are only redundant values of total_bath\n", + "df_train = df_train.drop(['full_bath','half_bath'], axis=1)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "02X1y6EBzhT9" + }, + "outputs": [], + "source": [ + "#bedroomcnt\t1421\n", + "\n", + "print(df_train.bedroomcnt.isnull().sum())\n", + "print(df_train.shape)\n", + "temp=df_train.copy()\n", + "print(temp.bedroomcnt.isnull().sum())\n", + "print(temp.shape)\n", + "missing_values=fillna_knnr(temp,\n", + " base = ['propertylandusetypeid','total_finished_living_area_sqft','total_bath' ] ,\n", + " target = 'bedroomcnt')\n", + "\n", + "print(\"predicted output shape\",missing_values.shape)\n", + "missing_values_boolflag = df_train['bedroomcnt'].isnull()\n", + "df_train.loc[ missing_values_boolflag, 'bedroomcnt' ] = missing_values\n", + "print(df_train.bedroomcnt.isnull().sum())" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "WzkZ_qeHzhXP" + }, + "outputs": [], + "source": [ + "df_train['total_bath']=df_train.total_bath.round(1)\n", + "df_train['bedroomcnt']=df_train.bedroomcnt.round(1)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "QF9DtDAczhaW" + }, + "outputs": [], + "source": [ + "#recalculate roomcnt\t1416 as we have used imputation for total_bath and bedroomcnt\n", + "\n", + "df_train.loc[(df_train.roomcnt.isnull()),['roomcnt']]=df_train.total_bath + df_train.bedroomcnt" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "U5N41TBlz60W" + }, + "outputs": [], + "source": [ + "print(df_train.shape)\n", + "df_train =df_train.loc[(df_train.total_parcel_tax.notnull()) & (df_train.land_tax.notnull()),df_train.columns]\n", + "\n", + "print(df_train.shape)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "kv9h5yL3z64Q" + }, + "outputs": [], + "source": [ + "#lot_area_sqft\n", + "print(df_train.lot_area_sqft.isnull().sum())\n", + "print(df_train.shape)\n", + "temp=df_train.copy()\n", + "print(temp.lot_area_sqft.isnull().sum())\n", + "print(temp.shape)\n", + "missing_values=fillna_knnr(temp,\n", + " base = ['latitude','longitude','propertylandusetypeid','total_finished_living_area_sqft','roomcnt','numberofstories' ] ,\n", + " target = 'lot_area_sqft')\n", + "\n", + "print(\"predicted output shape\",missing_values.shape)\n", + "missing_values_boolflag = df_train['lot_area_sqft'].isnull()\n", + "df_train.loc[ missing_values_boolflag, 'lot_area_sqft' ] = missing_values.round(2)\n", + "print(df_train.lot_area_sqft.isnull().sum())" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "GYJLHrR4z68f" + }, + "outputs": [], + "source": [ + "# predict structure_tax and recalculate total_parcel_tax = land_tax + structure_tax\n", + "\n", + "\n", + "print(df_train.structure_tax.isnull().sum())\n", + "print(df_train.shape)\n", + "temp=df_train.copy()\n", + "print(temp.structure_tax.isnull().sum())\n", + "print(temp.shape)\n", + "missing_values=fillna_knnr(temp,\n", + " base = ['latitude','longitude','lot_area_sqft','propertylandusetypeid','total_finished_living_area_sqft','roomcnt','numberofstories' ] ,\n", + " target = 'structure_tax')\n", + "\n", + "print(\"predicted output shape\",missing_values.shape)\n", + "missing_values_boolflag = df_train['structure_tax'].isnull()\n", + "df_train.loc[ missing_values_boolflag, 'structure_tax' ] = missing_values.round(2)\n", + "print(df_train.structure_tax.isnull().sum())" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "Ya-3K06Zz6_y" + }, + "outputs": [], + "source": [ + "#36 total_property_tax_2016 \n", + "\n", + "#total_parcel_tax = land_tax + structure_tax\n", + " \n", + "df_train['total_parcel_tax']=df_train['structure_tax']+df_train['land_tax']" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "8Fvr7voVz7DX" + }, + "outputs": [], + "source": [ + "#age of the property\n", + "df_train['age'] = 2016 - df_train['yearbuilt']\n", + "df_train=df_train.drop(['yearbuilt'],axis=1)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "xl0EOIT-z7Gl" + }, + "outputs": [], + "source": [ + "#total_property_tax_2016\n", + "\n", + "\n", + "print(df_train.total_property_tax_2016.isnull().sum())\n", + "print(df_train.shape)\n", + "temp=df_train.copy()\n", + "print(temp.total_property_tax_2016.isnull().sum())\n", + "print(temp.shape)\n", + "missing_values=fillna_knnr(temp,\n", + " base = ['latitude','longitude','lot_area_sqft','propertylandusetypeid','total_finished_living_area_sqft','roomcnt','numberofstories' ] ,\n", + " target = 'total_property_tax_2016')\n", + "\n", + "print(\"predicted output shape\",missing_values.shape)\n", + "missing_values_boolflag = df_train['total_property_tax_2016'].isnull()\n", + "df_train.loc[ missing_values_boolflag, 'total_property_tax_2016' ] = missing_values.round(2)\n", + "print(df_train.total_property_tax_2016.isnull().sum())" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "YlaxWegqz7I-" + }, + "outputs": [], + "source": [ + "#check missing values\n", + "\n", + "missing_df = df_train.isnull().sum(axis=0).reset_index()\n", + "missing_df.columns = ['column_name', 'missing_count']\n", + "missing_df = missing_df.loc[missing_df['missing_count']>0]\n", + "missing_df = missing_df.sort_values(by='missing_count')\n", + "print(missing_df)\n", + "print(missing_df.shape)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "dIl_nqKVz7NQ" + }, + "outputs": [], + "source": [ + "#both the columns above miss 92% of the data, there is no related varibale to impute it, hence dropping them at this point\n", + "\n", + "df_train = df_train.drop(['finished_living_area_entryfloor_sqft2','finished_living_area_entryfloor_sqft1'], axis=1)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "HQJd7rgKz7Qq" + }, + "outputs": [], + "source": [ + "#Identify numerical columns to produce a heatmap\n", + "catcols = ['ac_id','buildingqualitytypeid','deck_flag','fips','pool_with_spa_tub_no','pool_with_spa_tub_yes','has_hottub_or_spa',\n", + " 'just_hottub_or_spa','heating_system_id','propertylandusetypeid','basement_flag','fireplaceflag','taxdelinquencyflag']\n", + "numcols = [x for x in df_train.columns if x not in catcols]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "VUN3a6uJz7Ut" + }, + "outputs": [], + "source": [ + "# 2 variables are in object datatype, coverting into numeric\n", + "df_train[['census_tractnumber','block_number']] = df_train[['census_tractnumber','block_number']].apply(pd.to_numeric)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "zGx77rRAz7ZZ" + }, + "outputs": [], + "source": [ + "# dropping categorical columns as xgboost feature selection cannot hadle it\n", + "\n", + "train_x = df_train.drop(catcols+['logerror'], axis=1)\n", + "\n", + "train_y=df_train['logerror']\n", + "\n", + "train_x = train_x.astype(float) \n", + "train_y = train_y.astype(float)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "es_Ew2YJz7dT" + }, + "outputs": [], + "source": [ + "pd.options.display.max_rows = 65\n", + "\n", + "dtype_df = train_x.dtypes.reset_index()\n", + "dtype_df.columns = [\"Count\", \"Column Type\"]\n", + "#dtype_df" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "bvWIhR38z7fW" + }, + "outputs": [], + "source": [ + "df_train.loc[df_train.has_hottub_or_spa==True,'has_hottub_or_spa']=\"Yes\"\n", + "df_train.loc[df_train.has_hottub_or_spa==0,'has_hottub_or_spa']=\"No\"\n", + "\n", + "df_train.loc[df_train.just_hottub_or_spa==0,'just_hottub_or_spa']=\"No\"\n", + "df_train.loc[df_train.just_hottub_or_spa==1,'just_hottub_or_spa']=\"Yes\"\n", + "\n", + "df_train.loc[df_train.deck_flag==0,'deck_flag']=\"No\"\n", + "df_train.loc[df_train.deck_flag==1,'deck_flag']=\"Yes\"\n", + "\n", + "df_train.loc[df_train.basement_flag==0,'basement_flag']=\"No\"\n", + "df_train.loc[df_train.basement_flag==1,'basement_flag']=\"Yes\"\n", + "\n", + "df_train.loc[df_train.fireplaceflag==False,'fireplaceflag']=\"No\"\n", + "df_train.loc[df_train.fireplaceflag==True,'fireplaceflag']=\"Yes\"\n", + "#" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "Ef9JjrmMz7jw" + }, + "outputs": [], + "source": [ + "#ac_id,heating_system_id,propertylandusetypeid\n", + "dummieslist=['has_hottub_or_spa','just_hottub_or_spa',\n", + " 'deck_flag','fips','basement_flag','fireplaceflag','taxdelinquencyflag']" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "Z51Zrt2Uz7oD" + }, + "outputs": [], + "source": [ + "df_train[dummieslist] = df_train[dummieslist].astype(object)\n", + "dummies = pd.get_dummies(df_train[dummieslist], prefix= dummieslist)\n", + "dummies.shape" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "VHBi5Gg6z7tu" + }, + "outputs": [], + "source": [ + "dummies2=['pool_with_spa_tub_no','pool_with_spa_tub_yes']\n", + "df_train[dummies2] = df_train[dummies2].astype(int)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "oocTPKI9z7rk" + }, + "outputs": [], + "source": [ + "import MySQLdb\n", + "from sqlalchemy import create_engine\n", + "engineString = 'mysql+mysqldb://root:MyNewPass@localhost/sakila'\n", + "engine = create_engine(engineString)\n", + "con=engine.connect()\n", + "\n", + "with engine.connect() as con, con.begin():\n", + " df_train.to_sql('df_train_f1', engine, chunksize=10000, index =False,if_exists ='replace')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "zj5ZLSPlz7XC" + }, + "outputs": [], + "source": [ + "numcols2=['basementsqft','total_bath','bedroomcnt','total_finished_living_area_sqft','fireplace_count','garagecarcnt',\n", + " 'garage_sqft','latitude','longitude','lot_area_sqft','poolcnt','pool_sqft','roomcnt','unitcnt','patio_sqft','storage_sqft',\n", + " 'numberofstories','structure_tax','total_parcel_tax','land_tax','total_property_tax_2016','taxdelinquencyyear','transaction_month',\n", + " 'census_tractnumber','block_number','age']" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "fp53dotszhgA" + }, + "outputs": [], + "source": [ + "Y=df_train['logerror']" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "O0Uaei4rzhj6" + }, + "outputs": [], + "source": [ + "#buildingqualitytypeid ->has order\n", + "le = LabelEncoder()\n", + "df_train['buildingqualitytypeid']=le.fit_transform(df_train.buildingqualitytypeid)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "g4-g-uvtzhds" + }, + "outputs": [], + "source": [ + "#df_train.ac_id.value_counts()\n", + "#df_train.propertylandusetypeid.value_counts()\n", + "#'buildingqualitytypeid','ac_id','heating_system_id','propertylandusetypeid'" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "SzliXafdzhRd" + }, + "outputs": [], + "source": [ + "X=pd.concat([dummies,df_train[dummies2],df_train[numcols2],df_train[['buildingqualitytypeid','ac_id','heating_system_id','propertylandusetypeid']]],axis=1)\n", + "X.shape" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "DBsZjyQd0W1N" + }, + "outputs": [], + "source": [ + "X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.10, random_state=3192)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "ihXFZWcn0W5D" + }, + "outputs": [], + "source": [ + "# top features\n", + "import xgboost as xgb\n", + "xgb_params = {\n", + " 'eta': 0.05,\n", + " 'max_depth': 8,\n", + " 'subsample': 0.7,\n", + " 'colsample_bytree': 0.7,\n", + " 'objective': 'reg:linear',\n", + " 'silent': 1,\n", + " 'seed' : 0\n", + "}\n", + "dtrain = xgb.DMatrix(X_train, Y_train, feature_names=X_train.columns.values)\n", + "model = xgb.train(dict(xgb_params, silent=0), dtrain, num_boost_round=50)\n", + "# plot the important features #\n", + "fig, ax = plt.subplots(figsize=(12,18))\n", + "#max_num_features=50, error for no reason \n", + "xgb.plot_importance(model, height=0.8, ax=ax)\n", + "plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "TQEEzNkX0W9w" + }, + "outputs": [], + "source": [ + "#top features\n", + "xgboost_selection=['total_finished_living_area_sqft','latitude','structure_tax','total_property_tax_2016',\n", + "'total_parcel_tax','land_tax','longitude','lot_area_sqft','census_tractnumber','age','total_bath','bedroomcnt',\n", + "'block_number','transaction_month','roomcnt','taxdelinquencyyear','unitcnt','taxdelinquencyflag_No',\n", + "'fips_LA','garage_sqft','pool_with_spa_tub_no','has_hottub_or_spa_No','garagecarcnt','deck_flag_No',\n", + "'poolcnt','pool_sqft'\n", + "]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "Rr_6EO4G0XEj" + }, + "outputs": [], + "source": [ + "# feature selection\n", + "#c_id,heating_system_id,propertylandusetypeid\n", + "from sklearn.ensemble import ExtraTreesRegressor\n", + "from sklearn.feature_selection import SelectFromModel\n", + "reg = ExtraTreesRegressor(n_estimators=500, max_depth=8, max_features='sqrt',\n", + " min_samples_split=100 ,min_samples_leaf=10, bootstrap=True,n_jobs=-1, random_state=3192)\n", + "reg = reg.fit(X_train, Y_train)\n", + "#print(\"importance\",reg.feature_importances_) \n", + "model = SelectFromModel(reg, prefit=True)\n", + "X_new = model.transform(X_train)\n", + "print(X_train.shape)\n", + "print(X_new.shape) \n", + "\n", + "feat_names = X.columns.values\n", + "importances = reg.feature_importances_\n", + "std = np.std([tree.feature_importances_ for tree in reg.estimators_], axis=0)\n", + "indices = np.argsort(importances)[::-1][:26]\n", + "plt.figure(figsize=(12,12))\n", + "plt.title(\"Feature importances\")\n", + "plt.bar(range(len(indices)), importances[indices], color=\"r\", yerr=std[indices], align=\"center\")\n", + "plt.xticks(range(len(indices)), feat_names[indices], rotation='vertical')\n", + "plt.xlim([-1, len(indices)])\n", + "plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "i4FCNOG70XIU" + }, + "outputs": [], + "source": [ + "tree_selection=[\n", + " 'total_finished_living_area_sqft','structure_tax','total_property_tax_2016','total_bath','total_parcel_tax',\n", + " 'age','latitude','census_tractnumber','bedroomcnt','longitude','land_tax','propertylandusetypeid','block_number',\n", + " 'buildingqualitytypeid','numberofstories','heating_system_id','unitcnt','transaction_month','lot_area_sqft','roomcnt',\n", + " 'garage_sqft','garagecarcnt','pool_with_spa_tub_no','poolcnt','fips_LA','taxdelinquencyyear','patio_sqft',\n", + " 'taxdelinquencyflag_No','taxdelinquencyflag_Yes'\n", + "]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "TmIS1WAS0XMW" + }, + "outputs": [], + "source": [ + "import matplotlib.pyplot as plt\n", + "from sklearn.model_selection import KFold\n", + "from sklearn.linear_model import Ridge,Lasso\n", + "from sklearn.feature_selection import RFECV\n", + "from sklearn.linear_model import LinearRegression\n", + "from sklearn.metrics import r2_score,mean_absolute_error,make_scorer\n", + "\n", + "#model=Lasso(alpha=0.2, fit_intercept=True, normalize=True, precompute=False, copy_X=True,\n", + " # max_iter=1000, \n", + " # tol=0.0001, warm_start=False, positive=False, random_state=3192, selection='cyclic')\n", + "\n", + "#Ridge(random_state=3192,solver='auto',fit_intercept=True,normalize=True,alpha=0.1)\n", + "#LinearRegression(n_jobs=-1,fit_intercept=True, normalize=True, copy_X=True)\n", + "\n", + "\n", + "rfecv = RFECV(estimator=LinearRegression(n_jobs=-1,fit_intercept=True, normalize=True, copy_X=True), step=2, cv=KFold(4),scoring='neg_mean_absolute_error')\n", + "rfecv.fit(X_train, Y_train)\n", + "\n", + "print(\"Optimal number of features : %d\" % rfecv.n_features_)\n", + "\n", + "# Plot number of features VS. cross-validation scores\n", + "plt.figure()\n", + "plt.xlabel(\"Number of features selected\")\n", + "\n", + "plt.ylabel(\"Cross validation score (nb of correct classifications)\")\n", + "plt.plot(range(1, len(rfecv.grid_scores_) + 1), rfecv.grid_scores_)\n", + "plt.show()\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "DIw8O00U0XPR" + }, + "outputs": [], + "source": [ + "rfe_selection = [i for indx,i in enumerate(X.columns) if rfecv.support_[indx] == True]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "gHA0x5_80XWy" + }, + "outputs": [], + "source": [ + "#Linear regression with rfe_selection selection\n", + "#rfe_selection, tree_selection, xgboost_selection\n", + "from sklearn.linear_model import LinearRegression\n", + "from sklearn.model_selection import train_test_split\n", + "from sklearn.metrics import r2_score,mean_absolute_error,make_scorer,mean_squared_error\n", + "\n", + "# just to check whether normalized /not normalized data gives better results\n", + "parameters = {'fit_intercept':[True], 'normalize':[True,False], 'copy_X':[True]}\n", + "scoring = {'MAE':'neg_mean_absolute_error','MSE': make_scorer(mean_squared_error,greater_is_better=False)}\n", + "\n", + "grid1 = GridSearchCV(LinearRegression(n_jobs=-1),param_grid=parameters, scoring=scoring,cv=5,refit='MAE',\n", + " return_train_score=True,\n", + " verbose=0,n_jobs=-1,pre_dispatch='n_jobs')\n", + "\n", + "grid1.fit(X_train[rfe_selection], Y_train)\n", + "#print(\"5. grid best_score_\",abs(grid.best_score_))\n", + "Y_pred = grid1.predict(X_test[rfe_selection])\n", + "print(\"MAE on test data\",mean_absolute_error(Y_test,Y_pred))\n", + "print(\"MSE on test data\",mean_squared_error(Y_test,Y_pred))\n", + "print(\"R Squared data \",r2_score(Y_test,Y_pred))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "ekn4pBs60XcT" + }, + "outputs": [], + "source": [ + "#pca selection\n", + "from sklearn.decomposition import PCA\n", + "from sklearn.preprocessing import scale\n", + "import matplotlib.pyplot as plt\n", + "from sklearn.preprocessing import scale\n", + "%matplotlib inline\n", + "scaled_x = scale(X)\n", + "pca = PCA(n_components=None, copy=True, whiten=False, svd_solver='auto', tol=0.0, iterated_power='auto', random_state=None)\n", + "pca.fit(scaled_x)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "yFuT-wUN0XfV" + }, + "outputs": [], + "source": [ + "# The amount of variance that each PC explains\n", + "var= pca.explained_variance_ratio_\n", + "#Cumulative Variance explains\n", + "var1=np.cumsum(np.round(pca.explained_variance_ratio_, decimals=4)*100)\n", + "print(var1)\n", + "plt.plot(var1)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "iPN4OBUe0XlD" + }, + "outputs": [], + "source": [ + "#Looking at above plot I'm taking 28 variables\n", + "\n", + "pca = PCA(n_components=28, copy=True, whiten=False, svd_solver='auto', tol=0.0, iterated_power='auto', random_state=None)\n", + "pca.fit(scaled_x)\n", + "\n", + "pca1=pca.fit_transform(scaled_x)\n", + "\n", + "pca = PCA(n_components=28, copy=True, whiten=True, svd_solver='auto', tol=0.0, iterated_power='auto', random_state=None)\n", + "pca.fit(scaled_x)\n", + "pca2=pca.fit_transform(scaled_x)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "EE4ednPC0XjX" + }, + "outputs": [], + "source": [ + "pcaX_train, pcaX_test, pcaY_train, pcaY_test = train_test_split(pca1, Y, test_size=0.10, random_state=3192)\n", + "pca2X_train, pca2X_test, pca2Y_train, pca2Y_test = train_test_split(pca2, Y, test_size=0.10, random_state=3192)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "erYMXvTG0XaK" + }, + "outputs": [], + "source": [ + "from sklearn.ensemble import GradientBoostingRegressor\n", + "from sklearn.metrics import mean_absolute_error,make_scorer\n", + "from sklearn.model_selection import GridSearchCV\n", + "\n", + "# just to check whether normalized /not normalized data gives better results\n", + "\n", + " # 0.005 for 1200 trees.\n", + "param_grid={'n_estimators':[1200],'max_features':[22]}\n", + "\n", + " \n", + "grid13 = GridSearchCV(GradientBoostingRegressor(subsample=0.8,min_samples_leaf=50,min_samples_split=50,max_depth=9,loss='ls',criterion='friedman_mse',learning_rate=0.005,random_state=3192),\n", + " param_grid=param_grid, cv=5,refit='MAE',\n", + " return_train_score=True,\n", + " verbose=2,n_jobs=-1,pre_dispatch='n_jobs')\n", + "\n", + "grid13.fit(pcaX_train, pcaY_train)\n", + "print(\"5. grid best_score_\",abs(grid13.best_score_))\n", + "print(\"best params\",grid13.best_params_)\n", + "print(\"best score\",grid13.best_score_)\n", + "Y_pred = grid13.predict(pcaX_test)\n", + "print(\"MAE on test data\",mean_absolute_error(pcaY_test,Y_pred))\n", + "print(\"MSE on test data\",mean_squared_error(pcaY_test,Y_pred))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "BgtbLCcR0XUx" + }, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "FjdSCEFP0XCM" + }, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "WzATgLxmam5w" + }, + "source": [ + "In this competition, Zillow is asking you to predict the log-error between their Zestimate and the actual sale price, given all the features of a home. The log error is defined as\n", + "\n", + "logerror=log(Zestimate)−log(SalePrice)\n", + "and it is recorded in the transactions file train.csv. In this competition, you are going to predict the logerror for the months in Fall 2017. Since all the real estate transactions in the U.S. are publicly available, we will close the competition (no longer accepting submissions) before the evaluation period begins.\n", + "\n", + "Train/Test split\n", + "You are provided with a full list of real estate properties in three counties (Los Angeles, Orange and Ventura, California) data in 2016.\n", + "The train data has all the transactions before October 15, 2016, plus some of the transactions after October 15, 2016.\n", + "The test data in the public leaderboard has the rest of the transactions between October 15 and December 31, 2016.\n", + "The rest of the test data, which is used for calculating the private leaderboard, is all the properties in October 15, 2017, to December 15, 2017. This period is called the \"sales tracking period\", during which we will not be taking any submissions.\n", + "You are asked to predict 6 time points for all properties: October 2016 (201610), November 2016 (201611), December 2016 (201612), October 2017 (201710), November 2017 (201711), and December 2017 (201712).\n", + "Not all the properties are sold in each time period. If a property was not sold in a certain time period, that particular row will be ignored when calculating your score.\n", + "If a property is sold multiple times within 31 days, we take the first reasonable value as the ground truth. By \"reasonable\", we mean if the data seems wrong, we will take the transaction that has a value that makes more sense.\n", + "File descriptions\n", + "properties_2016.csv - all the properties with their home features for 2016. Note: Some 2017 new properties don't have any data yet except for their parcelid's. Those data points should be populated when properties_2017.csv is available.\n", + "properties_2017.csv - all the properties with their home features for 2017 (released on 10/2/2017)\n", + "train_2016.csv - the training set with transactions from 1/1/2016 to 12/31/2016\n", + "train_2017.csv - the training set with transactions from 1/1/2017 to 9/15/2017 (released on 10/2/2017)\n", + "sample_submission.csv - a sample submission file in the correct format" + ] + } + ], + "metadata": { + "accelerator": "GPU", + "colab": { + "collapsed_sections": [], + "name": "zillow_kaggle_zestimate_comp.ipynb", + "provenance": [], + "version": "0.3.2" + }, + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.3" + } + }, + "nbformat": 4, + "nbformat_minor": 1 +}