My solution of the New York City Taxi Fare Prediction competition of Kaggle.
To know more about the competition please follow the link: https://www.kaggle.com/c/new-york-city-taxi-fare-prediction
The training data contains in 'train.csv' file and the testing data contains in the 'train.csv' file and the testing data contains in the 'test.csv' file. I am also adding a file 'data_description.txt' which contains the explanations of the fields available in the other data files.
To download the data please follow the links:
- train.csv (https://www.kaggle.com/c/new-york-city-taxi-fare-prediction/download/train.csv)
- test.csv (https://www.kaggle.com/c/new-york-city-taxi-fare-prediction/download/test.csv)
To run this notebook:
- Clone the repository.
- Install virtualenv.
- Navigate to the directory where you unzipped or cloned the repo and create a virtual environment with
virtualenv env
. - Activate the environment with
source env/bin/activate
- Install the required dependencies with
pip install -r requirements.txt
. - Execute
ipython notebook
from the command line or terminal. - When you're done deactivate the virtual environment with
deactivate
.
I have been using kaggle kernels for this project. Here's my notebook : https://www.kaggle.com/rishabh254/nyc-ola/notebook
main.ipynb : contains the whole code
eda.ipynb : contains data exploration
cleaning.ipynb : contains code to remove outliers
feature_engineering.ipynb : contains code to extract new features
model1.ipynb : model for rides with variable fare
model2.ipynb : model for rides with almost constant fare
The goal for the competition is to predict the fare amount (inclusive of tolls) for a taxi ride in New York City given the pickup and dropoff locations.