Scripts in support of this post: "Where ya headed"? Analyzing Over 450 Million Taxi Trips Using Hadoop and PySpark.
This repo provides code to download, process, and analyze data for NYC 's taxi data. The data is stored using Hadoop Distributed File Systems (HDFS) on Amazon EC2 instances. A great guide on how to set this up yourself can be found here. I plan to make a post soon on how to run PySpark in tandem with HDFS however, this post will be in video format. All of the code is written in PySpark so if you're unfamiliar with the package a tutorial is here.
Code and GeoJson file needed to map taxi trip coordinates to neighborhoods.
Folder containing links to all of the taxi data sets used for this project.
Each jupyter notebook with 'Section' in the title contains the code used for that specific section of the post "Where ya headed"? Analyzing Over 450 Million Taxi Trips Using Hadoop and PySpark.
- Sections 4 & 5.ipynb: Taxi ridership trends. How much will taxi ridership change in the future?
- Section 6.ipynb: Which neighborhoods give taxis the most business?
- Section 7.ipynb: How do rides change on weekdays vs. weekends?
- Section 8.ipynb: What factors determine how much a customer is going to tip?
- Section 9.ipynb: Do customers tip more on holidays?