Skip to content

Import public NYC taxi data with Hadoop, analyze with PySpark

Notifications You must be signed in to change notification settings

am2786/NYC-taxi-data-analysis

Repository files navigation

NYC-taxi-data-analysis

Scripts in support of this post: "Where ya headed"? Analyzing Over 450 Million Taxi Trips Using Hadoop and PySpark.

This repo provides code to download, process, and analyze data for NYC 's taxi data. The data is stored using Hadoop Distributed File Systems (HDFS) on Amazon EC2 instances. A great guide on how to set this up yourself can be found here. I plan to make a post soon on how to run PySpark in tandem with HDFS however, this post will be in video format. All of the code is written in PySpark so if you're unfamiliar with the package a tutorial is here.

Coordinate-neighborhood mapping

Code and GeoJson file needed to map taxi trip coordinates to neighborhoods.

Taxi data

Folder containing links to all of the taxi data sets used for this project.

'Section' jupyter notebooks

Each jupyter notebook with 'Section' in the title contains the code used for that specific section of the post "Where ya headed"? Analyzing Over 450 Million Taxi Trips Using Hadoop and PySpark.

  • Sections 4 & 5.ipynb: Taxi ridership trends. How much will taxi ridership change in the future?
  • Section 6.ipynb: Which neighborhoods give taxis the most business?
  • Section 7.ipynb: How do rides change on weekdays vs. weekends?
  • Section 8.ipynb: What factors determine how much a customer is going to tip?
  • Section 9.ipynb: Do customers tip more on holidays?

Questions/issues/contact

[email protected]

About

Import public NYC taxi data with Hadoop, analyze with PySpark

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published