This project provides a Docker-based setup to explore advanced PySpark DataFrame concepts using Jupyter notebooks. The environment includes all necessary dependencies, making it easy to get started with PySpark for data processing and analysis.
- Jupyter Notebooks: Interactive notebooks to experiment with PySpark code.
- PySpark: A Python library for Spark, used for large-scale data processing.
- Docker: Containerization tool to ensure a consistent development environment.
Follow these steps to set up and run the project on your local machine.
Ensure you have the following installed on your machine:
Clone this GitHub repository to your local machine:
git clone https://github.com/your_username/PySpark_Advanced_DataFrame_Concepts.git
cd PySpark_Advanced_DataFrame_Concepts
### 3. Download the NYC Taxi Trip Data
This project uses the NYC Taxi Trip Data in Parquet format as an example dataset. You can download the dataset from the [NYC Taxi & Limousine Commission website](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page).
#### Download the Parquet file(s):
- Choose the relevant data files in Parquet format.
#### Place the files in the datasets folder:
- After downloading, place the Parquet files in the following directory:
```bash
notebooks/datasets
This project uses the official jupyter/pyspark-notebook
Docker image. The docker-compose.yml
file is included for easy setup.
-
Build and run the container using Docker Compose:
docker-compose up
-
Access Jupyter Lab:
- Once the container is running, open your web browser and navigate to http://localhost:8888. Use the token provided in the terminal to log in.
To stop the running Docker containers, use the following command:
docker-compose down
Once the container is running, open Jupyter Lab at http://localhost:8888 in your web browser. You'll find the notebooks inside the /home/jovyan/work/notebooks
directory.
- /notebooks: Contains Jupyter notebooks for various PySpark DataFrame operations.
- /notebooks/datasets: Directory for datasets used in the notebooks.
- Dockerfile: Defines the Docker image for this project (optional if using
jupyter/pyspark-notebook
directly). - docker-compose.yml: Docker Compose file to manage the container setup.