Skip to content

A collection of exercises (and mini projects) with PySpark (Python API for Apache Spark)

Notifications You must be signed in to change notification settings

nabilshadman/pyspark-dataframe-sql-ml-exercises

Repository files navigation

PySpark Exercises

A comprehensive collection of exercises and mini-projects using PySpark (Python API for Apache Spark). These materials were developed as part of Udacity's Learn Spark at Udacity course, providing hands-on experience with Apache Spark's core features and advanced capabilities.

🛠 Tech Stack

  • Python
  • PySpark
  • NumPy
  • pandas
  • Matplotlib
  • Jupyter Notebook
  • AWS
  • GitHub

📂 Repository Structure

.
├── data_wrangling_with_spark/          # Data processing fundamentals
│   ├── notebooks covering procedural vs functional programming
│   ├── Spark operations and lazy evaluation
│   ├── DataFrame operations and SQL
│   └── practice datasets
├── debugging_and_optimization/          # Performance tuning
│   └── exercises/
│       ├── data skewness handling
│       ├── broadcast joins
│       └── repartitioning strategies
├── machine_learning_with_spark/         # ML implementations
│   ├── feature engineering
│   ├── linear regression
│   ├── k-means clustering
│   └── model tuning
└── setting_up_spark_clusters_with_aws/  # AWS deployment
    ├── demo_code/
    └── exercises/
        ├── EMR cluster creation
        ├── script submission
        └── S3 integration

📚 Course Content

1. The Power of Spark

  • Introduction to Big Data ecosystem
  • MapReduce implementation
  • Fundamental Spark concepts

2. Data Wrangling with Spark

  • Functional programming principles
  • DataFrame operations and transformations
  • Spark SQL integration
  • Data input/output operations

3. Setting up Spark Clusters with AWS

  • EMR cluster deployment
  • AWS CLI integration
  • S3 data storage
  • Spark job submission

4. Debugging and Optimization

  • Data skewness handling
  • Broadcast join optimization
  • Partition management
  • Performance tuning strategies

5. Machine Learning with Spark

  • Feature engineering (numeric and text)
  • Linear regression implementation
  • K-means clustering
  • Model tuning and optimization
  • ML pipeline construction

🚀 Getting Started

  1. Environment Setup

    • Follow PySpark's official installation guide
    • Set up Python environment with required dependencies
    • Configure AWS credentials (for cluster-related exercises)
  2. Running the Exercises

    • Each directory contains Jupyter notebooks and Python scripts
    • Start with the numbered notebooks in each section
    • Solutions are provided for self-assessment

📝 Notes

  • Exercise solutions are available in corresponding *_solution notebooks
  • AWS-related exercises require active AWS credentials
  • Sample datasets are included in respective directories

🤝 Contributing

Feel free to submit issues and enhancement requests!

About

A collection of exercises (and mini projects) with PySpark (Python API for Apache Spark)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published