A comprehensive collection of exercises and mini-projects using PySpark (Python API for Apache Spark). These materials were developed as part of Udacity's Learn Spark at Udacity course, providing hands-on experience with Apache Spark's core features and advanced capabilities.
- Python
- PySpark
- NumPy
- pandas
- Matplotlib
- Jupyter Notebook
- AWS
- GitHub
.
├── data_wrangling_with_spark/ # Data processing fundamentals
│ ├── notebooks covering procedural vs functional programming
│ ├── Spark operations and lazy evaluation
│ ├── DataFrame operations and SQL
│ └── practice datasets
├── debugging_and_optimization/ # Performance tuning
│ └── exercises/
│ ├── data skewness handling
│ ├── broadcast joins
│ └── repartitioning strategies
├── machine_learning_with_spark/ # ML implementations
│ ├── feature engineering
│ ├── linear regression
│ ├── k-means clustering
│ └── model tuning
└── setting_up_spark_clusters_with_aws/ # AWS deployment
├── demo_code/
└── exercises/
├── EMR cluster creation
├── script submission
└── S3 integration
- Introduction to Big Data ecosystem
- MapReduce implementation
- Fundamental Spark concepts
- Functional programming principles
- DataFrame operations and transformations
- Spark SQL integration
- Data input/output operations
- EMR cluster deployment
- AWS CLI integration
- S3 data storage
- Spark job submission
- Data skewness handling
- Broadcast join optimization
- Partition management
- Performance tuning strategies
- Feature engineering (numeric and text)
- Linear regression implementation
- K-means clustering
- Model tuning and optimization
- ML pipeline construction
-
Environment Setup
- Follow PySpark's official installation guide
- Set up Python environment with required dependencies
- Configure AWS credentials (for cluster-related exercises)
-
Running the Exercises
- Each directory contains Jupyter notebooks and Python scripts
- Start with the numbered notebooks in each section
- Solutions are provided for self-assessment
- Exercise solutions are available in corresponding
*_solution
notebooks - AWS-related exercises require active AWS credentials
- Sample datasets are included in respective directories
Feel free to submit issues and enhancement requests!