This is the code repository for Mastering Big Data Analytics with PySpark [Video], published by Packt. It contains all the supporting project files necessary to work through the video course from start to finish. Authored by: Danny Meijer
PySpark helps you perform data analysis at-scale; it enables you to build more scalable analyses and pipelines. This course starts by introducing you to PySpark's potential for performing effective analyses of large datasets. You'll learn how to interact with Spark from Python and connect Jupyter to Spark to provide rich data visualizations. After that, you'll delve into various Spark components and its architecture.
You'll learn to work with Apache Spark and perform ML tasks more smoothly than before. Gathering and querying data using Spark SQL, to overcome challenges involved in reading it. You'll use the DataFrame API to operate with Spark MLlib and learn about the Pipeline API. Finally, we provide tips and tricks for deploying your code and performance tuning.
By the end of this course, you will not only be able to perform efficient data analytics but will have also learned to use PySpark to easily analyze large datasets at-scale in your organization.
- Gain a solid knowledge of vital Data Analytics concepts via practical use cases
- Create elegant data visualizations using Jupyter
- Run, process, and analyze large chunks of datasets using PySpark
- Utilize Spark SQL to easily load big data into DataFrames
- Create fast and scalable Machine Learning applications using MLlib with Spark
- Perform exploratory Data Analysis in a scalable way
- Achieve scalable, high-throughput and fault-tolerant processing of data streams using Spark Streaming
This course will greatly appeal to data science enthusiasts, data scientists, or anyone who is familiar with Machine Learning concepts and wants to scale out his/her work to work with big data.
If you find it difficult to analyze large datasets that keep growing, then this course is the perfect guide for you!
A working knowledge of Python assumed.
For successful completion of this course, students will require the computer systems with at least the following:
OS: Windows, Mac, or Linux Processor: Any processor from the last few years Memory: 2GB RAM Storage: 300MB for the Integrated Development Environment (IDE) and 1GB for cache
For an optimal experience with hands-on labs and other practical activities, we recommend the following configuration:
OS: Windows, Mac, or Linux Processor: Core i5 or better (or AMD equivalent) Memory: 8GB RAM or better Storage: 2GB free for build caches and dependencies
Operating system: Windows, Mac, or Linux Docker
setting up your interactive development environment.
Once you have cloned this repository locally, simply navigate to the folder you have
stored the repo in and run:
python download_data.py
This will populate the data-sets
folder in your repo with a number of data sets that
will be used throughout the course.
The Docker Image bundled with this course (see Dockerfile
) is based on the
pyspark-notebook
, distributed and maintained by Jupyter
Github link Original copyright (c) Jupyter Development Team. Distributed under the terms of the Modified BSD License.
This Course's Docker image extends the pyspark-notebook
with the following additions:
- enables Jupyter Lab by default
- exposes correct ports for JupyterLab and SparkUI
- sets numerous default settings to improve Quality of Life for the user
- installs numerous add-ons (such as
pyspark-stubs
andblackcellmagic
) usingjupyter_contrib_nbextensions
There are 2 ways to access the Docker container in this course:
- Through the bundled
run_me.py
script (recommended to use) - Through the Docker CLI (only for advanced users)
The easiest way to run the container that belongs to this course is by running
python run_me.py
from the course's repository. This will automatically
build the Docker image, set up the Docker container, download the data, and set up the
necessary volume mounts.
If you rather start the Docker container manually, use the following instructions:
-
Download the data
python download_data.py
-
Build the image
docker build --rm -f "Dockerfile" -t mastering_pyspark_ml:latest .
-
Run the image Ensure that you replace
/path/to/mastering_pyspark_ml/repo/
in the following command, and run it in a terminal or command prompt:docker run -v /path/to/mastering_pyspark_ml/repo/:/home/jovyan/ --rm -d -p 8888:8888 -p 4040:4040 --name mastering_pyspark_ml mastering_pyspark_ml .
-
Open Jupyter lab once Docker image is running Navigate to http://localhost:8888/lab
Once you are ready to shutdown the Docker container, you can use the following command:
docker stop mastering_pyspark_ml