This repository contains templates to help you get started with MP10.
Last updated in May 2021, by Yifan Chen ([email protected]).
Last updated in May 2021, by Sam Cheng ([email protected]).
- Parts B and D (MLLib exercises) can be solved using either the
Dataframe-based API (pyspark.ml) or the RDD-based API (pyspark.mllib).
The corresponding templates for each have the suffix
_ml
and_mllib
. Make sure you rename the python files corresopnding to parts B and D topart_b.py
andpart_d.py
respectively before submitting them.
- Each file can be executed by running
spark-submit part_xxx.py
- You can alternatively run the following to get rid of spark logs
spark-submit part_xxx.py 2> /dev/null
- Make sure that you have the given dataset in the directory you are running the given code from. The structure this repository is arranged in is recommended.
- While the extra argument for graphframes is not required for part b and part d, it is not necessary to remove it these parts