A Python environment to develop Pyspark based notebooks in.
- Python3 installed and added to PATH environment variable
- Download and install Java JDK (if not installed already). Ensure JAVA_HOME varialbe is set OR copy path to 'jdk-****' to 'conf_java_jdk_path.txt'
- Run '# Step 2 - Create Python Venv.bat' under 'Setup' folder to set up Python environment
You can download other versions of Spark and/or Hadoop winutils.exe to use by following the links (also in Setup folder):
- Download Spark with Hadoop version from https://spark.apache.org/downloads.html
- Extract file to a descriptive folder
- Change 'conf_spark_to_use.txt' to new spark folder
- Download corresponding Hadoop winutils.exe from https://github.com/kontext-tech/winutils
- Place winutils.exe in '<new_hadoop_folder>\bin'
- Change 'conf_hadoop_winutils_to_use.txt' to new folder
There are two options:
- You can start a Spark shell by running '# Run Spark Shel.bat' OR
- You can start a JupyterLab instance using the Pyspark by running '# Run Pyspark JupyterLab.bat'