Repository for Python scripts to check Hudi issues reproducibility.
PyCharm is used for running scrtipts. Spark is run locally with one thread.
-
Download corresponding Apache Spark version with Hadoop and extract to some folder,
SPARK_HOME
. -
Prepare Python virtual environment:
python3 -m venv [env_name]
source [env_name]/bin/activate
pip install -r requirements.txt
-
Set
SPARK_HOME
,SPARK_HOST_URL
andSPARK_WAREHOUSE_PATH
in the.env
file. Examples are presented in.env-example
. -
For PyCharm, add
pyspark.zip
andpy4j-*-src.zip
from{SPARK_HOME}/python/lib/
intoProject
>>Project Structure
>>Content root
. -
Build intended version of Hudi project and copy
hudi-spark*.*-bundle_2.1*-*.jar
to{SPARK_HOME}/jars/
. -
Start Spark cluster locally,
{SPARK_HOME}/sbin/start-all.sh
.
Run script, which corresponds to an issue, from configured PyCharm.
To switch versions:
- Change
SPARK_HOME
in the.env
file. - For PyCharm, add related
pyspark.zip
andpy4j-*-src.zip
toContent root
. - Build new version of Hudi project and copy new
hudi-spark*.*-bundle_2.1*-*.jar
to corresponding Spark home directory. - Stop previously started Spark cluster, and start proper version of Spark.