-
Step 1: Prepare your environment
Make sure you have Hadoop and Hive installed in your cluster.
gcc
is also needed to build the TPC-DS data generator. -
Step 2: Build the data generator
Run
./tpcds-build.sh
Download and build the TPC-DS data generator.
-
Step 3: Generate TPC-DS dataset
Run
./tpcds-setup.sh 10000
. The hive database istpcds_bin_orc_10000
.Run
./tpcds-setup.sh <SCALE_FACTOR>
to generate dataset. The "scale factor" represents how much data you will generate, which roughly translates to gigabytes. For example,./tpcds-setup.sh 10
will generate about 10GB data. Note that the scale factor must be greater than 1.tpcds-setup.sh
will launch a MapReduce job to generate the data in text format. By default, the generated data will be placed in/tmp/tpcds-generate/<SCALE_FACTOR>
of your HDFS cluster. If the folder already exists, the MapReduce job will be skipped.Once data generation is completed,
tpcds-setup.sh
will load the data into Hive tables. Make sure thehive
executable is in yourPATH
, alternatively, you can specify your Hive executable path viaHIVE_BIN
environment variable.tpcds-setup.sh
will create external Hive tables based on the generated text files. These tables reside in a database namedtpcds_text_<SCALE_FACTOR>
. Thentpcds-setup.sh
will convert the text tables into an optimized format and the converted tables are placed in databasetpcds_bin_<FORMAT>_<SCALE_FACTOR>
. By default, the optimized format isorc
. You can choose a different format by setting theFORMAT
environment variable. The following is an example that creates 1TB test dataset in parquet format:FORMAT=parquet HIVE_BIN=/path/to/hive ./tpcds-setup.sh 1000
Once the data is loaded into Hive, you can use database
tpcds_bin_<FORMAT>_<SCALE_FACTOR>
to run the benchmark.
-
Step 1: Prepare your flink environment.
-
Prepare flink-conf.yaml: Recommended Conf.
-
Setup hive integration: Hive dependencies.
-
Setup hadoop integration: Hadoop environment.
-
Setup flink cluster: Standalone cluster or Yarn session.
-
Recommended environment for 10T
- 20 machines.
- Machine: 64 processors. 256GB memory. 1 SSD disk for spill. Multi SATA disks for HDFS.
-
-
Step 2: Build test jar.
-
Modify flink version and hive version of
pom.xml
. -
cd flink-tpcds
,mvn clean install
-
-
Step 3: Run
flink_home/bin/flink run -c org.apache.flink.benchmark.Benchmark ./flink-tpcds-0.1-SNAPSHOT-jar-with-dependencies.jar --database tpcds_bin_orc_10000 --hive_conf hive_home/conf
Because the prepared test data is standard hive data, other calculation frameworks integrated with hive data can also run benchmark very simply. Please build your own environment and test it.
If you have any questions, please contact:
- Jingsong Lee ([email protected])
- Rui Li ([email protected])