flink-sql-benchmark

Generate test hive dataset

Step 1: Prepare your environment

Make sure you have Hadoop and Hive installed in your cluster. gcc is also needed to build the TPC-DS data generator.
Step 2: Build the data generator

Run ./tpcds-build.sh

Download and build the TPC-DS data generator.
Step 3: Generate TPC-DS dataset

Run ./tpcds-setup.sh 10000. The hive database is tpcds_bin_orc_10000.

Run ./tpcds-setup.sh <SCALE_FACTOR> to generate dataset. The "scale factor" represents how much data you will generate, which roughly translates to gigabytes. For example, ./tpcds-setup.sh 10 will generate about 10GB data. Note that the scale factor must be greater than 1.

tpcds-setup.sh will launch a MapReduce job to generate the data in text format. By default, the generated data will be placed in /tmp/tpcds-generate/<SCALE_FACTOR> of your HDFS cluster. If the folder already exists, the MapReduce job will be skipped.

Once data generation is completed, tpcds-setup.sh will load the data into Hive tables. Make sure the hive executable is in your PATH, alternatively, you can specify your Hive executable path via HIVE_BIN environment variable.

tpcds-setup.sh will create external Hive tables based on the generated text files. These tables reside in a database named tpcds_text_<SCALE_FACTOR>. Then tpcds-setup.sh will convert the text tables into an optimized format and the converted tables are placed in database tpcds_bin_<FORMAT>_<SCALE_FACTOR>. By default, the optimized format is orc. You can choose a different format by setting the FORMAT environment variable. The following is an example that creates 1TB test dataset in parquet format:

FORMAT=parquet HIVE_BIN=/path/to/hive ./tpcds-setup.sh 1000

Once the data is loaded into Hive, you can use database tpcds_bin_<FORMAT>_<SCALE_FACTOR>to run the benchmark.

Run benchmark in flink

Step 1: Prepare your flink environment.
- Prepare flink-conf.yaml: Recommended Conf.
- Setup hive integration: Hive dependencies.
- Setup hadoop integration: Hadoop environment.
- Setup flink cluster: Standalone cluster or Yarn session.
- Recommended environment for 10T
  - 20 machines.
  - Machine: 64 processors. 256GB memory. 1 SSD disk for spill. Multi SATA disks for HDFS.
Step 2: Build test jar.
- Modify flink version and hive version of pom.xml.
- cd flink-tpcds, mvn clean install
Step 3: Run
- flink_home/bin/flink run -c org.apache.flink.benchmark.Benchmark ./flink-tpcds-0.1-SNAPSHOT-jar-with-dependencies.jar --database tpcds_bin_orc_10000 --hive_conf hive_home/conf

Run benchmark in other systems

Because the prepared test data is standard hive data, other calculation frameworks integrated with hive data can also run benchmark very simply. Please build your own environment and test it.

If you have any questions, please contact:

Jingsong Lee ([email protected])
Rui Li ([email protected])

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

flink-sql-benchmark

Generate test hive dataset

Run benchmark in flink

Run benchmark in other systems

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
ddl-tpcds		ddl-tpcds
flink-tpcds		flink-tpcds
tpcds-gen		tpcds-gen
.gitignore		.gitignore
README.md		README.md
flink-conf.yaml		flink-conf.yaml
pom.xml		pom.xml
tpcds-build.sh		tpcds-build.sh
tpcds-setup.sh		tpcds-setup.sh

JingsongLi/flink-sql-benchmark

Folders and files

Latest commit

History

Repository files navigation

flink-sql-benchmark

Generate test hive dataset

Run benchmark in flink

Run benchmark in other systems

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages