Spark/xgboost integration #31

szilard · 2019-05-23T20:29:24Z

m5.2xlarge 8cores 30GB RAM

for comparison:

On 8 cores (including HT):

0.1m:
h2o 11.805 0.7022567
xgboost 3.295 0.7324224
lightgbm 2.287 0.7298355
1m:
h2o 29.214 0.7623596
xgboost 12.52 0.7494959
lightgbm 6.962 0.7636987
10m:
h2o 291.868 0.7763336
xgboost 109.124 0.7551197
lightgbm 57.033 0.7742033

On 4 cores:

0.1m:
h2o 11.499 0.7022444
xgboost 3.023 0.7324224
lightgbm 2.004 0.7298355
1m:
h2o 33.062 0.7623495
xgboost 15.785 0.7494959
lightgbm 6.662 0.7636987
10m:
h2o 376.488 0.7763268
xgboost 126.148 0.7551197
lightgbm 61.678 0.7742033

The text was updated successfully, but these errors were encountered:

szilard · 2019-05-23T20:36:47Z

https://github.com/szilard/GBM-perf/blob/master/wip-testing/spark-xgboost/xgb.scala

1M data:

Time 42.5 sec
AUC 0.7367

szilard · 2019-05-23T20:41:35Z

val xgbParam = Map(
      "objective" -> "binary:logistic",
      "num_round" -> 100,
      "max_depth" -> 10,
      "eta" -> 0.1,
      "tree_method" -> "hist",
      "num_workers" -> 8,   // needs to be given manually (default =1)
      "missing" -> 0)

szilard · 2019-05-23T20:44:26Z

workers	time [s]
8	42.5
4	37.5
2	41.5
1	51.8

szilard · 2019-05-24T14:18:07Z

10M data:

workers	time [s]
8	126
4	178
2	277
1	511

szilard · 2019-05-24T20:06:15Z

cluster (standalone spark 1 master+2 slaves):

1M:

workers	time cluster [s]	time local [s]
16	52.8
8	52.0	42.5
4	36.3	37.5
2	31.4	41.5
1	50.4	51.8

d_train.rdd.getNumPartitions: local = 6, cluster =12 (auto)

szilard · 2019-05-25T15:54:57Z

Spark cluster (standalone):

On master:

ssh-keys to log in into the slaves

Confs:

cat spark-2.4.3-bin-hadoop2.7/conf/spark-env.sh

#!/usr/bin/env bash

export SPARK_MASTER_HOST=172.31.10.112
export JAVA_HOME=/usr/lib/jvm/default-java

cat spark-2.4.3-bin-hadoop2.7/conf/slaves

172.31.9.74
172.31.11.93

Start cluster:

~/spark-2.4.3-bin-hadoop2.7/sbin/start-all.sh

Shell:

~/spark-2.4.3-bin-hadoop2.7/bin/spark-shell --master spark://172.31.10.112:7077 \
  --jars  xgboost/jvm-packages/xgboost4j-spark/target/xgboost4j-spark-0.90.jar,xgboost/jvm-packages/xgboost4j/target/xgboost4j-0.90.jar \
  --executor-memory 20G

Browser:

http://ec2-34-216-44-15.us-west-2.compute.amazonaws.com:8080/
http://ec2-34-216-44-15.us-west-2.compute.amazonaws.com:4040/

szilard · 2019-05-25T16:20:41Z

10M

workers	time cluster [s]	time local [s]
16	95
8	128	126
4	152	178
2	270	277
1	511	511

d_train.rdd.getNumPartitions: local = 8, cluster =16 (auto)

szilard · 2019-05-25T16:21:57Z

szilard added the analysis label May 25, 2019

szilard changed the title ~~xgboost-spark~~ Spark/xgboost integration May 25, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark/xgboost integration #31

Spark/xgboost integration #31

szilard commented May 23, 2019 •

edited

Loading

szilard commented May 23, 2019 •

edited

Loading

szilard commented May 23, 2019

szilard commented May 23, 2019 •

edited

Loading

szilard commented May 24, 2019 •

edited

Loading

szilard commented May 24, 2019 •

edited

Loading

szilard commented May 25, 2019 •

edited

Loading

szilard commented May 25, 2019 •

edited

Loading

szilard commented May 25, 2019

Spark/xgboost integration #31

Spark/xgboost integration #31

Comments

szilard commented May 23, 2019 • edited Loading

szilard commented May 23, 2019 • edited Loading

szilard commented May 23, 2019

szilard commented May 23, 2019 • edited Loading

szilard commented May 24, 2019 • edited Loading

szilard commented May 24, 2019 • edited Loading

szilard commented May 25, 2019 • edited Loading

Spark cluster (standalone):

szilard commented May 25, 2019 • edited Loading

szilard commented May 25, 2019

szilard commented May 23, 2019 •

edited

Loading

szilard commented May 23, 2019 •

edited

Loading

szilard commented May 23, 2019 •

edited

Loading

szilard commented May 24, 2019 •

edited

Loading

szilard commented May 24, 2019 •

edited

Loading

szilard commented May 25, 2019 •

edited

Loading

szilard commented May 25, 2019 •

edited

Loading