Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark/xgboost integration #31

Open
szilard opened this issue May 23, 2019 · 8 comments
Open

Spark/xgboost integration #31

szilard opened this issue May 23, 2019 · 8 comments
Labels

Comments

@szilard
Copy link
Owner

szilard commented May 23, 2019

m5.2xlarge 8cores 30GB RAM

for comparison:

On 8 cores (including HT):

0.1m:
h2o 11.805 0.7022567
xgboost 3.295 0.7324224
lightgbm 2.287 0.7298355
1m:
h2o 29.214 0.7623596
xgboost 12.52 0.7494959
lightgbm 6.962 0.7636987
10m:
h2o 291.868 0.7763336
xgboost 109.124 0.7551197
lightgbm 57.033 0.7742033

On 4 cores:

0.1m:
h2o 11.499 0.7022444
xgboost 3.023 0.7324224
lightgbm 2.004 0.7298355
1m:
h2o 33.062 0.7623495
xgboost 15.785 0.7494959
lightgbm 6.662 0.7636987
10m:
h2o 376.488 0.7763268
xgboost 126.148 0.7551197
lightgbm 61.678 0.7742033
@szilard
Copy link
Owner Author

szilard commented May 23, 2019

@szilard
Copy link
Owner Author

szilard commented May 23, 2019

val xgbParam = Map(
      "objective" -> "binary:logistic",
      "num_round" -> 100,
      "max_depth" -> 10,
      "eta" -> 0.1,
      "tree_method" -> "hist",
      "num_workers" -> 8,   // needs to be given manually (default =1)
      "missing" -> 0)

@szilard
Copy link
Owner Author

szilard commented May 23, 2019

workers time [s]
8 42.5
4 37.5
2 41.5
1 51.8

@szilard
Copy link
Owner Author

szilard commented May 24, 2019

10M data:

workers time [s]
8 126
4 178
2 277
1 511

@szilard
Copy link
Owner Author

szilard commented May 24, 2019

cluster (standalone spark 1 master+2 slaves):

1M:

workers time cluster [s] time local [s]
16 52.8  
8 52.0 42.5
4 36.3 37.5
2 31.4 41.5
1 50.4 51.8

d_train.rdd.getNumPartitions: local = 6, cluster =12 (auto)

@szilard
Copy link
Owner Author

szilard commented May 25, 2019

Spark cluster (standalone):

On master:

ssh-keys to log in into the slaves

Confs:

cat spark-2.4.3-bin-hadoop2.7/conf/spark-env.sh

#!/usr/bin/env bash

export SPARK_MASTER_HOST=172.31.10.112
export JAVA_HOME=/usr/lib/jvm/default-java
cat spark-2.4.3-bin-hadoop2.7/conf/slaves

172.31.9.74
172.31.11.93

Start cluster:

~/spark-2.4.3-bin-hadoop2.7/sbin/start-all.sh

Shell:

~/spark-2.4.3-bin-hadoop2.7/bin/spark-shell --master spark://172.31.10.112:7077 \
  --jars  xgboost/jvm-packages/xgboost4j-spark/target/xgboost4j-spark-0.90.jar,xgboost/jvm-packages/xgboost4j/target/xgboost4j-0.90.jar \
  --executor-memory 20G

Browser:

http://ec2-34-216-44-15.us-west-2.compute.amazonaws.com:8080/
http://ec2-34-216-44-15.us-west-2.compute.amazonaws.com:4040/

@szilard
Copy link
Owner Author

szilard commented May 25, 2019

10M

workers time cluster [s] time local [s]
16 95  
8 128 126
4 152 178
2 270 277
1 511  511

d_train.rdd.getNumPartitions: local = 8, cluster =16 (auto)

@szilard
Copy link
Owner Author

szilard commented May 25, 2019

Screen Shot 2019-05-25 at 9 17 25 AM

Screen Shot 2019-05-25 at 9 18 20 AM

@szilard szilard changed the title xgboost-spark Spark/xgboost integration May 25, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant