-
Notifications
You must be signed in to change notification settings - Fork 767
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Apply initial centroids on Spark Kmeans workload. #187
Open
pfxuan
wants to merge
1
commit into
Intel-bigdata:master
Choose a base branch
from
pfxuan:spark-kmeans
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use KMeans.RANDOM? What is the difference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
KMeans.RANDOM
is from Spark MLLib andKMeans.setInitialModel
shares the initial centroids generated by HiBench GenKMeansDataset. To get a meaningful comparison, we have to pick one of approaches and apply it on both of MapReduce and Spark benchmarks. In this PR, we select the latter approach, HiBench GenKMeansDataset, to generate the random centroids based-on normal (Gaussian) distribution.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ye, the random centroids would be generated by takeSample without a normal distribution in Spark Kmeans even if KMeans.RANDOM is used. If getting a meaningful comparison between MapReduce and Spark benchmarks is the core value of HiBench, I agree with your approach, it looks like not enough for the performance of Spark K-Means algorithm implementation by itself though. @carsonwang
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if k != number of centroids in initiModel, Spark KMeans will throw exception:
Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: mismatched cluster count
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hqzizania, this should be an expected behavior. If Spark KMeans uses HiBench generated initial model, the parameter k must has the matched value as defined in the initiModel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pfxuan , please see
10-data-scale-profile.conf
. Thenum_of_clusters
used to generate the input data and thek
might be different. This is one of the concerns for using the input centroids as the initial model. The MR version is currently using the input centroids and doesn't pass thek
parameter. We'd prefer to using thek
parameter as this is the expected number of clusters.As @hqzizania mentioned, the problem in Java Kmeans is the RRD is not cached. Can you please have a check with the scala version which supports both
KMeas ||
andRandom
. It seems there is no huge computation cost when initializing the mode as the RDD is cached? If this is the case, we can fix the Java version and also pass-k
to the MR version. This should make all of them comparable.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did the performance test on scala version Kmeans. The size of input data is about 100 GB across 4 nodes. The running time on
Random
andParallel
is almost same, which took about 4 mins for running 3 iterations including 1.3 mins on centroid initialization. So there is about 48.1% overhead when using either the Spark version ofRandom
orParallel
. As a comparison, the implementation of this PR only took about 2.4 mins for 3 iterations with almost zero-overhead on initialization.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition, Mahout version of random initialization is a sequential rather than MapReduce-based implementation. I passed
-k 20
parameter to the MR benchmark, and it took 18.8 mins to generate 20 initial centroids using only one CPU core. To make a reasonable comparison, I think it would be better to keep the original HiBench generator for all Kmeans benchmarks.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @pfxuan , I just got chance to run KMean with
Random
initialization by passing--initMode Random
inrun.sh
. In my test, therandom
initialization ran much faster and I saw much less stages. TheParallel
initialization is slow as there are many stages likecollect
andaggregate
which run several iterations. But these stages are not expected when usingRandom
initialization. Can you take a look at the Spark UI to see if there is any difference?