This is an implementation of the DBSCAN clustering algorithm on top of Apache Spark. It is loosely based on the paper from He, Yaobin, et al. "MR-DBSCAN: a scalable MapReduce-based DBSCAN algorithm for heavily skewed data".
I have also created a visual guide that explains how the algorithm works.
DBSCAN on Spark is built against Scala 2.11.
I have created a sample project showing how DBSCAN on Spark can be used. The following however should give you a good idea of how it should be included in your application.
import org.apache.spark.mllib.clustering.dbscan.DBSCAN
object DBSCANSample {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("DBSCAN Sample")
val sc = new SparkContext(conf)
val data = sc.textFile(src)
val parsedData = data.map(s => Vectors.dense(s.split(',').map(_.toDouble))).cache()
log.info(s"EPS: $eps minPoints: $minPoints")
val model = DBSCAN.train(
parsedData,
eps = eps,
minPoints = minPoints,
maxPointsPerPartition = maxPointsPerPartition)
model.labeledPoints.map(p => s"${p.x},${p.y},${p.cluster}").saveAsTextFile(dest)
sc.stop()
}
}
DBSCAN on Spark is available under the Apache 2.0 license. See the LICENSE file for details.
DBSCAN on Spark is maintained by Irving Cordova ([email protected]).