-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Run Gemini file-level duplicate detection on PGA #42
Comments
how do to that? we don't control spark cluster. |
let's measure, identify and document the bottleneck first, set preliminary expectations on resources for 100k and then discuss the possible options that we might have i.e this can be powerful argument for changing https://github.com/src-d/charts/tree/master/spark to Apache Spark on k8s. We would be able to improve the performance expectation model, based on more data later on. |
Thanks for keeping it updated! BTW, super-nice issue description and example how to reproduce 👍 |
Engine issue is resolved in https://github.com/src-d/engine/releases/tag/v0.5.1 |
yep. But the engine api has changed a bit. We need to update gemini. |
Run gemini on new 1k dataset with new engine. And it works!!!! The bad new is timing: 24 min. |
10k has failed with https://github.com/src-d/engine/issues/332 |
currently is blocked by https://github.com/src-d/engine/issues/336 |
To move this forward, as DR team is super-busy now, can we please submit a PR to engine that just logs |
@carlosms could you please check if https://github.com/src-d/engine/pull/347 solves the issue and allows us to move forward with #42 ? If that PR is tested on real data and solves the issue - it may be worth posting this information on the PR as well. |
Engine 0.5.7 was release 🎉 with many bug fixes and discussion like https://github.com/src-d/minutes/pull/210/files#diff-a0ec2b18d53b6bebfc2a342ed864a52fR34 should rise the priority of finishing running Gemini file duplication up to PGA sizes. |
Title and description are updated to represent the current goal. |
10k repos are processed successfully with engine 0.5.7. Full PGA is failing with OOM with default params. Need to tune them. |
Plan is:
|
PGA is downloading to the pipeline HDFS cluster on WIP by At this rate it will take ~25h to get there. |
PGA download is finished 🎉 but it's a bit as only 2.4Tb not 2.7Tb as the rumor has it to be. Would verify PGA integrity first with src-d/datasets#53 |
Pre-conditions for running new Gemini on pipeline staging Apache Spark cluster:
|
blocked by src-d/backlog#1266 |
|
Full PGA was downloaded to HDFS 🎉 src-d/datasets#53 (comment)
|
Plan
|
Blocked, as all Feature Extractors are deployed under https://github.com/src-d/issues-infrastructure/issues/184 are part of new, separate Apache Spark cluster in a different k8s namespace |
Hash has finished successfully, I'm submitting PRs now to Gemini that enabled it. Report is
|
1h for hashing a ~1/250 of PGA on 3 machines of pipeline staging cluster Configuration
Command
Output
FE exceptions
UAST extraction exceptions
DB
|
Thanks a lot for the detailed results, @bzz! Question: how are we sampling the repos for each of these tests? |
Good question. We always just used only a single shard of PGA dataset - all the repos, who's siva file names start with prefix Overall, on Apache Spark performance depends on data distribution A LOT, so attaching .siva file size distribution histogram in 10mb buckets
|
Local: 1mb, 30k features DataFramelocal: 8sec, cluster: 4sec val freqDf = features.withColumnRenamed("_1", "feature").withColumnRenamed("_2", "doc")
.select("feature", "doc")
.distinct
.groupBy("feature")
.agg(count("*").alias("cnt"))
.map(row => (row.getAs[String]("feature"), row.getAs[Long]("cnt")))
.collect().toMap RDDlocal: 4sec, cluster: 5s val freq = features.rdd
.map { case (feature, doc, _) => (feature, doc) }
.distinct
.map { case (token, _) => (token, 1) }
.reduceByKey(_ + _)
.collectAsMap() DataFrame API does not seem to change performance much, but still has nice benefit of uniform API. |
There are 141 .siva files bigger then 1Gb, with rest 260+k being smaller.
|
After moving biggest files, jobs fail with
After setting
which at this point might indicate broken .siva files on |
@bzz here is your issue: https://github.com/src-d/engine/issues/414 |
Simple processing of full PGA from Removing outliers, ~140 .siva files (of ~270k) which are >1Gb each, would speed it up x2-3 times.
val files = repos.getHEAD
.getCommits
.getTreeEntries
.getBlobs
.filter('is_binary === false)
files
.sort("repository_id")
.coalesce(1000)
.write.parquet("hdfs://hdfs-namenode.default.svc.cluster.local/pga2/parquet/files") Caching all files on-disk in Parquet fails though, \w
This happens due to the fact that DF API for a String column keeps in-memory the longest string, which is a full file content and is more then 1Gb. |
On-disk Parquet cache was failing due to the number of tasks ~40k beeing to high for our clusted configuration, was fixed by reducing the number and can proceed over full PGA (~50h) 🎉 but is failing at the end now 😖 with
Simple example to reproduce val path = "hdfs://hdfs-namenode.default.svc.cluster.local/pga2/siva/latest/"
val engine = Engine(spark, path, "siva")
val repos = engine.getRepositories
val files = repos.getHEAD
.getCommits
.getTreeEntries
.getBlobs
.filter('is_binary === false)
files
.coalesce(1000)
.sort("repository_id")
.write.parquet("hdfs://hdfs-namenode.default.svc.cluster.local/pga2/parquet/files") |
Document in README the resources, needed to successfully process 1k, 2k, 10k, 100k and whole PGA of the .siva files.
So good start would be
The text was updated successfully, but these errors were encountered: