-
Notifications
You must be signed in to change notification settings - Fork 385
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clean up DAGScheduler datastructures after job completes #414
base: master
Are you sure you want to change the base?
Conversation
Good catch; the only thing I'm not 100% sure of is whether this will work when two active jobs depend on the same stage. (This can only happen if two threads submit a job at the same time, but that may occur in a multi-user system or in Spark Streaming.) I'll look into it more carefully, but if you've thought about this, let me know. |
This will remove all stages submitted under a single job - if another job comes along and depends on the same stages, it will not work. In our cases, we have not resubmitted the same RDD's multiple types (using shark we create new RDD's every time). I can see now where getShuffleMapStage in wich a lookup matches a stage by an ID from the RDD. I think a per-job reference-count for the stage-related maps might work- seem plausible? Thank you for looking at this, btw. You seem very busy if the mailing list is any indication. -Jon |
Yeah, I think a reference count would be better. We should also make a test where multiple jobs depend on the same stage. For example:
Here both the actions should depend on that first shuffle map stage before the group-by. The other thing I've been thinking about is whether we should just use WeakHashMap for most of the data structures. This way stages will only be kept for RDDs that the user program references somehow. I think this will be a fair amount of work but it might be the right thing in the long term. |
Can one of the admins verify this patch? |
I'm the Jenkins test bot for the UC, Berkeley AMPLab. I've noticed your pull request and will test it once an admin authorizes me to. Thanks for your submission! |
1 similar comment
I'm the Jenkins test bot for the UC, Berkeley AMPLab. I've noticed your pull request and will test it once an admin authorizes me to. Thanks for your submission! |
Bump -- any progress on this one? |
I think this will have to be done after Spark 0.8, because of the reference-counting issue above. However, @markhamstra has also been looking at the reference counting for some of his own work, I believe. |
Thank you for your pull request. An admin will review this request soon. |
I added some fixes to reduce heap/prevent memory leak we found after running a load test (thousands of jobs). There may be more leaks, but this fixes a big one.