-
Notifications
You must be signed in to change notification settings - Fork 385
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding dropTake to RDD. Allows to drop first 'drop' elements and then take next 'take' elements #617
base: master
Are you sure you want to change the base?
Conversation
trulite
commented
May 19, 2013
- Added a dropTake(drop: Int, num: Int) method which will drop first 'drop' elements and then take the next 'num' elements.
- Added a test for take(num: Int)
- Added a test for dropTake(drop: Int, num: Int)
Thank you for your pull request. An admin will review this request soon. |
while ((drop - dropped) > 0 && it.hasNext) { | ||
it.next() | ||
dropped += 1 | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure this will work the way you expect it to - "dropped" is updated in the slave and not 'brought back' : all slaves will see it as '0'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added another commit. Please take a look if that is okay. I added some comments which can aid in the discussion. Can remove them later if we want to merge.
Since i am not very familiar with usage of accumulator, I will let someone else comment ! |
@@ -714,6 +714,42 @@ abstract class RDD[T: ClassManifest]( | |||
} | |||
|
|||
/** | |||
* Drop the first drop elements and then take next num elements of the RDD. This currently scans the partitions *one by one*, so | |||
* it will be slow if a lot of partitions are required. In that case, use dropCollect(drop) to get the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have a dropCollect
method; maybe suggest collect().drop(drop)
instead?
I think the current test only exercises one of the branches in val taken = if (leftToDrop > 0) it.take(0) else it.take(left) To cover more cases, I'd add a test where we drop one or more complete partitions before taking results from the remaining partitions. |
I added these tests
|
Ping! |
Sorry for taking a while to get to this. I'm curious, what do you want to use this operation for? It seems potentially limited in use because it's going to scan the whole RDD linearly. |
This provides the LIMIT and TOPK results like in SQL/LINQ. Sometimes it is useful to page to just 2-3 pages(of some page size) of the TOPK results. This will not fetch the whole RDD like collect does. |
A) Instead of iteratively dropping, it would be better to : a) Count records per partition Ofcourse, this assumes that :
B) Above will also get rid of the need to do 'collect().drop(drop)' as mentioned in comment : since that will also not scale with size of RDD. |
A)
|
Assuming "T" seconds to process a partition (for simplicity sake, count/drop/etc all take same time - though it is not accurate) - in this PR, if you need to process N partitions, you take N * T seconds. Though count is linear - all partitions get scheduled in parallel : so unless you are running on a single core or number of partitions are much higher than number of cores, you wont take num_partitions * time_per_partition. Btw, we are not doing a global count - but count per partition.
But as a general construct to RDD, which might get used for all usecases, I am worried about the performance implications - particularly ability to cause OOM and bad scaling - when used on large datasets. |
Thanks! I understand and agree. I will whip up something along these lines soon and update this PR. |
If your use case is paging with small values of "drop" and "take", then consider just calling take() on the first few thousand elements and paging through that as a local collection. Otherwise this seems like a very specific thing to add for just one use case. A parallel slicing operation like Mridul says could also be good, but again I'd like to understand where it will be used, because it seems like a complicated operation to add and to implement well. Just as an example, the counts there might be better off being cached across calls to dropTake if it will be used for paging. |
Thank you for your pull request. An admin will review this request soon. |
Any movement on this request..? I could use this. |
I was whipping up a generic in tool in which I needed to allow the user to page on top-K elements. Since a generic tool is a one-off use-case I am using my solution(this PR as of the last update). Since this is human user driven, they will only do this for a small number of pages, so the solution I have works better for me than the one Mridul suggested with parallel slicing. I agree with Mathei that this might be dangerous to have around as a generic RDD construct. That said, I have experimented with Mridul's approach and might have the code lying around. So if we intend to have this PR pulled I will update the code and test cases. |
…ropTake from spark pull request 617 (mesos/spark#617)