-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data partitioning architecture #106
Comments
No, for now alenka can use only a single gpu. But it shouldn't be difficult On Tue, Nov 1, 2016 at 4:19 PM, archenroot [email protected] wrote:
|
Ok, clear. Thank you very much. |
Can you please suggest where and how to implement the partitioning? Another idea came to my head is data sharding or global cluster configuration:
If you can point not only ideas, but real possible points where in the code implement the functionality, it might be helpful. I see big potential in this application, I am interested in doing some comparison between Alenka and in-memory engines (Redis, etc.) Thanks a lot! |
Keep in mind that Alenka is pretty experimental, it is not suitable for production use at this point. Concerning partitioning you could write some program which would run some modified Alenka instances on different nodes and collect the results at a master node. Or something like that. |
I understand that it will require effort to implement this feature, but
more or less asking you about your opinions.
2016-11-27 17:30 GMT+01:00 Anton <[email protected]>:
… Keep in mind that Alenka is pretty experimental, it is not suitable for
production use at this point. Concerning partitioning you could write some
program which would run some modified Alenka instances on different nodes
and collect the results at a master node. Or something like that.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#106 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAhyKL9w9Ca4exxMn8fLqhqFcBtugeC-ks5rCbA-gaJpZM4KmFMt>
.
|
I was working on using Hadoop HDFS as data storage, this would allow nodes to query the same dataset. |
Thanks Alexander,
well, what I am interested in is dynamic partitioning of the data, which
could be seen as grouping of segments.
Imagine I have a table DATA with 2 columns:
Type Value
A something
A something else
B something else
B something else
Imagine each type has 1 billion of records, what I am interested in is to
create 2 groups of segments, one for type A, another for type B
And when the query comes like select from date where type = 'A', I will
pickup only segments registered with A. This will of course require more
fine grained processing of segments than it is in the moment happening. I
also would like to store each partition on different GPU cluster(just as
ability to tune the performance).
Ladislav
2016-12-20 11:35 GMT+01:00 Alexander Hurd <[email protected]>:
… I was working on using Hadoop HDFS as data storage, this would allow nodes
to query the same dataset.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#106 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAhyKEO2hp4JSvD_28a3cQqQjzR0tc56ks5rJ69pgaJpZM4KmFMt>
.
|
So, like you could partition by date and have data from each day on different servers? |
Exactly, this need more brainstorming. In case of monthly based partitions
i would like to have whole year stored on one. Just dummy example of
generic approach.
sent from mobile, Ladislav
…On 20 Dec 2016 1:06 p.m., "Alexander Hurd" ***@***.***> wrote:
So, like you could partition by date and have data from each day on
different servers?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#106 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAhyKHHrbYfND-bi68uy7uhDpgxc4U5Iks5rJ8TggaJpZM4KmFMt>
.
|
one issue I see is that you would need to pre sort the data your loading by your partition field when creating the segments. anyway, the main reason I was perusing a clustered /distributed approach was to allow for concurrent queries on the dataset. currently Alenka only supports single queries per GPU |
Exactly, good point, but these are 2 different things here when we jump
into clustering which we need to target:
1. Create a network stack around gpudb to reach the single process on gpu
node
I am considering using Netty Java to build the network around gpudb, or go
with pure C++ like Netty library of Facebook https://github.com/
facebook/wangle
Then you need loadbalancer and cluster control, this will require
monitoring of what GPU works on what type of task, especially you need to
know how much memory you have available,etc.
There are 2 frameworks I am looking into for multigpu cuda: CUDA aware-MPI
and rCUDA
2. Make multiple queries to be executed in parallel - current design is
completely not aware of that, also usually you create a segments based on
extracted data size to be processed(to limit number of segments and
therefore number of offloads), so when one query is executed, the VRAM of
one GPU is fully used to its limit. Maybe with Nvidia Tesla 32GB ram (16
per gpu core) could bring more space to create segments smaller than VRAM.
Then we will need to immplement CUDA Streams available from Kepler
architecture (correct me if I am wrong here).
I was doing some research on multi node interconnect and PCIe network could
be and option ( I saw Dolphin adapters doing 50Gbit/s with larger messages
of 4K), or going with infiniband which could reach already over 100Gbit/s.
Regards,
Ladislav
2016-12-20 13:35 GMT+01:00 Alexander Hurd <[email protected]>:
… one issue I see is that you would need to pre sort the data your loading
by your partition field when creating the segments.
anyway, the main reason I was perusing a clustered /distributed approach
was to allow for concurrent queries on the dataset. currently Alenka only
supports single queries per GPU
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#106 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAhyKMxR3n7_nuwovWLAKhD447PIjtqgks5rJ8ubgaJpZM4KmFMt>
.
|
Hey guys, I think this thread worth to mention http://www.bitfusion.io/ and usage example from http://tech.marksblogg.com/billion-nyc-taxi-rides-aws-ec2-mapd.html I'm not affiliated with both of them, Bitfusion does GPU virtualisation, just my 2 cents. |
First of all good stuff here!
I have question about, how Alenka is designed to partition across multiple devices. Suppose I have big GPU cluster. Does it work in a way of making some "super" partitions each assigned to 1 device and while making parallel queries, each "super" partition is then "mini" partitioned for parallel execution within specific GPU?
I am more or less experimenting with this code in the moment and CUDA in general, so sorry for not accurate idea.
Thank you.
The text was updated successfully, but these errors were encountered: