Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data partitioning architecture #106

Open
archenroot opened this issue Nov 1, 2016 · 12 comments
Open

Data partitioning architecture #106

archenroot opened this issue Nov 1, 2016 · 12 comments

Comments

@archenroot
Copy link

First of all good stuff here!

I have question about, how Alenka is designed to partition across multiple devices. Suppose I have big GPU cluster. Does it work in a way of making some "super" partitions each assigned to 1 device and while making parallel queries, each "super" partition is then "mini" partitioned for parallel execution within specific GPU?

I am more or less experimenting with this code in the moment and CUDA in general, so sorry for not accurate idea.

Thank you.

@antonmks
Copy link
Owner

antonmks commented Nov 1, 2016

No, for now alenka can use only a single gpu. But it shouldn't be difficult
to modify it to partition the data and process it on multiple gpus !

On Tue, Nov 1, 2016 at 4:19 PM, archenroot [email protected] wrote:

First of all good stuff here!

I have question about, how Alenka is designed to partition across multiple
devices. Suppose I have big GPU cluster. Does it work in a way of making
some "super" partitions each assigned to 1 device and while making parallel
queries, each "super" partition is then "mini" partitioned for parallel
execution within specific GPU?

I am more or less experimenting with this code in the moment and CUDA in
general, so sorry for not accurate idea.

Thank you.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#106, or mute the thread
https://github.com/notifications/unsubscribe-auth/ABhkFETxXahi3KehvIHcZ0W54MavTOsbks5q5zxzgaJpZM4KmFMt
.

@archenroot
Copy link
Author

Ok, clear. Thank you very much.

@archenroot
Copy link
Author

archenroot commented Nov 23, 2016

Can you please suggest where and how to implement the partitioning? Another idea came to my head is data sharding or global cluster configuration:

  • sharding -> only specific data ranges will live on GPU or GPU cluster
  • global cluster -> imagine I have 4 machines each with 8 GPUs connected via PCIExpress network and I would like to run one big database of Alenka on them

If you can point not only ideas, but real possible points where in the code implement the functionality, it might be helpful.

I see big potential in this application, I am interested in doing some comparison between Alenka and in-memory engines (Redis, etc.)

Thanks a lot!

@antonmks
Copy link
Owner

Keep in mind that Alenka is pretty experimental, it is not suitable for production use at this point. Concerning partitioning you could write some program which would run some modified Alenka instances on different nodes and collect the results at a master node. Or something like that.

@archenroot
Copy link
Author

archenroot commented Nov 27, 2016 via email

@hurdad
Copy link
Collaborator

hurdad commented Dec 20, 2016

I was working on using Hadoop HDFS as data storage, this would allow nodes to query the same dataset.

@archenroot
Copy link
Author

archenroot commented Dec 20, 2016 via email

@hurdad
Copy link
Collaborator

hurdad commented Dec 20, 2016

So, like you could partition by date and have data from each day on different servers?

@archenroot
Copy link
Author

archenroot commented Dec 20, 2016 via email

@hurdad
Copy link
Collaborator

hurdad commented Dec 20, 2016

one issue I see is that you would need to pre sort the data your loading by your partition field when creating the segments.

anyway, the main reason I was perusing a clustered /distributed approach was to allow for concurrent queries on the dataset. currently Alenka only supports single queries per GPU

@archenroot
Copy link
Author

archenroot commented Dec 20, 2016 via email

@dkourilov
Copy link

Hey guys,

I think this thread worth to mention http://www.bitfusion.io/ and usage example from http://tech.marksblogg.com/billion-nyc-taxi-rides-aws-ec2-mapd.html

I'm not affiliated with both of them, Bitfusion does GPU virtualisation, just my 2 cents.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants