Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document Azure Blob Storage support #256

Open
behrica opened this issue Oct 11, 2020 · 18 comments
Open

Document Azure Blob Storage support #256

behrica opened this issue Oct 11, 2020 · 18 comments

Comments

@behrica
Copy link
Collaborator

behrica commented Oct 11, 2020

@behrica, that sounds awesome! Would you mind sharing what config options to pass, and we can add it to the docs or the README? 😄

Originally posted by @anthony-khong in #228 (comment)

@behrica
Copy link
Collaborator Author

behrica commented Oct 11, 2020

In which form should we address this ?

I would think about a continuation of https://github.com/zero-one-group/geni/blob/develop/docs/kubernetes_basic.md

At the end of the Kubernes setup, the next natural question is:
How do I read data files ?

There is quite some options for it:

  1. Copy them "somehow" to all pods of the cluster (and the driver) in the same place and access as local files
  2. Mount Azure blob storage
  3. lot of potentially other options, which I have not tried:
    • hdfs
    • copy files on Kubernetes nodes and mount into pods
    • Use Azure Data Lake

All of this are a bit complex, and depend on he "concrete setup" .
It might be required to change and re-build docker images to add dependencies and worry about a lot of other Spark/Kubernetes/Azure specific details.

And they have little to do with "geni" itself, and are somewhere else documented.
From a "geni" point of view, only 2 things change:

  1. The options passed to g/create-spark-session
  2. The "url" of (g/read...) changes

Maybe the best is just to point this out at then end of kubernetes_basic.md, without giving a solution (because there are so many)

@behrica
Copy link
Collaborator Author

behrica commented Oct 11, 2020

I added a chapter into the Kubernetes documentation accordingly:

https://github.com/behrica/geni/blob/develop/docs/kubernetes_basic.md

@anthony-khong
Copy link
Member

Hi @behrica, thank you again for bringing this up.

I would think about a continuation of https://github.com/zero-one-group/geni/blob/develop/docs/kubernetes_basic.md

Yes, I think that's probably a good place to put it. I think some docs do get a bit long, which is fine.

And they have little to do with "geni" itself, and are somewhere else documented.

In which case, we can perhaps link to the documents?

I still think it's good to have an example of a working version. I'm happy to try out one of your examples and try to get it working!

I added a chapter into the Kubernetes documentation accordingly:

Would you like to make a PR? I believe there are some typos - I hope you're okay with it being reviewed 😄

@behrica
Copy link
Collaborator Author

behrica commented Oct 12, 2020

I made a PR.
I am not English mother tongue, so feel free to fix it

@behrica
Copy link
Collaborator Author

behrica commented Oct 12, 2020

This does not add yet anything to read the files from storage.
I am still figuring out how to do an example, which is neither too simplistic nor too complex.

All realistic examples, would need to assume the existence of some form of "cloud storage "of data.

@anthony-khong
Copy link
Member

I am not English mother tongue, so feel free to fix it

I think it's great! I'm just reviewing the styling, so that it's a bit more consistent throughout the repo 😄

All realistic examples, would need to assume the existence of some form of "cloud storage "of data.

Ah I see, in which case it may become a bit too involved to setup. Could we work on some public Azure Storage files, but I'm not sure what's available out there.

@behrica
Copy link
Collaborator Author

behrica commented Oct 17, 2020

I did a complete walk-through, which starts from "zero" up to analysing a 10GB CVS file stored in a newly created Azure File Storage with Geni on an AKS cluster.
https://github.com/behrica/geni/blob/develop/docs/kubernetes_azureStorage.md

All commands are there, but I did not write any text yet.

@anthony-khong Do you have a way to try it out and give me some feedback ?

If you copy / paste all commands after each other it should all work.

Starting from scratch made it rather long, but like this it's easy reproducible.

@anthony-khong
Copy link
Member

Hi @behrica, absolutely, I'll give it a go in the coming days and report back to you! It looks really neat!

@anthony-khong
Copy link
Member

Hi @behrica, I've given it a go, and, as before, it works as expected! And I've never used Azure before, so that's great! Really looking forward to merging this.

I've got some comments and feedback.

  • You need a Microsoft account to go through this tutorial. Let's say that in the prerequisites and link to https://signup.live.com/?
  • I believe many of the az commands run for a few minutes without any indications from stdout. I think we should put a warning to say that some of these command will take a few minutes, so hang in there!
  • Kubernetes cluster: maybe only two nodes are enough for a simple tutorial? At least I got a limit-exceeded error from Azure with the default settings right after sign up.
  • pvc.yaml: should we minimise the required resources? For instance, do we need 50GB? Maybe we make it as small as possible, and say that this is something to adjust?
  • Creating the driver Docker image: we need to say explicitly to "create a Dockerfile with the following content". Also, let's use the latest Geni version? We can do it like so.
  • When running az acr login --resource-group geni-azure-demo --name genidemo18, I got the following warning: Argument 'resource_group_name' has been deprecated and will be removed in a future release. Not sure if there's another way of doing it that may be more future-proof.
  • kubectl exec -tui geni -n spark -- clj -> -ti?
  • (-> df (g/agg df (g/sum :amount)) g/show) -> get rid of df in the second form of the threading macro?
  • (g/cache df) and (-> df (g/agg df (g/sum :amount)) g/show) takes a while to run! I think we should put a warning.
  • When specifying the schema, we can use the data-oriented schema such as here. At the moment it's undocumented, but I've put it in my TODO list :)
  • I assume az group delete -n geni-azure-demo cleans up everything, so that there won't be any charges?

Please let me know if you'd like me to chip in on some of these. I'd be very happy to work on it!

I think this a really cool guide. I would love to make a lein template that has everything here in it. All you do is make install and lein run, boom you're doing stuff on an Azure cluster.

@behrica
Copy link
Collaborator Author

behrica commented Oct 22, 2020

Thanks for the feedback, I will take it on board when writing the text.
There is one piece eventually missing, not sure what you think.

Working in a "kubectl exec" terminal is not the most comfortable experience of the world.

So what I do personally, is so start an nRepl in the driver node, instead of shelling into it.

To that one I can then connect remotely from Emacs or any other nRepl client.
This works securely by using "kubectl port-forward"

I could add this as an optional step at the end.

@anthony-khong
Copy link
Member

I think that's a really good point. Copying and pasting to the terminal is not the end of the world, but it would absolutely be a deal breaker for some people. I think if the kubectl port-forward approach works, we should add it in!

I think we can have a main-like script that gets executed during startup - it can look like Geni's main, but with a different [init-eval[(https://github.com/zero-one-group/geni/blob/develop/src/clojure/zero_one/geni/main.clj#L24) where we create a SparkSession that connects to an Azure cluster, we fix the port instead of picking a random one, then on a separate terminal instance we do kubectl port-forward? In any case, it is just a suggestion, please do whatever you think is best 😄

@behrica
Copy link
Collaborator Author

behrica commented Oct 28, 2020

I think that's a really good point. Copying and pasting to the terminal is not the end of the world, but it would absolutely be a deal breaker for some people. I think if the kubectl port-forward approach works, we should add it in!

I think we can have a main-like script that gets executed during startup - it can look like Geni's main, but with a different [init-eval[(https://github.com/zero-one-group/geni/blob/develop/src/clojure/zero_one/geni/main.clj#L24) where we create a SparkSession that connects to an Azure cluster, we fix the port instead of picking a random one, then on a separate terminal instance we do kubectl port-forward? In any case, it is just a suggestion, please do whatever you think is best

I am still not sure, If I want to go the geni cli way.
I can clearly see its advantages, a being single executable.
Specially for new comers.

But how long will it take, until I want to add dependencies to it ?

@behrica
Copy link
Collaborator Author

behrica commented Oct 29, 2020

I think that's a really good point. Copying and pasting to the terminal is not the end of the world, but it would absolutely be a deal breaker for some people. I think if the kubectl port-forward approach works, we should add it in!

I think we can have a main-like script that gets executed during startup - it can look like Geni's main, but with a different [init-eval[(https://github.com/zero-one-group/geni/blob/develop/src/clojure/zero_one/geni/main.clj#L24) where we create a SparkSession that connects to an Azure cluster, we fix the port instead of picking a random one, then on a separate terminal instance we do kubectl port-forward? In any case, it is just a suggestion, please do whatever you think is best

I did without using geni CLI, just by adding the nrepl start into the command used by the string container.

This "scenario" is as well a potential realistic usage scenario, in which the :

  • Kubernetes cluster runs 24 / 7
  • driver pod and its nrepl are running all the time, ready to be connected to
  • the blob storage is in place, ready to be mounted somewhere else and data can get copied to it

This is maybe still an enterprise scenario, as the Kubernetes cluster costs money, while existing.
But it can be configured to autoscale down to 1 or 2 nodes, and then it cost little money.

@behrica
Copy link
Collaborator Author

behrica commented Oct 29, 2020

We could potentially make a bash script, which does the whole setup "on keypress".

Including copy of a data file into the blob storage. (this can be the most time consuming part).

This would be more attractive for users, which don't want to have a long running Kubernetes / blob storage.

@anthony-khong
Copy link
Member

I did without using geni CLI, just by adding the nrepl start into the command used by the string container.

Yes, this is exactly what I meant! I agree with you, Geni CLI serves a simple use case to get started up and running quickly (and most realistically on a local machine).

Instead of a bash script, what do you think about making a lein template where you could just do lein setup-azure and lein repl, which starts the nREPL server ready to be port-forwarded and connected to your text editor on your laptop. Not sure if lein is divisive now though, because many people have moved to Clojure CLI tools.

What's the most effective way for me to help you with this? It sounds like a great addition to the library, and I would love to chip in here.

@behrica
Copy link
Collaborator Author

behrica commented Oct 29, 2020

I did without using geni CLI, just by adding the nrepl start into the command used by the string container.

Yes, this is exactly what I meant! I agree with you, Geni CLI serves a simple use case to get started up and running quickly (and most realistically on a local machine).

Instead of a bash script, what do you think about making a lein template where you could just do lein setup-azure and lein repl, which starts the nREPL server ready to be port-forwarded and connected to your text editor on your laptop. Not sure if lein is divisive now though, because many people have moved to Clojure CLI tools.

I thought about this. The setup script is mostly calls to "az", which could be easily shelled out to or even use the proper java/clojure client.
But we have calls to "docker" in it, which seems to be strange to not to them in "bash"...

I have a "running bash script" ready, I will share with you and you can have a look.
Then we can discuss, if we should bring it on a other form.

@behrica
Copy link
Collaborator Author

behrica commented Nov 7, 2020

Please find here the working setup script:

https://github.com/behrica/geni/blob/azure_storage_doc/docs/azureSetup/setupKubernetes.sh

It does all the tasks as from here :
https://github.com/behrica/geni/blob/azure_storage_doc/docs/kubernetes_azureStorage.md

in one go. In the beginning there are some parameters to be set, if needed.
It downloads as well a 10G file at te end.
This can take quite a while, 30 minutes or more.

After it finishes, you can do 2 port forwards (in different shells):

kubectl port-forward pod/geni 12345:12345 -n spark

and

kubectl port-forward pod/geni 4040:4040 -n spark

to have the nrepl port and the spark web-gui proxied on your local machine.

@behrica
Copy link
Collaborator Author

behrica commented Nov 7, 2020

I am not sure, in which form the script could be re-used, either "as is" , or with some modifications.

The concrete setup has so many "moving parts", which a user want to do eventually differently.
Maybe the best is to see it as a addition to the docu: https://github.com/behrica/geni/blob/azure_storage_doc/docs/kubernetes_azureStorage.md
so, a user can just execute it in one go, without copy/paste the commands from docu to shell.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants