Skip to content
This repository has been archived by the owner on Dec 26, 2018. It is now read-only.

How to save the trained model to HDFS? #7

Open
echoyes opened this issue Aug 29, 2016 · 3 comments
Open

How to save the trained model to HDFS? #7

echoyes opened this issue Aug 29, 2016 · 3 comments

Comments

@echoyes
Copy link

echoyes commented Aug 29, 2016

After I have trained the model, I can not find an interface to save the trained model to HDFS. Is there any way that could solve this problem? much thanks.

@illuzen
Copy link
Contributor

illuzen commented Aug 30, 2016

We originally had an API for saving it to local disk, for example here

self.saver.save(self.model.session, './models/parameter_server_model', global_step=int(time.time()))

However we took it out when we removed the Sacred integration, I thing. We just use TF's Saver object. https://www.tensorflow.org/versions/r0.10/api_docs/python/state_ops.html#Saver

As for persisting the model to HDFS, I imagine you can hook the Saver up to the HDFS thru whatever HDFS API you are using. In tensorspark, the model is not distributed, the driver and each worker have their own copies of the model. I hope this helps.

@echoyes
Copy link
Author

echoyes commented Aug 30, 2016

@snakecharmer1024 thanks for answering. I have deployed tensorspark to a cluster which have 6 workers and a driver. As far as I can see, tensorspark is data parallel. There is still one question that when I hook the code by adding saving model code in on_close function, dose the model have integrated all the gradients from all the workers? Besides, is there will be a model parallel version in the future?

@illuzen
Copy link
Contributor

illuzen commented Aug 31, 2016

Glad to hear someone is using it! Yes tensorspark is data parallel.

I don't think on_close is where you want to save, as there is one ParameterServerWebsocketHandler per websocket connection, so you would be saving the model 6 times for 6 workers. But it would probably guarantee getting the most up to date gradients if you had the workers close the websocket connection...

You could probably put the saving code at the end of train_epochs and it would provide the same guarantee.
https://github.com/adatao/tensorspark/blob/master/tensorspark.py#L153
Since the gradients have to be pushed before returning... altho it's asynchronous, so conceivably the last gradient could be in transit when train_partition returns... you might be able to just grab the lock before saving...
https://github.com/adatao/tensorspark/blob/master/tensorspark.py#L84
https://github.com/adatao/tensorspark/blob/master/parameterwebsocketclient.py#L51

As you can probably see, we used this for collecting data, but did not productionize it.

There are no current plans to make a model parallel version, but pull requests are welcome!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants