Proposal: use distributed network for training #3404

Redict · 2023-03-03T10:09:24Z

Redict
Mar 3, 2023

The amount of data in the dataset is growing at an incredible rate, so I suggest leveraging the power of the community by, for example, installing a special node on the user's computer in a Docker container that would allow Open Assistant developers to create tasks on these nodes to train the model.

Benefits:

Potentially large number of nodes
High efficiency
Low cost for developers to rent servers

Cons:

Possible tampering with data when modifying the container
Potential vulnerability in case of hacking master node

Feel free to discuss

someone13574 · 2023-03-03T13:16:39Z

someone13574
Mar 3, 2023

The problem with using distributed networks for training large models is the massive bandwidth needed to transfer gradients. Models would likely be in the billions of parameters and sending this data back and forward from nodes to a central server every training step would make training unfeasible due to the latency it would create.

0 replies

Redict · 2023-03-03T14:50:02Z

Redict
Mar 3, 2023
Author

@someone13574 In what form will these gradients be transmitted? What data sizes are we talking about? Is it possible to represent them in binary form and compress them, for example with zstd? I don't think sending huge amounts of data over net is actually a problem. It's cheaper to get more bandwidth than getting another few GPU's in rent.

0 replies

andreaskoepf · 2023-03-03T15:50:27Z

andreaskoepf
Mar 3, 2023
Maintainer

It is not super easy to do distributed training, but there are specialized groups, e.g. check https://github.com/learning-at-home/hivemind

0 replies

andreaskoepf · 2023-03-03T15:51:51Z

andreaskoepf
Mar 3, 2023
Maintainer

and https://www.together.xyz/

0 replies

Redict · 2023-03-03T15:56:10Z

Redict
Mar 3, 2023
Author

@andreaskoepf there's also BONIC-driven solutions

0 replies

umbra-scientia · 2023-03-04T09:21:30Z

umbra-scientia
Mar 4, 2023

I would take inspiration from earlier LAION success and lean more towards distributed pre-processing of training data. There is plenty of cost in things like the reinforcement learning rollouts of RLHF, Chain of Thought, and ToolFormer.

0 replies

jsekamane · 2023-03-05T11:02:30Z

jsekamane
Mar 5, 2023

I think distributed pre-processing would work great for specific elements, such as the “execution of API calls”-step in ToolFormer.

You could:

centrally sample the API calls using the LM and various prompts,
then execute the API calls decentralized (this part is easily containerized and probably more influenced by CPU resources and network latency), and
finally centrally filter the API calls/responses and fine-tuning the model.

Excerpt from the ToolFormer paper:

Executing API Calls As a next step, we execute all API calls generated by $M$ to obtain the corresponding results. How this is done depends entirely on the API itself – for example, it can involve calling another neural network, executing a Python script or using a retrieval system to perform search over a large corpus. The response for each API call $c_i$ needs to be a single text sequence $r_i$.

You would not be transferring millions of parameters, gradients, etc. You would be sending urls to docker images and the API calls you want executed, and you would be receiving responses from these API calls. Validation of responses could be simple majority voting.

0 replies

andreaskoepf · 2023-03-05T13:58:57Z

andreaskoepf
Mar 5, 2023
Maintainer

I would take inspiration from earlier LAION success and lean more towards distributed pre-processing of training data. There is plenty of cost in things like the reinforcement learning rollouts of RLHF, Chain of Thought, and ToolFormer.

Very interesting idea. If someone is interested in closer planning and prototyping this please let us know (either here or OA discord).

0 replies

Redict · 2023-03-05T21:10:03Z

Redict
Mar 5, 2023
Author

I would take inspiration from earlier LAION success and lean more towards distributed pre-processing of training data. There is plenty of cost in things like the reinforcement learning rollouts of RLHF, Chain of Thought, and ToolFormer.

Very interesting idea. If someone is interested in closer planning and prototyping this please let us know (either here or OA discord).

I am not really familiar with all of these, but I can implement it if I'll get more in-depth explanation.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: use distributed network for training #3404

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 9 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Proposal: use distributed network for training #3404

Redict Mar 3, 2023

Replies: 9 comments

someone13574 Mar 3, 2023

Redict Mar 3, 2023 Author

andreaskoepf Mar 3, 2023 Maintainer

andreaskoepf Mar 3, 2023 Maintainer

Redict Mar 3, 2023 Author

umbra-scientia Mar 4, 2023

jsekamane Mar 5, 2023

andreaskoepf Mar 5, 2023 Maintainer

Redict Mar 5, 2023 Author

Redict
Mar 3, 2023

someone13574
Mar 3, 2023

Redict
Mar 3, 2023
Author

andreaskoepf
Mar 3, 2023
Maintainer

andreaskoepf
Mar 3, 2023
Maintainer

Redict
Mar 3, 2023
Author

umbra-scientia
Mar 4, 2023

jsekamane
Mar 5, 2023

andreaskoepf
Mar 5, 2023
Maintainer

Redict
Mar 5, 2023
Author