-
Notifications
You must be signed in to change notification settings - Fork 467
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding parallelism using ProcessPoolExecutor and concurrent.futures #314
base: master
Are you sure you want to change the base?
Conversation
Hi, Many thanks for your contribution to the examples!
This function seems too specific for a certain task.
Looks like you init a VAD each time the new file gets processed. For long files this overhead will be negligible, but I believe we should do our best to provide the best example we can. Because sometimes, e.g. in a multi-threaded application, it may lead to painful memory leaks. You see, if you are using PyTorch (not ONNX), it actually creates a reference to an underlying object, and in this particular case it hardly matters, but in general I guess a proper way is to init a separate model instance per process. |
Hello @snakers4 Spent a few hours working to pickle the model. And found a small workaround please take a look at the new commit. Cleaned up the code a bit as well. |
Hi,
Looks like this is not used.
Looks like you nevertheless initialize the model in the main process, and then pass it as a reference to process pool executor processes. Why not just init a separate instance of model in each child process, i.e. in
You do not really need to pass any parameters to the init. You can write some plain logs to make sure that several instances of model were created in parallel. |
This comment was marked as off-topic.
This comment was marked as off-topic.
For a real-time system (the VAD is intended for this use-case) latency of about 500ms when processing one 30ms chunk takes about 1ms is not very good. Also batching works fine when you process a lot of files locally, but it becomes problematic when you can receive chunks asynchronously. Of course you can keep the state for each batch element and pass it back and forth for a next chunk in a messaging-like system, but we decided not to maintain public examples of such complexity. |
Well, I still don't see why provide example of batch size=1? Like if someone has 200 files those can be processed by a single executor with batch size 200, not by 200 in parallel. |
Oh that's the whole processing time for 50 wavs of ~10s each. Chunk-wise it was obviously 0.5s/nchunks ~ 5ms.
Well, it's not terribly complicated. I just implemented it (save/restore) for our own use in couple of hours. Arguably to make the best use of your GPU resources (i.e. execution units, ram, scheduling etc), you need to batch up local tasks, not to multiplex them. With proper batching I can multiplex small number of functionally different models (i.e. TTS, STT, VAD etc) for say 50 channels on a single GPU today in a real-time, but I cannot multiplex 150 copies of the same 3 models. Well, I can but on 10x number of GPUs. ;-) So giving only examples of batch size 1 promotes the most inefficient way to use the model, IMHO. |
@snakers4 P.S. Sorry for hiding that comment, it's bit late here, I wanted to give it a fresh read and re-post in the morning but you have beaten me to it. 😂 |
Do not use python for concurrency, GIL will generate cpu spikes |
adding parallelism using ProcessPoolExecutor and concurrent.futures