Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

detokenization parallelization #37

Open
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

NickNickGo
Copy link
Contributor

Async detokenization

@NickNickGo NickNickGo requested a review from a team September 8, 2020 22:18
Copy link
Contributor

@JiushengChen JiushengChen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If #26 is there with this multi-process change?

fastseq_cli/transformers_generate.py Show resolved Hide resolved
fastseq_cli/transformers_generate.py Outdated Show resolved Hide resolved
fastseq_cli/transformers_generate.py Outdated Show resolved Hide resolved
Copy link
Contributor

@feihugis feihugis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NickNickGo It looks good to me in general. One major question is about the output order. We need to make sure the output order as same as before.

fastseq_cli/transformers_generate.py Show resolved Hide resolved
fastseq_cli/transformers_generate.py Outdated Show resolved Hide resolved
fastseq_cli/transformers_generate.py Outdated Show resolved Hide resolved
fastseq_cli/transformers_generate.py Outdated Show resolved Hide resolved
fastseq_cli/transformers_generate.py Outdated Show resolved Hide resolved
data_queue = Queue()
msg_queue = Queue()
p_list = []
threads = cpu_count()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be better to allow users to specify CPU numbers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't make a big difference right, although I can create an argument .,

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be some differences. It will waste the CPU resources and it also brings overhead to create and manage these processes and sync data across these processes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a parameter define when support parallel for fairseq. GPU machine has 32/64 or more CPU. Do you get better speed when have threads > 1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@feihugis I added support for this.
@yuyan2do , I haven't yet analyzed effect of changing num of threads on speed, let me do that .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't notice significant changes in overall time when number of threads are changed.

fastseq_cli/transformers_generate.py Outdated Show resolved Hide resolved

def chunks(lst, n):
"""Yield successive n-sized chunks from lst."""
for i in range(0, len(lst), n):
yield lst[i:i + n]

class IOProcess (Process) :
""" Write detokenized output to file in order."""
def __init__ (self, msg_queue, fout):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def __init__ (self, msg_queue, fout):
def __init__(self, msg_queue, fout):

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the similar spaces in other places.

def run (self) :
while (True) :
ind, dec = self.msg_queue.get()
if dec == GENERATE_FINISHED :
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if dec == GENERATE_FINISHED :
if dec == GENERATE_FINISHED:

fastseq_cli/transformers_generate.py Outdated Show resolved Hide resolved
fastseq_cli/transformers_generate.py Outdated Show resolved Hide resolved
fastseq_cli/transformers_generate.py Outdated Show resolved Hide resolved
fastseq_cli/transformers_generate.py Show resolved Hide resolved
fastseq_cli/transformers_generate.py Outdated Show resolved Hide resolved
fastseq_cli/transformers_generate.py Outdated Show resolved Hide resolved
fastseq_cli/transformers_generate.py Outdated Show resolved Hide resolved
@NickNickGo
Copy link
Contributor Author

Linting checks are clean. Could you please add additional formatting requirements (if any) in rcfile, this will reduce formatting iterations.

@feihugis
Copy link
Contributor

feihugis commented Sep 14, 2020

Linting checks are clean. Could you please add additional formatting requirements (if any) in rcfile, this will reduce formatting iterations.

Good suggestion. The rcfile is enhanced here(#38). One thing it does not cover is the whitespace between 1) function name and parentheses; 2) variables and colon, which you need to manually check and remove but it should be easy.

@NickNickGo
Copy link
Contributor Author

NickNickGo commented Sep 25, 2020

Multi-worker preprocess : Bart Large BS 128 1k samples, throughput change from 11.8 (from #40 ) to 12.3.

Copy link
Contributor

@feihugis feihugis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the numbers in the benchmarking scripts need to be updated?

fastseq_cli/transformers_generate.py Outdated Show resolved Hide resolved
Comment on lines 32 to 34
return_tensors="pt",
truncation=True,
padding="max_length")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add these parameters to the constructor instead of hard coding.

fastseq_cli/transformers_generate.py Outdated Show resolved Hide resolved
fastseq_cli/transformers_generate.py Outdated Show resolved Hide resolved
@NickNickGo
Copy link
Contributor Author

@feihugis thanks, I incorporated all nitpicks.

Copy link
Contributor

@feihugis feihugis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last comments: 1) update the benchmarking scripts as this PR will change the performance of all the transformers models; 2) add the docs for the new classes and public APIs (e.g. short description of the API, the types and meaning of the input args and returns).

Comment on lines 18 to 25
def __init__(self, examples, tokenizer, model_name, prefix):
self.examples = examples
self.tokenizer= tokenizer
self.model_name = model_name
self.prefix = prefix
self.return_tensors="pt"
self.truncation=True
self.padding="max_length"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean something like

def _init(self, examples, tokenizer, model_name, prefix, return_tensors, truncation, padding):
    ...
    self.return_tensors = return_tensors
    ...

@NickNickGo
Copy link
Contributor Author

Only HF benchmarks:

Before (without #40 , #37 )

Model W/O FastSeq (in samples/s) W/ FastSeq (in samples/s) Speedup
Bart (hf) 3.4 8.1 2.4x
DistilBart (hf) 4.0 8.5 2.1x
T5 (hf) 4.8 7.5 1.6x

After:

Model W/O FastSeq (in samples/s) W/ FastSeq (in samples/s) Speedup
Bart (hf) 3.4 11.0 3.2x
DistilBart (hf) 4.0 13.5 3.4x
T5 (hf) 4.8 17.0 3.5x

@feihugis
Copy link
Contributor

@NickNickGo One minor question: are the numbers in the benchmark scripts based on #40 or not? If not, the benchmark script may fail when both PRs are merged.

Comment on lines +30 to +31
self.return_tensors="pt"
self.truncation=True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use hard code here? We can put these two as the parameters of the constructor.


class IOProcess (Process):
""" Write detokenized output to file in order."""
def __init__(self, msg_queue, fout):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing docs


class PostProcess(Process):
""" Parallel detokenization """
def __init__(self, tokenizer, data_queue, msg_queue,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants