-
-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add tokenization and truncation funcionality to Tokenize() function. #39
Conversation
pytest.fail(f"gRPC call failed with error: {e}") | ||
|
||
# Verify the response | ||
expected_response = test_case['response'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better replace with test_case['response'][0]
and eliminate the for loop below.
def test_tokenization(server, grpc_stub, test_case): | ||
"""Test tokenization with the given test case.""" | ||
text = test_case['request']['text'] | ||
truncate_input_tokens = test_case['request'].get('truncate_input_tokens', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps you can add a new variable request = test_case['request']
to eliminate some repetition.
return_offsets_mapping=request.return_offsets | ||
) # Tokenize the input text and get offset_mapping | ||
|
||
# Tokenize the input text async |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment doesn't match what the code does.
token_count = len(token_ids) | ||
|
||
# Truncate the token count if truncate_input_tokens | ||
if 1 <= request.truncate_input_tokens < token_count: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be easier to read 0 < request...tokens < token_count
.
|
||
# Return the batched tokenization response |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can remove most of these comments because it's fairly obvious what the code does.
Closing this PR as I have created a more organized one: #47 |
Here, we use of
tokenizer.encode_plus()
directly because we don't want to modifytokenizer_group.encode_async
and this function is part of upstream code/project. This implementation won't affect performance becausetokenizer_group.encode_async
doesn't runTokenizer.encode
in a background thread.Should we add a unittest for this implementation?