Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tokenization and truncation funcionality to Tokenize() function. #39

Closed
wants to merge 4 commits into from

Conversation

fialhocoelho
Copy link

Here, we use of tokenizer.encode_plus() directly because we don't want to modify tokenizer_group.encode_async and this function is part of upstream code/project. This implementation won't affect performance because tokenizer_group.encode_async doesn't run Tokenizer.encode in a background thread.
Should we add a unittest for this implementation?

@fialhocoelho fialhocoelho requested review from njhill and joerunde June 4, 2024 19:27
pytest.fail(f"gRPC call failed with error: {e}")

# Verify the response
expected_response = test_case['response']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better replace with test_case['response'][0] and eliminate the for loop below.

def test_tokenization(server, grpc_stub, test_case):
"""Test tokenization with the given test case."""
text = test_case['request']['text']
truncate_input_tokens = test_case['request'].get('truncate_input_tokens',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps you can add a new variable request = test_case['request'] to eliminate some repetition.

return_offsets_mapping=request.return_offsets
) # Tokenize the input text and get offset_mapping

# Tokenize the input text async
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment doesn't match what the code does.

token_count = len(token_ids)

# Truncate the token count if truncate_input_tokens
if 1 <= request.truncate_input_tokens < token_count:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be easier to read 0 < request...tokens < token_count.


# Return the batched tokenization response
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can remove most of these comments because it's fairly obvious what the code does.

@fialhocoelho
Copy link
Author

Closing this PR as I have created a more organized one: #47

@fialhocoelho fialhocoelho deleted the jeff-tokenizer branch June 14, 2024 23:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants