Add tokenization and truncation funcionality to Tokenize() function. #39

fialhocoelho · 2024-06-04T19:27:27Z

Here, we use of tokenizer.encode_plus() directly because we don't want to modify tokenizer_group.encode_async and this function is part of upstream code/project. This implementation won't affect performance because tokenizer_group.encode_async doesn't run Tokenizer.encode in a background thread.
Should we add a unittest for this implementation?

…th different parameters

…runcate.py`

maxdebayser · 2024-06-13T20:49:53Z

tests/entrypoints/test_server_tokenize_truncate.py

+        pytest.fail(f"gRPC call failed with error: {e}")
+
+    # Verify the response
+    expected_response = test_case['response']


Better replace with test_case['response'][0] and eliminate the for loop below.

maxdebayser · 2024-06-13T20:51:28Z

tests/entrypoints/test_server_tokenize_truncate.py

+def test_tokenization(server, grpc_stub, test_case):
+    """Test tokenization with the given test case."""
+    text = test_case['request']['text']
+    truncate_input_tokens = test_case['request'].get('truncate_input_tokens',


Perhaps you can add a new variable request = test_case['request'] to eliminate some repetition.

maxdebayser · 2024-06-13T20:53:29Z

vllm/entrypoints/grpc/grpc_server.py

+                return_offsets_mapping=request.return_offsets
+            ) # Tokenize the input text and get offset_mapping
+
+            # Tokenize the input text async


Comment doesn't match what the code does.

maxdebayser · 2024-06-13T20:54:21Z

vllm/entrypoints/grpc/grpc_server.py

+            token_count = len(token_ids)
+
+            # Truncate the token count if truncate_input_tokens
+            if 1 <= request.truncate_input_tokens < token_count:


I think it would be easier to read 0 < request...tokens < token_count.

maxdebayser · 2024-06-13T20:55:34Z

vllm/entrypoints/grpc/grpc_server.py


+        # Return the batched tokenization response


I think you can remove most of these comments because it's fairly obvious what the code does.

fialhocoelho · 2024-06-14T23:50:33Z

Closing this PR as I have created a more organized one: #47

Add tokenization and truncation funcionality to Tokenize() function.

040a69a

fialhocoelho requested review from njhill and joerunde June 4, 2024 19:27

fialhocoelho added 3 commits June 4, 2024 16:33

solving errors from format script

570f7aa

Create a test for run a grpc server and test tokenize and truncate wi…

9b93eef

…th different parameters

A makefile to compile grpc pb (pb folder) for `test_server_tokenize_t…

1f3e41e

…runcate.py`

maxdebayser reviewed Jun 13, 2024

View reviewed changes

fialhocoelho closed this Jun 14, 2024

fialhocoelho deleted the jeff-tokenizer branch June 14, 2024 23:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add tokenization and truncation funcionality to Tokenize() function. #39

Add tokenization and truncation funcionality to Tokenize() function. #39

fialhocoelho commented Jun 4, 2024

maxdebayser Jun 13, 2024

maxdebayser Jun 13, 2024

maxdebayser Jun 13, 2024

maxdebayser Jun 13, 2024

maxdebayser Jun 13, 2024

fialhocoelho commented Jun 14, 2024

Add tokenization and truncation funcionality to Tokenize() function. #39

Add tokenization and truncation funcionality to Tokenize() function. #39

Conversation

fialhocoelho commented Jun 4, 2024

maxdebayser Jun 13, 2024

Choose a reason for hiding this comment

maxdebayser Jun 13, 2024

Choose a reason for hiding this comment

maxdebayser Jun 13, 2024

Choose a reason for hiding this comment

maxdebayser Jun 13, 2024

Choose a reason for hiding this comment

maxdebayser Jun 13, 2024

Choose a reason for hiding this comment

fialhocoelho commented Jun 14, 2024