Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server : Add option to return token pieces in /tokenize endpoint #9108

Merged
merged 9 commits into from
Sep 12, 2024

Conversation

mathijshenquet
Copy link
Contributor

Description

This PR enhances the /tokenize endpoint by adding an option to return token pieces along with their IDs. This feature allows users to easily understand what each token represents without needing to make additional API calls or perform client-side lookups.

Motivation

Often, users want to know not just the token IDs but also what these tokens represent in the original text. Previously, this required either inefficient additional API calls or complex client-side logic. This change makes it much easier and more efficient to get this information in a single request.

Changes

  1. Added a new with_pieces boolean parameter to the /tokenize endpoint.
  2. When with_pieces is true, the response includes both token IDs and their corresponding pieces.
  3. Updated the endpoint documentation to reflect the new parameter and response format.
  4. The change is backward compatible: if with_pieces is not specified or is false, the endpoint behaves as before.

Testing

  • Added unit tests to cover the new functionality.
  • Manually tested the endpoint with various inputs and parameter combinations.

Documentation

Updated the API documentation to include the new with_pieces parameter and to show example responses for both cases (with and without pieces).

NB: I wasn't able to run the CI pipeline locally as I'm currently on Windows.

@github-actions github-actions bot added examples python python script changes server labels Aug 20, 2024
@ngxson
Copy link
Collaborator

ngxson commented Aug 21, 2024

Also, this will fail if one of the pieces has incomplete unicode bytes. For example:

{"content": "媽", "with_pieces": true}

Response:

{
    "error": {
        "code": 500,
        "message": "[json.exception.type_error.316] incomplete UTF-8 string; last byte: 0xE5",
        "type": "server_error"
    }
}

@mathijshenquet
Copy link
Contributor Author

Also, this will fail if one of the pieces has incomplete unicode bytes. For example:

For the case that a piece is invalid utf8 I have added a fallback where a list of bytes will be sent instead.

@mathijshenquet mathijshenquet requested a review from ngxson August 21, 2024 22:46
@github-actions github-actions bot added the devops improvements to build systems and github actions label Aug 22, 2024
@mathijshenquet
Copy link
Contributor Author

ping @ngxson

Copy link
Collaborator

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks. Let's merge after the CI passes

Small suggestion: we actually already had a version of is_valid_utf8 in process_token, would be nice if we can deduplicate the code by reusing that. It can be done in a follow-up PR though.

@mofosyne mofosyne added the Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix label Aug 30, 2024
@ggerganov
Copy link
Owner

@mathijshenquet Sorry for the delay. Let's resolve the conflict and merge.

@ngxson
Copy link
Collaborator

ngxson commented Sep 12, 2024

I merged with upstream master. That should fix the CI problem. (Let's merge once CI passes)

@ngxson ngxson merged commit 7820364 into ggerganov:master Sep 12, 2024
54 checks passed
dsx1986 pushed a commit to dsx1986/llama.cpp that referenced this pull request Oct 29, 2024
…rganov#9108)

* server : added with_pieces functionality to /tokenize endpoint

* server : Add tokenize with pieces tests to server.feature

* Handle case if tokenizer splits along utf8 continuation bytes

* Add example of token splitting

* Remove trailing ws

* Fix trailing ws

* Maybe fix ci

* maybe this fix windows ci?

---------

Co-authored-by: Xuan Son Nguyen <[email protected]>
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 15, 2024
…rganov#9108)

* server : added with_pieces functionality to /tokenize endpoint

* server : Add tokenize with pieces tests to server.feature

* Handle case if tokenizer splits along utf8 continuation bytes

* Add example of token splitting

* Remove trailing ws

* Fix trailing ws

* Maybe fix ci

* maybe this fix windows ci?

---------

Co-authored-by: Xuan Son Nguyen <[email protected]>
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 18, 2024
…rganov#9108)

* server : added with_pieces functionality to /tokenize endpoint

* server : Add tokenize with pieces tests to server.feature

* Handle case if tokenizer splits along utf8 continuation bytes

* Add example of token splitting

* Remove trailing ws

* Fix trailing ws

* Maybe fix ci

* maybe this fix windows ci?

---------

Co-authored-by: Xuan Son Nguyen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
devops improvements to build systems and github actions examples python python script changes Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix server
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants