Cpp-Tiktoken

This is a C++ implementation of a tiktoken tokenizer library for C++. It was heavily inspired by https://github.com/dmitry-brazhenko/SharpToken

To use, first somewhere have a lines in your project that reads something like:

    #include "tiktoken/enconding.h"

    ....

    auto encoder = GptEncoding::get_encoding(<model name>);

The value returned from this function is an std::shared_ptr and you will not have to manage its memory.

Supported language models that you can pass as a parameter to this function are:

    LanguageModel::O200K_BASE
    LanguageModel::CL100K_BASE 
    LanguageModel::R50K_BASE
    LanguageModel::P50K_BASE
    LanguageModel::P50K_EDIT

After obtaining an encoder, you can then call

    auto tokens = encoder->encode(string_to_encode);

This returns a vector of the tokens for that language model.

You can decode a vector of tokens back into its original string with

    auto string_value = encoder->decode(tokens)

If you like this project, and find it useful, you are invited to make a donation of whatever amount you believe is appropriate via paypal to markt AT nerdflat.com. There is absolutely no obligation to donate.

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
pcre2 @ e4ccef3		pcre2 @ e4ccef3
ut		ut
.clang-format		.clang-format
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
License.md		License.md
README.md		README.md
byte_pair_encoding.cc		byte_pair_encoding.cc
byte_pair_encoding.h		byte_pair_encoding.h
cl100k_base.tiktoken		cl100k_base.tiktoken
emdedded_resource_reader.cc		emdedded_resource_reader.cc
emdedded_resource_reader.h		emdedded_resource_reader.h
encoding.cc		encoding.cc
encoding.h		encoding.h
encoding_utils.cc		encoding_utils.cc
encoding_utils.h		encoding_utils.h
modelparams.cc		modelparams.cc
modelparams.h		modelparams.h
o200k_base.tiktoken		o200k_base.tiktoken
p50k_base.tiktoken		p50k_base.tiktoken
pcre2_regex.cc		pcre2_regex.cc
pcre2_regex.h		pcre2_regex.h
r50k_base.tiktoken		r50k_base.tiktoken
tokenizer.model		tokenizer.model
tokenizer_llama3.1.model		tokenizer_llama3.1.model

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cpp-Tiktoken

About

Releases

Packages

Contributors 5

Languages

License

gh-markt/cpp-tiktoken

Folders and files

Latest commit

History

Repository files navigation

Cpp-Tiktoken

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages