Skip to content

Commit

Permalink
Merge pull request #1 from semji/improve/encoder
Browse files Browse the repository at this point in the history
Improve encoder
  • Loading branch information
srogier authored Jan 27, 2023
2 parents 32344c0 + 222bfd2 commit 55922d5
Show file tree
Hide file tree
Showing 11 changed files with 2,391 additions and 339 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
vendor
4 changes: 4 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# v1.0.0

- Fork from https://github.com/CodeRevolutionPlugins/GPT-3-Encoder-PHP
- Use Object structuration
23 changes: 6 additions & 17 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,27 +2,16 @@
PHP BPE Text Encoder for GPT-2 / GPT-3

## About
GPT-2 and GPT-3 use byte pair encoding to turn text into a series of integers to feed into the model. This is a PHP implementation of OpenAI's original python encoder which can be found [here](https://github.com/openai/gpt-2). The main source of inspiration for writing this encoder was the NodeJS version of this encoder, found [here](https://github.com/latitudegames/GPT-3-Encoder).

You can test the results, by comparing the output generated by this script, with the [official tokenizer page from OpenAI](https://beta.openai.com/tokenizer).

This specific encoder is used in one of my [WordPress plugins](https://coderevolution.ro), to count the number of tokens a string will use when sent to OpenAI API.

Just a copy of https://github.com/CodeRevolutionPlugins/GPT-3-Encoder-PHP to fit our usage

## Usage

The mbstring PHP extension is needed for this tool to work correctly (in case non-ASCII characters are present in the tokenized text): [details here on how to install mbstring](https://www.php.net/manual/en/mbstring.installation.php)

PHP 8.1 is needed too;

```php

$prompt = "Many words map to one token, but some don't: indivisible. Unicode characters like emojis may be split into many tokens containing the underlying bytes: 🤚🏾 Sequences of characters commonly found next to each other may be grouped together: 1234567890";

$token_array = gpt_encode($prompt);

use Semji\GPT3Tokenizer\Encoder;
$prompt = "Many words map";
$encoder = new Encoder();
$encoder->encode($prompt);
```


## TODO

Create also a decoder for the package, currently only an encoder is implemented.
15 changes: 13 additions & 2 deletions composer.json
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
{
"name": "coderevolutionplugins/gpt-3-encoder-php",
"name": "semji/gpt-3-encoder-php",
"description": "PHP BPE Text Encoder for GPT-2 / GPT-3",
"type": "library",
"license": "MIT",
Expand All @@ -9,6 +9,17 @@
"email": "[email protected]"
}
],
"autoload": {
"psr-4": {
"Semji\\GPT3Tokenizer\\": "src"
}
},
"minimum-stability": "stable",
"require": {}
"require": {
"php": "^8.1",
"ext-mbstring": "*"
},
"require-dev": {
"phpunit/phpunit": "^9.5"
}
}
Loading

0 comments on commit 55922d5

Please sign in to comment.