Merge pull request #1 from semji/improve/encoder

Improve encoder
semji · Jan 27, 2023 · 55922d5 · 55922d5
2 parents 32344c0 + 222bfd2
commit 55922d5
Show file tree

Hide file tree

Showing 11 changed files with 2,391 additions and 339 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1 @@
+vendor
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -0,0 +1,4 @@
+# v1.0.0
+
+- Fork from https://github.com/CodeRevolutionPlugins/GPT-3-Encoder-PHP
+- Use Object structuration
diff --git a/README.md b/README.md
@@ -2,27 +2,16 @@
 PHP BPE Text Encoder for GPT-2 / GPT-3
 
 ## About
-GPT-2 and GPT-3 use byte pair encoding to turn text into a series of integers to feed into the model. This is a PHP implementation of OpenAI's original python encoder which can be found [here](https://github.com/openai/gpt-2). The main source of inspiration for writing this encoder was the NodeJS version of this encoder, found [here](https://github.com/latitudegames/GPT-3-Encoder).
-
-You can test the results, by comparing the output generated by this script, with the [official tokenizer page from OpenAI](https://beta.openai.com/tokenizer).
-
-This specific encoder is used in one of my [WordPress plugins](https://coderevolution.ro), to count the number of tokens a string will use when sent to OpenAI API.
-
+Just a copy of https://github.com/CodeRevolutionPlugins/GPT-3-Encoder-PHP to fit our usage
 
 ## Usage
 
 The mbstring PHP extension is needed for this tool to work correctly (in case non-ASCII characters are present in the tokenized text): [details here on how to install mbstring](https://www.php.net/manual/en/mbstring.installation.php)
-
+PHP 8.1 is needed too;
 
 ```php
-
-$prompt = "Many words map to one token, but some don't: indivisible. Unicode characters like emojis may be split into many tokens containing the underlying bytes: 🤚🏾 Sequences of characters commonly found next to each other may be grouped together: 1234567890";
-
-$token_array = gpt_encode($prompt);
-
+use Semji\GPT3Tokenizer\Encoder;
+$prompt = "Many words map";
+$encoder = new Encoder();
+$encoder->encode($prompt);
 ```
-
-
-## TODO
-
-Create also a decoder for the package, currently only an encoder is implemented.
diff --git a/composer.json b/composer.json
@@ -1,5 +1,5 @@
 {
-    "name": "coderevolutionplugins/gpt-3-encoder-php",
+    "name": "semji/gpt-3-encoder-php",
     "description": "PHP BPE Text Encoder for GPT-2 / GPT-3",
     "type": "library",
     "license": "MIT",
@@ -9,6 +9,17 @@
             "email": "[email protected]"
         }
     ],
+    "autoload": {
+        "psr-4": {
+            "Semji\\GPT3Tokenizer\\": "src"
+        }
+    },
     "minimum-stability": "stable",
-    "require": {}
+    "require": {
+        "php": "^8.1",
+        "ext-mbstring": "*"
+    },
+    "require-dev": {
+        "phpunit/phpunit": "^9.5"
+    }
 }