Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
MLH fellowship contribution: adding the
laser_encoders
module (#249)
* feat: converted SPMapply function to use python script * modified laserTokenizer class to have a seperate function for tokenizing a file * modified tokenize_file function * removed instances of Path * created new function for opening files * test for LaserTokenizer.tokenize * tests for normalisation, descape and lower_case * deleted test dir because of relative import error * modified test tokenizer function to use the downloaded model before exiting the context manager * test for tokenize_file * added test for is_printable * test for over_write when equal to True and False * added some type hints for tests * added type hint for log function * added header comment * feat: make LASER pip installable (#239) * feat: make LASER pip installable * Added GitHub Actions workflow for tests and linting * upgraded python version due to node depreciation error * removed updated python version * removed poetry * bug fixes * removed dependencies install * updated pyproject and made lint_and_test to install dev and mono dependencies * removed isort and black * removed mono dependencies * removed version from pyproject * removed duplicate of classifiers * removed description * removed dynamic * added src-layout to discover only laser_encoder * added build backend * updated project name * changed license to BSD * removed src-layout to test * added linting to actions * updated linting to only check the laser_encoders folder * fixed linting issues * fixed black linting issues * added white-space * Refactor embedder (#241) * feat: make LASER pip installable * Added GitHub Actions workflow for tests and linting * upgraded python version due to node depreciation error * removed updated python version * removed poetry * bug fixes * removed dependencies install * updated pyproject and made lint_and_test to install dev and mono dependencies * removed isort and black * removed mono dependencies * removed version from pyproject * removed duplicate of classifiers * removed description * removed dynamic * added src-layout to discover only laser_encoder * added build backend * updated project name * changed license to BSD * removed src-layout to test * added linting to actions * updated linting to only check the laser_encoders folder * fixed linting issues * fixed black linting issues * added white-space * refactored emmbeder to work in the laser tokenizer package * downgraded numpy version to suit the installled python version * added test for sentence encoder * added whitespace to test workflow * restructured test for sentence encoder * restructured test for sentence encoder * fixed black issues * restructured test for sentence encoder * changed python version because of workflow error * updated dependencies requirements version * removed unneccessary print statement * updated python version * restructured test_sentence_encoder * restructured test_sentence encoder * black linting fixes * restructure calling of tempile module * updated workflow to remove pip cache * removed commented code * refactored code and added type hints * fixed black issues * fixed no module found error by adding Laser environment * feat: Add Python function to download LASER models (#244) * feat: make LASER pip installable * Added GitHub Actions workflow for tests and linting * upgraded python version due to node depreciation error * removed updated python version * removed poetry * bug fixes * removed dependencies install * updated pyproject and made lint_and_test to install dev and mono dependencies * removed isort and black * removed mono dependencies * removed version from pyproject * removed duplicate of classifiers * removed description * removed dynamic * added src-layout to discover only laser_encoder * added build backend * updated project name * changed license to BSD * removed src-layout to test * added linting to actions * updated linting to only check the laser_encoders folder * fixed linting issues * fixed black linting issues * added white-space * refactored emmbeder to work in the laser tokenizer package * downgraded numpy version to suit the installled python version * added test for sentence encoder * added whitespace to test workflow * restructured test for sentence encoder * restructured test for sentence encoder * fixed black issues * restructured test for sentence encoder * changed python version because of workflow error * updated dependencies requirements version * removed unneccessary print statement * updated python version * restructured test_sentence_encoder * restructured test_sentence encoder * black linting fixes * restructure calling of tempile module * updated workflow to remove pip cache * removed commented code * refactored code and added type hints * fixed black issues * fixed no module found error by adding Laser environment * feat:created download function for downloading laser models in python * added language list and made some changes to the download models * fixed linting issues * added type hints * fixed linting issues * added progress bar for downloading of models * fixed black issues * updated code to download laser model based on where the language is found * fixed black and linting issues * fixed black issues * fixed bug in sentence encoder * black issues and relative import issues * removed addition of laser path * fixed isort issues * refactored the python entrypoint functions * fixed black issues * updated laguage list with some laser2 and laser3 languages * refactor: added option for laser * added laser2 language list * added laser3 language list * fixed black issues * updated language list * refactoed download function to display total filesize in MB and also made some changes to raise an error when laser is not passed * fixed black issues * refactored download models to move model_dir to the class * fixed black issues * refactored laser tokenizer test to use the laser downloader class methods * documentation for the laser_encoder * added tokenizer part * added some docs for tokenize file and download models * updated readme to include supported flore200 langs * corrected readme path and license * added requirements for laser_encoder * added __main__.py file for running download command easily * black and isort fixes, updated docs to effect changes due to creation of __main__.py file * added contributors section * Revert "added requirements for laser_encoder" This reverts commit 431780e. reverting back * reverting creation of main.py * fixed isort and black issues * removed irrelevant comment * moved pyproject to laser direcory and adjust contributors name * workflow issues due to removal of pyproject * pointed workflow to laser_encoders dir * fixed EOF error * fixed EOF error * debuging * debuging * debuging * debuging * debuging * debuging * debuging * debuging * debuging * debuging * debuging * debuging * bug fixes and new implementation of convert_tokens_to_id function * bug fix * bug fix * bug fix * bug fix * bug fix * bug fix * bug fix * bug fix * bug fix * reverting back because of workflow error * reverting back because of workflow error * some extra adjustment * changed ibo to igbo * updated doc to effect the ibo to igbo change * refactore: modified the sentence encoder to tokenize a text before encodingit * debugging failed test * added a call method to seperately handle the tokenization before encodding * added value error for when there is no spm_model * documentation for the new __call__ method for tokenization with encoder * docs: Update docs to include reference to laserembeddings (#254) * Handle Interrupted Model Weight Downloads (#253) * fix: Fix interrupted downloads issue * style: Format code using black * Update download method to use tempfile * style: Remove unnecessary space * Fix OSError by using shutil.move for cross-filesystem moves Using os.rename caused an OSError when trying to move files across different filesystems (e.g., from /tmp to another directory). By using shutil.move, we gracefully handle such situations, ensuring files are moved correctly regardless of the source and destination filesystems. * Refactor `initialize_encoder` to `LaserEncoderPipeline` (#256) * Remove 'tokenize' argument from initialize_encoder function * Add LaserEncoderPipeline for streamlined tokenization and encoding * docs: Update README to show use of LaserEncoderPipeline * style: Reformat code using black * refactor: move encoder and tokenizer initialization into repective files * style: run black * test: Add test for LaserEncoderPipeline * test to validate languages * test to validate languages * Delete flores directory * Update validate_models.py * Update validate_models.py * Update validate_models.py * Update validate_models.py * Update .gitignore * added pytest to validate_models.py * Update validate_models.py * Update validate_models.py * Update validate_models.py using mock downloader * Update validate_models.py * Update validate_models.py * Update validate_models.py * Update validate_models.py * Extend Tokenizer to Support Single Strings and Lists of Strings (#258) * Handle case for both str and list in tokenizer * test: Add test for tokenizer call method * Rename 'sentences' argument to 'text_or_batch' for clarity * Handle string input in call method * Update validate_models.py * Update download_models.py according to 1. * Update download_models.py * Update download_models.py * Update download_models.py * Enhance LaserTokenizer with Perl Parity, Optional Punctuation Normalization, and Embedding Normalization (#262) * Introduce pearl compability flag * Add argument `normalize_punct` to `LaserTokenizer` * Add normalize_embeddings option to encode_sentences * Update README on normalize_embeddings option * style: Run black and isort * test: Add tests for normalize_embeddings flag in sentence encoder * style: Run black * Update validate_models.py * Update models.py * Update laser_tokenizer.py * Update download_models.py * Update validate_models.py * Update validate_models.py * Added slow and fast tests to validate_models.py * Update validate_models.py * Update validate_models.py * Create test_validate_models.py * Rename test_validate_models.py to test_models_initialization.py * Update test_models_initialization.py * Update test_models_initialization.py * Update download_models.py * Update test_models_initialization.py * Update test_models_initialization.py * Update download_models.py * Update validate_models.py * Update validate_models.py * Update validate_models.py * Update validate_models.py * Update validate_models.py * Update validate_models.py * Update validate_models.py * Update validate_models.py * Update README.md * Update README.md * Decrease versions of numpy and torch required by laser-encoders (#264) * Update requirements to follow fairseq * Update README * Update dependencies in toml file * Remove requirements.txt * Update laser_encoders README * resolve parity with MOSES-4.0 release * update test * Update the main README file with a mention of `laser_encoders` (#266) * update the main readme file * wording changes * update the example in the readme * fix readme text * Update language_list.py (#269) * Update language_list.py * Update language_list.py * Update language_list.py * Updated laser encoder pipeline * Update models.py * Update models.py * Added warning for using laser2 with a language * add tests to test_laser_tokenizer.py * Update test_laser_tokenizer.py * Update models.py * Update test_laser_tokenizer.py * Update test_laser_tokenizer.py * Update language_list.py * Update language_list.py * Update language_list.py --------- Co-authored-by: CaptainVee <[email protected]> Co-authored-by: Victor Joseph <[email protected]> Co-authored-by: Kevin Heffernan <[email protected]> Co-authored-by: Okewunmi Paul <[email protected]> Co-authored-by: NIXBLACK11 <[email protected]> Co-authored-by: Siddharth Singh Rana <[email protected]> Co-authored-by: Kevin Heffernan <[email protected]>
- Loading branch information