Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added task_type parameter for correct model loading #245

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
5bbb23b
started adding task_type
MarleneKress79789 Jun 26, 2024
ea94780
started adding quality control tests
MarleneKress79789 Jun 28, 2024
27b6fb0
fixed integration tests, added separate models for tasks
MarleneKress79789 Jul 18, 2024
61df528
cleanup and renamings, added docu for new parameters
MarleneKress79789 Jul 31, 2024
f8d5721
Merge branch 'main' into bug/add_task_type_parameter_for_correct_mode…
MarleneKress79789 Jul 31, 2024
f5c8c7c
[CodeBuild] remove prints, changelog
MarleneKress79789 Jul 31, 2024
34fea48
[CodeBuild] fixed run integration tests without saas, fixed import
MarleneKress79789 Aug 1, 2024
2674d8b
[CodeBuild] fixed text replace error
MarleneKress79789 Aug 5, 2024
1ac2088
[CodeBuild] prepare release
MarleneKress79789 Aug 5, 2024
10cc7d4
Apply suggestions from code review [CodeBuild]
tkilias Aug 5, 2024
f756d0b
Apply suggestions from code review [CodeBuild]
tkilias Aug 5, 2024
8b174c7
[CodeBuild] fix saas db naming error
MarleneKress79789 Aug 6, 2024
c28f186
Use batch build for AWS CodeBuild to speed up tests against backends.…
tkilias Aug 6, 2024
7380563
[CodeBuild]
tkilias Aug 6, 2024
9e22f43
[CodeBuild]
tkilias Aug 6, 2024
727fead
[CodeBuild]
tkilias Aug 6, 2024
4f4c11a
[CodeBuild]
tkilias Aug 6, 2024
6df3989
Fix buildspec.yml [CodeBuild]
tkilias Aug 6, 2024
3e3bd6d
Use correct backend strings for comparison in tests [CodeBuild]
tkilias Aug 6, 2024
0f51211
Use correct backend strings for onprem for comparison in tests [CodeB…
tkilias Aug 6, 2024
3e07bcd
Add --setup-show to running integration tests nox sessions [CodeBuild]
tkilias Aug 6, 2024
99732c1
[CodeBuild]
tkilias Aug 6, 2024
836be30
Build and export SLC before running SaaS integration tests to avoid w…
tkilias Aug 6, 2024
501ba04
Use itde_config fixture instead itde fiture to avoid starting the itd…
tkilias Aug 6, 2024
406c7ce
Save SaaS Database id in pytest stash to not recreate a SaaS DB for e…
tkilias Aug 6, 2024
5d2d261
Fix pytest stash usage [CodeBuild]
tkilias Aug 6, 2024
8d06a40
Use pytest stash to export and upload the slc only once [CodeBuild]
tkilias Aug 6, 2024
a12d8f8
Fix bucketfs_fixture.py [CodeBuild]
tkilias Aug 6, 2024
3b1281a
Save in pytest stash for which backend we uploaded the slc [CodeBuild]
tkilias Aug 7, 2024
92905ed
Fix upload_slc [CodeBuild]
tkilias Aug 7, 2024
9a36321
Increase DB Mem Size for ITDE to hopefully stabalize onprem tests in …
tkilias Aug 7, 2024
cef1932
Increase VM Size for onprem tests in CodeBuild to hopefully stabalize…
tkilias Aug 7, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
45 changes: 18 additions & 27 deletions buildspec.yml
Original file line number Diff line number Diff line change
@@ -1,29 +1,20 @@
version: 0.2

env:
shell: bash
secrets-manager:
DOCKER_USER: "Dockerhub:User"
DOCKER_PASSWORD: "Dockerhub:AccessToken"
SAAS_HOST: "ExasolSaaSDatabase:SAAS_HOST"
SAAS_ACCOUNT_ID: "ExasolSaaSDatabase:SAAS_ACCOUNT_ID"
SAAS_PAT: "ExasolSaaSDatabase:SAAS_PAT"

phases:
install:
runtime-versions:
python: 3.10
commands:
- curl -sSL https://install.python-poetry.org | POETRY_VERSION=1.4.2 python3 -
- export PATH=$PATH:$HOME/.local/bin
- poetry env use $(command -v "python3.10")
- poetry --version
- poetry install
- poetry build
pre_build:
commands:
- echo "$DOCKER_PASSWORD" | docker login --username "$DOCKER_USER" --password-stdin
build:
commands:
- poetry run nox -s start_database
- poetry run nox -s integration_tests
batch:
fast-fail: false
build-graph:
- identifier: without_db_tests
env:
compute-type: BUILD_GENERAL1_MEDIUM
privileged-mode: true
buildspec: ./buildspec_without_db.yml
- identifier: saas_tests
env:
compute-type: BUILD_GENERAL1_MEDIUM
privileged-mode: true
buildspec: ./buildspec_saas.yml
- identifier: onprem_tests
env:
compute-type: BUILD_GENERAL1_LARGE
privileged-mode: true
buildspec: ./buildspec_onprem.yml
26 changes: 26 additions & 0 deletions buildspec_onprem.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
version: 0.2

env:
shell: bash
secrets-manager:
DOCKER_USER: "Dockerhub:User"
DOCKER_PASSWORD: "Dockerhub:AccessToken"

phases:
install:
runtime-versions:
python: 3.10
commands:
- curl -sSL https://install.python-poetry.org | POETRY_VERSION=1.4.2 python3 -
- export PATH=$PATH:$HOME/.local/bin
- poetry env use $(command -v "python3.10")
- poetry --version
- poetry install
- poetry build
pre_build:
commands:
- echo "$DOCKER_PASSWORD" | docker login --username "$DOCKER_USER" --password-stdin
build:
commands:
- poetry run nox -s start_database
- poetry run nox -s onprem_integration_tests
29 changes: 29 additions & 0 deletions buildspec_saas.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
version: 0.2

env:
shell: bash
secrets-manager:
DOCKER_USER: "Dockerhub:User"
DOCKER_PASSWORD: "Dockerhub:AccessToken"
SAAS_HOST: "ExasolSaaSDatabase:SAAS_HOST"
SAAS_ACCOUNT_ID: "ExasolSaaSDatabase:SAAS_ACCOUNT_ID"
SAAS_PAT: "ExasolSaaSDatabase:SAAS_PAT"

phases:
install:
runtime-versions:
python: 3.10
commands:
- curl -sSL https://install.python-poetry.org | POETRY_VERSION=1.4.2 python3 -
- export PATH=$PATH:$HOME/.local/bin
- poetry env use $(command -v "python3.10")
- poetry --version
- poetry install
- poetry build
pre_build:
commands:
- echo "$DOCKER_PASSWORD" | docker login --username "$DOCKER_USER" --password-stdin
build:
commands:
- poetry run nox -s export_slc
- poetry run nox -s saas_integration_tests
25 changes: 25 additions & 0 deletions buildspec_without_db.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
version: 0.2

env:
shell: bash
secrets-manager:
DOCKER_USER: "Dockerhub:User"
DOCKER_PASSWORD: "Dockerhub:AccessToken"

phases:
install:
runtime-versions:
python: 3.10
commands:
- curl -sSL https://install.python-poetry.org | POETRY_VERSION=1.4.2 python3 -
- export PATH=$PATH:$HOME/.local/bin
- poetry env use $(command -v "python3.10")
- poetry --version
- poetry install
- poetry build
pre_build:
commands:
- echo "$DOCKER_PASSWORD" | docker login --username "$DOCKER_USER" --password-stdin
build:
commands:
- poetry run nox -s without_db_integration_tests
9 changes: 7 additions & 2 deletions doc/changes/changes_2.0.0.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,12 @@
# Transformers Extension 2.0.0, t.b.d
# Transformers Extension 2.0.0, 2024-08-07

Code name:
Code name: Fixed model saving, added SaaS support and update to Python 3.10

## Summary

This release Fixes an error in saving and loading of the model metadata. It also adds Exasol Saas support and
updated the project to python 3.10


### Features

Expand All @@ -13,6 +16,7 @@ Code name:
### Bugs

- #237: Fixed reference to python-extension-common
- #245: Added task_type parameter to fix model saving and loading

### Documentation

Expand All @@ -27,5 +31,6 @@ Code name:
- #217: Refactored PredictionUDFs and LoadLocalModel so that LoadLocalModel constructs the bucketfs model file path
- #230: Updated supported python version to >= Python 3.10
- #236: Moved to the PathLike bucketfs interface.
- #218: Changed upload_model_udf to load model from Huggingface

### Security
46 changes: 23 additions & 23 deletions doc/user_guide/user_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -263,29 +263,36 @@ Once you have internet access, invoke the UDF like this:
```sql
SELECT TE_MODEL_DOWNLOADER_UDF(
model_name,
task_type,
sub_dir,
bucketfs_conn,
token_conn
)

```
- Parameters:
- ```model_name```: The name of the model to use for prediction. You can find the
details of the models on the [huggingface models page](https://huggingface.co/models).
- ```task_type```: The Name of the task you want to use the model for.
- ```sub_dir```: The directory where the model is stored in the BucketFS.
- ```bucketfs_conn```: The BucketFS connection name.
- ```token_conn```: The connection name containing the token required for
private models. You can use an empty string ('') for public models. For details
on how to create a connection object with token information, please check
[here](#getting-started).


"task_type" is a variable for the type of task you plan to use the model for.
Some models can be used for multiple types of tasks, but transformers stores
different metadata depending on the task of the model, which affects how the model
is loaded later. Setting an Incorrect task_type, o leaving the task_type empty may affect the models performance
severely. Available task_types are the same as the names of our available UDFs, namely:
`filling_mask`, `question_answering`, `sequence_classification`, `text_generation`, `token_classification`,
`translation` and`zero_shot_classification`.

### 2. Model Uploader Script
You can invoke the python script as below which allows to load the transformer
models from the local filesystem into BucketFS:
You can invoke the python script as below which allows to download the transformer
models from The Hugging Face hub to the local filesystem, and then from there to the BucketFS.

```buildoutcfg
python -m exasol_transformers_extension.upload_model <options>
```

#### List of options

Expand All @@ -309,26 +316,19 @@ Unless stated otherwise in the comments column, the option is required for eithe
| model-name | [x] | [x] | |
| path-in-bucket | [x] | [x] | Root location in the bucket for all models |
| sub-dir | [x] | [x] | Sub-directory where this model should be stored |
| task_type | [x] | [x] | Name of the task you want to use the model for |
| token | [x] | [x] | Huggingface token (needed for private models) |
| [no_]use-ssl-cert-validation | [x] | [x] | Optional boolean, defaults to True |

**Note**: The options --local-model-path needs to point to a path which contains the model and its tokenizer.
These should have been saved using transformers [save_pretrained](https://huggingface.co/docs/transformers/v4.32.1/en/installation#fetch-models-and-tokenizers-to-use-offline)
function to ensure proper loading by the Transformers Extension UDFs.
You can download the model using python like this:

```python
for model_factory in [transformers.AutoModel, transformers.AutoTokenizer]:
# download the model and tokenizer from Hugging Face
model = model_factory.from_pretrained(model_name)
# save the downloaded model using the save_pretrained function
model_save_path = <your local model save path>
model.save_pretrained(model_save_path)
```
***Note:*** Hugging Face models consist of two parts, the model and the tokenizer.
Make sure to download and save both into the same save directory so the upload model script uploads them together.
And then upload it using exasol_transformers_extension.upload_model script where ```--local-model-path = <your local model save path>```


"task_type" is a variable for the type of task you plan to use the model for.
Some models can be used for multiple types of tasks, but transformers stores
different metadata depending on the task of the model, which affects how the model
is loaded later. Setting an Incorrect task_type, o leaving the task_type empty may affect the models performance
severely. Available task_types are the same as the names of our available UDFs, namely:
`filling_mask`, `question_answering`, `sequence_classification`, `text_generation`, `token_classification`,
`translation` and`zero_shot_classification`.

## Using Prediction UDFs
We provide 7 prediction UDFs in this Transformers Extension, each performing an NLP
task through the [transformers API](https://huggingface.co/docs/transformers/task_summary).
Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
CREATE OR REPLACE {{ language_alias }} SET SCRIPT "TE_MODEL_DOWNLOADER_UDF"(
model_name VARCHAR(2000000),
task_type VARCHAR(2000000),
sub_dir VARCHAR(2000000),
bfs_conn VARCHAR(2000000),
token_conn VARCHAR(2000000)
Expand Down
10 changes: 5 additions & 5 deletions exasol_transformers_extension/udfs/models/base_model_udf.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
from exasol_transformers_extension.deployment import constants
from exasol_transformers_extension.utils import device_management, \
bucketfs_operations, dataframe_operations
from exasol_transformers_extension.utils.current_model_specification import CurrentModelSpecification
from exasol_transformers_extension.utils.bucketfs_model_specification import BucketFSModelSpecification
from exasol_transformers_extension.utils.load_local_model import LoadLocalModel
from exasol_transformers_extension.utils.model_factory_protocol import ModelFactoryProtocol
from exasol_transformers_extension.utils.model_specification import ModelSpecification
Expand Down Expand Up @@ -40,13 +40,13 @@ def __init__(self,
pipeline: transformers.Pipeline,
base_model: ModelFactoryProtocol,
tokenizer: ModelFactoryProtocol,
task_name: str):
task_type: str):
self.exa = exa
self.batch_size = batch_size
self.pipeline = pipeline
self.base_model = base_model
self.tokenizer = tokenizer
self.task_name = task_name
self.task_type = task_type
self.device = None
self.model_loader = None
self.last_created_pipeline = None
Expand Down Expand Up @@ -74,7 +74,7 @@ def create_model_loader(self):
self.model_loader = LoadLocalModel(pipeline_factory=self.pipeline,
base_model_factory=self.base_model,
tokenizer_factory=self.tokenizer,
task_name=self.task_name,
task_type=self.task_type,
device=self.device)

def get_predictions_from_batch(self, batch_df: pd.DataFrame) -> pd.DataFrame:
Expand Down Expand Up @@ -185,7 +185,7 @@ def check_cache(self, model_df: pd.DataFrame) -> None:
model_name = model_df["model_name"].iloc[0]
bucketfs_conn = model_df["bucketfs_conn"].iloc[0]
sub_dir = model_df["sub_dir"].iloc[0]
current_model_specification = CurrentModelSpecification(model_name, bucketfs_conn, sub_dir)
current_model_specification = BucketFSModelSpecification(model_name, self.task_type, bucketfs_conn, sub_dir)

if self.model_loader.current_model_specification != current_model_specification:
bucketfs_location = \
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ def __init__(self,
base_model=transformers.AutoModelForMaskedLM,
tokenizer=transformers.AutoTokenizer):
super().__init__(exa, batch_size, pipeline, base_model,
tokenizer, task_name='fill-mask')
tokenizer, task_type='fill-mask')
self._mask_token = "<mask>"
self._desired_fields_in_prediction = ["sequence", "score"]
self.new_columns = ["filled_text", "score", "rank", "error_message"]
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,8 @@
import transformers

from exasol_transformers_extension.utils import bucketfs_operations
from exasol_transformers_extension.utils.current_model_specification import \
CurrentModelSpecificationFactory
from exasol_transformers_extension.utils.bucketfs_model_specification import \
BucketFSModelSpecificationFactory
from exasol_transformers_extension.utils.model_factory_protocol import ModelFactoryProtocol
from exasol_transformers_extension.utils.huggingface_hub_bucketfs_model_transfer_sp import \
HuggingFaceHubBucketFSModelTransferSPFactory
Expand All @@ -24,13 +24,11 @@ class ModelDownloaderUDF:
"""
def __init__(self,
exa,
base_model_factory: ModelFactoryProtocol = transformers.AutoModel,
tokenizer_factory: ModelFactoryProtocol = transformers.AutoTokenizer,
huggingface_hub_bucketfs_model_transfer: HuggingFaceHubBucketFSModelTransferSPFactory =
HuggingFaceHubBucketFSModelTransferSPFactory(),
current_model_specification_factory: CurrentModelSpecificationFactory = CurrentModelSpecificationFactory()):
current_model_specification_factory: BucketFSModelSpecificationFactory = BucketFSModelSpecificationFactory()):
self._exa = exa
self._base_model_factory = base_model_factory
self._tokenizer_factory = tokenizer_factory
self._huggingface_hub_bucketfs_model_transfer = huggingface_hub_bucketfs_model_transfer
self._current_model_specification_factory = current_model_specification_factory
Expand All @@ -47,9 +45,11 @@ def _download_model(self, ctx) -> Tuple[str, str]:
bfs_conn = ctx.bfs_conn # BucketFS connection
token_conn = ctx.token_conn # name of token connection
current_model_specification = self._current_model_specification_factory.create(ctx.model_name,
bfs_conn,
ctx.sub_dir) # specifies details of Huggingface model
ctx.task_type,
bfs_conn,
ctx.sub_dir) # specifies details of Huggingface model

model_factory = current_model_specification.get_model_factory()
# extract token from the connection if token connection name is given.
# note that, token is required for private models. It doesn't matter
# whether there is a token for public model or even what the token is.
Expand All @@ -72,7 +72,7 @@ def _download_model(self, ctx) -> Tuple[str, str]:
model_path=model_path,
token=token
) as downloader:
for model in [self._base_model_factory, self._tokenizer_factory]:
for model in [model_factory, self._tokenizer_factory]:
downloader.download_from_huggingface_hub(model)
# upload model files to BucketFS
model_tar_file_path = downloader.upload_to_bucketfs()
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ def __init__(self,
base_model=transformers.AutoModelForSequenceClassification,
tokenizer=transformers.AutoTokenizer):
super().__init__(exa, batch_size, pipeline, base_model,
tokenizer, task_name='text-classification')
tokenizer, task_type='text-classification')
self.new_columns = ["label", "score", "error_message"]

def extract_unique_param_based_dataframes(
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ def __init__(self,
base_model=transformers.AutoModelForSequenceClassification,
tokenizer=transformers.AutoTokenizer):
super().__init__(exa, batch_size, pipeline, base_model,
tokenizer, task_name='text-classification')
tokenizer, task_type='text-classification')
self.new_columns = ["label", "score", "error_message"]

def extract_unique_param_based_dataframes(
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ def __init__(self,
base_model=transformers.AutoModelForCausalLM,
tokenizer=transformers.AutoTokenizer):
super().__init__(exa, batch_size, pipeline, base_model,
tokenizer, task_name='text-generation')
tokenizer, task_type='text-generation')
self.new_columns = ["generated_text", "error_message"]

def extract_unique_param_based_dataframes(
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ def __init__(self,
base_model=transformers.AutoModelForTokenClassification,
tokenizer=transformers.AutoTokenizer):
super().__init__(exa, batch_size, pipeline, base_model,
tokenizer, task_name='token-classification')
tokenizer, task_type='token-classification')
self._default_aggregation_strategy = 'simple'
self._desired_fields_in_prediction = [
"start", "end", "word", "entity", "score"]
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ def __init__(self,
base_model=transformers.AutoModelForSeq2SeqLM,
tokenizer=transformers.AutoTokenizer):
super().__init__(exa, batch_size, pipeline, base_model,
tokenizer, task_name='translation')
tokenizer, task_type='translation')
self._translation_prefix = "translate {src_lang} to {target_lang}: "
self.new_columns = ["translation_text", "error_message"]

Expand Down
Loading
Loading