You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
When we add special tokens to a ModularTokenizer, we first discover which of the new tokens are already part of the ModularTokenizer. To this end we consider the special tokens of the first subtokenizer alone, assuming all subtokenizers are consistent.
This, however, is not necessarily the case. For example, consider a second multitokenizer (Z) that was created from the first one (Y) by adding another subtokenizer (A). The first multitokenizer (Y) then was updated with additional special tokens, then all its subtokenizers were updated, but subtokenizer A was not (since it's only part of Z and not of Y). Next time multitokenizer Z is loaded, it will no longer be consistent - its subtokenizer A will be missing special tokens.
Moreover, if we try to add the missing tokens to Z, we'll fail because they're found in its first subtokenizer.
Possible solutions:
A. Test a ModularTokenizer for consistency each time it's loaded.
- If it is not consistent, add a consolidation function that will identify missing tokens (and their IDs) from each subtokenizer and add them, if possible (throwing an exception if not). Alternatively, the consolidation part may be optional, and we can just throw an exception that the ModularTokenizer is inconsistent and must be consolidated.
B. Test ModularTokenizer for consistency each time before it is changed (e.g. by add_special_tokens), and consilidate it if needed/possible
Fuse-Drug version
Fuse-Drug version/tag/commit used.
Python version
Exact Python version used. E.g. 3.8.13
To reproduce
Steps to reproduce the behavior.
Expected behavior
A clear and concise description of what you expected to happen.
Screenshots
If applicable, add screenshots to help explain your problem.
Make sure not to include any sensitive information.
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered:
The problem is not that acute, since the two tokenizers we use - modular_AA_SMILES_single_path and bmfm_extended_modular_tokenizer sit in different directories.
Initial manual solution would be to run special token addition code on both tokenizers each time, and to add a Jenkins test that compares them during commit.
Initial manual solution would be to run special token addition code on both tokenizers each time, and to add a Jenkins test that compares them during commit.
Maybe a better solution would be to add an "add special tokens to all tokenizers" script, so users will no need to follow the more complicated path of running the add-special-tokens twice or thrice.
Describe the bug
When we add special tokens to a ModularTokenizer, we first discover which of the new tokens are already part of the ModularTokenizer. To this end we consider the special tokens of the first subtokenizer alone, assuming all subtokenizers are consistent.
This, however, is not necessarily the case. For example, consider a second multitokenizer (Z) that was created from the first one (Y) by adding another subtokenizer (A). The first multitokenizer (Y) then was updated with additional special tokens, then all its subtokenizers were updated, but subtokenizer A was not (since it's only part of Z and not of Y). Next time multitokenizer Z is loaded, it will no longer be consistent - its subtokenizer A will be missing special tokens.
Moreover, if we try to add the missing tokens to Z, we'll fail because they're found in its first subtokenizer.
Possible solutions:
A. Test a ModularTokenizer for consistency each time it's loaded.
- If it is not consistent, add a consolidation function that will identify missing tokens (and their IDs) from each subtokenizer and add them, if possible (throwing an exception if not). Alternatively, the consolidation part may be optional, and we can just throw an exception that the ModularTokenizer is inconsistent and must be consolidated.
B. Test ModularTokenizer for consistency each time before it is changed (e.g. by add_special_tokens), and consilidate it if needed/possible
Fuse-Drug version
Fuse-Drug version/tag/commit used.
Python version
Exact Python version used. E.g. 3.8.13
To reproduce
Steps to reproduce the behavior.
Expected behavior
A clear and concise description of what you expected to happen.
Screenshots
If applicable, add screenshots to help explain your problem.
Make sure not to include any sensitive information.
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: