Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introducing inconsistency when adding special tokens to a ModularTokenizer #105

Open
floccinauc opened this issue Feb 22, 2024 · 5 comments
Assignees
Labels
bug Something isn't working

Comments

@floccinauc
Copy link
Collaborator

floccinauc commented Feb 22, 2024

Describe the bug
When we add special tokens to a ModularTokenizer, we first discover which of the new tokens are already part of the ModularTokenizer. To this end we consider the special tokens of the first subtokenizer alone, assuming all subtokenizers are consistent.
This, however, is not necessarily the case. For example, consider a second multitokenizer (Z) that was created from the first one (Y) by adding another subtokenizer (A). The first multitokenizer (Y) then was updated with additional special tokens, then all its subtokenizers were updated, but subtokenizer A was not (since it's only part of Z and not of Y). Next time multitokenizer Z is loaded, it will no longer be consistent - its subtokenizer A will be missing special tokens.
Moreover, if we try to add the missing tokens to Z, we'll fail because they're found in its first subtokenizer.
Possible solutions:
A. Test a ModularTokenizer for consistency each time it's loaded.
- If it is not consistent, add a consolidation function that will identify missing tokens (and their IDs) from each subtokenizer and add them, if possible (throwing an exception if not). Alternatively, the consolidation part may be optional, and we can just throw an exception that the ModularTokenizer is inconsistent and must be consolidated.
B. Test ModularTokenizer for consistency each time before it is changed (e.g. by add_special_tokens), and consilidate it if needed/possible

Fuse-Drug version
Fuse-Drug version/tag/commit used.

Python version
Exact Python version used. E.g. 3.8.13

To reproduce
Steps to reproduce the behavior.

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.
Make sure not to include any sensitive information.

Additional context
Add any other context about the problem here.

@floccinauc floccinauc added the bug Something isn't working label Feb 22, 2024
@matanninio
Copy link
Collaborator

Is this an issue with the extended tokenizer when compared to the simple one, or within the simple as well?

@floccinauc
Copy link
Collaborator Author

It's only an issue with the extended modulartokenizer

@matanninio
Copy link
Collaborator

after huddle: extend

to work between sub-tokenizers of two modular tokenizers, and add the test to the Jenkins run.

@floccinauc
Copy link
Collaborator Author

The problem is not that acute, since the two tokenizers we use - modular_AA_SMILES_single_path and bmfm_extended_modular_tokenizer sit in different directories.
Initial manual solution would be to run special token addition code on both tokenizers each time, and to add a Jenkins test that compares them during commit.

@matanninio
Copy link
Collaborator

Initial manual solution would be to run special token addition code on both tokenizers each time, and to add a Jenkins test that compares them during commit.

Maybe a better solution would be to add an "add special tokens to all tokenizers" script, so users will no need to follow the more complicated path of running the add-special-tokens twice or thrice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants