Introducing inconsistency when adding special tokens to a ModularTokenizer #105

floccinauc · 2024-02-22T08:46:13Z

Describe the bug
When we add special tokens to a ModularTokenizer, we first discover which of the new tokens are already part of the ModularTokenizer. To this end we consider the special tokens of the first subtokenizer alone, assuming all subtokenizers are consistent.
This, however, is not necessarily the case. For example, consider a second multitokenizer (Z) that was created from the first one (Y) by adding another subtokenizer (A). The first multitokenizer (Y) then was updated with additional special tokens, then all its subtokenizers were updated, but subtokenizer A was not (since it's only part of Z and not of Y). Next time multitokenizer Z is loaded, it will no longer be consistent - its subtokenizer A will be missing special tokens.
Moreover, if we try to add the missing tokens to Z, we'll fail because they're found in its first subtokenizer.
Possible solutions:
A. Test a ModularTokenizer for consistency each time it's loaded.
- If it is not consistent, add a consolidation function that will identify missing tokens (and their IDs) from each subtokenizer and add them, if possible (throwing an exception if not). Alternatively, the consolidation part may be optional, and we can just throw an exception that the ModularTokenizer is inconsistent and must be consolidated.
B. Test ModularTokenizer for consistency each time before it is changed (e.g. by add_special_tokens), and consilidate it if needed/possible

Fuse-Drug version
Fuse-Drug version/tag/commit used.

Python version
Exact Python version used. E.g. 3.8.13

To reproduce
Steps to reproduce the behavior.

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.
Make sure not to include any sensitive information.

Additional context
Add any other context about the problem here.

matanninio · 2024-02-22T10:35:42Z

Is this an issue with the extended tokenizer when compared to the simple one, or within the simple as well?

floccinauc · 2024-02-22T11:14:17Z

It's only an issue with the extended modulartokenizer

matanninio · 2024-02-22T12:26:33Z

after huddle: extend

fuse-drug/fusedrug/data/tokenizer/modulartokenizer/modular_tokenizer.py

Line 546 in 542778c

"""_summary_

to work between sub-tokenizers of two modular tokenizers, and add the test to the Jenkins run.

floccinauc · 2024-02-22T12:31:22Z

The problem is not that acute, since the two tokenizers we use - modular_AA_SMILES_single_path and bmfm_extended_modular_tokenizer sit in different directories.
Initial manual solution would be to run special token addition code on both tokenizers each time, and to add a Jenkins test that compares them during commit.

matanninio · 2024-03-03T08:47:31Z

Initial manual solution would be to run special token addition code on both tokenizers each time, and to add a Jenkins test that compares them during commit.

Maybe a better solution would be to add an "add special tokens to all tokenizers" script, so users will no need to follow the more complicated path of running the add-special-tokens twice or thrice.

floccinauc added the bug Something isn't working label Feb 22, 2024

floccinauc assigned matanninio and floccinauc Feb 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introducing inconsistency when adding special tokens to a ModularTokenizer #105

Introducing inconsistency when adding special tokens to a ModularTokenizer #105

floccinauc commented Feb 22, 2024 •

edited

Loading

matanninio commented Feb 22, 2024

floccinauc commented Feb 22, 2024

matanninio commented Feb 22, 2024

floccinauc commented Feb 22, 2024

matanninio commented Mar 3, 2024

Introducing inconsistency when adding special tokens to a ModularTokenizer #105

Introducing inconsistency when adding special tokens to a ModularTokenizer #105

Comments

floccinauc commented Feb 22, 2024 • edited Loading

matanninio commented Feb 22, 2024

floccinauc commented Feb 22, 2024

matanninio commented Feb 22, 2024

floccinauc commented Feb 22, 2024

matanninio commented Mar 3, 2024

floccinauc commented Feb 22, 2024 •

edited

Loading