Skip to content

Commit

Permalink
bugfix for IS_ALPHA
Browse files Browse the repository at this point in the history
  • Loading branch information
guipenedo committed Jan 30, 2025
1 parent b3daef2 commit b9fb72a
Showing 1 changed file with 2 additions and 2 deletions.
4 changes: 2 additions & 2 deletions src/datatrove/utils/word_tokenizers.py
Original file line number Diff line number Diff line change
Expand Up @@ -136,9 +136,9 @@ def _do_tokenize(self, text: str):
self.tokenizer.max_length = len(text)
try:
return [self.tokenizer(t, disable=["parser", "tagger", "ner"]) for t in texts]
except KeyError as e:
except Exception as e:
# this dumb string breaks the tokenizer completely
if "IS_ALPHA" in str(e):
if "IS_ALPHA" in text:
return [self.tokenizer(t.replace("IS_ALPHA", ""), disable=["parser", "tagger", "ner"]) for t in texts]
else:
raise e
Expand Down

0 comments on commit b9fb72a

Please sign in to comment.