You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi thanks for your great work on this. I noticed a subtle issue when playing with synthetic examples.
The bpe algorithm works as expected but the unigram algorithm does not make this constant piece a token in the vocabulary.
I generate synthetic data where each sentence is a random string followed by a constant piece.
constant_piece = 'helloWorld'
def rand_str(n=10):
return ''.join(
np.random.choice(list('bcegijklmnoqruvwxyz'), n)
)
data = [rand_str() + constant_piece for _ in range(1000)]
model = io.BytesIO()
spm.SentencePieceTrainer.train(
sentence_iterator=iter(data), model_writer=model,
vocab_size=1000,
minloglevel=5,
)
sp = spm.SentencePieceProcessor(model_proto=model.getvalue())
ex = data[20]
print([
sp.IdToPiece(x)
for x in sp.encode(ex, emit_unk_piece=True)
])
It mostly just gets random tokens. I think it gets 'he', 'llo', 'or' and 'ld' not because it noticed the repeating pattern but just by coincidently seeing it in the random strings. If I change constant_piece to '123456' then i get no tokens for the repeating pattern and only tokens for the random string: ['▁', 'gll', 'imq', 'xc', 'df', '1', '2', '3', '4', '5', '6']
This specifically because the constant_piece at the end. If I change data so that constant_piece is at the beginning of each sentence: data = [constant_piece + rand_str() for _ in range(1000)] then i get the expected result ['▁123456', 'uzb', 'ek', 'hoe', 'wr'].
TLDR;
Unexpected result under the following conditions:
same string at end of each sentence in the training data
using unigram algorithm
The text was updated successfully, but these errors were encountered:
Hi thanks for your great work on this. I noticed a subtle issue when playing with synthetic examples.
The bpe algorithm works as expected but the unigram algorithm does not make this constant piece a token in the vocabulary.
I generate synthetic data where each sentence is a random string followed by a constant piece.
outputs: ['▁uy', 'vx', 'yf', 'p', 'gmn', 'he', 'llo', 'W', 'or', 'ld']
It mostly just gets random tokens. I think it gets 'he', 'llo', 'or' and 'ld' not because it noticed the repeating pattern but just by coincidently seeing it in the random strings. If I change constant_piece to '123456' then i get no tokens for the repeating pattern and only tokens for the random string:
['▁', 'gll', 'imq', 'xc', 'df', '1', '2', '3', '4', '5', '6']
This specifically because the constant_piece at the end. If I change data so that constant_piece is at the beginning of each sentence:
data = [constant_piece + rand_str() for _ in range(1000)]
then i get the expected result['▁123456', 'uzb', 'ek', 'hoe', 'wr']
.TLDR;
Unexpected result under the following conditions:
The text was updated successfully, but these errors were encountered: