You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We had a discussion on slack about it, below are some notes:
Fatal error is from this: Any time there's a piece of punctuation/diacritic hanging out on its own not adjacent to a word, what happens is that it gets tokenized into a word (because it's in the "equiv" mapping), but it doesn't survive the g2p cascade (it's not on its own pronounceable), and then it goes to und, but und can't assign it a pronunciation either. The result is that there's a word in the file with no pronunciation, which we treat as a fatal error.
The better solution would be, whenever the cascade comes up with no pronunciation whatsoever, to say internally "I guess that's not a word after all" and then decide whether to exclude it from the FSG and align anyway (probably the best behavior for punctuation/diacritics), or to warn the user (probably the best behavior for number strings).
In any case, a random apostrophe (or any other piece of punctuation) hanging out somewhere in a file is pretty common, and our response shouldn't be failing to align the whole document. We should probably catch these kinds of errors and just gracefully not align them to anything. It's a different kind of error than (say) leaving an unpronounced number in the text file.
The text was updated successfully, but these errors were encountered:
We noticed a bug where this sequence of characters --> ̕" <- ̕ ( - COMBINING COMMA ABOVE RIGHT (U+315) + " would cause a fatal error.
We had a discussion on slack about it, below are some notes:
Fatal error is from this: Any time there's a piece of punctuation/diacritic hanging out on its own not adjacent to a word, what happens is that it gets tokenized into a word (because it's in the "equiv" mapping), but it doesn't survive the g2p cascade (it's not on its own pronounceable), and then it goes to und, but und can't assign it a pronunciation either. The result is that there's a word in the file with no pronunciation, which we treat as a fatal error.
The better solution would be, whenever the cascade comes up with no pronunciation whatsoever, to say internally "I guess that's not a word after all" and then decide whether to exclude it from the FSG and align anyway (probably the best behavior for punctuation/diacritics), or to warn the user (probably the best behavior for number strings).
In any case, a random apostrophe (or any other piece of punctuation) hanging out somewhere in a file is pretty common, and our response shouldn't be failing to align the whole document. We should probably catch these kinds of errors and just gracefully not align them to anything. It's a different kind of error than (say) leaving an unpronounced number in the text file.
The text was updated successfully, but these errors were encountered: