-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A proposal for punctuation & symbol recognition: Spelling mode #29
Comments
I like the idea |
First, I agree with OP: a FOSS accurate speech to text is a great news. Thanks for your appreciated work. Regarding this issue, I can't agree more too as I have to use punctuation when writing some text. BUT I don't think this proposal is a good option. If I have to use my finger to switch to punctuation & symbol mode, why not inserting the required string directly? There's a lot of space around the micro symbol to add these strings and have them ready to be used in one tap. BUT the main point of using a TTS app is to use it without hands, isn't it? Regarding this, managing punctuation & symbol that way is not a good option IMHO. I would suggest another idea: when the app detects a pause (100ms editable in settings) next word said is a punctuation or a symbol. And these strings are defined in a list so it's easier to identify them. That could simplify the punctuation detection but can also expand app's possibilities: users can add keywords with a full sentence for example. Like the "Text insert" thunderbird module. I don't know how hard it will be to implement but I think it's a good option to think of. Enjoy! |
I would strongly advise against anything that's pause-based. People need time to pause and think when typing like, most of the time. And since voice typing is the fastest input method, that fact is more pertinent than in any other circumstance. One of the central fallacies in mainstream voice input is the idea that the user will speak to the receiver in natural language with natural cadence. Exhibit A is how Google's voice input shuts off without being asked to after what seems like a split-second of silence. This is especially frustrating from an accessibility standpoint for those who struggle with speaking quickly or 'naturally.' The purpose of voice input is not to be as hands-off as possible, the purpose is to be the fastest input method; and to be precise and accessible in doing so. We don't need to pretend like the user's fingers disappear whenever voice typing is engaged. |
I agree with you : speaking to a device is not flawless as speaking to a person. But I wasn't clear enough in my previous comment. And if I have to use my finger to select punctuation, I write the whole text by hand. It's faster and less frustrating in my case. |
what about a non verbal utterance like a tongue click (like "tsk") to switch to "punctuation mode"? Anyway, in my use case, a few punctuation buttons on the screen would be enough. ...And they would make the app actually usable, which now it is not! |
That's an interesting idea, I think. |
Just the ability, aside from punctuation, to spell words, maybe using the NATO alphabet, would be nice, for when it just isn't getting the word you're trying to say, and you don't want to switch to typing. |
I'm also impressed with this keyboard, just to put that in there. I cam to the repo to create such a request because as it is, the keyboard is really not useful. That said, GBoard does an amazing job at transcription and even corrects for punctuation almost flawlessly. Although I'm sure they have machine learning and are considering context. If considering context is too hard, for the time being it seems reasonable that a short pause of some kind with an associated list of keywords would be more acceptable than touching the device. In my case, I almost never touch my phone to use the keyboard, instead I'm just using the voice dictation features of GBoard, which supports the point of freeing up your mind to speak and not worry about typing. |
@morenathan to be fair, you can't really compare Google's speech recognition that is done over the cloud, with offline speech recognition done on-device, even apart from the other advantages Google has... Sayboard is great but it's mainly an interface to Vosk, and apart from Vosk, there aren't really (m)any open source speech recognition engines that can work on a phone. For example, OpenAI's Whisper is a great one that does punctuation, but it's not realtime even on my relatively powerful computer! |
@LuccoJ Yes, I realize that. As I stated Iḿ sure their using machine learning contextual syntax parsing. And, actually with the newer phones much of that can be processed on-device, but they have a model that has an unfair advantage. My Pixel 7 experience mirrors this article almost the same. Without this type of hardware, research, and billions of peoples input to train Iḿ left with my experience and what I said earlier; Ïf considering context is too hard"a ¨short pause ... with an associated list of keywords would be more acceptable than touch the device. |
@morenathan : that's mainly what I talked about 3 weeks ago. ^_^ Definitely, a "user audio dictionnary" (ie vocal keywords associated with strings) along with a short pause would be a breaking feature. To remind, short pause gives only priority to keywords list. If spoken word after a pause is not in that list the application work as its normal speech-to-text behavior. That said, it is probably harder to implement such a feature. Enjoy! |
Yeah, I read through the whole dialog and thought, "This is probably the best option." In order to provide any of those features you really need to understand the context of the sentence for every type of person speaking. Unless there is some other way of doing it. And, without significant hardware supporting the process continuous dictation is hard to impossible. That said, I've just started playing with Vosk myself in Linux. Maybe at some point I can offer something to this project because it's something I myself might put into use on other Android devices (looking at you Samsung!). |
@morenathan @ElishaAz : I just remembered that my Pebble Time was really efficient in speak to text, at least in French and particularly with punctuation. Maybe there is some available stuff that could be used in this project. Enjoy! |
What about saying the word twice? "I am going to the museum period period" = I am going to the museum. |
@unoukujou : In that case, I will choose the other way. Saying the word twice will use the word and not the special character. I almost never want to write "comma" but "," instead. 😉 |
While I often write the words "period" and, believe it or not, "colon". I think it should also be taken into account how Vosk takes context into account and may try to turn words like "comma" into something that makes more sense within the sentence, and if said separately, I have very bad luck with getting it to interpret individual contextless words correctly (may just be my bad pronunciation though). I think repeating words may throw Vosk even more off. |
Sure either way can get the job done. Perhaps even a setting to have it both ways in case someone prefers the other way. |
This is a very interesting thread to read. As a more basic user that doesn't mind interacting with a keyboard while doing dictation i like the original idea of a button to switch context. Since that's what i'm doing now as i write this comment. By adding in punctuation manually, since we have a keyboard that allows such behavior, it would be similar to adding punctuation simply by touching the button and saying what punctuation you want however, i see the problem pointed out where it would be better to just say a phrase or word that switches to punctuation. As it stands now with the current keyboard, i'm pretty happy using say board this way with the punctuation keyboard because i've fitted numbers and most used punctuation in one keyboard to use. |
Hello, |
First of all, I want to say how impressed I am with this app already. Finally having a functional FOSS voice IME is something I've been longing for for a while now. I'm very grateful for the work you're doing here and would love to contribute some time once I get better at writing code.
There are a few issues already posted relating to how the models don't yet parse punctuations or other symbols like numbers and such. The current proposed solutions I've seen include:
Regardless of the approach taken, I think that something (I think?) somewhat simple could be implemented in the meantime that would be a long-term improvement and advantage for the project compared to more traditional voice IMs: a spelling mode.
The button would functionally be akin to the to-symbols key on traditional mobile keyboards. What it would do is switch the recognition model to one that is only listening for character-by-character utterances i.e. space:
, eff:
f
, period:.
, three:3
, hash:#
, right brace:}
, slash:/
, etc. and transcribing them in symbol form.This functionality would reduce the urgency of making the default models "smarter" (and more bloated, I would guess?) as well as provide a more precise UX than any other voice IME ever made—where I always find myself faced with "welp, time to go back to the regular keyboard because x punctuation symbol/undocumented word/non-capitalization isn't supported."
Personally, I like the idea of keeping the default model as a words-only one, as it keeps the distinction between 'three' & '3' and whatnot as precise as possible, but I can also imagine a compromise of a 'smarter' default engine with spelling and words-only modes for more precision. A shift-key would also be nice.
I don't know the logistics of switching models mid-transcription, but what I imagine is that pressing a button places 'break' in the recording that says "once you reach this point, change how you transcribe the audio."
Just food for thought :)
The text was updated successfully, but these errors were encountered: