Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delay speech-synthesis functions #127

Merged
merged 11 commits into from
Feb 1, 2022
Merged

Delay speech-synthesis functions #127

merged 11 commits into from
Feb 1, 2022

Conversation

noamr
Copy link
Collaborator

@noamr noamr commented Jan 30, 2022

See https://wicg.github.io/speech-api/#speechrecognition and https://wicg.github.io/speech-api/#tts-section.
Note that that API doesn't currently handle anything to do with document focus, which should be fixed separately.

Copy link
Collaborator

@domenic domenic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we think this delay-everything model is the right one? I could see a few possibilities:

  • Get no-op behavior for free based on document focus or user activation. (The speech synthesis spec doesn't seem to require these right now, but maybe implementations do?)
  • Start in the paused state, and auto-resume upon activation.
  • Be a bit smarter than just delaying. E.g., Allow cancel() while prerendering; allow pausing and resuming; allow using speak() to enqueue things; just avoid actually speaking.

For SpeechRecognition it seems more likely that delay-everything is correct, or maybe it should just auto-fail based on user activation/document focus.

Any idea what our implementation does?

@noamr
Copy link
Collaborator Author

noamr commented Jan 31, 2022

Do we think this delay-everything model is the right one? I could see a few possibilities:

  • Get no-op behavior for free based on document focus or user activation. (The speech synthesis spec doesn't seem to require these right now, but maybe implementations do?)

  • Start in the paused state, and auto-resume upon activation.

  • Be a bit smarter than just delaying. E.g., Allow cancel() while prerendering; allow pausing and resuming; allow using speak() to enqueue things; just avoid actually speaking.

Yea I see your point... The problem with all these suggestions and the reason I went with something a lot more basic is that the speech synthesis spec doesn't mention anything to do with multiple clients, I feel it's a conversation that should start at that spec's GitHub regardless of prerender and I wasn't sure whether to create a dependency, but that's maybe the right thing to do.

For SpeechRecognition it seems more likely that delay-everything is correct, or maybe it should just auto-fail based on user activation/document focus.

Any idea what our implementation does?

@domenic
Copy link
Collaborator

domenic commented Jan 31, 2022

Yeah I agree we don't want to take on too large of a dependency here; we're not responsible for solving all the spec tech debt in everything we touch. IMO the right tradeoff here is:

  • Investigate if we have any easy-outs, e.g. user activation or focus requirements that just aren't specced currently. If so, adding those to the speech spec seems like a reasonable amount of tech debt for us to fix while we're here.

  • If we don't have any easy-outs, then just make sure what we spec here either matches the Chromium implementation, or is reasonably easy to implement and we have some agreement to do so. We shouldn't knowingly spec something we don't plan to implement.

@noamr
Copy link
Collaborator Author

noamr commented Feb 1, 2022

Yeah I agree we don't want to take on too large of a dependency here; we're not responsible for solving all the spec tech debt in everything we touch. IMO the right tradeoff here is:

  • Investigate if we have any easy-outs, e.g. user activation or focus requirements that just aren't specced currently. If so, adding those to the speech spec seems like a reasonable amount of tech debt for us to fix while we're here.

The activation/focus gate is currently [in discussion[(https://github.com/WebAudio/web-speech-api/issues/35), and doesn't work the same across browsers. Firefox requires focus, Chrome uses the autoplay rules, which should throw not-allowed when trying to speak before the page is activated, and WebKit requires user-gesture but only on iOS. When opening a new unfocused tab with speech-synthesis on desktop, Firefox delays speech but Chrome/Safari doesn't. I believe the Firefox implementation is the closest to how I would expect this to behave when prerendering (the page appears normal, but some of the stuff only happens when you activate it).

  • If we don't have any easy-outs, then just make sure what we spec here either matches the Chromium implementation, or is reasonably easy to implement and we have some agreement to do so. We shouldn't knowingly spec something we don't plan to implement.

Currently the implementation would throw a not-allowed, I believe. @nyaxt, can you make sure? This is problematic as prerendering would cause pages with speech synthesis to reach an error branch where they wouldn't with regular rendering.

I'm becoming more convinced that the most straightforward solution is a simple DelayWhilePrerendering. I think if we do that for most less-common web features, and forego some possible subtleties at least at the first phase, there's a higher chance developers would understand the prerendering mechanism and how to go about it, while if every feature behaves slightly differently it would be confusing, especially for a new feature.

@domenic
Copy link
Collaborator

domenic commented Feb 1, 2022

I think I agree with that reasoning. I am slightly concerned [DelayWhilePrerendering] will be harder to implement than it is to spec, but we're already all-in on using it everywhere else, so I hope @nyaxt can agree to it as a general strategy :).

Given the underspecification of all the methods you've decorated, it's hard to review for sure and make sure that delaying them will work as expected. (I.e., it seems like it'd require some sort of queue of actions, which is implicit in the speech spec but not explicit.) But this is probably good enough.

@domenic domenic merged commit 3bc55b3 into WICG:main Feb 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants