Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create speech-recognition-context.md #140

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

yrw-google
Copy link

Add an explainer for the new speech recognition context feature

@evanbliu evanbliu requested a review from padenot February 19, 2025 23:36
Copy link
Member

@padenot padenot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A very welcome addition, thanks for the initial design. Let's iterate a bit on the API shape, but I'm excited about this.



### 2. **Enhanced Relevance**
By incorporating contextual information, the speech recognition models can produce transcriptions that are more meaningful and aligned with the user's intent. This leads to better understanding of the spoken content and more accurate execution of voice commands.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/intent/expectations/. Or more correct maybe, I don't know how to best phrase this. But the user doesn't have an intent here, it receives a result.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rewrote it to say better align with the user's expectations of the output.



### 1. **SpeechRecognitionPhrase**
This interface holds fundamental information about a biasing phrase, such as a text string and a boost value indicating how likely the phrase will appear.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WebIDL strawman usually help. Is this just an alias to DOMString ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added IDL definitions for these interfaces and you can take a look again.



### 2. **SpeechRecognitionPhraseList**
This interface holds a list of `SpeechRecognitionPhrase` and supports adding more `SpeechRecognitionPhrase` to the list.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this just sequence<SpeechRecognitionPhrase> ?

Copy link
Author

@yrw-google yrw-google Feb 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I experimented with using sequence a bit. I think it is feasible to put sequence<SpeechRecognitionPhrase> inside SpeechRecognitionContext and get rid of SpeechRecognitionPhraseList, but that means we need to move other methods like the definitions of length (which is somehow required by blink/v8 if it detects an array) and addItem from SpeechRecognitionPhraseList into SpeechRecognitionContext too. In that case if we add more types of data to SpeechRecognitionContext in the future, it might become confusing, and we won't be able to support another array inside SpeechRecognitionContext because the definition of length needs to duplicate.

From the IDL examples I can find, seems like it is a common practice to create a new ObjectList interface to support a new Object interface (e.g. DataTransferItemList), and sequence is used more often in a dictionary. I'm not sure if we should make SpeechRecognitionPhrase and SpeechRecognitionContext become dictionary instead, so that their relationship will be as simple as SpeechRecognitionContext contains a sequence of SpeechRecognitionPhrase. We want to perform data validation on each SpeechRecognitionPhrase so using dictionary may oversimplify things? Let me know what you think!

Comment on lines 70 to 108
const recognition = new SpeechRecognition();
recognition.start();
var list = new SpeechRecognitionPhraseList();
list.addItem(new SpeechRecognitionPhrase("updated text", 2.0));
var context = new SpeechRecognitionContext(list);
recognition.updateContext(context);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do we lose by simply doing:

Suggested change
const recognition = new SpeechRecognition();
recognition.start();
var list = new SpeechRecognitionPhraseList();
list.addItem(new SpeechRecognitionPhrase("updated text", 2.0));
var context = new SpeechRecognitionContext(list);
recognition.updateContext(context);
const recognition = new SpeechRecognition();
recognition.start();
var list = [{text: "update text", weight: 2.0}];
recognition.updateContext(list);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same interface simplification goes for the other example.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you still want to simplify things like this, can you tell me how IDL will look like in this case? We also want to perform data validation on each SpeechRecognitionPhrase, so using new SpeechRecognitionPhrase() helps us throw an error as we validate there, otherwise we would need to hold on validation until updateContext() is called?

if (event.error == "recognition-context-not-supported") {
console.log("Recognition context is not supported: ", event);
}
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this example leaving the actual use of the context that would lead to an error? I'm not sure if I follow.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rewrote and added more sample codes to hopefully make it clear. Please take a look again and let me know if additional explanation is needed!

Add an explainer for the new speech recognition context feature
yrw-google added a commit to yrw-google/web-speech-api that referenced this pull request Feb 24, 2025
Explainer for speech recognition context is added in WebAudio#140
yrw-google added a commit to yrw-google/web-speech-api that referenced this pull request Feb 24, 2025
Explainer for speech recognition context is added in WebAudio#140
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants