Automatic testing tools for Automatic Speech Recognition (ASR) systems are used to uncover failed test cases using test cases from generated audio derived from a corpus of text. ASDF is a differential testing framework for ASR systems that leverages upon CrossASR++, an existing ASR testing tool that automates the audio test case generation process, and further improves it by incorporating differential testing methods. CrossASR++ selects texts from an input text corpus, converts them into audio using a Text-to-Speech (TTS) service, and uses the audio to test the ASR systems. However, the quality of these tests greatly depends on the quality of the text corpus provided and may not uncover underlying weaknesses of the ASRs due to the text's limited variation and vocabulary.
Here, ASDF builds upon CrossASR++ by incorporating differential testing for ASR systems by applying text transformation methods to the original text inputs that failed in the CrossASR++ ASR testing process, effectively creating new high-quality test cases for ASR systems. The text transformation methods used are homophone transformation, augmentation, plurality transformation, tense transformation and adjacent deletion. We also improved analysis by providing high-level summaries of failed test cases and a graph of phonemes in the text input against their frequency of failure. Finally, we improved the tool's usability by including a CLI for easier use.
ASDF's approach to differential testing of ASR systems involves generating an audio test suite using a TTS library. This assumes that the generated audio is interchangeable with the real-world speech of humans, which may not be the case in real-life situations. There would still be differences between human speech and computer-generated audio, namely pronunciation, accent, and tone. As a result, if human speech were to be used instead of computer generated audio, such inconsistencies may yield different results for our differential testing. Adding to that, variations between accents from people originating across different regions of the world may yield different results as well. Therefore, if the research were to be extended further in the future, real-world conversations could instead be used. Possible sources of real-world conversation audio include the casual conversational data set of Meta Platform, where real-world conversations are provided in the file format (.mp4). Once converted to an audio file format (.mp3, .wav), this may help with this testing process.
Future extensions may include integration of a fully functional library that transforms the newly generated test cases into syntactically correct sentence sequences. Such an implementation is difficult to configure and implement. This is because English, as a language, is inconsistent in terms of grammar and spelling. During the course of our research, a reliable and full-proof library that handles such text transformations consistently has yet to be found. However, if this implementation were to take place, it would make the generated results more precise and be much more reliable.
Furthermore, if appropriate, more sophisticated text transformation strategies can be incorporated into differential testing to test their effectiveness in revealing erroneous words. Such techniques include a change in grammar, a change in words, and a modification of sentence structure. It is important to note that not all text transformation techniques are viable, as some may have low effectiveness metrics (percentage of failed text input and percentage of failed text cases). Therefore, it is imperative to choose an effective text transformation technique as the ideal transformer. It is important to note that not all text transformation techniques are viable, as some may have low effectiveness metrics (percentage of failed text input and percentage of failed text cases). Therefore, it is imperative to choose an effective text transformation technique as the ideal transformer.
-
Homophone Transformation: Identifies the error-inducing term of the failed test case. Following that, a homophone of the error-inducing term is found, and an example sentence containing the homophone is obtained through an online dictionary API. The example sentence will then be used as the new test case. Only transforms texts with an error-inducing term that has a homophone. Each homophone will produce a new test case.
-
Augmentation: Randomly inserts words into the failed test case through contextual word embedding. After insertion, the sentence is used as the new test case. Does not use the error-inducing term in its text transformation process.
-
Adjacent Deletion: Identifies the error-inducing term of the failed test case. Subsequently, the words adjacent to the error-inducing term in the sentence are deleted. The sentence after the deletion is then used as the new test case. Produces the same number of test cases as the number of error-inducing terms, with each test case deleting adjacent words from a single error-inducing term.
-
Plurality Transformation: Identifies the error-inducing term of the failed test case. It then converts the error-inducing terms in the failed test case from singular nouns to plural nouns. Only transforms texts with an error-inducing term that is a noun. Produces a single test case output with all error-inducing nouns transformed.
-
Tense Transformation: Identifies the verbs that express the grammatical tense and changes them to the past tense. Does not use the error-inducing term in its text transformation process. Produces a single test case with all error-inducing verbs transformed.
Table I: Original text input
Original Text | Error Terms |
---|---|
this has been done in no less than fifteen member states and through the medium of twelve community languages | in, fifteen, member |
Table II: Text transformation methods and transformed texts
Text Transformation Method | Affected Terms | Transformed Text |
---|---|---|
Homophone Transformation | First error term homophone transformation: in | the inns of court the inns of chancery serjeants inns |
Augmentation | - | this service has been being done in no less than his fifteen member states and varies through the medium of nearly twelve community spoken languages |
Adjacent Deletion | First error term adjacent deletion: done, no | this has been in less than fifteen member states and through the medium of twelve community languages |
Plurality Transformation | All error-inducing nouns: in, fifteen, member | this has been done ins no less than fifteens members states and through the medium of twelve community languages |
Tense Transformation | All verbs: has, been, done | this had was did in no less than fifteen member states and through the medium of twelve community languages |
Note: For Homophone Transformation and Adjacent Deletion, other test cases can be created based on the other error-inducing terms. However, there are no homophones for fifteen or member and only produces a single test case for in. For Adjacent Deletion, the other two test cases are not shown.
- Set up WSL to use Linux in Windows. Guide
- To access Window files in WSL: WSL mounts your machine's fixed drives under the /mnt/ folder in your Linux distros. For example, your C: drive is mounted under /mnt/c/
- It is recommended to store your local repo under your Linux distro home directory instead of /mnt//... because differing file naming systems will cause issues (e.g. filename is not case sensitive in Windows file system)
- To connect to the Internet, change the
nameserver
inetc/resolv.conf
to 8.8.8.8. StackOverflow etc/resolv.conf
will reset every time you restart WSL. To solve this issue, look at this post
- Download Docker Desktop
- If you are using WSL, set up Docker Desktop for Windows with WSL 2. Guide
- Go to the
example
directory using:cd asdf-differential-testing/CrossASRplus/examples
- Execute the
start.bat
batch file - Follow the prompts to customise execution
For certain selections, the default settings are used if not specified by the user:
- Input corpus file path:
CrossASRplus/examples/corpus/50-europarl-20000.txt
- Output file path:
CrossASRplus/examples/output/default
- ASRs:
deepspeech
,wav2letter
,wav2vec
- Number of texts to be processed: 100
Note: Setup and usage has only been tested on Windows running WSL.
Once testing is completed, a folder containing the results can be found within the output
directory specified by user input. If none is specified, it defaults to CrossASRplus/examples/output/default
.
Four folders are created within this directory:
case
: Contains nested folders of the format<tts_name>/<asr_name>/<test_case_no>.txt
containing text files indicating the result of each test case for each individual ASR.0
indicates an indeterminable test case,1
indicates a failed test case and2
indicates a successful test case.data
: Contains two nested folders of the formataudio/<tts_name>
andtranscription/<tts_name>/<asr_name>
which holds the audio files produced from the TTS and transcriptions of the audio by each individual ASR respectively.execution_time
: Contains two nested folders of the formataudio/<tts_name>
andtranscription/<tts_name>/<asr_name>/<test_case_no>
which contains text files indicating the time taken to produce each test case's audio and transcription by each ASR respectively.result
: Contains a nested folder of the format<tts_name>/<asr_name_1_asr_name_2_...asr_name_n>/num_iteration_2/text_batch_size_global
which contains six files:all_test_cases.json
: JSON file which has a key for each iteration containing an array value detailing the outcome of each text input.indeterminable.json
: JSON file detailing the statistics of indeterminable test cases.without_estimator.json
: JSON file detailing the statistics of failed test cases.failed_test_cases_analysis.txt
: Text file detailing the performance of the ASRs using ten different metrics.phoneme_graph.pdf
: PDF file graphing each phoneme within the processed text corpus and its respective frequency of failure.asr_comparison.csv
: CSV file containing two different tables for self-analysis, failed cases per ASR and failed cases per text input.
Ten metrics can be found within failed_test_cases_analysis.txt
. A failed text is defined as an input text that was incorrectly transcribed by at least one ASR service. A text input must be transcribed correctly by at least one ASR service to be deemed valid (or determinable); otherwise, it is considered indeterminable and discarded. A failed case, on the other hand, is defined as a specific text output from an individual ASR service that does not match its corresponding input text.
Table III: Explanation of each metric
Metric | Explanation |
---|---|
total_corpus_text |
Total number of texts in the corpus. |
total_transformed_text |
Total number of texts generated by the text transformer. |
total_corpus_failed_text |
Total number of texts in the corpus that fail to be transcribed by any of the ASRs. |
total_transformed_failed_text |
Total number of transformed texts that fail to be transcribed by any of the ASRs. |
total_corpus_failed_cases |
The sum of the count of failed test cases from texts in the corpus from each ASR. |
total_transformed_failed_cases |
The sum of the count of transformed failed test cases from each ASR. |
corpus_failed_text_percentage |
|
transformed_failed_text_percentage |
|
corpus_failed_cases_percentage |
|
transformed_failed_cases_percentage |
Some analysis that can be done by graphing the data in the CSV file are as follows:
Example 1: Failed test cases per ASR, where ASRs which are more failure-prone can be identified
Example 2: Failed test cases per text input, where areas in the input corpus which cause the most failures can be analysed