-**Dataset**: LibriSpeech
+**Dataset**: LibriSpeech, Fluent Speech
## Superresolution
diff --git a/models/speech_recognition/tiny_wav2letter/tflite_int8/README.md b/models/speech_recognition/tiny_wav2letter/tflite_int8/README.md
new file mode 100644
index 0000000..1a2c2c4
--- /dev/null
+++ b/models/speech_recognition/tiny_wav2letter/tflite_int8/README.md
@@ -0,0 +1,74 @@
+# Tiny Wav2letter INT8
+
+## Description
+Tiny Wav2letter is a tiny version of the original Wav2Letter model. It is a convolutional speech recognition neural network. This implementation was created by Arm, pruned to 50% sparsity, fine-tuned and quantized using the TensorFlow Model Optimization Toolkit.
+
+
+
+## License
+[Apache-2.0](https://spdx.org/licenses/Apache-2.0.html)
+
+## Network Information
+| Network Information | Value |
+|---------------------|----------------|
+| Framework | TensorFlow Lite |
+| SHA-1 Hash | 13ca2294ba4bbb1f1c6c5e663cb532d58cd76a6b |
+| Size (Bytes) | 3997112 |
+| Provenance | https://github.com/ARM-software/ML-zoo/tree/master/models/speech_recognition/wav2letter |
+| Paper | https://arxiv.org/abs/1609.03193 |
+
+## Performance
+
+| Platform | Optimized |
+|----------|:---------:|
+| Cortex-A |:heavy_check_mark: |
+| Cortex-M |:heavy_check_mark: |
+| Mali GPU |:heavy_multiplication_x: |
+| Ethos U |:heavy_check_mark: |
+
+### Key
+* :heavy_check_mark: - Will run on this platform.
+* :heavy_multiplication_x: - Will not run on this platform.
+
+## Accuracy
+Dataset: Fluent Speech (trianed on LibriSpeech,Mini LibrySpeech,Fluent Speech)
+
+Please note that Fluent Speech dataset hosted on Kaggle is a licensed dataset.
+
+| Metric | Value |
+|--------|-------|
+| LER | 0.0348 |
+| WER | 0.112 |
+
+## Optimizations
+| Optimization | Value |
+|--------------|---------|
+| Quantization | INT8 |
+
+## Network Inputs
+
+
+
Input Node Name
+
Shape
+
Description
+
+
+
input_1_int8
+
(1, 296, 39)
+
Speech converted to MFCCs and quantized to INT8
+
+
+
+## Network Outputs
+
+
+
Output Node Name
+
Shape
+
Description
+
+
+
Identity_int8
+
(1, 1, 148, 29)
+
A tensor of time and class probabilities, that represents the probability of each class at each timestep. Should be passed to a decoder. For example ctc_beam_search_decoder.
+
+
diff --git a/models/speech_recognition/tiny_wav2letter/tflite_int8/definition.yaml b/models/speech_recognition/tiny_wav2letter/tflite_int8/definition.yaml
new file mode 100644
index 0000000..6f59991
--- /dev/null
+++ b/models/speech_recognition/tiny_wav2letter/tflite_int8/definition.yaml
@@ -0,0 +1,60 @@
+author_notes: null
+benchmark:
+ benchmark_description: please note that fluent-speech-corpus dataset hosted on Kaggle
+ is a licensed dataset.
+ benchmark_link: https://www.kaggle.com/tommyngx/fluent-speech-corpus
+ benchmark_metrics:
+ LER: '0.0348'
+ WER: '0.1123'
+ benchmark_name: Fluent speech
+description: "Tiny Wav2letter is a tiny version of the original Wav2Letter model.\
+ \ It is a convolutional speech recognition neural network. This implementation was\
+ \ created by Arm, pruned to 50% sparsity, fine-tuned and quantized using the TensorFlow\
+ \ Model Optimization Toolkit.\r\n\r\n"
+license:
+- Apache-2.0
+network:
+ datatype: int8
+ file_size_bytes: 3997112
+ filename: tiny_wav2letter_int8.tflite
+ framework: TensorFlow Lite
+ framework_version: 2.4.1
+ hash:
+ algorithm: sha1
+ value: 13ca2294ba4bbb1f1c6c5e663cb532d58cd76a6b
+ provenance: https://github.com/ARM-software/ML-zoo/tree/master/models/speech_recognition/wav2letter
+ training: LibriSpeech,Mini LibrySpeech,fluent speech
+network_parameters:
+ input_nodes:
+ - description: Speech converted to MFCCs and quantized to INT8
+ example_input:
+ path: models/speech_recognition/tiny_wav2letter/tflite_int8/testing_input/input_1_int8
+ input_datatype: int8
+ name: input_1_int8
+ shape:
+ - 1
+ - 296
+ - 39
+ output_nodes:
+ - description: A tensor of time and class probabilities, that represents the probability
+ of each class at each timestep. Should be passed to a decoder. For example ctc_beam_search_decoder.
+ example_output:
+ path: models/speech_recognition/tiny_wav2letter/tflite_int8/testing_output/Identity_int8
+ name: Identity_int8
+ output_datatype: int8
+ shape:
+ - 1
+ - 1
+ - 148
+ - 29
+network_quality:
+ quality_level: Deployable
+ quality_level_hero_hw: null
+operators:
+ TensorFlow Lite:
+ - CONV_2D
+ - DEQUANTIZE
+ - LEAKY_RELU
+ - QUANTIZE
+ - RESHAPE
+paper: https://arxiv.org/abs/1609.03193
diff --git a/models/speech_recognition/tiny_wav2letter/tflite_int8/demo_input/84-121550-0000.flac b/models/speech_recognition/tiny_wav2letter/tflite_int8/demo_input/84-121550-0000.flac
new file mode 100644
index 0000000..8dd88a8
Binary files /dev/null and b/models/speech_recognition/tiny_wav2letter/tflite_int8/demo_input/84-121550-0000.flac differ
diff --git a/models/speech_recognition/tiny_wav2letter/tflite_int8/inference_demo.ipynb b/models/speech_recognition/tiny_wav2letter/tflite_int8/inference_demo.ipynb
new file mode 100644
index 0000000..5a43bf4
--- /dev/null
+++ b/models/speech_recognition/tiny_wav2letter/tflite_int8/inference_demo.ipynb
@@ -0,0 +1,323 @@
+{
+ "cells": [
+ {
+ "cell_type": "code",
+ "execution_count": 13,
+ "source": [
+ "import multiprocessing\n",
+ "import tensorflow as tf\n",
+ "import librosa\n",
+ "import numpy as np\n",
+ "from jiwer import wer"
+ ],
+ "outputs": [],
+ "metadata": {}
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "This is the audio file we are going to transcribe, as well as the ground truth transcription"
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 92,
+ "source": [
+ "audio_file = 'demo_input/84-121550-0000.flac'\n",
+ "transcript = 'BUT WITH FULL RAVISHMENT THE HOURS OF PRIME SINGING RECEIVED THEY IN THE MIDST OF LEAVES THAT EVER BORE A BURDEN TO THEIR RHYMES'"
+ ],
+ "outputs": [],
+ "metadata": {}
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "We first convert the transcript into integers, as well as defining a reverse mapping for decoding the final output."
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 101,
+ "source": [
+ "alphabet = \"abcdefghijklmnopqrstuvwxyz' @\"\n",
+ "alphabet_dict = {c: ind for (ind, c) in enumerate(alphabet)}\n",
+ "index_dict = {ind: c for (ind, c) in enumerate(alphabet)}\n",
+ "transcript_ints = [alphabet_dict[letter] for letter in transcript.lower()]\n",
+ "print(transcript_ints)"
+ ],
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "[1, 20, 19, 27, 22, 8, 19, 7, 27, 5, 20, 11, 11, 27, 17, 0, 21, 8, 18, 7, 12, 4, 13, 19, 27, 19, 7, 4, 27, 7, 14, 20, 17, 18, 27, 14, 5, 27, 15, 17, 8, 12, 4, 27, 18, 8, 13, 6, 8, 13, 6, 27, 17, 4, 2, 4, 8, 21, 4, 3, 27, 19, 7, 4, 24, 27, 8, 13, 27, 19, 7, 4, 27, 12, 8, 3, 18, 19, 27, 14, 5, 27, 11, 4, 0, 21, 4, 18, 27, 19, 7, 0, 19, 27, 4, 21, 4, 17, 27, 1, 14, 17, 4, 27, 0, 27, 1, 20, 17, 3, 4, 13, 27, 19, 14, 27, 19, 7, 4, 8, 17, 27, 17, 7, 24, 12, 4, 18]\n"
+ ]
+ }
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "We then load the audio file and convert it to MFCCs (with an extra batch dimension)."
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 70,
+ "source": [
+ "def normalize(values):\n",
+ " \"\"\"\n",
+ " Normalize values to mean 0 and std 1\n",
+ " \"\"\"\n",
+ " return (values - np.mean(values)) / np.std(values)\n",
+ "\n",
+ "def transform_audio_to_mfcc(audio_file, transcript, n_mfcc=13, n_fft=512, hop_length=160):\n",
+ " audio_data, sample_rate = librosa.load(audio_file, sr=16000)\n",
+ "\n",
+ " mfcc = librosa.feature.mfcc(audio_data, sr=sample_rate, n_mfcc=n_mfcc, n_fft=n_fft, hop_length=hop_length)\n",
+ "\n",
+ " # add derivatives and normalize\n",
+ " mfcc_delta = librosa.feature.delta(mfcc)\n",
+ " mfcc_delta2 = librosa.feature.delta(mfcc, order=2)\n",
+ " mfcc = np.concatenate((normalize(mfcc), normalize(mfcc_delta), normalize(mfcc_delta2)), axis=0)\n",
+ "\n",
+ " seq_length = mfcc.shape[1] // 2\n",
+ "\n",
+ " sequences = np.concatenate([[seq_length], transcript]).astype(np.int32)\n",
+ " sequences = np.expand_dims(sequences, 0)\n",
+ " mfcc_out = mfcc.T.astype(np.float32)\n",
+ " mfcc_out = np.expand_dims(mfcc_out, 0)\n",
+ "\n",
+ " return mfcc_out, sequences"
+ ],
+ "outputs": [],
+ "metadata": {}
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 33,
+ "source": [
+ "def log(std):\n",
+ " \"\"\"Log the given string to the standard output.\"\"\"\n",
+ " print(\"******* {}\".format(std), flush=True)"
+ ],
+ "outputs": [],
+ "metadata": {}
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "We use the ctc decoder to decode the output of the network"
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 99,
+ "source": [
+ "def ctc_preparation(tensor, y_predict):\n",
+ " if len(y_predict.shape) == 4:\n",
+ " y_predict = tf.squeeze(y_predict, axis=1)\n",
+ " y_predict = tf.transpose(y_predict, (1, 0, 2))\n",
+ " sequence_lengths, labels = tensor[:, 0], tensor[:, 1:]\n",
+ " idx = tf.where(tf.not_equal(labels, 28))\n",
+ " sparse_labels = tf.SparseTensor(\n",
+ " idx, tf.gather_nd(labels, idx), tf.shape(labels, out_type=tf.int64)\n",
+ " )\n",
+ " return sparse_labels, sequence_lengths, y_predict\n",
+ "\n",
+ "def ctc_ler(y_true, y_predict):\n",
+ " sparse_labels, logit_length, y_predict = ctc_preparation(y_true, y_predict)\n",
+ " decoded, log_probabilities = tf.nn.ctc_greedy_decoder(\n",
+ " y_predict, tf.cast(logit_length, tf.int32), merge_repeated=True\n",
+ " )\n",
+ " return tf.reduce_mean(\n",
+ " tf.edit_distance(\n",
+ " tf.cast(decoded[0], tf.int32), tf.cast(sparse_labels, tf.int32)\n",
+ " ).numpy()\n",
+ " ), tf.sparse.to_dense(decoded[0]).numpy()\n",
+ "\n",
+ "def trans_int_to_string(trans_int):\n",
+ " #create dictionary int -> string (0 -> a 1 -> b)\n",
+ " string = \"\"\n",
+ " alphabet = \"abcdefghijklmnopqrstuvwxyz' @\"\n",
+ " alphabet_dict = {}\n",
+ " count = 0\n",
+ " for x in alphabet:\n",
+ " alphabet_dict[count] = x\n",
+ " count += 1\n",
+ " for letter in trans_int:\n",
+ " letter_np = np.array(letter).item(0)\n",
+ " if letter_np != 28:\n",
+ " string += alphabet_dict[letter_np]\n",
+ " return string\n",
+ "\n",
+ "def ctc_wer(y_true, y_predict):\n",
+ " sparse_labels, logit_length, y_predict = ctc_preparation(y_true, y_predict)\n",
+ " decoded, log_probabilities = tf.nn.ctc_greedy_decoder(\n",
+ " y_predict, tf.cast(logit_length, tf.int32), merge_repeated=True\n",
+ " )\n",
+ " true_sentence = tf.cast(sparse_labels.values, tf.int32)\n",
+ " return wer(str(trans_int_to_string(decoded[0].values)), str(trans_int_to_string(true_sentence)))"
+ ],
+ "outputs": [],
+ "metadata": {}
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "The TFLite file requires inputs of size 296, so we apply a window to the input"
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 94,
+ "source": [
+ "def evaluate_tflite(tflite_path, input_window_length = 296):\n",
+ " \"\"\"Evaluates tflite (fp32, int8).\"\"\"\n",
+ " results = []\n",
+ " data, label = transform_audio_to_mfcc(audio_file, transcript_ints)\n",
+ "\n",
+ " interpreter = tf.lite.Interpreter(model_path=tflite_path, num_threads=multiprocessing.cpu_count())\n",
+ " interpreter.allocate_tensors()\n",
+ " input_chunk = interpreter.get_input_details()[0]\n",
+ " output_details = interpreter.get_output_details()[0]\n",
+ "\n",
+ " input_shape = input_chunk[\"shape\"]\n",
+ " log(\"eval_model() - input_shape: {}\".format(input_shape))\n",
+ " input_dtype = input_chunk[\"dtype\"]\n",
+ " output_dtype = output_details[\"dtype\"]\n",
+ "\n",
+ " # Check if the input/output type is quantized,\n",
+ " # set scale and zero-point accordingly\n",
+ " if input_dtype != tf.float32:\n",
+ " input_scale, input_zero_point = input_chunk[\"quantization\"]\n",
+ " else:\n",
+ " input_scale, input_zero_point = 1, 0\n",
+ "\n",
+ " if output_dtype != tf.float32:\n",
+ " output_scale, output_zero_point = output_details[\"quantization\"]\n",
+ " else:\n",
+ " output_scale, output_zero_point = 1, 0\n",
+ "\n",
+ "\n",
+ " data = data / input_scale + input_zero_point\n",
+ " # Round the data up if dtype is int8, uint8 or int16\n",
+ " if input_dtype is not np.float32:\n",
+ " data = np.round(data)\n",
+ "\n",
+ " while data.shape[1] < input_window_length:\n",
+ " data = np.append(data, data[:, -2:-1, :], axis=1)\n",
+ " # Zero-pad any odd-length inputs\n",
+ " if data.shape[1] % 2 == 1:\n",
+ " # log('Input length is odd, zero-padding to even (first layer has stride 2)')\n",
+ " data = np.concatenate([data, np.zeros((1, 1, data.shape[2]), dtype=input_dtype)], axis=1)\n",
+ "\n",
+ " context = 24 + 2 * (7 * 3 + 16) # = 98 - theoretical max receptive field on each side\n",
+ " size = input_chunk['shape'][1]\n",
+ " inner = size - 2 * context\n",
+ " data_end = data.shape[1]\n",
+ "\n",
+ " # Initialize variables for the sliding window loop\n",
+ " data_pos = 0\n",
+ " outputs = []\n",
+ "\n",
+ " while data_pos < data_end:\n",
+ " if data_pos == 0:\n",
+ " # Align inputs from the first window to the start of the data and include the intial context in the output\n",
+ " start = data_pos\n",
+ " end = start + size\n",
+ " y_start = 0\n",
+ " y_end = y_start + (size - context) // 2\n",
+ " data_pos = end - context\n",
+ " elif data_pos + inner + context >= data_end:\n",
+ " # Shift left to align final window to the end of the data and include the final context in the output\n",
+ " shift = (data_pos + inner + context) - data_end\n",
+ " start = data_pos - context - shift\n",
+ " end = start + size\n",
+ " assert start >= 0\n",
+ " y_start = (shift + context) // 2 # Will be even because we assert it above\n",
+ " y_end = size // 2\n",
+ " data_pos = data_end\n",
+ " else:\n",
+ " # Capture only the inner region from mid-input inferences, excluding output from both context regions\n",
+ " start = data_pos - context\n",
+ " end = start + size\n",
+ " y_start = context // 2\n",
+ " y_end = y_start + inner // 2\n",
+ " data_pos = end - context\n",
+ "\n",
+ " interpreter.set_tensor(input_chunk[\"index\"], tf.cast(data[:, start:end, :], input_dtype))\n",
+ " interpreter.invoke()\n",
+ " cur_output_data = interpreter.get_tensor(output_details[\"index\"])[:, :, y_start:y_end, :]\n",
+ " cur_output_data = output_scale * (\n",
+ " cur_output_data.astype(np.float32) - output_zero_point\n",
+ " )\n",
+ " outputs.append(cur_output_data)\n",
+ "\n",
+ " complete = np.concatenate(outputs, axis=2)\n",
+ " LER, output = ctc_ler(label, complete)\n",
+ " WER = ctc_wer(label, complete)\n",
+ " return output, LER , WER\n"
+ ],
+ "outputs": [],
+ "metadata": {}
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 107,
+ "source": [
+ "wav2letter_tflite_path = \"tflite_int8/tiny_wav2letter_int8.tflite\"\n",
+ "output, LER , WER = evaluate_tflite(wav2letter_tflite_path)\n",
+ "\n",
+ "decoded_output = [index_dict[value] for value in output[0]]\n",
+ "log(f'Transcribed File: {\"\".join(decoded_output)}')\n",
+ "log(f'Letter Error Rate is {LER}')\n",
+ "log(f'Word Error Rate is {WER}')"
+ ],
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "******* eval_model() - input_shape: [ 1 296 39]\n",
+ "******* Input length is odd, zero-padding to even (first layer has stride 2)\n",
+ "******* Transcribed File: but with full ravishment the hours of prime singing received they in the midst of leaves that everborea burden to their rimes\n",
+ "******* Letter Error Rate is 0.03125\n",
+ "******* Word Error Rate is 1.05\n"
+ ]
+ }
+ ],
+ "metadata": {}
+ }
+ ],
+ "metadata": {
+ "orig_nbformat": 4,
+ "language_info": {
+ "name": "python",
+ "version": "3.8.2",
+ "mimetype": "text/x-python",
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "pygments_lexer": "ipython3",
+ "nbconvert_exporter": "python",
+ "file_extension": ".py"
+ },
+ "kernelspec": {
+ "name": "python3",
+ "display_name": "Python 3.8.2 64-bit ('env': venv)"
+ },
+ "interpreter": {
+ "hash": "4b529a2edd0e262cfd8353ba70b138cbba10314325c544d99b9316c477c7841b"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/models/speech_recognition/tiny_wav2letter/tflite_int8/model_development_guide.md b/models/speech_recognition/tiny_wav2letter/tflite_int8/model_development_guide.md
new file mode 100644
index 0000000..546a975
--- /dev/null
+++ b/models/speech_recognition/tiny_wav2letter/tflite_int8/model_development_guide.md
@@ -0,0 +1,58 @@
+# Model Development Guide
+
+This document describes the process of training a model from scratch, using the Tiny Wav2Letter model as an example.
+
+## Datasets
+
+The first thing to decide is which dataset the model is to be trained on. Most commonly used datasets can either be found online or in the ARM AWS S3 bucket. In the case of Tiny Wav2Letter, both the LibriSpeech dataset hosted on [OpenSLR](http://www.openslr.org/resources.php) and fluent-speech-corpus dataset hosted on [Kaggle](https://www.kaggle.com/tommyngx/fluent-speech-corpus) were used to train the model.
+! ! please note that fluent-speech-corpus dataset hosted on [Kaggle](https://www.kaggle.com/tommyngx/fluent-speech-corpus) is a licensed dataset.
+
+## Preprocessing
+
+The dataset is often not in the right format for training, so preprocessing steps must be taken. In this case, the LibriSpeech dataset consists of audio files, however the paper stated that as input MFCCs should be used, so the audio files needed to be converted. It is to be recommended that all preprocessing be performed offline, as this will make the actual training process faster, as the data is already in the correct format. The most convenient way to store the preprocessed data is using [TFRecords](https://www.tensorflow.org/tutorials/load_data/tfrecord), as these are very easily loaded into TFDatasets. While it can take a long time to write the whole dataset to a TFRecord file, it is outweighed by the time saved during training.
+Please note: input audio data sample rate is 16K
+## Model Architecture
+
+The model architecture can generally be found from a variety of sources. If a similar model exists in the IMZ, then [Netron](https://netron.app) can be used to inspect the TFLite file. The original paper the model was proposed in will also define the architecture. The model should ideally be defined using the Tensorflow Functional API rather than the sequential API.
+
+## Loss Function and Metrics
+
+The loss function and desired metrics will be defined by the model. If at all possible, structure the data such that the input to the loss function is in the form (y_true, y_predicted) as this will enable model.fit to be used and avoid custom training loops. TensorFlow has lots of standard loss functions straight out of the box, but if need be custom loss functions can be defined, as was the case in TinyWav2Letter.
+
+## Model Training
+
+If everything else has been set up properly, the code here should not be complicated. Load the datasets, create an instance of the model, and then ideally run model.fit but if that's not possible use tf.GradientTape. Use callbacks to write to a log directory (tf.keras.callbacks.Tensorboard), then use Tensorboard to visualise the training process. One can use the [TensorFlow Profiler](https://www.tensorflow.org/guide/profiler) to identify bottlenecks in the training pipeline and speed up the training process. Another useful callback is tf.keras.callbacks.ModelCheckpoint which saves a checkpoint at defined intervals, so one can pick up training from where it was left off. Generally we will want a training set, a validation set and a test set, with normally about a 90:5:5 split. If the model performs well on the training set but not on the validation set or test set, then the model is overfitting. This can be reduced by introducing regularisation, increasing the amount of data, reducing model complexity or adjusting hyperparameters. In the case of TinyWav2Letter, the model was initially trained on the full-size LibriSpeech Dataset to capture the features of speech, then fine-tuned on the much smaller Mini LibriSpeech to improve the accuracy on the smaller dataset, then fine-tuned on fluent-speech-corpus dataset
+
+## Optimisation and Conversion
+
+Once the model has been trained to satisfaction, it can be optionally be optimised using the TensorFlow Model Optimization Toolkit. Pruning sets a specified percentage of the weights to be 0, so the model is sparser, which can lead to faster inference. Clustering clusters together weights of similar values, reducing the number of unique values. This again leads to faster inference. Quantisation (eg to INT8) converts all the weights to INT8 representations, giving a 4x reduction in size compared to FP32. If the quantisation process affects the metric too severely, quantisation aware training can be performed, which fine-tunes the model and makes it more robust to quantisation. Quantisation aware training requires at least Tensorflow 2.5.0. The final step is to convert the model to the TFLite model format. If using INT8 conversion, one must define a representative dataset to calibrate the model.
+
+## Training a smaller FP32 Keras Model
+
+The trained Wav2Letter model then serves as the foundation for investigations into how best to reduce the size of the network.
+
+There are three hyperparameters relevant to the size of the network. They are:
+
+- Number of layers
+- Number of filters in each convolutional layer
+- Stride of each convolutional filter
+
+The following table shows chose architecture for Tiny Wav2Letter and the effect that it has on the output size.
+
+| Identifier | Total Number of Layers | Number of middle Convolutional Layers | Corresponding number of filters | Number of filters in the antepenultimate Conv2D Layer | Number of filters in the penultimate Conv2D Layer |
+| ------ | ------ | ------ | ------ | ------ | ------ |
+| Wav2Letter | 11 | 7 | 250 | 2000 | 2000 |
+| Tiny Wav2Letter | 6 | 5 | 100 | 750 | 750 |
+
+| Identifier | Size (MB) | LER | WER |
+| ------ | ------ | ------ | ------ |
+| Wav2Letter INT8| 22.7 | 0.0877** | N/A |
+| Wav2Letter INT8 pruned| 22.7 | 0.0783** | N/A |
+| Tiny Wav2Letter FP32| 15.6* | 0.0351 | 0.0714 |
+| Tiny Wav2Letter FP32 pruned| 15.6* | 0.0266 | 0.0577 |
+| Tiny Wav2Letter INT8| 3.81 | 0.0348 | 0.1123 |
+| Tiny Wav2Letter INT8 pruned| 3.81 | 0.0283 | 0.0886 |
+
+"*" - the size is according to the tflite model \
+** trained on different dataset
+
diff --git a/models/speech_recognition/tiny_wav2letter/tflite_int8/recreate_code/.idea/misc.xml b/models/speech_recognition/tiny_wav2letter/tflite_int8/recreate_code/.idea/misc.xml
new file mode 100644
index 0000000..625040c
--- /dev/null
+++ b/models/speech_recognition/tiny_wav2letter/tflite_int8/recreate_code/.idea/misc.xml
@@ -0,0 +1,7 @@
+
+
+
+
+
+
+
\ No newline at end of file
diff --git a/models/speech_recognition/tiny_wav2letter/tflite_int8/recreate_code/.idea/workspace.xml b/models/speech_recognition/tiny_wav2letter/tflite_int8/recreate_code/.idea/workspace.xml
new file mode 100644
index 0000000..1dcd832
--- /dev/null
+++ b/models/speech_recognition/tiny_wav2letter/tflite_int8/recreate_code/.idea/workspace.xml
@@ -0,0 +1,156 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ 1639400202242
+
+
+ 1639400202242
+
+
+
+
+
+
+
+
+
+
+ file://$PROJECT_DIR$/preprocessing.py
+ 148
+
+
+
+ file://$PROJECT_DIR$/train_model.py
+ 158
+
+
+
+ file://$PROJECT_DIR$/train_model.py
+ 99
+
+
+
+
+
+
\ No newline at end of file
diff --git a/models/speech_recognition/tiny_wav2letter/tflite_int8/recreate_code/README.md b/models/speech_recognition/tiny_wav2letter/tflite_int8/recreate_code/README.md
new file mode 100644
index 0000000..90ff4be
--- /dev/null
+++ b/models/speech_recognition/tiny_wav2letter/tflite_int8/recreate_code/README.md
@@ -0,0 +1,13 @@
+# Tiny Wav2letter FP32/INT8/INT8_Pruned Model Re-Creation
+This folder contains a script that allows for the model to be re-created from scratch.
+## Datasets
+Tiny Wav2Letter was trianed on both LibriSpeech dataset hosted on OpenSLR and fluent-speech-corpus dataset hosted on Kaggle.
+Please note that fluent-speech-corpus dataset hosted on [Kaggle](https://www.kaggle.com/tommyngx/fluent-speech-corpus) is a licensed dataset.
+## Requirements
+The script in this folder requires that the following must be installed:
+- Python 3.6
+- Create new dir: fluent_speech_commands_dataset
+- (LICENSED DATASET!!) Download and extract fluent-speech-corpus from: https://www.kaggle.com/tommyngx/fluent-speech-corpus to fluent_speech_commands_dataset dir
+
+## Running The Script
+To run the script, run the following in a terminal: `./recreate_model.sh`
diff --git a/models/speech_recognition/tiny_wav2letter/tflite_int8/recreate_code/__pycache__/load_mfccs.cpython-36.pyc b/models/speech_recognition/tiny_wav2letter/tflite_int8/recreate_code/__pycache__/load_mfccs.cpython-36.pyc
new file mode 100644
index 0000000..b74e040
Binary files /dev/null and b/models/speech_recognition/tiny_wav2letter/tflite_int8/recreate_code/__pycache__/load_mfccs.cpython-36.pyc differ
diff --git a/models/speech_recognition/tiny_wav2letter/tflite_int8/recreate_code/__pycache__/tinywav2letter.cpython-36.pyc b/models/speech_recognition/tiny_wav2letter/tflite_int8/recreate_code/__pycache__/tinywav2letter.cpython-36.pyc
new file mode 100644
index 0000000..c8ddf4a
Binary files /dev/null and b/models/speech_recognition/tiny_wav2letter/tflite_int8/recreate_code/__pycache__/tinywav2letter.cpython-36.pyc differ
diff --git a/models/speech_recognition/tiny_wav2letter/tflite_int8/recreate_code/__pycache__/train_model.cpython-36.pyc b/models/speech_recognition/tiny_wav2letter/tflite_int8/recreate_code/__pycache__/train_model.cpython-36.pyc
new file mode 100644
index 0000000..b760462
Binary files /dev/null and b/models/speech_recognition/tiny_wav2letter/tflite_int8/recreate_code/__pycache__/train_model.cpython-36.pyc differ
diff --git a/models/speech_recognition/tiny_wav2letter/tflite_int8/recreate_code/corpus.py b/models/speech_recognition/tiny_wav2letter/tflite_int8/recreate_code/corpus.py
new file mode 100644
index 0000000..b5bd5ce
--- /dev/null
+++ b/models/speech_recognition/tiny_wav2letter/tflite_int8/recreate_code/corpus.py
@@ -0,0 +1,171 @@
+# Copyright (C) 2020 Arm Limited or its affiliates. All rights reserved.
+#
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the License); you may
+# not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an AS IS BASIS, WITHOUT
+# WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import tarfile
+import urllib.request
+
+
+class SpeechCorpusProvider:
+ """
+ Ensures the availability and downloads the speech corpus if necessary
+ """
+
+ DATA_SOURCE = {
+ "librispeech_full_size":
+ {"DATA_SETS":
+ {
+ ('train', 'train-clean-100'),
+ ('train', 'train-clean-360'),
+ ('val', 'dev-clean')
+ },
+ "BASE_URL": 'http://www.openslr.org/resources/12/'
+ },
+ "librispeech_reduced_size":
+ {"DATA_SETS":
+ {
+ ('train', 'train-clean-5'),
+ ('val', 'dev-clean-2')
+ },
+ "BASE_URL": 'http://www.openslr.org/resources/31/'
+ }
+ }
+
+ SET_FILE_EXTENSION = '.tar.gz'
+ TAR_ROOT = 'LibriSpeech/'
+
+ def __init__(self, data_directory):
+ """
+ Creates a new SpeechCorpusProvider with the root directory `data_directory`.
+ The speech corpus is downloaded and extracted into sub-directories.
+
+ Args:
+ data_directory: the root directory to use, e.g. data/
+ """
+
+ self._data_directory = data_directory
+ self._make_dir_if_not_exists(data_directory)
+ self.data_sets = SpeechCorpusProvider.DATA_SOURCE[data_directory]['DATA_SETS']
+ self.base_url = SpeechCorpusProvider.DATA_SOURCE[data_directory]['BASE_URL']
+
+ @staticmethod
+ def _make_dir_if_not_exists(directory):
+ """
+ Helper function to create a directory if it doesn't exist.
+
+ Args:
+ directory: directory to create
+ """
+
+ if not os.path.exists(directory):
+ os.makedirs(directory)
+
+ def _download_if_not_exists(self, remote_file_name):
+ """
+ Downloads the given `remote_file_name` if not yet stored in the `data_directory`
+
+ Args:
+ remote_file_name: the file to download
+
+ Returns: path to downloaded file
+ """
+
+ path = os.path.join(self._data_directory, remote_file_name)
+ if not os.path.exists(path):
+ print('Downloading {}...'.format(remote_file_name))
+ urllib.request.urlretrieve(self.base_url + remote_file_name, path)
+ return path
+
+ @staticmethod
+ def _extract_from_to(tar_file_name, source, target_directory):
+ """
+ Extract all necessary files `source` from `tar_file_name` into `target_directory`
+
+ Args:
+ tar_file_name: the tar file to extract from
+ source: the directory in the root to extract
+ target_directory: the directory to store the files in
+ """
+
+ print('Extracting {}...'.format(tar_file_name))
+ with tarfile.open(tar_file_name, 'r:gz') as tar:
+ source_members = [
+ tarinfo for tarinfo in tar.getmembers()
+ if tarinfo.name.startswith(SpeechCorpusProvider.TAR_ROOT + source)
+ ]
+ for member in source_members:
+ # Extract without prefix
+ member.name = member.name.replace(SpeechCorpusProvider.TAR_ROOT, '')
+ tar.extractall(target_directory, source_members)
+
+ def _is_ready(self):
+ """
+ Returns whether all `data_sets` are downloaded and extracted
+
+ Args:
+ data_sets: list of the datasets to ensure
+
+ Returns: bool, is ready to use
+
+ """
+
+ data_set_paths = [os.path.join(set_type, set_name)
+ for set_type, set_name in self.data_sets]
+
+ return all([os.path.exists(os.path.join(self._data_directory, data_set))
+ for data_set in data_set_paths])
+
+ def _download(self):
+ """
+ Download the given `data_sets`
+
+ Args:
+ data_sets: a list of the datasets to download
+ """
+
+ for data_set_type, data_set_name in self.data_sets:
+ remote_file = data_set_name + SpeechCorpusProvider.SET_FILE_EXTENSION
+ self._download_if_not_exists(remote_file)
+
+ def _extract(self):
+ """
+ Extract all necessary files from the given `data_sets`
+ """
+
+ for data_set_type, data_set_name in self.data_sets:
+ local_file = os.path.join(self._data_directory, data_set_name + SpeechCorpusProvider.SET_FILE_EXTENSION)
+ target_directory = self._data_directory
+ self._extract_from_to(local_file, data_set_name, target_directory)
+
+ def ensure_availability(self):
+ """
+ Ensure that all datasets are downloaded and extracted. If this is not the case,
+ the download and extraction is initated.
+ """
+
+ if not self._is_ready():
+ self._download()
+ self._extract()
+
+
+if __name__=="__main__":
+ full_size_corpus = SpeechCorpusProvider("librispeech_full_size")
+ full_size_corpus.ensure_availability()
+
+ reduced_size_corpus = SpeechCorpusProvider("librispeech_reduced_size")
+ reduced_size_corpus.ensure_availability()
+
+
diff --git a/models/speech_recognition/tiny_wav2letter/tflite_int8/recreate_code/evaluate_saved_weights.py b/models/speech_recognition/tiny_wav2letter/tflite_int8/recreate_code/evaluate_saved_weights.py
new file mode 100644
index 0000000..399a914
--- /dev/null
+++ b/models/speech_recognition/tiny_wav2letter/tflite_int8/recreate_code/evaluate_saved_weights.py
@@ -0,0 +1,52 @@
+# Copyright (C) 2020 Arm Limited or its affiliates. All rights reserved.
+#
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the License); you may
+# not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an AS IS BASIS, WITHOUT
+# WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+
+import tensorflow as tf
+
+from tinywav2letter import get_metrics, create_tinywav2letter
+from train_model import get_data
+
+def evaluate_saved_weights(args, pruned = False):
+
+ model = create_tinywav2letter(batch_size = args.batch_size)
+
+ model.load_weights('weights/tiny_wav2letter' + pruned * "_pruned" + '_weights.h5')
+
+ opt = tf.keras.optimizers.Adam()
+ model.compile(loss=get_metrics("loss"), metrics=[get_metrics("ler")], optimizer=opt)
+
+ (reduced_validation_data, reduced_validation_num_steps) = get_data(args, "val_reduced_size", args.batch_size)
+
+ model.evaluate(reduced_validation_data)
+
+
+if __name__ == "__main__":
+ parser = argparse.ArgumentParser(allow_abbrev=False)
+
+ parser.add_argument(
+ "--batch_size",
+ dest="batch_size",
+ type=int,
+ required=False,
+ default=32,
+ help="batch size wanted when creating model",
+ )
+
+ args = parser.parse_args()
+ evaluate_saved_weights(args)
+
diff --git a/models/speech_recognition/tiny_wav2letter/tflite_int8/recreate_code/load_mfccs.py b/models/speech_recognition/tiny_wav2letter/tflite_int8/recreate_code/load_mfccs.py
new file mode 100644
index 0000000..d36bf6b
--- /dev/null
+++ b/models/speech_recognition/tiny_wav2letter/tflite_int8/recreate_code/load_mfccs.py
@@ -0,0 +1,177 @@
+# Copyright (C) 2020 Arm Limited or its affiliates. All rights reserved.
+#
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the License); you may
+# not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an AS IS BASIS, WITHOUT
+# WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import tensorflow as tf
+import os
+import numpy as np
+
+class MFCC_Loader:
+ def __init__(self, full_size_data_dir:str, reduced_size_data_dir:str, fluent_speech_data_dir:str):
+ """
+ Args:
+ data_dir: Absolute path to librispeech data folder
+ """
+ self.full_size_data_dir = full_size_data_dir
+ self.reduced_size_data_dir = reduced_size_data_dir
+ self.fluent_speech_data_dir = fluent_speech_data_dir
+ self.seed = 0
+ self.train = False
+ self.batch_size = 32
+ self.num_samples = 0
+ self.input_files = []
+
+
+ @staticmethod
+ def _extract_features(example_proto):
+ feature_description = {
+ 'mfcc_bytes': tf.io.FixedLenFeature([], tf.string),
+ 'sequence_bytes': tf.io.FixedLenFeature([], tf.string),
+ }
+ # Parse the input tf.train.Example proto using the dictionary above.
+ serialized_tensor = tf.io.parse_single_example(example_proto, feature_description)
+
+ mfcc_features = tf.io.parse_tensor(serialized_tensor['mfcc_bytes'], out_type = tf.float32)
+ sequences = tf.io.parse_tensor(serialized_tensor['sequence_bytes'], out_type = tf.int32)
+
+ return mfcc_features, sequences
+
+ def full_training_set(self, batch_size=32, num_samples = -1):
+ """
+ Args:
+ batch_size: batch size required for the set
+ """
+ self.tfrecord_file = [
+ os.path.join(self.full_size_data_dir, 'preprocessed/train-clean-100/train-clean-100.tfrecord'),
+ os.path.join(self.full_size_data_dir, 'preprocessed/train-clean-360/train-clean-360.tfrecord')]
+
+ self.train = True
+ self.batch_size = batch_size
+ self.num_samples = 132553
+ return self.load_dataset(num_samples)
+
+ def reduced_training_set(self, batch_size=32, num_samples = -1):
+ """
+ Args:
+ batch_size: batch size required for the set
+ """
+ self.tfrecord_file = os.path.join(self.reduced_size_data_dir, 'preprocessed/train-clean-5/train-clean-5.tfrecord')
+ self.train = True
+ self.batch_size = batch_size
+ self.num_samples = 1519
+ return self.load_dataset(num_samples)
+
+ def full_validation_set(self, batch_size=32):
+ """
+ Args:
+ batch_size: batch size required for the set
+ """
+ self.tfrecord_file = os.path.join(self.full_size_data_dir, 'preprocessed/dev-clean/dev-clean.tfrecord')
+ self.train = False
+ self.batch_size = batch_size
+ self.num_samples = 2703
+ return self.load_dataset()
+
+ def reduced_validation_set(self, batch_size=32):
+ """
+ Args:
+ batch_size: batch size required for the set
+ """
+ self.tfrecord_file = os.path.join(self.reduced_size_data_dir, 'preprocessed/dev-clean-2/dev-clean-2.tfrecord')
+ self.train = False
+ self.batch_size = batch_size
+ self.num_samples = 1089
+ return self.load_dataset()
+
+
+ def evaluation_set(self, batch_size=32):
+ """
+ Args:
+ batch_size: batch size required for the set
+ """
+
+ self.tfrecord_file = os.path.join(self.full_size_data_dir, 'preprocessed/test-clean/test-clean.tfrecord')
+ self.train = False
+ self.batch_size = batch_size
+ self.num_samples = 2620
+ return self.load_dataset()
+
+ def fluent_speech_train_set(self, batch_size=32, num_samples = -1):
+ """
+ Args:
+ batch_size: batch size required for the set
+ """
+ self.tfrecord_file = os.path.join(self.fluent_speech_data_dir, 'preprocessed/train/train.tfrecord')
+
+ self.train = True
+ self.batch_size = batch_size
+ self.num_samples = 23132
+ return self.load_dataset(num_samples)
+
+ def fluent_speech_validation_set(self, batch_size=32):
+ """
+ Args:
+ batch_size: batch size required for the set
+ """
+ self.tfrecord_file = os.path.join(self.fluent_speech_data_dir, 'preprocessed/dev/dev.tfrecord')
+ self.train = False
+ self.batch_size = batch_size
+ self.num_samples = 3118
+ return self.load_dataset()
+
+ def fluent_speech_test_set(self, batch_size=32):
+ """
+ Args:
+ batch_size: batch size required for the set
+ """
+ self.tfrecord_file = os.path.join(self.fluent_speech_data_dir, 'preprocessed/test/test.tfrecord')
+ self.train = False
+ self.batch_size = batch_size
+ self.num_samples = 3793
+ return self.load_dataset()
+
+ def num_steps(self, batch):
+ """
+ Get the number of steps based on the given batch size and the number
+ of samples.
+ """
+ return int(np.math.ceil(self.num_samples / batch))
+
+
+ def load_dataset(self, num_samples = -1):
+
+ # load the specified TF Record files
+ dataset = tf.data.TFRecordDataset(self.tfrecord_file)
+
+ # parse the data, and take the desired number of samples
+ dataset = dataset.map(self._extract_features, num_parallel_calls = tf.data.AUTOTUNE).take(num_samples)
+
+ dataset = dataset.cache()
+
+ # shuffle the training set
+ if self.train:
+ dataset = dataset.shuffle(buffer_size=max(self.batch_size * 2, 1024), seed=self.seed)
+
+ MFCC_coeffs = 39
+ blank_index = 28
+
+
+ # Pad the dataset so that all the data is the same size
+ dataset = dataset.padded_batch(
+ self.batch_size,
+ padded_shapes=(tf.TensorShape([None, MFCC_coeffs]), tf.TensorShape([None])),
+ padding_values=(0.0, blank_index), drop_remainder=True
+ )
+ return dataset.prefetch(tf.data.experimental.AUTOTUNE)
diff --git a/models/speech_recognition/tiny_wav2letter/tflite_int8/recreate_code/preprocessing.py b/models/speech_recognition/tiny_wav2letter/tflite_int8/recreate_code/preprocessing.py
new file mode 100644
index 0000000..0eddb4c
--- /dev/null
+++ b/models/speech_recognition/tiny_wav2letter/tflite_int8/recreate_code/preprocessing.py
@@ -0,0 +1,257 @@
+# Copyright (C) 2020 Arm Limited or its affiliates. All rights reserved.
+#
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the License); you may
+# not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an AS IS BASIS, WITHOUT
+# WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import fnmatch
+import os
+import librosa
+
+import numpy as np
+import tensorflow as tf
+from corpus import SpeechCorpusProvider
+from preprocessing_fluent_speech_commands import preprocess_fluent_sppech
+from preprocessing_convert_to_flac import convert_to_flac
+def _bytes_feature(value):
+ """Returns a bytes_list from a string / byte."""
+ # If the value is an eager tensor BytesList won't unpack a string from an EagerTensor.
+ if isinstance(value, type(tf.constant(0))):
+ value = value.numpy()
+ return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))
+
+def _int64_feature(value):
+ """Returns an int64_list from a bool / enum / int / uint."""
+ return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))
+
+def normalize(values):
+ """
+ Normalize values to mean 0 and std 1
+ """
+ return (values - np.mean(values)) / np.std(values)
+
+
+def iglob_recursive(directory, file_pattern):
+ """
+ Recursively search for `file_pattern` in `directory`
+
+ Args:
+ directory: the directory to search in
+ file_pattern: the file pattern to match (wildcard compatible)
+
+ Returns:
+ iterator for found files
+
+ """
+ for root, dir_names, file_names in os.walk(directory):
+ for filename in fnmatch.filter(file_names, file_pattern):
+ yield os.path.join(root, filename)
+
+
+class SpeechCorpusReader:
+ """
+ Reads preprocessed speech corpus to be used by the NN
+ """
+ def __init__(self, data_directory):
+ """
+ Create SpeechCorpusReader and read samples from `data_directory`
+
+ Args:
+ data_directory: the directory to use
+ """
+ self._data_directory = data_directory
+ self._transcript_dict = self._build_transcript()
+
+ @staticmethod
+ def _get_transcript_entries(transcript_directory):
+ """
+ Iterate over all transcript lines and yield splitted entries
+
+ Args:
+ transcript_directory: open all transcript files in this directory and extract their contents
+
+ Returns: Iterator for all entries in the form (id, sentence)
+
+ """
+ transcript_files = iglob_recursive(transcript_directory, '*.trans.txt')
+ for transcript_file in transcript_files:
+ with open(transcript_file, 'r') as f:
+ for line in f:
+ # Strip included new line symbol
+ line = line.rstrip('\n')
+
+ # Each line is in the form
+ # 00-000000-0000 WORD1 WORD2 ...
+ splitted = line.split(' ', 1)
+ yield splitted
+
+ def _build_transcript(self):
+ """
+ Builds a transcript from transcript files, mapping from audio-id to a list of vocabulary ids
+
+ Returns: the created transcript
+ """
+ alphabet = "abcdefghijklmnopqrstuvwxyz' @"
+ alphabet_dict = {c: ind for (ind, c) in enumerate(alphabet)}
+
+ # Create the transcript dictionary
+ transcript_dict = dict()
+ for splitted in self._get_transcript_entries(self._data_directory):
+ transcript_dict[splitted[0]] = [alphabet_dict[letter] for letter in splitted[1].lower()]
+
+ return transcript_dict
+
+ @classmethod
+ def _extract_audio_id(cls, audio_file):
+ file_name = os.path.basename(audio_file)
+ audio_id = os.path.splitext(file_name)[0]
+
+ return audio_id
+
+ @staticmethod
+ def transform_audio_to_mfcc(audio_file, transcript, n_mfcc=13, n_fft=512, hop_length=160):
+ """
+ Calculate mfcc coefficients from the given raw audio data
+
+ Args:
+ audio_file: .flac audio file
+ n_mfcc: the number of coefficients to generate
+ n_fft: the window size of the fft
+ hop_length: the hop length for the window
+
+ Returns:
+ the mfcc coefficients in the form [time, coefficients]
+ sequences: the transcript of the audio file
+
+ """
+
+ audio_data, sample_rate = librosa.load(audio_file, sr=16000)
+
+ mfcc = librosa.feature.mfcc(audio_data, sr=sample_rate, n_mfcc=n_mfcc, n_fft=n_fft, hop_length=hop_length)
+
+ # add derivatives and normalize
+ mfcc_delta = librosa.feature.delta(mfcc)
+ mfcc_delta2 = librosa.feature.delta(mfcc, order=2)
+ mfcc = np.concatenate((normalize(mfcc), normalize(mfcc_delta), normalize(mfcc_delta2)), axis=0) #mfcc is now 13+13+13=39 (according to our input shpe)
+
+ seq_length = mfcc.shape[1] // 2
+
+ sequences = np.concatenate([[seq_length], transcript]).astype(np.int32)
+ mfcc_out = mfcc.T.astype(np.float32)
+
+ return mfcc_out, sequences
+
+ @staticmethod
+ def _create_feature(mfcc_bytes, sequence_bytes):
+ """
+ Creates a tf.train.Example message ready to be written to a file.
+ """
+ # Create a dictionary mapping the feature name to the tf.train.Example-compatible
+ # data type.
+
+ feature = {
+ 'mfcc_bytes': _bytes_feature(mfcc_bytes),
+ 'sequence_bytes': _bytes_feature(sequence_bytes),
+ }
+
+ # Create a Features message using tf.train.Example.
+ return tf.train.Example(features=tf.train.Features(feature=feature))
+
+
+ def _get_directory(self, sub_directory):
+ preprocess_directory = 'preprocessed'
+
+ directory = self._data_directory + '/' + preprocess_directory + '/' + sub_directory
+
+ return directory
+
+
+ def process_data(self, directory):
+ """
+ Read audio files from `directory` and store the preprocessed version in preprocessed/`directory`
+
+ Args:
+ directory: the sub-directory to read from
+
+ """
+ # create a list of all the .flac files
+ audio_files = list(iglob_recursive(self._data_directory + '/' + directory , '*.flac'))
+
+ out_directory = self._get_directory(directory)
+
+ if not os.path.exists(out_directory):
+ os.makedirs(out_directory)
+
+ # the file the TFRecord will be written to
+ filename = out_directory + f'/{directory}.tfrecord'
+ with tf.io.TFRecordWriter(filename) as writer:
+ for audio_file in audio_files:
+ if os.path.getsize(audio_file) >= 13885: #small files are not suported
+ audio_id = self._extract_audio_id(audio_file)
+
+ # identify the transcript corresponding to the audio file
+ transcript = self._transcript_dict[audio_id]
+
+ # convert the audio to MFCCs
+ mfcc_feature, sequences = self.transform_audio_to_mfcc(audio_file, transcript)
+
+ # Serialise the MFCCs and transcript
+ mfcc_bytes = tf.io.serialize_tensor(mfcc_feature)
+ sequence_bytes = tf.io.serialize_tensor(sequences)
+
+ # Write to the TFRecord file
+ writer.write(self._create_feature(mfcc_bytes, sequence_bytes).SerializeToString())
+
+ else:
+ print('Processed data already exists')
+
+
+class Preprocessing:
+
+ def __init__(self, data_dir):
+ self.data_dir = data_dir
+
+ def run(self):
+ # Download the raw data
+ corpus = SpeechCorpusProvider(self.data_dir)
+ corpus.ensure_availability()
+
+ corpus_reader = SpeechCorpusReader(self.data_dir)
+
+ # initialise the datasets
+ data_sets = [data_set[1] for data_set in corpus.data_sets]
+
+ for data_set in data_sets:
+ print(f"Preprocessing {data_set} data")
+ corpus_reader.process_data(data_set)
+
+ print('Preprocessing Complete')
+ def run_without_download(self):
+ corpus_reader = SpeechCorpusReader(self.data_dir)
+ corpus_reader.process_data('dev')
+ corpus_reader.process_data('train')
+ corpus_reader.process_data('test')
+if __name__=="__main__":
+ reduced_preprocessing = Preprocessing('librispeech_reduced_size')
+ reduced_preprocessing.run()
+
+ full_preprocessing = Preprocessing('librispeech_full_size')
+ full_preprocessing.run()
+
+ preprocess_fluent_sppech()
+ convert_to_flac()
+ fluent_speech_preprocessing = Preprocessing('fluent_speech_commands_dataset') #please note this is a license data set
+ fluent_speech_preprocessing.run_without_download()
+
+
+
diff --git a/models/speech_recognition/tiny_wav2letter/tflite_int8/recreate_code/preprocessing_convert_to_flac.py b/models/speech_recognition/tiny_wav2letter/tflite_int8/recreate_code/preprocessing_convert_to_flac.py
new file mode 100644
index 0000000..9327164
--- /dev/null
+++ b/models/speech_recognition/tiny_wav2letter/tflite_int8/recreate_code/preprocessing_convert_to_flac.py
@@ -0,0 +1,100 @@
+# Copyright (C) 2020 Arm Limited or its affiliates. All rights reserved.
+#
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the License); you may
+# not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an AS IS BASIS, WITHOUT
+# WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from queue import Queue
+import logging
+import os
+from threading import Thread
+import audiotools
+from audiotools.wav import InvalidWave
+
+class W2F:
+
+ logger = ''
+
+ def __init__(self):
+ global logger
+ # create logger
+ logger = logging.getLogger(__name__)
+ logger.setLevel(logging.DEBUG)
+
+ # create a file handler
+ handler = logging.FileHandler('converter.log')
+ handler.setLevel(logging.INFO)
+
+ # create a logging format
+ formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
+ handler.setFormatter(formatter)
+
+ # add the handlers to the logger
+ logger.addHandler(handler)
+
+ def convert(self,path):
+ global logger
+ file_queue = Queue()
+ num_converter_threads = 5
+
+ # collect files to be converted
+ for root, dirs, files in os.walk(path):
+
+ for file in files:
+ if file.endswith(".wav"):
+ file_wav = os.path.join(root, file)
+ file_flac = file_wav.replace(".wav", ".flac")
+
+ if (os.path.exists(file_flac)):
+ logger.debug(''.join(["File ",file_flac, " already exists."]))
+ else:
+ file_queue.put(file_wav)
+
+ logger.info("Start converting: %s files", str(file_queue.qsize()))
+
+ # Set up some threads to convert files
+ for i in range(num_converter_threads):
+ worker = Thread(target=self.process, args=(file_queue,))
+ worker.setDaemon(True)
+ worker.start()
+
+ file_queue.join()
+
+ def process(self, q):
+ """This is the worker thread function.
+ It processes files in the queue one after
+ another. These daemon threads go into an
+ infinite loop, and only exit when
+ the main thread ends.
+ """
+ while True:
+ global logger
+ compression_quality = '0' #min compression
+ file_wav = q.get()
+ file_flac = file_wav.replace(".wav", ".flac")
+
+ try:
+ audiotools.open(file_wav).convert(file_flac,audiotools.FlacAudio, compression_quality)
+ logger.info(''.join(["Converted ", file_wav, " to: ", file_flac]))
+ q.task_done()
+ except InvalidWave:
+ logger.error(''.join(["Failed to open file ", file_wav, " to: ", file_flac," failed."]), exc_info=True)
+
+def convert_to_flac():
+ reduced_preprocessing = W2F()
+ reduced_preprocessing.convert("fluent_speech_commands_dataset/train/")
+ reduced_preprocessing.convert("fluent_speech_commands_dataset/dev/")
+ reduced_preprocessing.convert("fluent_speech_commands_dataset/test/")
+ print('')
+
+
diff --git a/models/speech_recognition/tiny_wav2letter/tflite_int8/recreate_code/preprocessing_fluent_speech_commands.py b/models/speech_recognition/tiny_wav2letter/tflite_int8/recreate_code/preprocessing_fluent_speech_commands.py
new file mode 100644
index 0000000..831fb06
--- /dev/null
+++ b/models/speech_recognition/tiny_wav2letter/tflite_int8/recreate_code/preprocessing_fluent_speech_commands.py
@@ -0,0 +1,50 @@
+# Copyright (C) 2020 Arm Limited or its affiliates. All rights reserved.
+#
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the License); you may
+# not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an AS IS BASIS, WITHOUT
+# WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from csv import reader
+import os
+from shutil import copyfile
+from pathlib import Path
+import re
+def preprocess(flag):
+ if flag == 'train':
+ path_csv = 'fluent_speech_commands_dataset/data/train_data.csv'
+ path_dir = 'fluent_speech_commands_dataset/train/'
+ elif flag == 'dev':
+ path_csv = 'fluent_speech_commands_dataset/data/valid_data.csv'
+ path_dir = 'fluent_speech_commands_dataset/dev/'
+ else:
+ path_csv = 'fluent_speech_commands_dataset/data/test_data.csv'
+ path_dir = 'fluent_speech_commands_dataset/test/'
+ with open(path_csv, 'r') as read_obj:
+ csv_reader = reader(read_obj)
+ if not os.path.exists(path_dir):
+ os.makedirs(path_dir)
+ with open(path_dir + flag + '.trans.txt', 'w') as write_obj:
+ for row in csv_reader:
+ print(row)
+ if(row[1] == 'path'):
+ continue
+ head, file_name = os.path.split(row[1])
+ copyfile('fluent_speech_commands_dataset/' + row[1],path_dir + file_name)
+ text = row[3]
+ text = text.upper()
+ text = re.sub('[^a-zA-Z \']+', '', text) #remove all other chars
+ write_obj.write(Path(file_name).stem + " " + text + '\n')
+def preprocess_fluent_sppech():
+ preprocess('train')
+ preprocess('dev')
+ preprocess('test')
diff --git a/models/speech_recognition/tiny_wav2letter/tflite_int8/recreate_code/prune_and_quantise_model.py b/models/speech_recognition/tiny_wav2letter/tflite_int8/recreate_code/prune_and_quantise_model.py
new file mode 100755
index 0000000..742436b
--- /dev/null
+++ b/models/speech_recognition/tiny_wav2letter/tflite_int8/recreate_code/prune_and_quantise_model.py
@@ -0,0 +1,386 @@
+# Copyright (C) 2020 Arm Limited or its affiliates. All rights reserved.
+#
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the License); you may
+# not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an AS IS BASIS, WITHOUT
+# WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import multiprocessing
+import os
+import pathlib
+
+import tensorflow as tf
+import tensorflow_model_optimization as tfmot
+from tqdm import tqdm
+import numpy as np
+
+from tinywav2letter import get_metrics
+from train_model import log, get_lr_schedule, get_data
+
+options = tf.data.Options()
+options.experimental_distribute.auto_shard_policy = tf.data.experimental.AutoShardPolicy.DATA
+
+gpus = tf.config.list_logical_devices('GPU')
+strategy = tf.distribute.MirroredStrategy(gpus)
+
+def load_model():
+ """
+ Returns the model saved at the end of training
+ """
+
+ # load the saved model
+ with strategy.scope():
+ model = tf.keras.models.load_model(f'saved_models/tiny_wav2letter',
+ custom_objects={'ctc_loss': get_metrics("loss"), 'ctc_ler':get_metrics("ler")}
+ )
+
+ return model
+
+def evaluate_model(args, model):
+ """
+ Evaluates an unquantised model
+
+ Args:
+ args: The command line arguments
+ model: The model to evaluate
+ """
+
+ # Get the data to evaluate the model on
+ (fluent_speech_test_data, fluent_speech_test_num_steps) = get_data(args, "test_fluent_speech", args.batch_size)
+
+ # Compile and evaluate the model - LER
+ with strategy.scope():
+ opt = tf.keras.optimizers.Adam(learning_rate=get_lr_schedule(learning_rate = 1e-6, steps_per_epoch=fluent_speech_test_num_steps))
+ model.compile(loss=get_metrics("loss"), metrics=[get_metrics("ler")], optimizer=opt, run_eagerly=True)
+ log(f'Evaluating TinyWav2Letter - LER')
+ model.evaluate(fluent_speech_test_data)
+
+ # Get the data to evaluate the model on
+ (fluent_speech_test_data, fluent_speech_test_num_steps) = get_data(args, "test_fluent_speech", batch_size=1) # #based on batch=1
+
+ # Compile and evaluate the model - WER
+ with strategy.scope():
+ opt = tf.keras.optimizers.Adam(learning_rate=get_lr_schedule(learning_rate = 1e-6, steps_per_epoch=fluent_speech_test_num_steps))
+ model.compile(loss=get_metrics("loss"), metrics=[get_metrics("wer")], optimizer=opt,run_eagerly=True)
+ log(f'Evaluating TinyWav2Letter - WER')
+ model.evaluate(fluent_speech_test_data)
+
+def prune_model(args, model):
+ """Performs pruning, fine-tuning and returns stripped pruned model"""
+
+ # Get all the training and validation data
+ (full_training_data, full_training_num_steps) = get_data(args, "train_full_size", args.batch_size)
+ (full_validation_data, full_validation_num_steps) = get_data(args, "val_full_size", args.batch_size)
+ (fluent_speech_training_data, fluent_speech_training_num_steps) = get_data(args, "train_fluent_speech",args.batch_size)
+ (fluent_speech_validation_data, fluent_speech_validation_num_steps) = get_data(args, "val_fluent_speech",args.batch_size)
+ (fluent_speech_test_data, fluent_speech_test_num_steps) = get_data(args, "test_fluent_speech", args.batch_size)
+
+ log("Pruning model to {} sparsity".format(args.sparsity))
+
+ log_dir = f"logs/pruned_model"
+
+ # Set up the callbacks
+ pruning_callbacks = [
+ tfmot.sparsity.keras.UpdatePruningStep(),
+ tfmot.sparsity.keras.PruningSummaries(log_dir=log_dir),
+ ]
+
+ # Perform the pruning - full_training_data
+ pruning_params = {
+ "pruning_schedule": tfmot.sparsity.keras.ConstantSparsity(
+ args.sparsity, begin_step=0, end_step=int(full_training_num_steps * 0.7), frequency=10
+ )
+ }
+
+
+ with strategy.scope():
+ pruned_model = tfmot.sparsity.keras.prune_low_magnitude(model, **pruning_params)
+ opt = tf.keras.optimizers.Adam(learning_rate=get_lr_schedule(learning_rate = 1e-4, steps_per_epoch=full_training_num_steps))
+ pruned_model.compile(loss=get_metrics("loss"), metrics=[get_metrics("ler")], optimizer=opt)
+
+ pruned_model.fit(
+ full_training_data,
+ epochs=5,
+ verbose=1,
+ callbacks=pruning_callbacks,
+ validation_data=full_validation_data,
+ )
+
+ # Perform the pruning - fluent_speech_training_data
+ pruning_params = {
+ "pruning_schedule": tfmot.sparsity.keras.ConstantSparsity(
+ args.sparsity, begin_step=0, end_step=int(fluent_speech_validation_num_steps * 0.7), frequency=10
+ )
+ }
+
+
+ with strategy.scope():
+ pruned_model = tfmot.sparsity.keras.prune_low_magnitude(model, **pruning_params)
+ opt = tf.keras.optimizers.Adam(learning_rate=get_lr_schedule(learning_rate = 1e-4, steps_per_epoch=fluent_speech_validation_num_steps))
+ pruned_model.compile(loss=get_metrics("loss"), metrics=[get_metrics("ler")], optimizer=opt)
+
+ pruned_model.fit(
+ fluent_speech_training_data,
+ epochs=5,
+ verbose=1,
+ callbacks=pruning_callbacks,
+ validation_data=fluent_speech_validation_data,
+ )
+
+ stripped_model = tfmot.sparsity.keras.strip_pruning(pruned_model)
+
+ return stripped_model
+
+def prepare_model_for_inference(model, input_window_length = 296):
+ "Takes the a model and returns a model with fixed input size"
+ MFCC_coeffs = 39
+
+ # Define the input
+ layer_input = tf.keras.layers.Input((input_window_length, MFCC_coeffs), batch_size=1)
+ static_shaped_model = tf.keras.models.Model(
+ inputs=[layer_input], outputs=[model.call(layer_input)]
+ )
+ return static_shaped_model
+
+def tflite_conversion(model, tflite_path, conversion_type="fp32"):
+ """Performs tflite conversion (fp32, int8)."""
+ # Prepare model for inference
+ model = prepare_model_for_inference(model)
+ converter = tf.lite.TFLiteConverter.from_keras_model(model)
+
+ # define a dataset to calibrate the conversion to INT8
+ def representative_dataset_gen(input_dim):
+ calib_data = []
+ for data in tqdm(fluent_speech_test_data.take(1000), desc="model calibration"):
+ input_data = data[0]
+ for i in range(input_data.shape[1] // input_dim):
+ input_chunks = [
+ input_data[:, i * input_dim: (i + 1) * input_dim, :, ]
+ ]
+ for chunk in input_chunks:
+ calib_data.append([chunk])
+
+ return lambda: [
+ (yield data) for data in tqdm(calib_data, desc="model calibration")
+ ]
+
+ if conversion_type == "int8":
+ log("Quantizing Model")
+ (fluent_speech_test_data, fluent_speech_test_num_steps) = get_data(args, "test_fluent_speech", args.batch_size)
+ converter.optimizations = [tf.lite.Optimize.DEFAULT]
+ converter.inference_input_type = tf.int8
+ converter.inference_output_type = tf.int8
+ converter.representative_dataset = representative_dataset_gen(model.input_shape[1])
+
+ tflite_model = converter.convert()
+ open(tflite_path, "wb").write(tflite_model)
+
+def evaluate_tflite(tflite_path, input_window_length = 296):
+ """Evaluates tflite (fp32, int8)."""
+ results_ler = []
+ results_wer = []
+ (fluent_speech_test_data, fluent_speech_test_num_steps) = get_data(args, "test_fluent_speech", batch_size=1)
+ log("Setting number of used threads to {}".format(multiprocessing.cpu_count()))
+ interpreter = tf.lite.Interpreter(
+ model_path=tflite_path, num_threads=multiprocessing.cpu_count()
+ )
+ interpreter.allocate_tensors()
+ input_chunk = interpreter.get_input_details()[0]
+ output_details = interpreter.get_output_details()[0]
+
+ input_shape = input_chunk["shape"]
+ log("eval_model() - input_shape: {}".format(input_shape))
+ input_dtype = input_chunk["dtype"]
+ output_dtype = output_details["dtype"]
+
+ # Check if the input/output type is quantized,
+ # set scale and zero-point accordingly
+ if input_dtype != tf.float32:
+ input_scale, input_zero_point = input_chunk["quantization"]
+ else:
+ input_scale, input_zero_point = 1, 0
+
+ if output_dtype != tf.float32:
+ output_scale, output_zero_point = output_details["quantization"]
+ else:
+ output_scale, output_zero_point = 1, 0
+
+ log("Running {} iterations".format(fluent_speech_test_num_steps))
+ for i_iter, (data, label) in enumerate(
+ tqdm(fluent_speech_test_data, total=fluent_speech_test_num_steps)
+ ):
+ data = data / input_scale + input_zero_point
+ # Round the data up if dtype is int8, uint8 or int16
+ if input_dtype is not np.float32:
+ data = np.round(data)
+
+ while data.shape[1] < input_window_length:
+ data = np.append(data, data[:, -2:-1, :], axis=1)
+ # Zero-pad any odd-length inputs
+ if data.shape[1] % 2 == 1:
+ log('Input length is odd, zero-padding to even (first layer has stride 2)')
+ data = np.concatenate([data, np.zeros((1, 1, data.shape[2]), dtype=input_dtype)], axis=1)
+
+ context = 24 + 2 * (7 * 3 + 16) # = 98 - theoretical max receptive field on each side
+ size = input_chunk['shape'][1]
+ inner = size - 2 * context
+ data_end = data.shape[1]
+
+ # Initialize variables for the sliding window loop
+ data_pos = 0
+ outputs = []
+
+ while data_pos < data_end:
+ if data_pos == 0:
+ # Align inputs from the first window to the start of the data and include the intial context in the output
+ start = data_pos
+ end = start + size
+ y_start = 0
+ y_end = y_start + (size - context) // 2
+ data_pos = end - context
+ elif data_pos + inner + context >= data_end:
+ # Shift left to align final window to the end of the data and include the final context in the output
+ shift = (data_pos + inner + context) - data_end
+ start = data_pos - context - shift
+ end = start + size
+ assert start >= 0
+ y_start = (shift + context) // 2 # Will be even because we assert it above
+ y_end = size // 2
+ data_pos = data_end
+ else:
+ # Capture only the inner region from mid-input inferences, excluding output from both context regions
+ start = data_pos - context
+ end = start + size
+ y_start = context // 2
+ y_end = y_start + inner // 2
+ data_pos = end - context
+
+ interpreter.set_tensor(
+ input_chunk["index"], tf.cast(data[:, start:end, :], input_dtype))
+ interpreter.invoke()
+ cur_output_data = interpreter.get_tensor(output_details["index"])[:, :, y_start:y_end, :]
+ cur_output_data = output_scale * (
+ cur_output_data.astype(np.float32) - output_zero_point
+ )
+ outputs.append(cur_output_data)
+
+ complete = np.concatenate(outputs, axis=2)
+ results_ler.append(get_metrics("ler")(label, complete))
+ results_wer.append(get_metrics("wer")(label, complete))
+
+ log("ler: {}".format(np.mean(results_ler)))
+ log("wer: {}".format(np.mean(results_wer))) #based on batch=1
+
+def main(args):
+
+ model = load_model()
+ evaluate_model(args, model)
+
+ if args.prune:
+ model = prune_model(args, model)
+ model.save(f"saved_models/pruned_tiny_wav2letter")
+ evaluate_model(args, model)
+
+ output_directory = pathlib.Path(os.path.dirname(os.path.abspath(__file__)))
+ output_directory = os.path.join(output_directory, "tiny_wav2letter/tflite_models")
+ wav2letter_tflite_path = os.path.join(output_directory, args.prune * "pruned_" + f"tiny_wav2letter_int8.tflite")
+
+ if not os.path.exists(os.path.dirname(wav2letter_tflite_path)):
+ try:
+ os.makedirs(os.path.dirname(wav2letter_tflite_path))
+ except OSError as exc:
+ raise
+
+ # Convert the saved model to TFLite and then evaluate
+ tflite_conversion(model, wav2letter_tflite_path, "int8")
+ evaluate_tflite(wav2letter_tflite_path)
+
+
+if __name__ == "__main__":
+
+ parser = argparse.ArgumentParser(allow_abbrev=False)
+ parser.add_argument(
+ "--training_set_size",
+ dest="training_set_size",
+ type=int,
+ required=False,
+ default = 132553,
+ help="The number of samples in the training set"
+ )
+ parser.add_argument(
+ "--fluent_speech_training_set_size",
+ dest="fluent_speech_set_size",
+ type=int,
+ required=False,
+ default = 23132,
+ help="The number of samples in the fluent speech training dataset"
+ )
+ parser.add_argument(
+ "--batch_size",
+ dest="batch_size",
+ type=int,
+ required=False,
+ default=32,
+ help="batch size wanted when creating model",
+ )
+
+ parser.add_argument(
+ "--finetuning_epochs",
+ dest="finetuning_epochs",
+ type=int,
+ required=False,
+ default=10,
+ help="Number of epochs for fine-tuning",
+ )
+ parser.add_argument(
+ "--full_data_dir",
+ dest="full_data_dir",
+ type=str,
+ required=False,
+ default='librispeech_full_size',
+ help="Path to dataset directory",
+ )
+ parser.add_argument(
+ "--reduced_data_dir",
+ dest="reduced_data_dir",
+ type=str,
+ required=False,
+ default='librispeech_reduced_size',
+ help="Path to dataset directory",
+ )
+ parser.add_argument(
+ "--fluent_speech_data_dir",
+ dest="fluent_speech_data_dir",
+ type=str,
+ required=False,
+ default='fluent_speech_commands_dataset',
+ help="Path to dataset directory",
+ )
+ parser.add_argument(
+ "--prune",
+ dest="prune",
+ required=False,
+ action='store_true',
+ help="Prune model true or false",
+ )
+ parser.add_argument(
+ "--sparsity",
+ dest="sparsity",
+ type=float,
+ required=False,
+ default=0.5,
+ help="Level of sparsity required",
+ )
+
+ args = parser.parse_args()
+ main(args)
diff --git a/models/speech_recognition/tiny_wav2letter/tflite_int8/recreate_code/recreate_model.sh b/models/speech_recognition/tiny_wav2letter/tflite_int8/recreate_code/recreate_model.sh
new file mode 100644
index 0000000..5cc1b39
--- /dev/null
+++ b/models/speech_recognition/tiny_wav2letter/tflite_int8/recreate_code/recreate_model.sh
@@ -0,0 +1,11 @@
+python3 -m venv env
+source env/bin/activate
+
+pip install -r requirements.txt
+python preprocessing.py
+python train_model.py --with_baseline --baseline_epochs 30 --with_finetuning --finetuning_epochs 10 --with_fluent_speech --fluent_speech_epochs 30
+python prune_and_quantise_model.py --prune --sparsity 0.5 --finetuning_epochs 10
+python prune_and_quantise_model.py --sparsity 0.5 --finetuning_epochs 10
+
+
+
diff --git a/models/speech_recognition/tiny_wav2letter/tflite_int8/recreate_code/requirements.txt b/models/speech_recognition/tiny_wav2letter/tflite_int8/recreate_code/requirements.txt
new file mode 100644
index 0000000..6ea5a87
--- /dev/null
+++ b/models/speech_recognition/tiny_wav2letter/tflite_int8/recreate_code/requirements.txt
@@ -0,0 +1,10 @@
+librosa==0.8.1
+numpy==1.19.5
+tensorboard==2.6.0
+tensorboard-data-server==0.6.1
+tensorboard-plugin-profile==2.5.0
+tensorboard-plugin-wit==1.8.0
+tensorflow==2.4.1
+tensorflow-model-optimization==0.6.0
+tqdm
+jiwer==2.3.0
diff --git a/models/speech_recognition/tiny_wav2letter/tflite_int8/recreate_code/saved_models/pruned_tiny_wav2letter/saved_model.pb b/models/speech_recognition/tiny_wav2letter/tflite_int8/recreate_code/saved_models/pruned_tiny_wav2letter/saved_model.pb
new file mode 100644
index 0000000..5b2edf2
--- /dev/null
+++ b/models/speech_recognition/tiny_wav2letter/tflite_int8/recreate_code/saved_models/pruned_tiny_wav2letter/saved_model.pb
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:081d7d86e2b6fd788ca37b9566213560140f595d7702a2126b5a4b895d00cc9e
+size 206996
diff --git a/models/speech_recognition/tiny_wav2letter/tflite_int8/recreate_code/saved_models/pruned_tiny_wav2letter/variables/variables.data-00000-of-00001 b/models/speech_recognition/tiny_wav2letter/tflite_int8/recreate_code/saved_models/pruned_tiny_wav2letter/variables/variables.data-00000-of-00001
new file mode 100644
index 0000000..134cd19
--- /dev/null
+++ b/models/speech_recognition/tiny_wav2letter/tflite_int8/recreate_code/saved_models/pruned_tiny_wav2letter/variables/variables.data-00000-of-00001
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:df3647d94be8c5feb098f69122e22007b11873bddb11bdc606112576d4588d4f
+size 15644464
diff --git a/models/speech_recognition/tiny_wav2letter/tflite_int8/recreate_code/saved_models/pruned_tiny_wav2letter/variables/variables.index b/models/speech_recognition/tiny_wav2letter/tflite_int8/recreate_code/saved_models/pruned_tiny_wav2letter/variables/variables.index
new file mode 100644
index 0000000..e69b315
--- /dev/null
+++ b/models/speech_recognition/tiny_wav2letter/tflite_int8/recreate_code/saved_models/pruned_tiny_wav2letter/variables/variables.index
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:e6ba80bc6403066c573eed4e039219c085f2d83b32c3e0e19f40a21a03c8efb7
+size 1339
diff --git a/models/speech_recognition/tiny_wav2letter/tflite_int8/recreate_code/saved_models/tiny_wav2letter/saved_model.pb b/models/speech_recognition/tiny_wav2letter/tflite_int8/recreate_code/saved_models/tiny_wav2letter/saved_model.pb
new file mode 100644
index 0000000..93177d0
--- /dev/null
+++ b/models/speech_recognition/tiny_wav2letter/tflite_int8/recreate_code/saved_models/tiny_wav2letter/saved_model.pb
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:163b79816a803b5d03d45e2a792e62981a1102fc0d9bd58efb4947d44b5e2af7
+size 264815
diff --git a/models/speech_recognition/tiny_wav2letter/tflite_int8/recreate_code/saved_models/tiny_wav2letter/variables/variables.data-00000-of-00001 b/models/speech_recognition/tiny_wav2letter/tflite_int8/recreate_code/saved_models/tiny_wav2letter/variables/variables.data-00000-of-00001
new file mode 100644
index 0000000..230a0dc
--- /dev/null
+++ b/models/speech_recognition/tiny_wav2letter/tflite_int8/recreate_code/saved_models/tiny_wav2letter/variables/variables.data-00000-of-00001
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:d4048120f5ff4474cf70d3161723ac756b053e7053df76ffd48387416dab24e4
+size 46925195
diff --git a/models/speech_recognition/tiny_wav2letter/tflite_int8/recreate_code/saved_models/tiny_wav2letter/variables/variables.index b/models/speech_recognition/tiny_wav2letter/tflite_int8/recreate_code/saved_models/tiny_wav2letter/variables/variables.index
new file mode 100644
index 0000000..b53f5fd
--- /dev/null
+++ b/models/speech_recognition/tiny_wav2letter/tflite_int8/recreate_code/saved_models/tiny_wav2letter/variables/variables.index
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:82c10539eb05229e769397e7ee90a7befa1ec22377032da89440d69c69f4e07d
+size 4440
diff --git a/models/speech_recognition/tiny_wav2letter/tflite_int8/recreate_code/tinywav2letter.py b/models/speech_recognition/tiny_wav2letter/tflite_int8/recreate_code/tinywav2letter.py
new file mode 100644
index 0000000..c998497
--- /dev/null
+++ b/models/speech_recognition/tiny_wav2letter/tflite_int8/recreate_code/tinywav2letter.py
@@ -0,0 +1,152 @@
+# Copyright (C) 2020 Arm Limited or its affiliates. All rights reserved.
+#
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the License); you may
+# not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an AS IS BASIS, WITHOUT
+# WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Model definition for Tinywav2Letter."""
+import tensorflow as tf
+from tensorflow.python.ops import ctc_ops
+import numpy as np
+from jiwer import wer
+
+def get_metrics(metric):
+ """Get metrics needed to compile wav2letter."""
+ def ctc_preparation(tensor, y_predict):
+ if len(y_predict.shape) == 4:
+ y_predict = tf.squeeze(y_predict, axis=1)
+ y_predict = tf.transpose(y_predict, (1, 0, 2))
+ sequence_lengths, labels = tensor[:, 0], tensor[:, 1:]
+ idx = tf.where(tf.not_equal(labels, 28))
+ sparse_labels = tf.SparseTensor(
+ idx, tf.gather_nd(labels, idx), tf.shape(labels, out_type=tf.int64)
+ )
+ return sparse_labels, sequence_lengths, y_predict
+
+ def get_loss():
+ """Calculate CTC loss."""
+ def ctc_loss(y_true, y_predict):
+ sparse_labels, logit_length, y_predict = ctc_preparation(y_true, y_predict)
+ return tf.reduce_mean(
+ ctc_ops.ctc_loss_v2(
+ labels=sparse_labels,
+ logits=y_predict,
+ label_length=None,
+ logit_length=logit_length,
+ blank_index=-1,
+ )
+ )
+ return ctc_loss
+
+ def get_ler():
+ """Calculate CTC LER (Letter Error Rate)."""
+ def ctc_ler(y_true, y_predict):
+ sparse_labels, logit_length, y_predict = ctc_preparation(y_true, y_predict)
+ decoded, log_probabilities = tf.nn.ctc_greedy_decoder(
+ y_predict, tf.cast(logit_length, tf.int32), merge_repeated=True
+ )
+ return tf.reduce_mean(
+ tf.edit_distance(
+ tf.cast(decoded[0], tf.int32), tf.cast(sparse_labels, tf.int32)
+ )
+ )
+ return ctc_ler
+ def get_wer():
+ """Calculate CTC WER (Word Error Rate) only for batch size = 1."""
+
+ def trans_int_to_string(trans_int):
+ #create dictionary int -> string (0 -> a 1 -> b)
+ string = ""
+ alphabet = "abcdefghijklmnopqrstuvwxyz' @"
+ alphabet_dict = {}
+ count = 0
+ for x in alphabet:
+ alphabet_dict[count] = x
+ count += 1
+ for letter in trans_int:
+ letter_np = np.array(letter).item(0)
+ if letter_np != 28:
+ string += alphabet_dict[letter_np]
+ return string
+
+ def ctc_wer(y_true, y_predict):
+ sparse_labels, logit_length, y_predict = ctc_preparation(y_true, y_predict)
+ decoded, log_probabilities = tf.nn.ctc_greedy_decoder(
+ y_predict, tf.cast(logit_length, tf.int32), merge_repeated=True
+ )
+ true_sentence = tf.cast(sparse_labels.values, tf.int32)
+ return wer(str(trans_int_to_string(true_sentence)),str(trans_int_to_string(decoded[0].values)))
+ return ctc_wer
+
+ return {"loss": get_loss(), "ler": get_ler(), "wer": get_wer()}[metric]
+
+
+def create_tinywav2letter(batch_size=1, no_stride_count=5, filters_small=100, filters_large_1=750, filters_large_2=750) -> tf.keras.models.Model:
+ """Create and return Tinywav2Letter model"""
+ layer = tf.keras.layers
+ leaky_relu = layer.LeakyReLU([0.20000000298023224])
+ MFCC_coeffs = 39
+ input = layer.Input(shape=[None, MFCC_coeffs], batch_size=batch_size)
+ # Reshape to prepare input for first layer
+ x = layer.Reshape([1, -1, 39])(input)
+ # One striding layer of output size [batch_size, max_time / 2, 250]
+ x = layer.Conv2D(
+ filters=250,
+ kernel_size=[1, 48],
+ padding="same",
+ activation=None,
+ strides=[1, 2],
+ )(x)
+ # Add non-linearity
+ x = leaky_relu(x)
+ # layers without striding of output size [batch_size, max_time / 2, 250]
+ for i in range(0, no_stride_count):
+ x = layer.Conv2D(
+ filters=filters_small,
+ kernel_size=[1, 7],
+ padding="same",
+ activation=None,
+ strides=[1, 1],
+ )(x)
+ # Add non-linearity
+ x = leaky_relu(x)
+ # 1 layer with high kernel width and output size [batch_size, max_time / 2, 2000]
+ x = layer.Conv2D(
+ filters=filters_large_1,
+ kernel_size=[1, 32],
+ padding="same",
+ activation=None,
+ strides=[1, 1],
+ )(x)
+ # Add non-linearity
+ x = leaky_relu(x)
+ # 1 layer of output size [batch_size, max_time / 2, 2000]
+ x = layer.Conv2D(
+ filters=filters_large_2,
+ kernel_size=[1, 1],
+ padding="same",
+ activation=None,
+ strides=[1, 1],
+ )(x)
+ # Add non-linearity
+ x = leaky_relu(x)
+ # 1 layer of output size [batch_size, max_time / 2, num_classes]
+ # We must not apply a non linearity in this last layer
+ x = layer.Conv2D(
+ filters=29,
+ kernel_size=[1, 1],
+ padding="same",
+ activation=None,
+ strides=[1, 1],
+ )(x)
+ return tf.keras.models.Model(inputs=[input], outputs=[x])
diff --git a/models/speech_recognition/tiny_wav2letter/tflite_int8/recreate_code/train_model.py b/models/speech_recognition/tiny_wav2letter/tflite_int8/recreate_code/train_model.py
new file mode 100644
index 0000000..1ae4943
--- /dev/null
+++ b/models/speech_recognition/tiny_wav2letter/tflite_int8/recreate_code/train_model.py
@@ -0,0 +1,267 @@
+# Copyright (C) 2020 Arm Limited or its affiliates. All rights reserved.
+#
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the License); you may
+# not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an AS IS BASIS, WITHOUT
+# WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Wav2letter training, optimisation and evaluation script"""
+import argparse
+
+import tensorflow as tf
+
+from tinywav2letter import create_tinywav2letter, get_metrics
+from load_mfccs import MFCC_Loader
+
+options = tf.data.Options()
+options.experimental_distribute.auto_shard_policy = tf.data.experimental.AutoShardPolicy.DATA
+
+gpus = tf.config.list_logical_devices('GPU')
+strategy = tf.distribute.MirroredStrategy(gpus)
+
+
+def log(std):
+ """Log the given string to the standard output."""
+ print("******* {}".format(std), flush=True)
+
+def get_data(args, dataset_type, batch_size):
+ """Returns particular training and validation dataset."""
+ dataset = MFCC_Loader(args.full_data_dir, args.reduced_data_dir,args.fluent_speech_data_dir)
+
+ return {"train_full_size": [dataset.full_training_set(batch_size=batch_size, num_samples = args.training_set_size).with_options(options), dataset.num_steps(batch=batch_size)],
+ "train_reduced_size": [dataset.reduced_training_set(batch_size=batch_size, num_samples = args.training_set_size).with_options(options), dataset.num_steps(batch=batch_size)],
+ "val_full_size": [dataset.full_validation_set(batch_size=batch_size).with_options(options), dataset.num_steps(batch=batch_size)],
+ "val_reduced_size": [dataset.reduced_validation_set(batch_size=batch_size).with_options(options), dataset.num_steps(batch=batch_size)],
+ "train_fluent_speech": [dataset.fluent_speech_train_set(batch_size=batch_size, num_samples = args.fluent_speech_set_size).with_options(options), dataset.num_steps(batch=batch_size)],
+ "val_fluent_speech": [dataset.fluent_speech_validation_set(batch_size=batch_size).with_options(options), dataset.num_steps(batch=batch_size)],
+ "test_fluent_speech": [dataset.fluent_speech_test_set(batch_size=batch_size).with_options(options),dataset.num_steps(batch=batch_size)],
+ }[dataset_type]
+
+def setup_callbacks(checkpoint_path, log_dir):
+ """Returns callbacks for baseline training and optimization fine-tuning."""
+ callbacks = [
+ tf.keras.callbacks.TerminateOnNaN(),
+ tf.keras.callbacks.ModelCheckpoint(
+ filepath=checkpoint_path,
+ verbose=1,
+ save_weights_only=True,
+ save_freq='epoch', # save every epoch
+ ),
+ tf.keras.callbacks.TensorBoard(
+ log_dir=log_dir,
+ histogram_freq=1, # update every epoch
+ update_freq=100, # update every 100 batch
+
+ ),
+ ]
+ return callbacks
+
+def get_lr_schedule(steps_per_epoch, learning_rate=1e-4, lr_schedule_config=[[1.0, 0.1, 0.01, 0.001]]):
+ """Returns learn rate schedule for baseline training and optimization fine-tuning."""
+ initial_learning_rate = learning_rate
+ lr_schedule = tf.keras.optimizers.schedules.PiecewiseConstantDecay(
+ boundaries=list(p[1] * steps_per_epoch for p in lr_schedule_config),
+ values=[initial_learning_rate] + list(p[0] * initial_learning_rate for p in lr_schedule_config))
+ return lr_schedule
+
+
+def train_model(args):
+ """Performs pruning, fine-tuning and returns stripped pruned model"""
+ log("Commencing Model Training")
+
+ # Get all of the required datasets
+ (full_training_data, full_training_num_steps) = get_data(args, "train_full_size", args.batch_size)
+ (reduced_training_data, reduced_training_num_steps) = get_data(args, "train_reduced_size", args.batch_size)
+ (full_validation_data, full_validation_num_steps) = get_data(args, "val_full_size", args.batch_size)
+ (reduced_validation_data, reduced_validation_num_steps) = get_data(args, "val_reduced_size", args.batch_size)
+ (fluent_speech_training_data, fluent_speech_training_num_steps) = get_data(args, "train_fluent_speech", args.batch_size)
+ (fluent_speech_validation_data, fluent_speech_validation_num_steps) = get_data(args, "val_fluent_speech", args.batch_size)
+ (fluent_speech_test_data, fluent_speech_test_num_steps) = get_data(args, "test_fluent_speech",args.batch_size)
+
+ # Set up checkpoint paths, directories for the log files and the callbacks
+ baseline_checkpoint_path = f"checkpoints/baseline/checkpoint.ckpt"
+ finetuning_checkpoint_path = f"checkpoints/finetuning/checkpoint.ckpt"
+
+ baseline_log_dir = f"logs/tiny_wav2letter_baseline"
+ finetuning_log_dir = f"logs/tiny_wav2letter_finetuning"
+
+ baseline_callbacks = setup_callbacks(baseline_checkpoint_path, baseline_log_dir)
+ finetuning_callbacks = setup_callbacks(finetuning_checkpoint_path, finetuning_log_dir)
+
+ # Initialise the Tiny Wav2Letter model
+ with strategy.scope():
+ model = create_tinywav2letter(batch_size = args.batch_size)
+
+
+ # Perform the baseline training with the full size LibriSpeech dataset
+ if args.with_baseline:
+
+ with strategy.scope():
+ opt = tf.keras.optimizers.Adam(learning_rate=get_lr_schedule(learning_rate = 1e-4, steps_per_epoch=full_training_num_steps))
+ model.compile(loss=get_metrics("loss"), metrics=[get_metrics("ler")], optimizer=opt)
+
+ model.fit(
+ full_training_data,
+ epochs=args.baseline_epochs,
+ verbose=1,
+ callbacks=baseline_callbacks,
+ validation_data=full_validation_data,
+ initial_epoch = 0
+ )
+
+ log(f'Evaluating Tiny Wav2Letter post baseline training')
+ model.evaluate(fluent_speech_test_data)
+
+ # Perform finetuning training with the reduced size MiniLibriSpeech dataset
+ if args.with_finetuning:
+
+ with strategy.scope():
+ opt = tf.keras.optimizers.Adam(learning_rate=get_lr_schedule(learning_rate = 1e-5, steps_per_epoch=reduced_training_num_steps))
+ model.compile(loss=get_metrics("loss"), metrics=[get_metrics("ler")], optimizer=opt)
+
+ model.fit(
+ reduced_training_data,
+ epochs=args.finetuning_epochs + args.baseline_epochs,
+ verbose=1,
+ callbacks=finetuning_callbacks,
+ validation_data=reduced_validation_data,
+ initial_epoch = args.baseline_epochs
+ )
+
+ log(f'Evaluating Tiny Wav2Letter post finetuning')
+ model.evaluate(x=fluent_speech_test_data)
+
+
+ if args.with_fluent_speech:
+
+ with strategy.scope():
+ opt = tf.keras.optimizers.Adam(learning_rate=get_lr_schedule(learning_rate = 1e-5, steps_per_epoch=fluent_speech_training_num_steps))
+ model.compile(loss=get_metrics("loss"), metrics=[get_metrics("ler")], optimizer=opt,run_eagerly=True)
+
+ model.fit(
+ fluent_speech_training_data,
+ epochs=args.finetuning_epochs + args.baseline_epochs,
+ verbose=1,
+ callbacks=finetuning_callbacks,
+ validation_data=fluent_speech_validation_data,
+ initial_epoch = args.baseline_epochs
+ )
+
+ model.evaluate(x=fluent_speech_test_data)
+ # Save the final trained model in TF SavedModel format
+ model.save(f"saved_models/tiny_wav2letter")
+
+if __name__ == "__main__":
+
+ parser = argparse.ArgumentParser(allow_abbrev=False)
+ parser.add_argument(
+ "--training_set_size",
+ dest="training_set_size",
+ type=int,
+ required=False,
+ default = 132553,
+ help="The number of samples in the training set"
+ )
+ parser.add_argument(
+ "--fluent_speech_training_set_size",
+ dest="fluent_speech_set_size",
+ type=int,
+ required=False,
+ default = 23132,
+ help="The number of samples in the fluent speech training dataset"
+ )
+ parser.add_argument(
+ "--batch_size",
+ dest="batch_size",
+ type=int,
+ required=False,
+ default=32,
+ help="batch size wanted when creating model",
+ )
+ parser.add_argument(
+ "--full_data_dir",
+ dest="full_data_dir",
+ type=str,
+ required=False,
+ default='librispeech_full_size',
+ help="Path to dataset directory",
+ )
+ parser.add_argument(
+ "--reduced_data_dir",
+ dest="reduced_data_dir",
+ type=str,
+ required=False,
+ default='librispeech_reduced_size',
+ help="Path to dataset directory",
+ )
+ parser.add_argument(
+ "--fluent_speech_data_dir",
+ dest="fluent_speech_data_dir",
+ type=str,
+ required=False,
+ default='fluent_speech_commands_dataset',
+ help="Path to dataset directory",
+ )
+ parser.add_argument(
+ "--load_model",
+ dest="load_model",
+ required=False,
+ action='store_true',
+ help="Model number to load",
+ )
+ parser.add_argument(
+ "--with_baseline",
+ dest="with_baseline",
+ required=False,
+ action='store_true',
+ help="Perform pre-training baseline using the full size dataset",
+ )
+ parser.add_argument(
+ "--with_finetuning",
+ dest="with_finetuning",
+ required=False,
+ action='store_true',
+ help="Perform fine-tuning training using the reduced corpus dataset",
+ )
+ parser.add_argument(
+ "--with_fluent_speech",
+ dest="with_fluent_speech",
+ required=False,
+ action='store_true',
+ help="Perform fluent_speech training using the fluent speech dataset",
+ )
+ parser.add_argument(
+ "--baseline_epochs",
+ dest="baseline_epochs",
+ type=int,
+ required=False,
+ default=30,
+ help="Number of epochs for baseline training",
+ )
+ parser.add_argument(
+ "--finetuning_epochs",
+ dest="finetuning_epochs",
+ type=int,
+ required=False,
+ default=10,
+ help="Number of epochs for fine-tuning",
+ )
+ parser.add_argument(
+ "--fluent_speech_epochs",
+ dest="fluent_speech_epochs",
+ type=int,
+ required=False,
+ default=30,
+ help="Number of epochs for fluent speech training",
+ )
+ args = parser.parse_args()
+ train_model(args)
diff --git a/models/speech_recognition/tiny_wav2letter/tflite_int8/testing_input/input_1_int8/0.npy b/models/speech_recognition/tiny_wav2letter/tflite_int8/testing_input/input_1_int8/0.npy
new file mode 100644
index 0000000..0d65816
--- /dev/null
+++ b/models/speech_recognition/tiny_wav2letter/tflite_int8/testing_input/input_1_int8/0.npy
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:8b4c5a0ac79f152bca4fe9fe66d1d2ac9981aba59e935cde12cbc40ceedad4f9
+size 11672
diff --git a/models/speech_recognition/tiny_wav2letter/tflite_int8/testing_output/Identity_int8/0.npy b/models/speech_recognition/tiny_wav2letter/tflite_int8/testing_output/Identity_int8/0.npy
new file mode 100644
index 0000000..2bcaab3
--- /dev/null
+++ b/models/speech_recognition/tiny_wav2letter/tflite_int8/testing_output/Identity_int8/0.npy
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:fcc8e66ffab57ef3e42b544349f33e8ff4c5ff5077d7505710ea115769335b63
+size 4420
diff --git a/models/speech_recognition/tiny_wav2letter/tflite_int8/tiny_wav2letter_int8.tflite b/models/speech_recognition/tiny_wav2letter/tflite_int8/tiny_wav2letter_int8.tflite
new file mode 100644
index 0000000..dc2ef41
--- /dev/null
+++ b/models/speech_recognition/tiny_wav2letter/tflite_int8/tiny_wav2letter_int8.tflite
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:953f63f82c375520bb34e0e80d005bdec5aa4e1090d1702e3dae56f4988edea0
+size 3997112
diff --git a/models/speech_recognition/tiny_wav2letter/tflite_pruned_int8/README.md b/models/speech_recognition/tiny_wav2letter/tflite_pruned_int8/README.md
new file mode 100644
index 0000000..c0d8535
--- /dev/null
+++ b/models/speech_recognition/tiny_wav2letter/tflite_pruned_int8/README.md
@@ -0,0 +1,75 @@
+# Tiny Wav2letter Pruned INT8
+
+## Description
+Tiny Wav2letter is a tiny version of the original Wav2Letter model. It is a convolutional speech recognition neural network. This implementation was created by Arm, pruned to 50% sparsity, fine-tuned and quantized using the TensorFlow Model Optimization Toolkit.
+
+
+## License
+[Apache-2.0](https://spdx.org/licenses/Apache-2.0.html)
+
+## Network Information
+| Network Information | Value |
+|---------------------|----------------|
+| Framework | TensorFlow Lite |
+| SHA-1 Hash | edc581b85190b2bcbfba904b50645264be52f516 |
+| Size (Bytes) | 3997112 |
+| Provenance | https://github.com/ARM-software/ML-zoo/tree/master/models/speech_recognition/wav2letter |
+| Paper | https://arxiv.org/abs/1609.03193 |
+
+## Performance
+
+| Platform | Optimized |
+|----------|:---------:|
+| Cortex-A |:heavy_check_mark: |
+| Cortex-M |:heavy_check_mark: |
+| Mali GPU |:heavy_multiplication_x: |
+| Ethos U |:heavy_check_mark: |
+
+### Key
+* :heavy_check_mark: - Will run on this platform.
+* :heavy_multiplication_x: - Will not run on this platform.
+
+## Accuracy
+Dataset: Fluent Speech (trianed on LibriSpeech,Mini LibrySpeech,Fluent Speech)
+
+Please note that Fluent Speech dataset hosted on Kaggle is a licensed dataset.
+
+
+
+| Metric | Value |
+|--------|-------|
+| LER | 0.0283 |
+| WER | 0.089 |
+
+## Optimizations
+| Optimization | Value |
+|--------------|---------|
+| Quantization | INT8 |
+
+## Network Inputs
+
+
+
Input Node Name
+
Shape
+
Description
+
+
+
input_1_int8
+
(1, 296, 39)
+
Speech converted to MFCCs and quantized to INT8
+
+
+
+## Network Outputs
+
+
+
Output Node Name
+
Shape
+
Description
+
+
+
Identity_int8
+
(1, 1, 148, 29)
+
A tensor of time and class probabilities, that represents the probability of each class at each timestep. Should be passed to a decoder. For example ctc_beam_search_decoder.
+
+
diff --git a/models/speech_recognition/tiny_wav2letter/tflite_pruned_int8/definition.yaml b/models/speech_recognition/tiny_wav2letter/tflite_pruned_int8/definition.yaml
new file mode 100644
index 0000000..4947ebb
--- /dev/null
+++ b/models/speech_recognition/tiny_wav2letter/tflite_pruned_int8/definition.yaml
@@ -0,0 +1,60 @@
+author_notes: null
+benchmark:
+ benchmark_description: please note that fluent-speech-corpus dataset hosted on Kaggle
+ is a licensed dataset.
+ benchmark_link: https://www.kaggle.com/tommyngx/fluent-speech-corpus
+ benchmark_metrics:
+ LER: '0.0283'
+ WER: '0.0886'
+ benchmark_name: Fluent speech
+description: "Tiny Wav2letter is a tiny version of the original Wav2Letter model.\
+ \ It is a convolutional speech recognition neural network. This implementation was\
+ \ created by Arm, pruned to 50% sparsity, fine-tuned and quantized using the TensorFlow\
+ \ Model Optimization Toolkit.\r\n\r\n"
+license:
+- Apache-2.0
+network:
+ datatype: int8
+ file_size_bytes: 3997112
+ filename: tiny_wav2letter_pruned_int8.tflite
+ framework: TensorFlow Lite
+ framework_version: 2.4.1
+ hash:
+ algorithm: sha1
+ value: edc581b85190b2bcbfba904b50645264be52f516
+ provenance: https://github.com/ARM-software/ML-zoo/tree/master/models/speech_recognition/wav2letter
+ training: LibriSpeech,Mini LibrySpeech,fluent speech
+network_parameters:
+ input_nodes:
+ - description: Speech converted to MFCCs and quantized to INT8
+ example_input:
+ path: models/speech_recognition/tiny_wav2letter/tflite_pruned_int8/testing_input/input_1_int8
+ input_datatype: int8
+ name: input_1_int8
+ shape:
+ - 1
+ - 296
+ - 39
+ output_nodes:
+ - description: A tensor of time and class probabilities, that represents the probability of
+ each class at each timestep. Should be passed to a decoder. For example ctc_beam_search_decoder.
+ example_output:
+ path: models/speech_recognition/tiny_wav2letter/tflite_pruned_int8/testing_output/Identity_int8
+ name: Identity_int8
+ output_datatype: int8
+ shape:
+ - 1
+ - 1
+ - 148
+ - 29
+network_quality:
+ quality_level: Deployable
+ quality_level_hero_hw: null
+operators:
+ TensorFlow Lite:
+ - CONV_2D
+ - DEQUANTIZE
+ - LEAKY_RELU
+ - QUANTIZE
+ - RESHAPE
+paper: https://arxiv.org/abs/1609.03193
diff --git a/models/speech_recognition/tiny_wav2letter/tflite_pruned_int8/demo_input/84-121550-0000.flac b/models/speech_recognition/tiny_wav2letter/tflite_pruned_int8/demo_input/84-121550-0000.flac
new file mode 100644
index 0000000..8dd88a8
Binary files /dev/null and b/models/speech_recognition/tiny_wav2letter/tflite_pruned_int8/demo_input/84-121550-0000.flac differ
diff --git a/models/speech_recognition/tiny_wav2letter/tflite_pruned_int8/inference_demo.ipynb b/models/speech_recognition/tiny_wav2letter/tflite_pruned_int8/inference_demo.ipynb
new file mode 100644
index 0000000..5a43bf4
--- /dev/null
+++ b/models/speech_recognition/tiny_wav2letter/tflite_pruned_int8/inference_demo.ipynb
@@ -0,0 +1,323 @@
+{
+ "cells": [
+ {
+ "cell_type": "code",
+ "execution_count": 13,
+ "source": [
+ "import multiprocessing\n",
+ "import tensorflow as tf\n",
+ "import librosa\n",
+ "import numpy as np\n",
+ "from jiwer import wer"
+ ],
+ "outputs": [],
+ "metadata": {}
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "This is the audio file we are going to transcribe, as well as the ground truth transcription"
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 92,
+ "source": [
+ "audio_file = 'demo_input/84-121550-0000.flac'\n",
+ "transcript = 'BUT WITH FULL RAVISHMENT THE HOURS OF PRIME SINGING RECEIVED THEY IN THE MIDST OF LEAVES THAT EVER BORE A BURDEN TO THEIR RHYMES'"
+ ],
+ "outputs": [],
+ "metadata": {}
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "We first convert the transcript into integers, as well as defining a reverse mapping for decoding the final output."
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 101,
+ "source": [
+ "alphabet = \"abcdefghijklmnopqrstuvwxyz' @\"\n",
+ "alphabet_dict = {c: ind for (ind, c) in enumerate(alphabet)}\n",
+ "index_dict = {ind: c for (ind, c) in enumerate(alphabet)}\n",
+ "transcript_ints = [alphabet_dict[letter] for letter in transcript.lower()]\n",
+ "print(transcript_ints)"
+ ],
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "[1, 20, 19, 27, 22, 8, 19, 7, 27, 5, 20, 11, 11, 27, 17, 0, 21, 8, 18, 7, 12, 4, 13, 19, 27, 19, 7, 4, 27, 7, 14, 20, 17, 18, 27, 14, 5, 27, 15, 17, 8, 12, 4, 27, 18, 8, 13, 6, 8, 13, 6, 27, 17, 4, 2, 4, 8, 21, 4, 3, 27, 19, 7, 4, 24, 27, 8, 13, 27, 19, 7, 4, 27, 12, 8, 3, 18, 19, 27, 14, 5, 27, 11, 4, 0, 21, 4, 18, 27, 19, 7, 0, 19, 27, 4, 21, 4, 17, 27, 1, 14, 17, 4, 27, 0, 27, 1, 20, 17, 3, 4, 13, 27, 19, 14, 27, 19, 7, 4, 8, 17, 27, 17, 7, 24, 12, 4, 18]\n"
+ ]
+ }
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "We then load the audio file and convert it to MFCCs (with an extra batch dimension)."
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 70,
+ "source": [
+ "def normalize(values):\n",
+ " \"\"\"\n",
+ " Normalize values to mean 0 and std 1\n",
+ " \"\"\"\n",
+ " return (values - np.mean(values)) / np.std(values)\n",
+ "\n",
+ "def transform_audio_to_mfcc(audio_file, transcript, n_mfcc=13, n_fft=512, hop_length=160):\n",
+ " audio_data, sample_rate = librosa.load(audio_file, sr=16000)\n",
+ "\n",
+ " mfcc = librosa.feature.mfcc(audio_data, sr=sample_rate, n_mfcc=n_mfcc, n_fft=n_fft, hop_length=hop_length)\n",
+ "\n",
+ " # add derivatives and normalize\n",
+ " mfcc_delta = librosa.feature.delta(mfcc)\n",
+ " mfcc_delta2 = librosa.feature.delta(mfcc, order=2)\n",
+ " mfcc = np.concatenate((normalize(mfcc), normalize(mfcc_delta), normalize(mfcc_delta2)), axis=0)\n",
+ "\n",
+ " seq_length = mfcc.shape[1] // 2\n",
+ "\n",
+ " sequences = np.concatenate([[seq_length], transcript]).astype(np.int32)\n",
+ " sequences = np.expand_dims(sequences, 0)\n",
+ " mfcc_out = mfcc.T.astype(np.float32)\n",
+ " mfcc_out = np.expand_dims(mfcc_out, 0)\n",
+ "\n",
+ " return mfcc_out, sequences"
+ ],
+ "outputs": [],
+ "metadata": {}
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 33,
+ "source": [
+ "def log(std):\n",
+ " \"\"\"Log the given string to the standard output.\"\"\"\n",
+ " print(\"******* {}\".format(std), flush=True)"
+ ],
+ "outputs": [],
+ "metadata": {}
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "We use the ctc decoder to decode the output of the network"
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 99,
+ "source": [
+ "def ctc_preparation(tensor, y_predict):\n",
+ " if len(y_predict.shape) == 4:\n",
+ " y_predict = tf.squeeze(y_predict, axis=1)\n",
+ " y_predict = tf.transpose(y_predict, (1, 0, 2))\n",
+ " sequence_lengths, labels = tensor[:, 0], tensor[:, 1:]\n",
+ " idx = tf.where(tf.not_equal(labels, 28))\n",
+ " sparse_labels = tf.SparseTensor(\n",
+ " idx, tf.gather_nd(labels, idx), tf.shape(labels, out_type=tf.int64)\n",
+ " )\n",
+ " return sparse_labels, sequence_lengths, y_predict\n",
+ "\n",
+ "def ctc_ler(y_true, y_predict):\n",
+ " sparse_labels, logit_length, y_predict = ctc_preparation(y_true, y_predict)\n",
+ " decoded, log_probabilities = tf.nn.ctc_greedy_decoder(\n",
+ " y_predict, tf.cast(logit_length, tf.int32), merge_repeated=True\n",
+ " )\n",
+ " return tf.reduce_mean(\n",
+ " tf.edit_distance(\n",
+ " tf.cast(decoded[0], tf.int32), tf.cast(sparse_labels, tf.int32)\n",
+ " ).numpy()\n",
+ " ), tf.sparse.to_dense(decoded[0]).numpy()\n",
+ "\n",
+ "def trans_int_to_string(trans_int):\n",
+ " #create dictionary int -> string (0 -> a 1 -> b)\n",
+ " string = \"\"\n",
+ " alphabet = \"abcdefghijklmnopqrstuvwxyz' @\"\n",
+ " alphabet_dict = {}\n",
+ " count = 0\n",
+ " for x in alphabet:\n",
+ " alphabet_dict[count] = x\n",
+ " count += 1\n",
+ " for letter in trans_int:\n",
+ " letter_np = np.array(letter).item(0)\n",
+ " if letter_np != 28:\n",
+ " string += alphabet_dict[letter_np]\n",
+ " return string\n",
+ "\n",
+ "def ctc_wer(y_true, y_predict):\n",
+ " sparse_labels, logit_length, y_predict = ctc_preparation(y_true, y_predict)\n",
+ " decoded, log_probabilities = tf.nn.ctc_greedy_decoder(\n",
+ " y_predict, tf.cast(logit_length, tf.int32), merge_repeated=True\n",
+ " )\n",
+ " true_sentence = tf.cast(sparse_labels.values, tf.int32)\n",
+ " return wer(str(trans_int_to_string(decoded[0].values)), str(trans_int_to_string(true_sentence)))"
+ ],
+ "outputs": [],
+ "metadata": {}
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "The TFLite file requires inputs of size 296, so we apply a window to the input"
+ ],
+ "metadata": {}
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 94,
+ "source": [
+ "def evaluate_tflite(tflite_path, input_window_length = 296):\n",
+ " \"\"\"Evaluates tflite (fp32, int8).\"\"\"\n",
+ " results = []\n",
+ " data, label = transform_audio_to_mfcc(audio_file, transcript_ints)\n",
+ "\n",
+ " interpreter = tf.lite.Interpreter(model_path=tflite_path, num_threads=multiprocessing.cpu_count())\n",
+ " interpreter.allocate_tensors()\n",
+ " input_chunk = interpreter.get_input_details()[0]\n",
+ " output_details = interpreter.get_output_details()[0]\n",
+ "\n",
+ " input_shape = input_chunk[\"shape\"]\n",
+ " log(\"eval_model() - input_shape: {}\".format(input_shape))\n",
+ " input_dtype = input_chunk[\"dtype\"]\n",
+ " output_dtype = output_details[\"dtype\"]\n",
+ "\n",
+ " # Check if the input/output type is quantized,\n",
+ " # set scale and zero-point accordingly\n",
+ " if input_dtype != tf.float32:\n",
+ " input_scale, input_zero_point = input_chunk[\"quantization\"]\n",
+ " else:\n",
+ " input_scale, input_zero_point = 1, 0\n",
+ "\n",
+ " if output_dtype != tf.float32:\n",
+ " output_scale, output_zero_point = output_details[\"quantization\"]\n",
+ " else:\n",
+ " output_scale, output_zero_point = 1, 0\n",
+ "\n",
+ "\n",
+ " data = data / input_scale + input_zero_point\n",
+ " # Round the data up if dtype is int8, uint8 or int16\n",
+ " if input_dtype is not np.float32:\n",
+ " data = np.round(data)\n",
+ "\n",
+ " while data.shape[1] < input_window_length:\n",
+ " data = np.append(data, data[:, -2:-1, :], axis=1)\n",
+ " # Zero-pad any odd-length inputs\n",
+ " if data.shape[1] % 2 == 1:\n",
+ " # log('Input length is odd, zero-padding to even (first layer has stride 2)')\n",
+ " data = np.concatenate([data, np.zeros((1, 1, data.shape[2]), dtype=input_dtype)], axis=1)\n",
+ "\n",
+ " context = 24 + 2 * (7 * 3 + 16) # = 98 - theoretical max receptive field on each side\n",
+ " size = input_chunk['shape'][1]\n",
+ " inner = size - 2 * context\n",
+ " data_end = data.shape[1]\n",
+ "\n",
+ " # Initialize variables for the sliding window loop\n",
+ " data_pos = 0\n",
+ " outputs = []\n",
+ "\n",
+ " while data_pos < data_end:\n",
+ " if data_pos == 0:\n",
+ " # Align inputs from the first window to the start of the data and include the intial context in the output\n",
+ " start = data_pos\n",
+ " end = start + size\n",
+ " y_start = 0\n",
+ " y_end = y_start + (size - context) // 2\n",
+ " data_pos = end - context\n",
+ " elif data_pos + inner + context >= data_end:\n",
+ " # Shift left to align final window to the end of the data and include the final context in the output\n",
+ " shift = (data_pos + inner + context) - data_end\n",
+ " start = data_pos - context - shift\n",
+ " end = start + size\n",
+ " assert start >= 0\n",
+ " y_start = (shift + context) // 2 # Will be even because we assert it above\n",
+ " y_end = size // 2\n",
+ " data_pos = data_end\n",
+ " else:\n",
+ " # Capture only the inner region from mid-input inferences, excluding output from both context regions\n",
+ " start = data_pos - context\n",
+ " end = start + size\n",
+ " y_start = context // 2\n",
+ " y_end = y_start + inner // 2\n",
+ " data_pos = end - context\n",
+ "\n",
+ " interpreter.set_tensor(input_chunk[\"index\"], tf.cast(data[:, start:end, :], input_dtype))\n",
+ " interpreter.invoke()\n",
+ " cur_output_data = interpreter.get_tensor(output_details[\"index\"])[:, :, y_start:y_end, :]\n",
+ " cur_output_data = output_scale * (\n",
+ " cur_output_data.astype(np.float32) - output_zero_point\n",
+ " )\n",
+ " outputs.append(cur_output_data)\n",
+ "\n",
+ " complete = np.concatenate(outputs, axis=2)\n",
+ " LER, output = ctc_ler(label, complete)\n",
+ " WER = ctc_wer(label, complete)\n",
+ " return output, LER , WER\n"
+ ],
+ "outputs": [],
+ "metadata": {}
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 107,
+ "source": [
+ "wav2letter_tflite_path = \"tflite_int8/tiny_wav2letter_int8.tflite\"\n",
+ "output, LER , WER = evaluate_tflite(wav2letter_tflite_path)\n",
+ "\n",
+ "decoded_output = [index_dict[value] for value in output[0]]\n",
+ "log(f'Transcribed File: {\"\".join(decoded_output)}')\n",
+ "log(f'Letter Error Rate is {LER}')\n",
+ "log(f'Word Error Rate is {WER}')"
+ ],
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "******* eval_model() - input_shape: [ 1 296 39]\n",
+ "******* Input length is odd, zero-padding to even (first layer has stride 2)\n",
+ "******* Transcribed File: but with full ravishment the hours of prime singing received they in the midst of leaves that everborea burden to their rimes\n",
+ "******* Letter Error Rate is 0.03125\n",
+ "******* Word Error Rate is 1.05\n"
+ ]
+ }
+ ],
+ "metadata": {}
+ }
+ ],
+ "metadata": {
+ "orig_nbformat": 4,
+ "language_info": {
+ "name": "python",
+ "version": "3.8.2",
+ "mimetype": "text/x-python",
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "pygments_lexer": "ipython3",
+ "nbconvert_exporter": "python",
+ "file_extension": ".py"
+ },
+ "kernelspec": {
+ "name": "python3",
+ "display_name": "Python 3.8.2 64-bit ('env': venv)"
+ },
+ "interpreter": {
+ "hash": "4b529a2edd0e262cfd8353ba70b138cbba10314325c544d99b9316c477c7841b"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/models/speech_recognition/tiny_wav2letter/tflite_pruned_int8/model_development_guide.md b/models/speech_recognition/tiny_wav2letter/tflite_pruned_int8/model_development_guide.md
new file mode 100644
index 0000000..546a975
--- /dev/null
+++ b/models/speech_recognition/tiny_wav2letter/tflite_pruned_int8/model_development_guide.md
@@ -0,0 +1,58 @@
+# Model Development Guide
+
+This document describes the process of training a model from scratch, using the Tiny Wav2Letter model as an example.
+
+## Datasets
+
+The first thing to decide is which dataset the model is to be trained on. Most commonly used datasets can either be found online or in the ARM AWS S3 bucket. In the case of Tiny Wav2Letter, both the LibriSpeech dataset hosted on [OpenSLR](http://www.openslr.org/resources.php) and fluent-speech-corpus dataset hosted on [Kaggle](https://www.kaggle.com/tommyngx/fluent-speech-corpus) were used to train the model.
+! ! please note that fluent-speech-corpus dataset hosted on [Kaggle](https://www.kaggle.com/tommyngx/fluent-speech-corpus) is a licensed dataset.
+
+## Preprocessing
+
+The dataset is often not in the right format for training, so preprocessing steps must be taken. In this case, the LibriSpeech dataset consists of audio files, however the paper stated that as input MFCCs should be used, so the audio files needed to be converted. It is to be recommended that all preprocessing be performed offline, as this will make the actual training process faster, as the data is already in the correct format. The most convenient way to store the preprocessed data is using [TFRecords](https://www.tensorflow.org/tutorials/load_data/tfrecord), as these are very easily loaded into TFDatasets. While it can take a long time to write the whole dataset to a TFRecord file, it is outweighed by the time saved during training.
+Please note: input audio data sample rate is 16K
+## Model Architecture
+
+The model architecture can generally be found from a variety of sources. If a similar model exists in the IMZ, then [Netron](https://netron.app) can be used to inspect the TFLite file. The original paper the model was proposed in will also define the architecture. The model should ideally be defined using the Tensorflow Functional API rather than the sequential API.
+
+## Loss Function and Metrics
+
+The loss function and desired metrics will be defined by the model. If at all possible, structure the data such that the input to the loss function is in the form (y_true, y_predicted) as this will enable model.fit to be used and avoid custom training loops. TensorFlow has lots of standard loss functions straight out of the box, but if need be custom loss functions can be defined, as was the case in TinyWav2Letter.
+
+## Model Training
+
+If everything else has been set up properly, the code here should not be complicated. Load the datasets, create an instance of the model, and then ideally run model.fit but if that's not possible use tf.GradientTape. Use callbacks to write to a log directory (tf.keras.callbacks.Tensorboard), then use Tensorboard to visualise the training process. One can use the [TensorFlow Profiler](https://www.tensorflow.org/guide/profiler) to identify bottlenecks in the training pipeline and speed up the training process. Another useful callback is tf.keras.callbacks.ModelCheckpoint which saves a checkpoint at defined intervals, so one can pick up training from where it was left off. Generally we will want a training set, a validation set and a test set, with normally about a 90:5:5 split. If the model performs well on the training set but not on the validation set or test set, then the model is overfitting. This can be reduced by introducing regularisation, increasing the amount of data, reducing model complexity or adjusting hyperparameters. In the case of TinyWav2Letter, the model was initially trained on the full-size LibriSpeech Dataset to capture the features of speech, then fine-tuned on the much smaller Mini LibriSpeech to improve the accuracy on the smaller dataset, then fine-tuned on fluent-speech-corpus dataset
+
+## Optimisation and Conversion
+
+Once the model has been trained to satisfaction, it can be optionally be optimised using the TensorFlow Model Optimization Toolkit. Pruning sets a specified percentage of the weights to be 0, so the model is sparser, which can lead to faster inference. Clustering clusters together weights of similar values, reducing the number of unique values. This again leads to faster inference. Quantisation (eg to INT8) converts all the weights to INT8 representations, giving a 4x reduction in size compared to FP32. If the quantisation process affects the metric too severely, quantisation aware training can be performed, which fine-tunes the model and makes it more robust to quantisation. Quantisation aware training requires at least Tensorflow 2.5.0. The final step is to convert the model to the TFLite model format. If using INT8 conversion, one must define a representative dataset to calibrate the model.
+
+## Training a smaller FP32 Keras Model
+
+The trained Wav2Letter model then serves as the foundation for investigations into how best to reduce the size of the network.
+
+There are three hyperparameters relevant to the size of the network. They are:
+
+- Number of layers
+- Number of filters in each convolutional layer
+- Stride of each convolutional filter
+
+The following table shows chose architecture for Tiny Wav2Letter and the effect that it has on the output size.
+
+| Identifier | Total Number of Layers | Number of middle Convolutional Layers | Corresponding number of filters | Number of filters in the antepenultimate Conv2D Layer | Number of filters in the penultimate Conv2D Layer |
+| ------ | ------ | ------ | ------ | ------ | ------ |
+| Wav2Letter | 11 | 7 | 250 | 2000 | 2000 |
+| Tiny Wav2Letter | 6 | 5 | 100 | 750 | 750 |
+
+| Identifier | Size (MB) | LER | WER |
+| ------ | ------ | ------ | ------ |
+| Wav2Letter INT8| 22.7 | 0.0877** | N/A |
+| Wav2Letter INT8 pruned| 22.7 | 0.0783** | N/A |
+| Tiny Wav2Letter FP32| 15.6* | 0.0351 | 0.0714 |
+| Tiny Wav2Letter FP32 pruned| 15.6* | 0.0266 | 0.0577 |
+| Tiny Wav2Letter INT8| 3.81 | 0.0348 | 0.1123 |
+| Tiny Wav2Letter INT8 pruned| 3.81 | 0.0283 | 0.0886 |
+
+"*" - the size is according to the tflite model \
+** trained on different dataset
+
diff --git a/models/speech_recognition/tiny_wav2letter/tflite_pruned_int8/recreate_code/.idea/misc.xml b/models/speech_recognition/tiny_wav2letter/tflite_pruned_int8/recreate_code/.idea/misc.xml
new file mode 100644
index 0000000..625040c
--- /dev/null
+++ b/models/speech_recognition/tiny_wav2letter/tflite_pruned_int8/recreate_code/.idea/misc.xml
@@ -0,0 +1,7 @@
+
+
+
+
+
+
+
\ No newline at end of file
diff --git a/models/speech_recognition/tiny_wav2letter/tflite_pruned_int8/recreate_code/.idea/workspace.xml b/models/speech_recognition/tiny_wav2letter/tflite_pruned_int8/recreate_code/.idea/workspace.xml
new file mode 100644
index 0000000..1dcd832
--- /dev/null
+++ b/models/speech_recognition/tiny_wav2letter/tflite_pruned_int8/recreate_code/.idea/workspace.xml
@@ -0,0 +1,156 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ 1639400202242
+
+
+ 1639400202242
+
+
+
+
+
+
+
+
+
+
+ file://$PROJECT_DIR$/preprocessing.py
+ 148
+
+
+
+ file://$PROJECT_DIR$/train_model.py
+ 158
+
+
+
+ file://$PROJECT_DIR$/train_model.py
+ 99
+
+
+
+
+
+
\ No newline at end of file
diff --git a/models/speech_recognition/tiny_wav2letter/tflite_pruned_int8/recreate_code/README.md b/models/speech_recognition/tiny_wav2letter/tflite_pruned_int8/recreate_code/README.md
new file mode 100644
index 0000000..90ff4be
--- /dev/null
+++ b/models/speech_recognition/tiny_wav2letter/tflite_pruned_int8/recreate_code/README.md
@@ -0,0 +1,13 @@
+# Tiny Wav2letter FP32/INT8/INT8_Pruned Model Re-Creation
+This folder contains a script that allows for the model to be re-created from scratch.
+## Datasets
+Tiny Wav2Letter was trianed on both LibriSpeech dataset hosted on OpenSLR and fluent-speech-corpus dataset hosted on Kaggle.
+Please note that fluent-speech-corpus dataset hosted on [Kaggle](https://www.kaggle.com/tommyngx/fluent-speech-corpus) is a licensed dataset.
+## Requirements
+The script in this folder requires that the following must be installed:
+- Python 3.6
+- Create new dir: fluent_speech_commands_dataset
+- (LICENSED DATASET!!) Download and extract fluent-speech-corpus from: https://www.kaggle.com/tommyngx/fluent-speech-corpus to fluent_speech_commands_dataset dir
+
+## Running The Script
+To run the script, run the following in a terminal: `./recreate_model.sh`
diff --git a/models/speech_recognition/tiny_wav2letter/tflite_pruned_int8/recreate_code/__pycache__/load_mfccs.cpython-36.pyc b/models/speech_recognition/tiny_wav2letter/tflite_pruned_int8/recreate_code/__pycache__/load_mfccs.cpython-36.pyc
new file mode 100644
index 0000000..b74e040
Binary files /dev/null and b/models/speech_recognition/tiny_wav2letter/tflite_pruned_int8/recreate_code/__pycache__/load_mfccs.cpython-36.pyc differ
diff --git a/models/speech_recognition/tiny_wav2letter/tflite_pruned_int8/recreate_code/__pycache__/tinywav2letter.cpython-36.pyc b/models/speech_recognition/tiny_wav2letter/tflite_pruned_int8/recreate_code/__pycache__/tinywav2letter.cpython-36.pyc
new file mode 100644
index 0000000..c8ddf4a
Binary files /dev/null and b/models/speech_recognition/tiny_wav2letter/tflite_pruned_int8/recreate_code/__pycache__/tinywav2letter.cpython-36.pyc differ
diff --git a/models/speech_recognition/tiny_wav2letter/tflite_pruned_int8/recreate_code/__pycache__/train_model.cpython-36.pyc b/models/speech_recognition/tiny_wav2letter/tflite_pruned_int8/recreate_code/__pycache__/train_model.cpython-36.pyc
new file mode 100644
index 0000000..b760462
Binary files /dev/null and b/models/speech_recognition/tiny_wav2letter/tflite_pruned_int8/recreate_code/__pycache__/train_model.cpython-36.pyc differ
diff --git a/models/speech_recognition/tiny_wav2letter/tflite_pruned_int8/recreate_code/corpus.py b/models/speech_recognition/tiny_wav2letter/tflite_pruned_int8/recreate_code/corpus.py
new file mode 100644
index 0000000..b5bd5ce
--- /dev/null
+++ b/models/speech_recognition/tiny_wav2letter/tflite_pruned_int8/recreate_code/corpus.py
@@ -0,0 +1,171 @@
+# Copyright (C) 2020 Arm Limited or its affiliates. All rights reserved.
+#
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the License); you may
+# not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an AS IS BASIS, WITHOUT
+# WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import tarfile
+import urllib.request
+
+
+class SpeechCorpusProvider:
+ """
+ Ensures the availability and downloads the speech corpus if necessary
+ """
+
+ DATA_SOURCE = {
+ "librispeech_full_size":
+ {"DATA_SETS":
+ {
+ ('train', 'train-clean-100'),
+ ('train', 'train-clean-360'),
+ ('val', 'dev-clean')
+ },
+ "BASE_URL": 'http://www.openslr.org/resources/12/'
+ },
+ "librispeech_reduced_size":
+ {"DATA_SETS":
+ {
+ ('train', 'train-clean-5'),
+ ('val', 'dev-clean-2')
+ },
+ "BASE_URL": 'http://www.openslr.org/resources/31/'
+ }
+ }
+
+ SET_FILE_EXTENSION = '.tar.gz'
+ TAR_ROOT = 'LibriSpeech/'
+
+ def __init__(self, data_directory):
+ """
+ Creates a new SpeechCorpusProvider with the root directory `data_directory`.
+ The speech corpus is downloaded and extracted into sub-directories.
+
+ Args:
+ data_directory: the root directory to use, e.g. data/
+ """
+
+ self._data_directory = data_directory
+ self._make_dir_if_not_exists(data_directory)
+ self.data_sets = SpeechCorpusProvider.DATA_SOURCE[data_directory]['DATA_SETS']
+ self.base_url = SpeechCorpusProvider.DATA_SOURCE[data_directory]['BASE_URL']
+
+ @staticmethod
+ def _make_dir_if_not_exists(directory):
+ """
+ Helper function to create a directory if it doesn't exist.
+
+ Args:
+ directory: directory to create
+ """
+
+ if not os.path.exists(directory):
+ os.makedirs(directory)
+
+ def _download_if_not_exists(self, remote_file_name):
+ """
+ Downloads the given `remote_file_name` if not yet stored in the `data_directory`
+
+ Args:
+ remote_file_name: the file to download
+
+ Returns: path to downloaded file
+ """
+
+ path = os.path.join(self._data_directory, remote_file_name)
+ if not os.path.exists(path):
+ print('Downloading {}...'.format(remote_file_name))
+ urllib.request.urlretrieve(self.base_url + remote_file_name, path)
+ return path
+
+ @staticmethod
+ def _extract_from_to(tar_file_name, source, target_directory):
+ """
+ Extract all necessary files `source` from `tar_file_name` into `target_directory`
+
+ Args:
+ tar_file_name: the tar file to extract from
+ source: the directory in the root to extract
+ target_directory: the directory to store the files in
+ """
+
+ print('Extracting {}...'.format(tar_file_name))
+ with tarfile.open(tar_file_name, 'r:gz') as tar:
+ source_members = [
+ tarinfo for tarinfo in tar.getmembers()
+ if tarinfo.name.startswith(SpeechCorpusProvider.TAR_ROOT + source)
+ ]
+ for member in source_members:
+ # Extract without prefix
+ member.name = member.name.replace(SpeechCorpusProvider.TAR_ROOT, '')
+ tar.extractall(target_directory, source_members)
+
+ def _is_ready(self):
+ """
+ Returns whether all `data_sets` are downloaded and extracted
+
+ Args:
+ data_sets: list of the datasets to ensure
+
+ Returns: bool, is ready to use
+
+ """
+
+ data_set_paths = [os.path.join(set_type, set_name)
+ for set_type, set_name in self.data_sets]
+
+ return all([os.path.exists(os.path.join(self._data_directory, data_set))
+ for data_set in data_set_paths])
+
+ def _download(self):
+ """
+ Download the given `data_sets`
+
+ Args:
+ data_sets: a list of the datasets to download
+ """
+
+ for data_set_type, data_set_name in self.data_sets:
+ remote_file = data_set_name + SpeechCorpusProvider.SET_FILE_EXTENSION
+ self._download_if_not_exists(remote_file)
+
+ def _extract(self):
+ """
+ Extract all necessary files from the given `data_sets`
+ """
+
+ for data_set_type, data_set_name in self.data_sets:
+ local_file = os.path.join(self._data_directory, data_set_name + SpeechCorpusProvider.SET_FILE_EXTENSION)
+ target_directory = self._data_directory
+ self._extract_from_to(local_file, data_set_name, target_directory)
+
+ def ensure_availability(self):
+ """
+ Ensure that all datasets are downloaded and extracted. If this is not the case,
+ the download and extraction is initated.
+ """
+
+ if not self._is_ready():
+ self._download()
+ self._extract()
+
+
+if __name__=="__main__":
+ full_size_corpus = SpeechCorpusProvider("librispeech_full_size")
+ full_size_corpus.ensure_availability()
+
+ reduced_size_corpus = SpeechCorpusProvider("librispeech_reduced_size")
+ reduced_size_corpus.ensure_availability()
+
+
diff --git a/models/speech_recognition/tiny_wav2letter/tflite_pruned_int8/recreate_code/evaluate_saved_weights.py b/models/speech_recognition/tiny_wav2letter/tflite_pruned_int8/recreate_code/evaluate_saved_weights.py
new file mode 100644
index 0000000..399a914
--- /dev/null
+++ b/models/speech_recognition/tiny_wav2letter/tflite_pruned_int8/recreate_code/evaluate_saved_weights.py
@@ -0,0 +1,52 @@
+# Copyright (C) 2020 Arm Limited or its affiliates. All rights reserved.
+#
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the License); you may
+# not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an AS IS BASIS, WITHOUT
+# WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+
+import tensorflow as tf
+
+from tinywav2letter import get_metrics, create_tinywav2letter
+from train_model import get_data
+
+def evaluate_saved_weights(args, pruned = False):
+
+ model = create_tinywav2letter(batch_size = args.batch_size)
+
+ model.load_weights('weights/tiny_wav2letter' + pruned * "_pruned" + '_weights.h5')
+
+ opt = tf.keras.optimizers.Adam()
+ model.compile(loss=get_metrics("loss"), metrics=[get_metrics("ler")], optimizer=opt)
+
+ (reduced_validation_data, reduced_validation_num_steps) = get_data(args, "val_reduced_size", args.batch_size)
+
+ model.evaluate(reduced_validation_data)
+
+
+if __name__ == "__main__":
+ parser = argparse.ArgumentParser(allow_abbrev=False)
+
+ parser.add_argument(
+ "--batch_size",
+ dest="batch_size",
+ type=int,
+ required=False,
+ default=32,
+ help="batch size wanted when creating model",
+ )
+
+ args = parser.parse_args()
+ evaluate_saved_weights(args)
+
diff --git a/models/speech_recognition/tiny_wav2letter/tflite_pruned_int8/recreate_code/load_mfccs.py b/models/speech_recognition/tiny_wav2letter/tflite_pruned_int8/recreate_code/load_mfccs.py
new file mode 100644
index 0000000..d36bf6b
--- /dev/null
+++ b/models/speech_recognition/tiny_wav2letter/tflite_pruned_int8/recreate_code/load_mfccs.py
@@ -0,0 +1,177 @@
+# Copyright (C) 2020 Arm Limited or its affiliates. All rights reserved.
+#
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the License); you may
+# not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an AS IS BASIS, WITHOUT
+# WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import tensorflow as tf
+import os
+import numpy as np
+
+class MFCC_Loader:
+ def __init__(self, full_size_data_dir:str, reduced_size_data_dir:str, fluent_speech_data_dir:str):
+ """
+ Args:
+ data_dir: Absolute path to librispeech data folder
+ """
+ self.full_size_data_dir = full_size_data_dir
+ self.reduced_size_data_dir = reduced_size_data_dir
+ self.fluent_speech_data_dir = fluent_speech_data_dir
+ self.seed = 0
+ self.train = False
+ self.batch_size = 32
+ self.num_samples = 0
+ self.input_files = []
+
+
+ @staticmethod
+ def _extract_features(example_proto):
+ feature_description = {
+ 'mfcc_bytes': tf.io.FixedLenFeature([], tf.string),
+ 'sequence_bytes': tf.io.FixedLenFeature([], tf.string),
+ }
+ # Parse the input tf.train.Example proto using the dictionary above.
+ serialized_tensor = tf.io.parse_single_example(example_proto, feature_description)
+
+ mfcc_features = tf.io.parse_tensor(serialized_tensor['mfcc_bytes'], out_type = tf.float32)
+ sequences = tf.io.parse_tensor(serialized_tensor['sequence_bytes'], out_type = tf.int32)
+
+ return mfcc_features, sequences
+
+ def full_training_set(self, batch_size=32, num_samples = -1):
+ """
+ Args:
+ batch_size: batch size required for the set
+ """
+ self.tfrecord_file = [
+ os.path.join(self.full_size_data_dir, 'preprocessed/train-clean-100/train-clean-100.tfrecord'),
+ os.path.join(self.full_size_data_dir, 'preprocessed/train-clean-360/train-clean-360.tfrecord')]
+
+ self.train = True
+ self.batch_size = batch_size
+ self.num_samples = 132553
+ return self.load_dataset(num_samples)
+
+ def reduced_training_set(self, batch_size=32, num_samples = -1):
+ """
+ Args:
+ batch_size: batch size required for the set
+ """
+ self.tfrecord_file = os.path.join(self.reduced_size_data_dir, 'preprocessed/train-clean-5/train-clean-5.tfrecord')
+ self.train = True
+ self.batch_size = batch_size
+ self.num_samples = 1519
+ return self.load_dataset(num_samples)
+
+ def full_validation_set(self, batch_size=32):
+ """
+ Args:
+ batch_size: batch size required for the set
+ """
+ self.tfrecord_file = os.path.join(self.full_size_data_dir, 'preprocessed/dev-clean/dev-clean.tfrecord')
+ self.train = False
+ self.batch_size = batch_size
+ self.num_samples = 2703
+ return self.load_dataset()
+
+ def reduced_validation_set(self, batch_size=32):
+ """
+ Args:
+ batch_size: batch size required for the set
+ """
+ self.tfrecord_file = os.path.join(self.reduced_size_data_dir, 'preprocessed/dev-clean-2/dev-clean-2.tfrecord')
+ self.train = False
+ self.batch_size = batch_size
+ self.num_samples = 1089
+ return self.load_dataset()
+
+
+ def evaluation_set(self, batch_size=32):
+ """
+ Args:
+ batch_size: batch size required for the set
+ """
+
+ self.tfrecord_file = os.path.join(self.full_size_data_dir, 'preprocessed/test-clean/test-clean.tfrecord')
+ self.train = False
+ self.batch_size = batch_size
+ self.num_samples = 2620
+ return self.load_dataset()
+
+ def fluent_speech_train_set(self, batch_size=32, num_samples = -1):
+ """
+ Args:
+ batch_size: batch size required for the set
+ """
+ self.tfrecord_file = os.path.join(self.fluent_speech_data_dir, 'preprocessed/train/train.tfrecord')
+
+ self.train = True
+ self.batch_size = batch_size
+ self.num_samples = 23132
+ return self.load_dataset(num_samples)
+
+ def fluent_speech_validation_set(self, batch_size=32):
+ """
+ Args:
+ batch_size: batch size required for the set
+ """
+ self.tfrecord_file = os.path.join(self.fluent_speech_data_dir, 'preprocessed/dev/dev.tfrecord')
+ self.train = False
+ self.batch_size = batch_size
+ self.num_samples = 3118
+ return self.load_dataset()
+
+ def fluent_speech_test_set(self, batch_size=32):
+ """
+ Args:
+ batch_size: batch size required for the set
+ """
+ self.tfrecord_file = os.path.join(self.fluent_speech_data_dir, 'preprocessed/test/test.tfrecord')
+ self.train = False
+ self.batch_size = batch_size
+ self.num_samples = 3793
+ return self.load_dataset()
+
+ def num_steps(self, batch):
+ """
+ Get the number of steps based on the given batch size and the number
+ of samples.
+ """
+ return int(np.math.ceil(self.num_samples / batch))
+
+
+ def load_dataset(self, num_samples = -1):
+
+ # load the specified TF Record files
+ dataset = tf.data.TFRecordDataset(self.tfrecord_file)
+
+ # parse the data, and take the desired number of samples
+ dataset = dataset.map(self._extract_features, num_parallel_calls = tf.data.AUTOTUNE).take(num_samples)
+
+ dataset = dataset.cache()
+
+ # shuffle the training set
+ if self.train:
+ dataset = dataset.shuffle(buffer_size=max(self.batch_size * 2, 1024), seed=self.seed)
+
+ MFCC_coeffs = 39
+ blank_index = 28
+
+
+ # Pad the dataset so that all the data is the same size
+ dataset = dataset.padded_batch(
+ self.batch_size,
+ padded_shapes=(tf.TensorShape([None, MFCC_coeffs]), tf.TensorShape([None])),
+ padding_values=(0.0, blank_index), drop_remainder=True
+ )
+ return dataset.prefetch(tf.data.experimental.AUTOTUNE)
diff --git a/models/speech_recognition/tiny_wav2letter/tflite_pruned_int8/recreate_code/preprocessing.py b/models/speech_recognition/tiny_wav2letter/tflite_pruned_int8/recreate_code/preprocessing.py
new file mode 100644
index 0000000..0eddb4c
--- /dev/null
+++ b/models/speech_recognition/tiny_wav2letter/tflite_pruned_int8/recreate_code/preprocessing.py
@@ -0,0 +1,257 @@
+# Copyright (C) 2020 Arm Limited or its affiliates. All rights reserved.
+#
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the License); you may
+# not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an AS IS BASIS, WITHOUT
+# WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import fnmatch
+import os
+import librosa
+
+import numpy as np
+import tensorflow as tf
+from corpus import SpeechCorpusProvider
+from preprocessing_fluent_speech_commands import preprocess_fluent_sppech
+from preprocessing_convert_to_flac import convert_to_flac
+def _bytes_feature(value):
+ """Returns a bytes_list from a string / byte."""
+ # If the value is an eager tensor BytesList won't unpack a string from an EagerTensor.
+ if isinstance(value, type(tf.constant(0))):
+ value = value.numpy()
+ return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))
+
+def _int64_feature(value):
+ """Returns an int64_list from a bool / enum / int / uint."""
+ return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))
+
+def normalize(values):
+ """
+ Normalize values to mean 0 and std 1
+ """
+ return (values - np.mean(values)) / np.std(values)
+
+
+def iglob_recursive(directory, file_pattern):
+ """
+ Recursively search for `file_pattern` in `directory`
+
+ Args:
+ directory: the directory to search in
+ file_pattern: the file pattern to match (wildcard compatible)
+
+ Returns:
+ iterator for found files
+
+ """
+ for root, dir_names, file_names in os.walk(directory):
+ for filename in fnmatch.filter(file_names, file_pattern):
+ yield os.path.join(root, filename)
+
+
+class SpeechCorpusReader:
+ """
+ Reads preprocessed speech corpus to be used by the NN
+ """
+ def __init__(self, data_directory):
+ """
+ Create SpeechCorpusReader and read samples from `data_directory`
+
+ Args:
+ data_directory: the directory to use
+ """
+ self._data_directory = data_directory
+ self._transcript_dict = self._build_transcript()
+
+ @staticmethod
+ def _get_transcript_entries(transcript_directory):
+ """
+ Iterate over all transcript lines and yield splitted entries
+
+ Args:
+ transcript_directory: open all transcript files in this directory and extract their contents
+
+ Returns: Iterator for all entries in the form (id, sentence)
+
+ """
+ transcript_files = iglob_recursive(transcript_directory, '*.trans.txt')
+ for transcript_file in transcript_files:
+ with open(transcript_file, 'r') as f:
+ for line in f:
+ # Strip included new line symbol
+ line = line.rstrip('\n')
+
+ # Each line is in the form
+ # 00-000000-0000 WORD1 WORD2 ...
+ splitted = line.split(' ', 1)
+ yield splitted
+
+ def _build_transcript(self):
+ """
+ Builds a transcript from transcript files, mapping from audio-id to a list of vocabulary ids
+
+ Returns: the created transcript
+ """
+ alphabet = "abcdefghijklmnopqrstuvwxyz' @"
+ alphabet_dict = {c: ind for (ind, c) in enumerate(alphabet)}
+
+ # Create the transcript dictionary
+ transcript_dict = dict()
+ for splitted in self._get_transcript_entries(self._data_directory):
+ transcript_dict[splitted[0]] = [alphabet_dict[letter] for letter in splitted[1].lower()]
+
+ return transcript_dict
+
+ @classmethod
+ def _extract_audio_id(cls, audio_file):
+ file_name = os.path.basename(audio_file)
+ audio_id = os.path.splitext(file_name)[0]
+
+ return audio_id
+
+ @staticmethod
+ def transform_audio_to_mfcc(audio_file, transcript, n_mfcc=13, n_fft=512, hop_length=160):
+ """
+ Calculate mfcc coefficients from the given raw audio data
+
+ Args:
+ audio_file: .flac audio file
+ n_mfcc: the number of coefficients to generate
+ n_fft: the window size of the fft
+ hop_length: the hop length for the window
+
+ Returns:
+ the mfcc coefficients in the form [time, coefficients]
+ sequences: the transcript of the audio file
+
+ """
+
+ audio_data, sample_rate = librosa.load(audio_file, sr=16000)
+
+ mfcc = librosa.feature.mfcc(audio_data, sr=sample_rate, n_mfcc=n_mfcc, n_fft=n_fft, hop_length=hop_length)
+
+ # add derivatives and normalize
+ mfcc_delta = librosa.feature.delta(mfcc)
+ mfcc_delta2 = librosa.feature.delta(mfcc, order=2)
+ mfcc = np.concatenate((normalize(mfcc), normalize(mfcc_delta), normalize(mfcc_delta2)), axis=0) #mfcc is now 13+13+13=39 (according to our input shpe)
+
+ seq_length = mfcc.shape[1] // 2
+
+ sequences = np.concatenate([[seq_length], transcript]).astype(np.int32)
+ mfcc_out = mfcc.T.astype(np.float32)
+
+ return mfcc_out, sequences
+
+ @staticmethod
+ def _create_feature(mfcc_bytes, sequence_bytes):
+ """
+ Creates a tf.train.Example message ready to be written to a file.
+ """
+ # Create a dictionary mapping the feature name to the tf.train.Example-compatible
+ # data type.
+
+ feature = {
+ 'mfcc_bytes': _bytes_feature(mfcc_bytes),
+ 'sequence_bytes': _bytes_feature(sequence_bytes),
+ }
+
+ # Create a Features message using tf.train.Example.
+ return tf.train.Example(features=tf.train.Features(feature=feature))
+
+
+ def _get_directory(self, sub_directory):
+ preprocess_directory = 'preprocessed'
+
+ directory = self._data_directory + '/' + preprocess_directory + '/' + sub_directory
+
+ return directory
+
+
+ def process_data(self, directory):
+ """
+ Read audio files from `directory` and store the preprocessed version in preprocessed/`directory`
+
+ Args:
+ directory: the sub-directory to read from
+
+ """
+ # create a list of all the .flac files
+ audio_files = list(iglob_recursive(self._data_directory + '/' + directory , '*.flac'))
+
+ out_directory = self._get_directory(directory)
+
+ if not os.path.exists(out_directory):
+ os.makedirs(out_directory)
+
+ # the file the TFRecord will be written to
+ filename = out_directory + f'/{directory}.tfrecord'
+ with tf.io.TFRecordWriter(filename) as writer:
+ for audio_file in audio_files:
+ if os.path.getsize(audio_file) >= 13885: #small files are not suported
+ audio_id = self._extract_audio_id(audio_file)
+
+ # identify the transcript corresponding to the audio file
+ transcript = self._transcript_dict[audio_id]
+
+ # convert the audio to MFCCs
+ mfcc_feature, sequences = self.transform_audio_to_mfcc(audio_file, transcript)
+
+ # Serialise the MFCCs and transcript
+ mfcc_bytes = tf.io.serialize_tensor(mfcc_feature)
+ sequence_bytes = tf.io.serialize_tensor(sequences)
+
+ # Write to the TFRecord file
+ writer.write(self._create_feature(mfcc_bytes, sequence_bytes).SerializeToString())
+
+ else:
+ print('Processed data already exists')
+
+
+class Preprocessing:
+
+ def __init__(self, data_dir):
+ self.data_dir = data_dir
+
+ def run(self):
+ # Download the raw data
+ corpus = SpeechCorpusProvider(self.data_dir)
+ corpus.ensure_availability()
+
+ corpus_reader = SpeechCorpusReader(self.data_dir)
+
+ # initialise the datasets
+ data_sets = [data_set[1] for data_set in corpus.data_sets]
+
+ for data_set in data_sets:
+ print(f"Preprocessing {data_set} data")
+ corpus_reader.process_data(data_set)
+
+ print('Preprocessing Complete')
+ def run_without_download(self):
+ corpus_reader = SpeechCorpusReader(self.data_dir)
+ corpus_reader.process_data('dev')
+ corpus_reader.process_data('train')
+ corpus_reader.process_data('test')
+if __name__=="__main__":
+ reduced_preprocessing = Preprocessing('librispeech_reduced_size')
+ reduced_preprocessing.run()
+
+ full_preprocessing = Preprocessing('librispeech_full_size')
+ full_preprocessing.run()
+
+ preprocess_fluent_sppech()
+ convert_to_flac()
+ fluent_speech_preprocessing = Preprocessing('fluent_speech_commands_dataset') #please note this is a license data set
+ fluent_speech_preprocessing.run_without_download()
+
+
+
diff --git a/models/speech_recognition/tiny_wav2letter/tflite_pruned_int8/recreate_code/preprocessing_convert_to_flac.py b/models/speech_recognition/tiny_wav2letter/tflite_pruned_int8/recreate_code/preprocessing_convert_to_flac.py
new file mode 100644
index 0000000..9327164
--- /dev/null
+++ b/models/speech_recognition/tiny_wav2letter/tflite_pruned_int8/recreate_code/preprocessing_convert_to_flac.py
@@ -0,0 +1,100 @@
+# Copyright (C) 2020 Arm Limited or its affiliates. All rights reserved.
+#
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the License); you may
+# not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an AS IS BASIS, WITHOUT
+# WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from queue import Queue
+import logging
+import os
+from threading import Thread
+import audiotools
+from audiotools.wav import InvalidWave
+
+class W2F:
+
+ logger = ''
+
+ def __init__(self):
+ global logger
+ # create logger
+ logger = logging.getLogger(__name__)
+ logger.setLevel(logging.DEBUG)
+
+ # create a file handler
+ handler = logging.FileHandler('converter.log')
+ handler.setLevel(logging.INFO)
+
+ # create a logging format
+ formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
+ handler.setFormatter(formatter)
+
+ # add the handlers to the logger
+ logger.addHandler(handler)
+
+ def convert(self,path):
+ global logger
+ file_queue = Queue()
+ num_converter_threads = 5
+
+ # collect files to be converted
+ for root, dirs, files in os.walk(path):
+
+ for file in files:
+ if file.endswith(".wav"):
+ file_wav = os.path.join(root, file)
+ file_flac = file_wav.replace(".wav", ".flac")
+
+ if (os.path.exists(file_flac)):
+ logger.debug(''.join(["File ",file_flac, " already exists."]))
+ else:
+ file_queue.put(file_wav)
+
+ logger.info("Start converting: %s files", str(file_queue.qsize()))
+
+ # Set up some threads to convert files
+ for i in range(num_converter_threads):
+ worker = Thread(target=self.process, args=(file_queue,))
+ worker.setDaemon(True)
+ worker.start()
+
+ file_queue.join()
+
+ def process(self, q):
+ """This is the worker thread function.
+ It processes files in the queue one after
+ another. These daemon threads go into an
+ infinite loop, and only exit when
+ the main thread ends.
+ """
+ while True:
+ global logger
+ compression_quality = '0' #min compression
+ file_wav = q.get()
+ file_flac = file_wav.replace(".wav", ".flac")
+
+ try:
+ audiotools.open(file_wav).convert(file_flac,audiotools.FlacAudio, compression_quality)
+ logger.info(''.join(["Converted ", file_wav, " to: ", file_flac]))
+ q.task_done()
+ except InvalidWave:
+ logger.error(''.join(["Failed to open file ", file_wav, " to: ", file_flac," failed."]), exc_info=True)
+
+def convert_to_flac():
+ reduced_preprocessing = W2F()
+ reduced_preprocessing.convert("fluent_speech_commands_dataset/train/")
+ reduced_preprocessing.convert("fluent_speech_commands_dataset/dev/")
+ reduced_preprocessing.convert("fluent_speech_commands_dataset/test/")
+ print('')
+
+
diff --git a/models/speech_recognition/tiny_wav2letter/tflite_pruned_int8/recreate_code/preprocessing_fluent_speech_commands.py b/models/speech_recognition/tiny_wav2letter/tflite_pruned_int8/recreate_code/preprocessing_fluent_speech_commands.py
new file mode 100644
index 0000000..831fb06
--- /dev/null
+++ b/models/speech_recognition/tiny_wav2letter/tflite_pruned_int8/recreate_code/preprocessing_fluent_speech_commands.py
@@ -0,0 +1,50 @@
+# Copyright (C) 2020 Arm Limited or its affiliates. All rights reserved.
+#
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the License); you may
+# not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an AS IS BASIS, WITHOUT
+# WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from csv import reader
+import os
+from shutil import copyfile
+from pathlib import Path
+import re
+def preprocess(flag):
+ if flag == 'train':
+ path_csv = 'fluent_speech_commands_dataset/data/train_data.csv'
+ path_dir = 'fluent_speech_commands_dataset/train/'
+ elif flag == 'dev':
+ path_csv = 'fluent_speech_commands_dataset/data/valid_data.csv'
+ path_dir = 'fluent_speech_commands_dataset/dev/'
+ else:
+ path_csv = 'fluent_speech_commands_dataset/data/test_data.csv'
+ path_dir = 'fluent_speech_commands_dataset/test/'
+ with open(path_csv, 'r') as read_obj:
+ csv_reader = reader(read_obj)
+ if not os.path.exists(path_dir):
+ os.makedirs(path_dir)
+ with open(path_dir + flag + '.trans.txt', 'w') as write_obj:
+ for row in csv_reader:
+ print(row)
+ if(row[1] == 'path'):
+ continue
+ head, file_name = os.path.split(row[1])
+ copyfile('fluent_speech_commands_dataset/' + row[1],path_dir + file_name)
+ text = row[3]
+ text = text.upper()
+ text = re.sub('[^a-zA-Z \']+', '', text) #remove all other chars
+ write_obj.write(Path(file_name).stem + " " + text + '\n')
+def preprocess_fluent_sppech():
+ preprocess('train')
+ preprocess('dev')
+ preprocess('test')
diff --git a/models/speech_recognition/tiny_wav2letter/tflite_pruned_int8/recreate_code/prune_and_quantise_model.py b/models/speech_recognition/tiny_wav2letter/tflite_pruned_int8/recreate_code/prune_and_quantise_model.py
new file mode 100755
index 0000000..742436b
--- /dev/null
+++ b/models/speech_recognition/tiny_wav2letter/tflite_pruned_int8/recreate_code/prune_and_quantise_model.py
@@ -0,0 +1,386 @@
+# Copyright (C) 2020 Arm Limited or its affiliates. All rights reserved.
+#
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the License); you may
+# not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an AS IS BASIS, WITHOUT
+# WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import multiprocessing
+import os
+import pathlib
+
+import tensorflow as tf
+import tensorflow_model_optimization as tfmot
+from tqdm import tqdm
+import numpy as np
+
+from tinywav2letter import get_metrics
+from train_model import log, get_lr_schedule, get_data
+
+options = tf.data.Options()
+options.experimental_distribute.auto_shard_policy = tf.data.experimental.AutoShardPolicy.DATA
+
+gpus = tf.config.list_logical_devices('GPU')
+strategy = tf.distribute.MirroredStrategy(gpus)
+
+def load_model():
+ """
+ Returns the model saved at the end of training
+ """
+
+ # load the saved model
+ with strategy.scope():
+ model = tf.keras.models.load_model(f'saved_models/tiny_wav2letter',
+ custom_objects={'ctc_loss': get_metrics("loss"), 'ctc_ler':get_metrics("ler")}
+ )
+
+ return model
+
+def evaluate_model(args, model):
+ """
+ Evaluates an unquantised model
+
+ Args:
+ args: The command line arguments
+ model: The model to evaluate
+ """
+
+ # Get the data to evaluate the model on
+ (fluent_speech_test_data, fluent_speech_test_num_steps) = get_data(args, "test_fluent_speech", args.batch_size)
+
+ # Compile and evaluate the model - LER
+ with strategy.scope():
+ opt = tf.keras.optimizers.Adam(learning_rate=get_lr_schedule(learning_rate = 1e-6, steps_per_epoch=fluent_speech_test_num_steps))
+ model.compile(loss=get_metrics("loss"), metrics=[get_metrics("ler")], optimizer=opt, run_eagerly=True)
+ log(f'Evaluating TinyWav2Letter - LER')
+ model.evaluate(fluent_speech_test_data)
+
+ # Get the data to evaluate the model on
+ (fluent_speech_test_data, fluent_speech_test_num_steps) = get_data(args, "test_fluent_speech", batch_size=1) # #based on batch=1
+
+ # Compile and evaluate the model - WER
+ with strategy.scope():
+ opt = tf.keras.optimizers.Adam(learning_rate=get_lr_schedule(learning_rate = 1e-6, steps_per_epoch=fluent_speech_test_num_steps))
+ model.compile(loss=get_metrics("loss"), metrics=[get_metrics("wer")], optimizer=opt,run_eagerly=True)
+ log(f'Evaluating TinyWav2Letter - WER')
+ model.evaluate(fluent_speech_test_data)
+
+def prune_model(args, model):
+ """Performs pruning, fine-tuning and returns stripped pruned model"""
+
+ # Get all the training and validation data
+ (full_training_data, full_training_num_steps) = get_data(args, "train_full_size", args.batch_size)
+ (full_validation_data, full_validation_num_steps) = get_data(args, "val_full_size", args.batch_size)
+ (fluent_speech_training_data, fluent_speech_training_num_steps) = get_data(args, "train_fluent_speech",args.batch_size)
+ (fluent_speech_validation_data, fluent_speech_validation_num_steps) = get_data(args, "val_fluent_speech",args.batch_size)
+ (fluent_speech_test_data, fluent_speech_test_num_steps) = get_data(args, "test_fluent_speech", args.batch_size)
+
+ log("Pruning model to {} sparsity".format(args.sparsity))
+
+ log_dir = f"logs/pruned_model"
+
+ # Set up the callbacks
+ pruning_callbacks = [
+ tfmot.sparsity.keras.UpdatePruningStep(),
+ tfmot.sparsity.keras.PruningSummaries(log_dir=log_dir),
+ ]
+
+ # Perform the pruning - full_training_data
+ pruning_params = {
+ "pruning_schedule": tfmot.sparsity.keras.ConstantSparsity(
+ args.sparsity, begin_step=0, end_step=int(full_training_num_steps * 0.7), frequency=10
+ )
+ }
+
+
+ with strategy.scope():
+ pruned_model = tfmot.sparsity.keras.prune_low_magnitude(model, **pruning_params)
+ opt = tf.keras.optimizers.Adam(learning_rate=get_lr_schedule(learning_rate = 1e-4, steps_per_epoch=full_training_num_steps))
+ pruned_model.compile(loss=get_metrics("loss"), metrics=[get_metrics("ler")], optimizer=opt)
+
+ pruned_model.fit(
+ full_training_data,
+ epochs=5,
+ verbose=1,
+ callbacks=pruning_callbacks,
+ validation_data=full_validation_data,
+ )
+
+ # Perform the pruning - fluent_speech_training_data
+ pruning_params = {
+ "pruning_schedule": tfmot.sparsity.keras.ConstantSparsity(
+ args.sparsity, begin_step=0, end_step=int(fluent_speech_validation_num_steps * 0.7), frequency=10
+ )
+ }
+
+
+ with strategy.scope():
+ pruned_model = tfmot.sparsity.keras.prune_low_magnitude(model, **pruning_params)
+ opt = tf.keras.optimizers.Adam(learning_rate=get_lr_schedule(learning_rate = 1e-4, steps_per_epoch=fluent_speech_validation_num_steps))
+ pruned_model.compile(loss=get_metrics("loss"), metrics=[get_metrics("ler")], optimizer=opt)
+
+ pruned_model.fit(
+ fluent_speech_training_data,
+ epochs=5,
+ verbose=1,
+ callbacks=pruning_callbacks,
+ validation_data=fluent_speech_validation_data,
+ )
+
+ stripped_model = tfmot.sparsity.keras.strip_pruning(pruned_model)
+
+ return stripped_model
+
+def prepare_model_for_inference(model, input_window_length = 296):
+ "Takes the a model and returns a model with fixed input size"
+ MFCC_coeffs = 39
+
+ # Define the input
+ layer_input = tf.keras.layers.Input((input_window_length, MFCC_coeffs), batch_size=1)
+ static_shaped_model = tf.keras.models.Model(
+ inputs=[layer_input], outputs=[model.call(layer_input)]
+ )
+ return static_shaped_model
+
+def tflite_conversion(model, tflite_path, conversion_type="fp32"):
+ """Performs tflite conversion (fp32, int8)."""
+ # Prepare model for inference
+ model = prepare_model_for_inference(model)
+ converter = tf.lite.TFLiteConverter.from_keras_model(model)
+
+ # define a dataset to calibrate the conversion to INT8
+ def representative_dataset_gen(input_dim):
+ calib_data = []
+ for data in tqdm(fluent_speech_test_data.take(1000), desc="model calibration"):
+ input_data = data[0]
+ for i in range(input_data.shape[1] // input_dim):
+ input_chunks = [
+ input_data[:, i * input_dim: (i + 1) * input_dim, :, ]
+ ]
+ for chunk in input_chunks:
+ calib_data.append([chunk])
+
+ return lambda: [
+ (yield data) for data in tqdm(calib_data, desc="model calibration")
+ ]
+
+ if conversion_type == "int8":
+ log("Quantizing Model")
+ (fluent_speech_test_data, fluent_speech_test_num_steps) = get_data(args, "test_fluent_speech", args.batch_size)
+ converter.optimizations = [tf.lite.Optimize.DEFAULT]
+ converter.inference_input_type = tf.int8
+ converter.inference_output_type = tf.int8
+ converter.representative_dataset = representative_dataset_gen(model.input_shape[1])
+
+ tflite_model = converter.convert()
+ open(tflite_path, "wb").write(tflite_model)
+
+def evaluate_tflite(tflite_path, input_window_length = 296):
+ """Evaluates tflite (fp32, int8)."""
+ results_ler = []
+ results_wer = []
+ (fluent_speech_test_data, fluent_speech_test_num_steps) = get_data(args, "test_fluent_speech", batch_size=1)
+ log("Setting number of used threads to {}".format(multiprocessing.cpu_count()))
+ interpreter = tf.lite.Interpreter(
+ model_path=tflite_path, num_threads=multiprocessing.cpu_count()
+ )
+ interpreter.allocate_tensors()
+ input_chunk = interpreter.get_input_details()[0]
+ output_details = interpreter.get_output_details()[0]
+
+ input_shape = input_chunk["shape"]
+ log("eval_model() - input_shape: {}".format(input_shape))
+ input_dtype = input_chunk["dtype"]
+ output_dtype = output_details["dtype"]
+
+ # Check if the input/output type is quantized,
+ # set scale and zero-point accordingly
+ if input_dtype != tf.float32:
+ input_scale, input_zero_point = input_chunk["quantization"]
+ else:
+ input_scale, input_zero_point = 1, 0
+
+ if output_dtype != tf.float32:
+ output_scale, output_zero_point = output_details["quantization"]
+ else:
+ output_scale, output_zero_point = 1, 0
+
+ log("Running {} iterations".format(fluent_speech_test_num_steps))
+ for i_iter, (data, label) in enumerate(
+ tqdm(fluent_speech_test_data, total=fluent_speech_test_num_steps)
+ ):
+ data = data / input_scale + input_zero_point
+ # Round the data up if dtype is int8, uint8 or int16
+ if input_dtype is not np.float32:
+ data = np.round(data)
+
+ while data.shape[1] < input_window_length:
+ data = np.append(data, data[:, -2:-1, :], axis=1)
+ # Zero-pad any odd-length inputs
+ if data.shape[1] % 2 == 1:
+ log('Input length is odd, zero-padding to even (first layer has stride 2)')
+ data = np.concatenate([data, np.zeros((1, 1, data.shape[2]), dtype=input_dtype)], axis=1)
+
+ context = 24 + 2 * (7 * 3 + 16) # = 98 - theoretical max receptive field on each side
+ size = input_chunk['shape'][1]
+ inner = size - 2 * context
+ data_end = data.shape[1]
+
+ # Initialize variables for the sliding window loop
+ data_pos = 0
+ outputs = []
+
+ while data_pos < data_end:
+ if data_pos == 0:
+ # Align inputs from the first window to the start of the data and include the intial context in the output
+ start = data_pos
+ end = start + size
+ y_start = 0
+ y_end = y_start + (size - context) // 2
+ data_pos = end - context
+ elif data_pos + inner + context >= data_end:
+ # Shift left to align final window to the end of the data and include the final context in the output
+ shift = (data_pos + inner + context) - data_end
+ start = data_pos - context - shift
+ end = start + size
+ assert start >= 0
+ y_start = (shift + context) // 2 # Will be even because we assert it above
+ y_end = size // 2
+ data_pos = data_end
+ else:
+ # Capture only the inner region from mid-input inferences, excluding output from both context regions
+ start = data_pos - context
+ end = start + size
+ y_start = context // 2
+ y_end = y_start + inner // 2
+ data_pos = end - context
+
+ interpreter.set_tensor(
+ input_chunk["index"], tf.cast(data[:, start:end, :], input_dtype))
+ interpreter.invoke()
+ cur_output_data = interpreter.get_tensor(output_details["index"])[:, :, y_start:y_end, :]
+ cur_output_data = output_scale * (
+ cur_output_data.astype(np.float32) - output_zero_point
+ )
+ outputs.append(cur_output_data)
+
+ complete = np.concatenate(outputs, axis=2)
+ results_ler.append(get_metrics("ler")(label, complete))
+ results_wer.append(get_metrics("wer")(label, complete))
+
+ log("ler: {}".format(np.mean(results_ler)))
+ log("wer: {}".format(np.mean(results_wer))) #based on batch=1
+
+def main(args):
+
+ model = load_model()
+ evaluate_model(args, model)
+
+ if args.prune:
+ model = prune_model(args, model)
+ model.save(f"saved_models/pruned_tiny_wav2letter")
+ evaluate_model(args, model)
+
+ output_directory = pathlib.Path(os.path.dirname(os.path.abspath(__file__)))
+ output_directory = os.path.join(output_directory, "tiny_wav2letter/tflite_models")
+ wav2letter_tflite_path = os.path.join(output_directory, args.prune * "pruned_" + f"tiny_wav2letter_int8.tflite")
+
+ if not os.path.exists(os.path.dirname(wav2letter_tflite_path)):
+ try:
+ os.makedirs(os.path.dirname(wav2letter_tflite_path))
+ except OSError as exc:
+ raise
+
+ # Convert the saved model to TFLite and then evaluate
+ tflite_conversion(model, wav2letter_tflite_path, "int8")
+ evaluate_tflite(wav2letter_tflite_path)
+
+
+if __name__ == "__main__":
+
+ parser = argparse.ArgumentParser(allow_abbrev=False)
+ parser.add_argument(
+ "--training_set_size",
+ dest="training_set_size",
+ type=int,
+ required=False,
+ default = 132553,
+ help="The number of samples in the training set"
+ )
+ parser.add_argument(
+ "--fluent_speech_training_set_size",
+ dest="fluent_speech_set_size",
+ type=int,
+ required=False,
+ default = 23132,
+ help="The number of samples in the fluent speech training dataset"
+ )
+ parser.add_argument(
+ "--batch_size",
+ dest="batch_size",
+ type=int,
+ required=False,
+ default=32,
+ help="batch size wanted when creating model",
+ )
+
+ parser.add_argument(
+ "--finetuning_epochs",
+ dest="finetuning_epochs",
+ type=int,
+ required=False,
+ default=10,
+ help="Number of epochs for fine-tuning",
+ )
+ parser.add_argument(
+ "--full_data_dir",
+ dest="full_data_dir",
+ type=str,
+ required=False,
+ default='librispeech_full_size',
+ help="Path to dataset directory",
+ )
+ parser.add_argument(
+ "--reduced_data_dir",
+ dest="reduced_data_dir",
+ type=str,
+ required=False,
+ default='librispeech_reduced_size',
+ help="Path to dataset directory",
+ )
+ parser.add_argument(
+ "--fluent_speech_data_dir",
+ dest="fluent_speech_data_dir",
+ type=str,
+ required=False,
+ default='fluent_speech_commands_dataset',
+ help="Path to dataset directory",
+ )
+ parser.add_argument(
+ "--prune",
+ dest="prune",
+ required=False,
+ action='store_true',
+ help="Prune model true or false",
+ )
+ parser.add_argument(
+ "--sparsity",
+ dest="sparsity",
+ type=float,
+ required=False,
+ default=0.5,
+ help="Level of sparsity required",
+ )
+
+ args = parser.parse_args()
+ main(args)
diff --git a/models/speech_recognition/tiny_wav2letter/tflite_pruned_int8/recreate_code/recreate_model.sh b/models/speech_recognition/tiny_wav2letter/tflite_pruned_int8/recreate_code/recreate_model.sh
new file mode 100644
index 0000000..5cc1b39
--- /dev/null
+++ b/models/speech_recognition/tiny_wav2letter/tflite_pruned_int8/recreate_code/recreate_model.sh
@@ -0,0 +1,11 @@
+python3 -m venv env
+source env/bin/activate
+
+pip install -r requirements.txt
+python preprocessing.py
+python train_model.py --with_baseline --baseline_epochs 30 --with_finetuning --finetuning_epochs 10 --with_fluent_speech --fluent_speech_epochs 30
+python prune_and_quantise_model.py --prune --sparsity 0.5 --finetuning_epochs 10
+python prune_and_quantise_model.py --sparsity 0.5 --finetuning_epochs 10
+
+
+
diff --git a/models/speech_recognition/tiny_wav2letter/tflite_pruned_int8/recreate_code/requirements.txt b/models/speech_recognition/tiny_wav2letter/tflite_pruned_int8/recreate_code/requirements.txt
new file mode 100644
index 0000000..6ea5a87
--- /dev/null
+++ b/models/speech_recognition/tiny_wav2letter/tflite_pruned_int8/recreate_code/requirements.txt
@@ -0,0 +1,10 @@
+librosa==0.8.1
+numpy==1.19.5
+tensorboard==2.6.0
+tensorboard-data-server==0.6.1
+tensorboard-plugin-profile==2.5.0
+tensorboard-plugin-wit==1.8.0
+tensorflow==2.4.1
+tensorflow-model-optimization==0.6.0
+tqdm
+jiwer==2.3.0
diff --git a/models/speech_recognition/tiny_wav2letter/tflite_pruned_int8/recreate_code/saved_models/pruned_tiny_wav2letter/saved_model.pb b/models/speech_recognition/tiny_wav2letter/tflite_pruned_int8/recreate_code/saved_models/pruned_tiny_wav2letter/saved_model.pb
new file mode 100644
index 0000000..5b2edf2
--- /dev/null
+++ b/models/speech_recognition/tiny_wav2letter/tflite_pruned_int8/recreate_code/saved_models/pruned_tiny_wav2letter/saved_model.pb
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:081d7d86e2b6fd788ca37b9566213560140f595d7702a2126b5a4b895d00cc9e
+size 206996
diff --git a/models/speech_recognition/tiny_wav2letter/tflite_pruned_int8/recreate_code/saved_models/pruned_tiny_wav2letter/variables/variables.data-00000-of-00001 b/models/speech_recognition/tiny_wav2letter/tflite_pruned_int8/recreate_code/saved_models/pruned_tiny_wav2letter/variables/variables.data-00000-of-00001
new file mode 100644
index 0000000..134cd19
--- /dev/null
+++ b/models/speech_recognition/tiny_wav2letter/tflite_pruned_int8/recreate_code/saved_models/pruned_tiny_wav2letter/variables/variables.data-00000-of-00001
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:df3647d94be8c5feb098f69122e22007b11873bddb11bdc606112576d4588d4f
+size 15644464
diff --git a/models/speech_recognition/tiny_wav2letter/tflite_pruned_int8/recreate_code/saved_models/pruned_tiny_wav2letter/variables/variables.index b/models/speech_recognition/tiny_wav2letter/tflite_pruned_int8/recreate_code/saved_models/pruned_tiny_wav2letter/variables/variables.index
new file mode 100644
index 0000000..e69b315
--- /dev/null
+++ b/models/speech_recognition/tiny_wav2letter/tflite_pruned_int8/recreate_code/saved_models/pruned_tiny_wav2letter/variables/variables.index
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:e6ba80bc6403066c573eed4e039219c085f2d83b32c3e0e19f40a21a03c8efb7
+size 1339
diff --git a/models/speech_recognition/tiny_wav2letter/tflite_pruned_int8/recreate_code/saved_models/tiny_wav2letter/saved_model.pb b/models/speech_recognition/tiny_wav2letter/tflite_pruned_int8/recreate_code/saved_models/tiny_wav2letter/saved_model.pb
new file mode 100644
index 0000000..93177d0
--- /dev/null
+++ b/models/speech_recognition/tiny_wav2letter/tflite_pruned_int8/recreate_code/saved_models/tiny_wav2letter/saved_model.pb
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:163b79816a803b5d03d45e2a792e62981a1102fc0d9bd58efb4947d44b5e2af7
+size 264815
diff --git a/models/speech_recognition/tiny_wav2letter/tflite_pruned_int8/recreate_code/saved_models/tiny_wav2letter/variables/variables.data-00000-of-00001 b/models/speech_recognition/tiny_wav2letter/tflite_pruned_int8/recreate_code/saved_models/tiny_wav2letter/variables/variables.data-00000-of-00001
new file mode 100644
index 0000000..230a0dc
--- /dev/null
+++ b/models/speech_recognition/tiny_wav2letter/tflite_pruned_int8/recreate_code/saved_models/tiny_wav2letter/variables/variables.data-00000-of-00001
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:d4048120f5ff4474cf70d3161723ac756b053e7053df76ffd48387416dab24e4
+size 46925195
diff --git a/models/speech_recognition/tiny_wav2letter/tflite_pruned_int8/recreate_code/saved_models/tiny_wav2letter/variables/variables.index b/models/speech_recognition/tiny_wav2letter/tflite_pruned_int8/recreate_code/saved_models/tiny_wav2letter/variables/variables.index
new file mode 100644
index 0000000..b53f5fd
--- /dev/null
+++ b/models/speech_recognition/tiny_wav2letter/tflite_pruned_int8/recreate_code/saved_models/tiny_wav2letter/variables/variables.index
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:82c10539eb05229e769397e7ee90a7befa1ec22377032da89440d69c69f4e07d
+size 4440
diff --git a/models/speech_recognition/tiny_wav2letter/tflite_pruned_int8/recreate_code/tinywav2letter.py b/models/speech_recognition/tiny_wav2letter/tflite_pruned_int8/recreate_code/tinywav2letter.py
new file mode 100644
index 0000000..c998497
--- /dev/null
+++ b/models/speech_recognition/tiny_wav2letter/tflite_pruned_int8/recreate_code/tinywav2letter.py
@@ -0,0 +1,152 @@
+# Copyright (C) 2020 Arm Limited or its affiliates. All rights reserved.
+#
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the License); you may
+# not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an AS IS BASIS, WITHOUT
+# WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Model definition for Tinywav2Letter."""
+import tensorflow as tf
+from tensorflow.python.ops import ctc_ops
+import numpy as np
+from jiwer import wer
+
+def get_metrics(metric):
+ """Get metrics needed to compile wav2letter."""
+ def ctc_preparation(tensor, y_predict):
+ if len(y_predict.shape) == 4:
+ y_predict = tf.squeeze(y_predict, axis=1)
+ y_predict = tf.transpose(y_predict, (1, 0, 2))
+ sequence_lengths, labels = tensor[:, 0], tensor[:, 1:]
+ idx = tf.where(tf.not_equal(labels, 28))
+ sparse_labels = tf.SparseTensor(
+ idx, tf.gather_nd(labels, idx), tf.shape(labels, out_type=tf.int64)
+ )
+ return sparse_labels, sequence_lengths, y_predict
+
+ def get_loss():
+ """Calculate CTC loss."""
+ def ctc_loss(y_true, y_predict):
+ sparse_labels, logit_length, y_predict = ctc_preparation(y_true, y_predict)
+ return tf.reduce_mean(
+ ctc_ops.ctc_loss_v2(
+ labels=sparse_labels,
+ logits=y_predict,
+ label_length=None,
+ logit_length=logit_length,
+ blank_index=-1,
+ )
+ )
+ return ctc_loss
+
+ def get_ler():
+ """Calculate CTC LER (Letter Error Rate)."""
+ def ctc_ler(y_true, y_predict):
+ sparse_labels, logit_length, y_predict = ctc_preparation(y_true, y_predict)
+ decoded, log_probabilities = tf.nn.ctc_greedy_decoder(
+ y_predict, tf.cast(logit_length, tf.int32), merge_repeated=True
+ )
+ return tf.reduce_mean(
+ tf.edit_distance(
+ tf.cast(decoded[0], tf.int32), tf.cast(sparse_labels, tf.int32)
+ )
+ )
+ return ctc_ler
+ def get_wer():
+ """Calculate CTC WER (Word Error Rate) only for batch size = 1."""
+
+ def trans_int_to_string(trans_int):
+ #create dictionary int -> string (0 -> a 1 -> b)
+ string = ""
+ alphabet = "abcdefghijklmnopqrstuvwxyz' @"
+ alphabet_dict = {}
+ count = 0
+ for x in alphabet:
+ alphabet_dict[count] = x
+ count += 1
+ for letter in trans_int:
+ letter_np = np.array(letter).item(0)
+ if letter_np != 28:
+ string += alphabet_dict[letter_np]
+ return string
+
+ def ctc_wer(y_true, y_predict):
+ sparse_labels, logit_length, y_predict = ctc_preparation(y_true, y_predict)
+ decoded, log_probabilities = tf.nn.ctc_greedy_decoder(
+ y_predict, tf.cast(logit_length, tf.int32), merge_repeated=True
+ )
+ true_sentence = tf.cast(sparse_labels.values, tf.int32)
+ return wer(str(trans_int_to_string(true_sentence)),str(trans_int_to_string(decoded[0].values)))
+ return ctc_wer
+
+ return {"loss": get_loss(), "ler": get_ler(), "wer": get_wer()}[metric]
+
+
+def create_tinywav2letter(batch_size=1, no_stride_count=5, filters_small=100, filters_large_1=750, filters_large_2=750) -> tf.keras.models.Model:
+ """Create and return Tinywav2Letter model"""
+ layer = tf.keras.layers
+ leaky_relu = layer.LeakyReLU([0.20000000298023224])
+ MFCC_coeffs = 39
+ input = layer.Input(shape=[None, MFCC_coeffs], batch_size=batch_size)
+ # Reshape to prepare input for first layer
+ x = layer.Reshape([1, -1, 39])(input)
+ # One striding layer of output size [batch_size, max_time / 2, 250]
+ x = layer.Conv2D(
+ filters=250,
+ kernel_size=[1, 48],
+ padding="same",
+ activation=None,
+ strides=[1, 2],
+ )(x)
+ # Add non-linearity
+ x = leaky_relu(x)
+ # layers without striding of output size [batch_size, max_time / 2, 250]
+ for i in range(0, no_stride_count):
+ x = layer.Conv2D(
+ filters=filters_small,
+ kernel_size=[1, 7],
+ padding="same",
+ activation=None,
+ strides=[1, 1],
+ )(x)
+ # Add non-linearity
+ x = leaky_relu(x)
+ # 1 layer with high kernel width and output size [batch_size, max_time / 2, 2000]
+ x = layer.Conv2D(
+ filters=filters_large_1,
+ kernel_size=[1, 32],
+ padding="same",
+ activation=None,
+ strides=[1, 1],
+ )(x)
+ # Add non-linearity
+ x = leaky_relu(x)
+ # 1 layer of output size [batch_size, max_time / 2, 2000]
+ x = layer.Conv2D(
+ filters=filters_large_2,
+ kernel_size=[1, 1],
+ padding="same",
+ activation=None,
+ strides=[1, 1],
+ )(x)
+ # Add non-linearity
+ x = leaky_relu(x)
+ # 1 layer of output size [batch_size, max_time / 2, num_classes]
+ # We must not apply a non linearity in this last layer
+ x = layer.Conv2D(
+ filters=29,
+ kernel_size=[1, 1],
+ padding="same",
+ activation=None,
+ strides=[1, 1],
+ )(x)
+ return tf.keras.models.Model(inputs=[input], outputs=[x])
diff --git a/models/speech_recognition/tiny_wav2letter/tflite_pruned_int8/recreate_code/train_model.py b/models/speech_recognition/tiny_wav2letter/tflite_pruned_int8/recreate_code/train_model.py
new file mode 100644
index 0000000..1ae4943
--- /dev/null
+++ b/models/speech_recognition/tiny_wav2letter/tflite_pruned_int8/recreate_code/train_model.py
@@ -0,0 +1,267 @@
+# Copyright (C) 2020 Arm Limited or its affiliates. All rights reserved.
+#
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the License); you may
+# not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an AS IS BASIS, WITHOUT
+# WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Wav2letter training, optimisation and evaluation script"""
+import argparse
+
+import tensorflow as tf
+
+from tinywav2letter import create_tinywav2letter, get_metrics
+from load_mfccs import MFCC_Loader
+
+options = tf.data.Options()
+options.experimental_distribute.auto_shard_policy = tf.data.experimental.AutoShardPolicy.DATA
+
+gpus = tf.config.list_logical_devices('GPU')
+strategy = tf.distribute.MirroredStrategy(gpus)
+
+
+def log(std):
+ """Log the given string to the standard output."""
+ print("******* {}".format(std), flush=True)
+
+def get_data(args, dataset_type, batch_size):
+ """Returns particular training and validation dataset."""
+ dataset = MFCC_Loader(args.full_data_dir, args.reduced_data_dir,args.fluent_speech_data_dir)
+
+ return {"train_full_size": [dataset.full_training_set(batch_size=batch_size, num_samples = args.training_set_size).with_options(options), dataset.num_steps(batch=batch_size)],
+ "train_reduced_size": [dataset.reduced_training_set(batch_size=batch_size, num_samples = args.training_set_size).with_options(options), dataset.num_steps(batch=batch_size)],
+ "val_full_size": [dataset.full_validation_set(batch_size=batch_size).with_options(options), dataset.num_steps(batch=batch_size)],
+ "val_reduced_size": [dataset.reduced_validation_set(batch_size=batch_size).with_options(options), dataset.num_steps(batch=batch_size)],
+ "train_fluent_speech": [dataset.fluent_speech_train_set(batch_size=batch_size, num_samples = args.fluent_speech_set_size).with_options(options), dataset.num_steps(batch=batch_size)],
+ "val_fluent_speech": [dataset.fluent_speech_validation_set(batch_size=batch_size).with_options(options), dataset.num_steps(batch=batch_size)],
+ "test_fluent_speech": [dataset.fluent_speech_test_set(batch_size=batch_size).with_options(options),dataset.num_steps(batch=batch_size)],
+ }[dataset_type]
+
+def setup_callbacks(checkpoint_path, log_dir):
+ """Returns callbacks for baseline training and optimization fine-tuning."""
+ callbacks = [
+ tf.keras.callbacks.TerminateOnNaN(),
+ tf.keras.callbacks.ModelCheckpoint(
+ filepath=checkpoint_path,
+ verbose=1,
+ save_weights_only=True,
+ save_freq='epoch', # save every epoch
+ ),
+ tf.keras.callbacks.TensorBoard(
+ log_dir=log_dir,
+ histogram_freq=1, # update every epoch
+ update_freq=100, # update every 100 batch
+
+ ),
+ ]
+ return callbacks
+
+def get_lr_schedule(steps_per_epoch, learning_rate=1e-4, lr_schedule_config=[[1.0, 0.1, 0.01, 0.001]]):
+ """Returns learn rate schedule for baseline training and optimization fine-tuning."""
+ initial_learning_rate = learning_rate
+ lr_schedule = tf.keras.optimizers.schedules.PiecewiseConstantDecay(
+ boundaries=list(p[1] * steps_per_epoch for p in lr_schedule_config),
+ values=[initial_learning_rate] + list(p[0] * initial_learning_rate for p in lr_schedule_config))
+ return lr_schedule
+
+
+def train_model(args):
+ """Performs pruning, fine-tuning and returns stripped pruned model"""
+ log("Commencing Model Training")
+
+ # Get all of the required datasets
+ (full_training_data, full_training_num_steps) = get_data(args, "train_full_size", args.batch_size)
+ (reduced_training_data, reduced_training_num_steps) = get_data(args, "train_reduced_size", args.batch_size)
+ (full_validation_data, full_validation_num_steps) = get_data(args, "val_full_size", args.batch_size)
+ (reduced_validation_data, reduced_validation_num_steps) = get_data(args, "val_reduced_size", args.batch_size)
+ (fluent_speech_training_data, fluent_speech_training_num_steps) = get_data(args, "train_fluent_speech", args.batch_size)
+ (fluent_speech_validation_data, fluent_speech_validation_num_steps) = get_data(args, "val_fluent_speech", args.batch_size)
+ (fluent_speech_test_data, fluent_speech_test_num_steps) = get_data(args, "test_fluent_speech",args.batch_size)
+
+ # Set up checkpoint paths, directories for the log files and the callbacks
+ baseline_checkpoint_path = f"checkpoints/baseline/checkpoint.ckpt"
+ finetuning_checkpoint_path = f"checkpoints/finetuning/checkpoint.ckpt"
+
+ baseline_log_dir = f"logs/tiny_wav2letter_baseline"
+ finetuning_log_dir = f"logs/tiny_wav2letter_finetuning"
+
+ baseline_callbacks = setup_callbacks(baseline_checkpoint_path, baseline_log_dir)
+ finetuning_callbacks = setup_callbacks(finetuning_checkpoint_path, finetuning_log_dir)
+
+ # Initialise the Tiny Wav2Letter model
+ with strategy.scope():
+ model = create_tinywav2letter(batch_size = args.batch_size)
+
+
+ # Perform the baseline training with the full size LibriSpeech dataset
+ if args.with_baseline:
+
+ with strategy.scope():
+ opt = tf.keras.optimizers.Adam(learning_rate=get_lr_schedule(learning_rate = 1e-4, steps_per_epoch=full_training_num_steps))
+ model.compile(loss=get_metrics("loss"), metrics=[get_metrics("ler")], optimizer=opt)
+
+ model.fit(
+ full_training_data,
+ epochs=args.baseline_epochs,
+ verbose=1,
+ callbacks=baseline_callbacks,
+ validation_data=full_validation_data,
+ initial_epoch = 0
+ )
+
+ log(f'Evaluating Tiny Wav2Letter post baseline training')
+ model.evaluate(fluent_speech_test_data)
+
+ # Perform finetuning training with the reduced size MiniLibriSpeech dataset
+ if args.with_finetuning:
+
+ with strategy.scope():
+ opt = tf.keras.optimizers.Adam(learning_rate=get_lr_schedule(learning_rate = 1e-5, steps_per_epoch=reduced_training_num_steps))
+ model.compile(loss=get_metrics("loss"), metrics=[get_metrics("ler")], optimizer=opt)
+
+ model.fit(
+ reduced_training_data,
+ epochs=args.finetuning_epochs + args.baseline_epochs,
+ verbose=1,
+ callbacks=finetuning_callbacks,
+ validation_data=reduced_validation_data,
+ initial_epoch = args.baseline_epochs
+ )
+
+ log(f'Evaluating Tiny Wav2Letter post finetuning')
+ model.evaluate(x=fluent_speech_test_data)
+
+
+ if args.with_fluent_speech:
+
+ with strategy.scope():
+ opt = tf.keras.optimizers.Adam(learning_rate=get_lr_schedule(learning_rate = 1e-5, steps_per_epoch=fluent_speech_training_num_steps))
+ model.compile(loss=get_metrics("loss"), metrics=[get_metrics("ler")], optimizer=opt,run_eagerly=True)
+
+ model.fit(
+ fluent_speech_training_data,
+ epochs=args.finetuning_epochs + args.baseline_epochs,
+ verbose=1,
+ callbacks=finetuning_callbacks,
+ validation_data=fluent_speech_validation_data,
+ initial_epoch = args.baseline_epochs
+ )
+
+ model.evaluate(x=fluent_speech_test_data)
+ # Save the final trained model in TF SavedModel format
+ model.save(f"saved_models/tiny_wav2letter")
+
+if __name__ == "__main__":
+
+ parser = argparse.ArgumentParser(allow_abbrev=False)
+ parser.add_argument(
+ "--training_set_size",
+ dest="training_set_size",
+ type=int,
+ required=False,
+ default = 132553,
+ help="The number of samples in the training set"
+ )
+ parser.add_argument(
+ "--fluent_speech_training_set_size",
+ dest="fluent_speech_set_size",
+ type=int,
+ required=False,
+ default = 23132,
+ help="The number of samples in the fluent speech training dataset"
+ )
+ parser.add_argument(
+ "--batch_size",
+ dest="batch_size",
+ type=int,
+ required=False,
+ default=32,
+ help="batch size wanted when creating model",
+ )
+ parser.add_argument(
+ "--full_data_dir",
+ dest="full_data_dir",
+ type=str,
+ required=False,
+ default='librispeech_full_size',
+ help="Path to dataset directory",
+ )
+ parser.add_argument(
+ "--reduced_data_dir",
+ dest="reduced_data_dir",
+ type=str,
+ required=False,
+ default='librispeech_reduced_size',
+ help="Path to dataset directory",
+ )
+ parser.add_argument(
+ "--fluent_speech_data_dir",
+ dest="fluent_speech_data_dir",
+ type=str,
+ required=False,
+ default='fluent_speech_commands_dataset',
+ help="Path to dataset directory",
+ )
+ parser.add_argument(
+ "--load_model",
+ dest="load_model",
+ required=False,
+ action='store_true',
+ help="Model number to load",
+ )
+ parser.add_argument(
+ "--with_baseline",
+ dest="with_baseline",
+ required=False,
+ action='store_true',
+ help="Perform pre-training baseline using the full size dataset",
+ )
+ parser.add_argument(
+ "--with_finetuning",
+ dest="with_finetuning",
+ required=False,
+ action='store_true',
+ help="Perform fine-tuning training using the reduced corpus dataset",
+ )
+ parser.add_argument(
+ "--with_fluent_speech",
+ dest="with_fluent_speech",
+ required=False,
+ action='store_true',
+ help="Perform fluent_speech training using the fluent speech dataset",
+ )
+ parser.add_argument(
+ "--baseline_epochs",
+ dest="baseline_epochs",
+ type=int,
+ required=False,
+ default=30,
+ help="Number of epochs for baseline training",
+ )
+ parser.add_argument(
+ "--finetuning_epochs",
+ dest="finetuning_epochs",
+ type=int,
+ required=False,
+ default=10,
+ help="Number of epochs for fine-tuning",
+ )
+ parser.add_argument(
+ "--fluent_speech_epochs",
+ dest="fluent_speech_epochs",
+ type=int,
+ required=False,
+ default=30,
+ help="Number of epochs for fluent speech training",
+ )
+ args = parser.parse_args()
+ train_model(args)
diff --git a/models/speech_recognition/tiny_wav2letter/tflite_pruned_int8/testing_input/input_1_int8/0.npy b/models/speech_recognition/tiny_wav2letter/tflite_pruned_int8/testing_input/input_1_int8/0.npy
new file mode 100644
index 0000000..61a809d
--- /dev/null
+++ b/models/speech_recognition/tiny_wav2letter/tflite_pruned_int8/testing_input/input_1_int8/0.npy
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:e373de705b3eae4fc8b51998317cf1ebf7cf26e728b8524fd1c97a2822811757
+size 11672
diff --git a/models/speech_recognition/tiny_wav2letter/tflite_pruned_int8/testing_output/Identity_int8/0.npy b/models/speech_recognition/tiny_wav2letter/tflite_pruned_int8/testing_output/Identity_int8/0.npy
new file mode 100644
index 0000000..74b5cb7
--- /dev/null
+++ b/models/speech_recognition/tiny_wav2letter/tflite_pruned_int8/testing_output/Identity_int8/0.npy
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:9ab109a9f565cd48cf95d5f83cfdd9c0694db812407b01f0c535172ac68f0bf8
+size 4420
diff --git a/models/speech_recognition/tiny_wav2letter/tflite_pruned_int8/tiny_wav2letter_pruned_int8.tflite b/models/speech_recognition/tiny_wav2letter/tflite_pruned_int8/tiny_wav2letter_pruned_int8.tflite
new file mode 100644
index 0000000..fbf9521
--- /dev/null
+++ b/models/speech_recognition/tiny_wav2letter/tflite_pruned_int8/tiny_wav2letter_pruned_int8.tflite
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:4487c1f55e60b0824939f9ae0bce9522489700a8d9bcb1830f17822d4a5c0041
+size 3997112