git clone [email protected]:daleonpz/stwin_AI_vowel_recognition.git
dvc remote add origin https://dagshub.com/daleonpz/stwin_AI_vowel_recognition.dvc
dvc pull -r origin
The objective of this project was to train a neural network for vowel recognition based on Inertial Movement Unit (IMU) measurements and deploy it on the dev board STEVAL-STWINKT1 from STM32.
The neural network is based on two layers of CNN + BN + ReLU, a Global Average Pooling layers and a Fully Connected + Softmax layer. The size of the neural network is 13KB, and has 95% accuracy on the testset.
Workflow overview:
- Pytorch 1.10.2+cu102
- ONNIX-tensorflow 1.10.0
- STM32 CUBE-AI 7.1.0
- Sensor ISM330DHCX
- 18.04.1-Ubuntu
Vowel recognition is a classification problem, so I defined the following setup for the neural network.
-
Number of classes: 5 (A, E, I, O, U)
-
Data:
- 200 samples/class = 1000 samples
- Split 80/10/10
-
Input: a 20x20x6 matrix based on IMU data.
-
Output: a 5x1 probability vector.
-
Architecture: Convolutional Neural Network + Fully Connected + Softmax.
-
Loss function: Cross Entropy, which is typical in multiclass classification.
I collected 200 samples for each class and each sample corresponds to 2sec of data from the ISM330DHCX sensor at a sampling rate of 200Hz. I decided to collect data from the accelerometer and gyroscope, because it has been proven that using multiple sensors improves gesture characterization. 1, 2, 3.
That means that each data sample have the following form.
such that
Before inputting data into the model, I conducted the normalization operation. Since the IMU signals differ in value and range, thus I normalize each signal
where
$$a_{min} = \min_{ a_i \in \{ \mathbf{a_x}, \mathbf{a_y},\mathbf{a_z} \} } a_i^{(j)} , j = 1,\cdots , 200 \\$$ $$a_{max} = \max_{ a_i \in \{ \mathbf{a_x}, \mathbf{a_y},\mathbf{a_z} \} } a_i^{(j)} , j = 1,\cdots , 200 \\$$ $$g_{min} = \min_{ a_i \in \{ \mathbf{a_x}, \mathbf{a_y},\mathbf{a_z} \} } a_i^{(j)} , j = 1,\cdots , 200 \\$$Thus the normalized feature matrix
4 showed that encoding feature vectors or matrices as images could take advantage of the performance of CNN on images. Thus the shape of input image
For the data collection I wrote some python scripts and C code for the microcontroller. All the relevant code for this task can be found here. The main script is collect.sh, where I defined the labels for each class.
I tried two architectures based on Convolution neural and fully connected networks. The papers I used as a reference are the following:
- Jianjie, Lu & Raymond, Tong. (2018). Encoding Accelerometer Signals as Images for Activity Recognition Using Residual Neural Network.
- Jiang, Yujian & Song, Lin & Zhang, Junming & Song, Yang & Yan, Ming. (2022). Multi-Category Gesture Recognition Modeling Based on sEMG and IMU Signals. Sensors. 22. 5855. 10.3390/s22155855.
The first architecture was the following.
-
Number of parameters:
- Conv2D + Batch + ReLU: $2463*3 + 24 + 24 + 24 + 0 = 1368$
- Conv2D + Batch + ReLU: $48243*3 + 48 + 48 + 48 + 0 = 10512$
- FC + ReLU:
$512*4800 + 512 + 0 = 2458112$ - FC + ReLU:
$512*32 + 32 + 0 = 16416$ - FC:
$5*32 + 5 = 165$ - Total number of parameters: 2486573
-
Number of elements per parameter: 4 (float)
-
Size of the model:
$2486573 * 4 = 9713.754 KB = 9.48 MB$
As you can see, the model size is too big 9.48 MB for the STWINK devboard, which has ARM Cortex-M4 MCU with 2048 kbytes of flash. But I still trained the model to see if I was on a right direction.
Here are the results:
Model | n_samples | num_epochs | learning_rate | criterion | optimizer | batch_size | train_acc | train_loss | val_acc | val_loss | test_acc | test_loss | size(float32) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
cnn (512,32) | 200 | 10 | 0.00001 | CrossEntropy | Adam | 16 | 91.67 | 0.514 | 83.33 | 0.544 | 82.5 | 0.5727 | 9MB |
cnn (512,32) | 200 | 50 | 0.00001 | CrossEntropy | Adam | 16 | 100 | 0.328 | 95 | 0.376 | 97.5 | 0.3458 | 9MB |
cnn (512,32) | 200 | 20 | 0.00001 | CrossEntropy | Adam | 16 | 96.67 | 0.402 | 86.67 | 0.469 | 92.5 | 0.4352 | 9MB |
As you can see, the accuracy in all datasets (train, validation and test) is above 90% which is a good indication. Although I plotted the curves "Accuracy vs epoch" and "Loss vs epoch", as well as the "confusion matrix" that validate performance of this model, I do not shown them because this model was not deployed.
The model I ended up deploying is the following
-
Number of parameters:
- Conv2D + Batch + ReLU: $1263*3 + 12 + 12 + 12 + 0 = 684$
- Conv2D + Batch + ReLU: $24123*3 + 24 + 24 + 24 + 0 = 2664$
- Global Average Pooling:
$0$ . With this layer I could reduce the number of parameters and code size - FC:
$5*24 + 5 = 125$ - Total number of parameters: 3473
-
Number of elements per parameter: 4 (float)
-
Size of the model:
$3473 * 4 = 13.863 KB$
This model is small enough to be deployed on the microcontroller, it uses only 0.68% of the available flash memory.
The results of training are the following:
The overall performance is above 90%.
For the training I wrote python scripts and C files for the microcontroller. All the relevant code for this task can be found here. Run python3 train_model.py --dataset_path ../data --num_epochs 200 --learning_rate 0.0005 --batch_size 128
to train the model.
Ep 199/200: Accuracy : Train:98.25 Val:83.00 || Loss: Train 0.956 Val 1.101: 99%|█████████████████████████████████████████████████████▍|
Ep 199/200: Accuracy : Train:98.25 Val:83.00 || Loss: Train 0.956 Val 1.101: 100%|██████████████████████████████████████████████████████|
Ep 199/200: Accuracy : Train:98.25 Val:83.00 || Loss: Train 0.956 Val 1.101: 100%|██████████████████████████████████████████████████████|
precision recall f1-score support
0 1.000 1.000 1.000 23
1 0.966 1.000 0.982 28
2 0.955 0.875 0.913 24
3 0.800 0.857 0.828 14
4 0.818 0.818 0.818 11
accuracy 0.930 100
macro avg 0.908 0.910 0.908 100
weighted avg 0.931 0.930 0.930 100
Test: acc 93.0 loss 0.9912329316139221
The CUBE-AI tool from STMicroelectronics doesn't support pytorch models, so I had two options. Either train the whole model again in tensorflow and then export it to tensorflow lite, or convert the pytorch model to ONNX (Open Neural Network Exchange). I chose the latter.
To export from pytorch to ONNX is straighforward:
from models.cnn_2 import CNN
# Load pytorch model
loadedmodel = CNN(fc_num_output=5, fc_hidden_size=[8]).to(DEVICE) # my model
loadedmodel.load_state_dict(torch.load('results/model.pth'))
loadedmodel.eval()
# Fuse some modules. it may save on memory access, make the model run faster, and improve its accuracy.
# https://pytorch.org/tutorials/recipes/fuse.html
torch.quantization.fuse_modules(loadedmodel,
[['conv1', 'bn1','relu1'],
['conv2', 'bn2','relu2']],
inplace=True)
# Convert to ONNX.
# Explanation on why we need a dummy input
# https://github.com/onnx/tutorials/issues/158
dummy_input = torch.randn(1, 6, 20, 20)
torch.onnx.export(loadedmodel,
dummy_input,
'model.onnx',
input_names=['input'],
output_names=['output'])
Once the the model is exported in ONNX format, it's time to import it into cube ai 7.1.0. STM has a CLI tool stm32ai
that imports the ONNX model and generates the corresponding C files.
$ stm32ai generate -m <model_path>/model.onnx --type onnx -o <output_dir> --name <project>
I wrote a script for that (link). The script generates the following files:
$ ll
drwxrwxr-x 2 me me 4096 13.01.2023 22:56 stm32ai_ws/
-rw-rw-r-- 1 me me 24356 13.01.2023 22:56 model.c
-rw-rw-r-- 1 me me 1500 13.01.2023 22:56 model_config.h
-rw-rw-r-- 1 me me 90718 13.01.2023 22:56 model_data.c
-rw-rw-r-- 1 me me 2624 13.01.2023 22:56 model_data.h
-rw-rw-r-- 1 me me 17926 13.01.2023 22:56 model_generate_report.txt
-rw-rw-r-- 1 me me 8766 13.01.2023 22:56 model.h
Here you can find the documentation in HTML of the API for the CUBE-AI framework. You could also see the function definitions in my code in the following links:
For the deployment the following modules were required:
- Ring buffer of size 600x6, because the sampling rate of the sensor is 200Hz and each sample is 2 seconds that means I needed at least an array of (200x2x6) elements. link
- Normalization of the samples between (0,1) before feeding the model. link
- Inference module with a threshold to detect the vowels. I set the threshold at 0.8. link
Since the model is small and the quantization model from Pytorch is not as good as from Tensorflow, I decide to use floats for my inference. In the future, I would build a quantized model using tensorflow and see how it performs.
The C code can be found under deployment. The main is defined in the main.c file.
Here are the results of the inference on the microcontroller.
Welcome to minicom 2.7.1
OPTIONS: I18n
Compiled on Aug 13 2017, 15:25:34.
Port /dev/ttyACM0, 19:45:52
Press CTRL-A Z for help on special keys
(HAL 1.13.0_0)
Compiled Jan 17 2023 19:34:40 (STM32CubeIDE)
Send Every 5mS Acc/Gyro/Magneto
system clock ----> 120000000
Testing label: 0
model output:
0.966853
0.004895
0.027935
0.000016
0.000300
inference result: 0
gesture: A
Testing label: 1
model output:
0.000000
0.999976
0.000006
0.000018
0.000000
inference result: 1
gesture: E
Testing label: 2
model output:
0.001132
0.000371
0.990676
0.000235
0.007587
inference result: 2
gesture: I
Testing label: 3
model output:
0.000021
0.007765
0.000039
0.990865
0.001310
inference result: 3
gesture: O
Testing label: 4
model output:
0.000081
0.000070
0.000067
0.005766
0.994015
model result: 4
gesture: U
I still having troubles with the timing between sensor reading and model inference. For that reason, I wanted to make sure that the inference on the microcontroller works, so I fed the model with some samples (ai float
arrays) from the Testset and run the inference, which was successful.
I fixed the timing problem and the model is working, but it is still difficult for the model to differentiate between A and I, as the patterns are similar.