Neural Networks, and especially Convolutional Neural Networks (CNNs), have become the backbone of advances in Computer Vision over the past decade. While CNNs exhibit remarkable performance on a wide variety of Computer Vision tasks, they require a large amount of memory and compute resources. To counteract this and be able to deploy CNNs on edge devices, researchers have proposed using Binary Neural Networks (BNNs) and Ternary Neural Networks (TNNs), new forms of Neural Networks which represent weights as binary and ternary values respectively, thereby saving large amounts of storage and computation.
In this report, we go one step further and present a highly optimized version of a Ternary Convolutional Layer. We optimize the algorithm top-down, starting off by changing the data order and merging subfunctions, and continue our optimizations by blocking, unrolling and eventually vectorizing the code where possible. To assess the quality of our improvements, we perform multiple benchmarks, on which we achieve a speedup of up to 29x.
This work was done by Felix Möller (@FelixMoeller3), Daniel Nezamabadi (@dnezam), Rudy Peterson (@rudynicolop), and Luca Tagliavini (@lucat1) in the context of the ETH Zurich course "Advanced Systems Lab" in Spring 2024.
We also thank Shien Zhu for supervising us and putting no restrictions on the baseline code, which was written in the context of the paper TAB : unified and optimized ternary, binary and mixed-precision neural network inference on the edge by Zhu, S., Duong, L. H. K. & Liu, W., published in 2022.
Finally, we also use libpopcnt by kimwalisch.
The interested reader can find our slides and report in 19.pdf and 19_report.pdf respectively.
All distributed code is licensed under AGPL-3.0-or-later except for libpopcnt.h, which is licensed under BSD 2-Clause.
To build run:
make
To view help:
./tnn -h
To run all tests do:
./tnn -t
To get benchmarks using particular parameters, for example from parameters/channels.csv
run:
./tnn -b -p parameters/channels.csv -o benchmarks/channels.csv
We use clang-format
to make sure our C and C++ code is consistently formatted.
Install clang-format
on your system.
Before committing and pushing to the remote be sure to run
make format
Alternatively you can add a pre-commit hook to do formatting adding a file named pre-commit
to your local .git/hooks
directory:
#!/bin/sh
# Run the make format command to format code before committing
make format
# Check if make format succeeded
if [ $? -ne 0 ]; then
echo "Code formatting failed. Please fix any issues and try committing again."
exit 1
fi
Generate benchmarks in .csv
using any of the scripts in scripts
.
./scripts/bench_*
Then to run the plotting, make sure the .csv
file is in the benchmarks
folder.
Then to generate performance and runtime plots run:
python3 -m plotting.plotter
The corresponding plots will be in the plots
folder.
To filter implementaion data from an existing .csv
you can filter out the implementations into a new .csv
via:
python3 -m plotting.filter_csv -i results/csvs/final/fc.csv -o benchmarks/fc.csv -n original best_impl_avx512
To introduce a new optimization you need to:
- Add a new directory under
include/main_impls
orinclude/minor_impls
, depending on whether or not this is a major improvement or just a minor change. - Add a new
tab.hpp
header file to this new directory. - Add a new directory under
src/main_impls
orsrc/minor_impls
. - Add a new
tab.cpp
file to his new directory. - Define a new namespace for this new optimization.
- To
src/main.cpp
add a new element tovector<Implementation> impls
where you give a name to the optimization, specify the order of the tensor dimensions, and the convolution function.
To time a specific operation, for example ternary GEMM, surround the call of the procedure with a call to measure_point
defined in measure.hpp
:
measure_point(MeasurementFunction::TNN_GEMM, MeasurementEvent::START);
auto gemm_result = ternary_gemm(reshaped, kernel);
measure_point(MeasurementFunction::TNN_GEMM, MeasurementEvent::END);
- original: original implementation using vectors (data order: nchw)
- data_order_nhwc: simple implementation using tensors (data order changed to nhwc)
- data_order_nchw_tensor_macro1: replace getters and setters of tensors with simple macros
- t2r_gemmLU: merge gemm and PreLU (based on tern2row_cpy)
- best_impl_avx2: Overall best implementation using AVX2 vectorization.
- best_impl_avx512: Overall best implementation using AVX512 vectorization.
- nchw: simple implementation using tensors (data order: nchw)
- nchw_tmacro1: replace getters and setters of tensors with simple macros
- nchw_tmacro1_sinline: use inline keyword for steps like ternarize, gemm, prelu, etc.
- nchw_tmacro2: manually eliminate redundant computation in indices
- nchw_tmacro2_sinline
- nhwc_tmacro1_sinline
- nhwc_tmacro2
- nhwc_tmacro2_sinline
- indirect: compute an indirection buffer instead of im2row
- more_indirect: smaller indirection buffer
- tern2row: naively merge ternarize and im2row
- tern2row_cpy: avoid recomputation by copying already computed elements (uses a loop for copying)
- tern2row_memcpy: copy using memcpy instead of a loop
- t2r_ur_gemmLU_block: unrolling by 2 in ternarize+im2row, and blocking merged gemm+prelu
- t2r_gemmLU_autoblock: template for searching for the best blocking parameters
- t2r_gemmLU_block: Block gemmLU
- t2r_gemmLU_lord: conditionally swaps the loop order (inspired by Model ATLAS)
- t2r_gemmLU_unroll: Unroll the most inner loop in SSA style
- t2r_avx2u_gemmLU_block: axv2 and unrolled by 2 cleanup loop in ternarize+im2row
- t2r_avx2u_ur_gemmLU_block: unrolled by 2 axv2 and unrolled by 2 cleanup loop in ternarize+im2row
- t2r_avx2u_permute_gemmLU_block: axv2 with permute intrinsic for reducing results and unrolled by 2 cleanup loop in ternarize+im2row
- t2r_avx2u_permute_ur_gemmLU_block: unrolled by 2 axv2 with permute intrinsic for reducing results and unrolled by 2 cleanup loop in ternarize+im2row
- avx2: straightforward AVX2, unrolled twice
- avx2_lessunpack: same as before + more computations and less unpacks
- avx2_lessunpack_popout: same as before + libpopcnt on a big vector
- avx2_popout: straightforward AVX2 with libpopcnt on a big vector
- t2r_gemmLU_block_avx2: avx2 used in blocked gemmLU
- t2r_avx512u_gemmLU_block: avx512 and unrolled cleanup loop in ternarize+im2row
- t2r_avx512u_ur_gemmLU_block: unrolled by 2 avx512 and unrolled cleanup loop in ternarize+im2row
- t2r_gemmLU_block_avx512: axv512 used in blocked gemmLU
- ternary_nhwc: nhwc using ternary operators for prelu/ternarize
Some parts of the code have been vectorized with AVX512.
If you do not have AVX512 on your machine, remove all code that uses AVX512 and the corresponding flags in Makefile
.