TY - JOUR
T1 - Automatic Generation of Resource and Accuracy Configurable Processing Elements
AU - León-Vega, Luis G.
AU - Salazar-Villalobos, Eduardo
AU - Rodriguez-Figueroa, Alejandro
AU - Castro-Godínez, Jorge
N1 - Publisher Copyright:
© 2023 Copyright held by the owner/author(s). Publication rights licensed to ACM.
PY - 2023/7/24
Y1 - 2023/7/24
N2 - Low-power consumption and scarce computational resources limit the computation at the edge. Besides, the approximate computing paradigm reports promising techniques for designing accelerators to deal with inherent limitations of the edge, and high-level synthesis with C++ opens the opportunity to use meta-programming for specialisable generic design. This work proposes a framework for automatically generating synthesis-time configurable processing elements (PEs) for matrix multiplication-addition (GEMMA) and convolution. To evaluate our work, we perform a design exploration after varying data bit-width, operand sizes, and kernel sizes. Our analyses include resource consumption scaling, clocks-to-solution, design efficiency, and error distribution, presenting a comprehensive view of how the parameters affect the properties of our generic implementations. The GEMMA presented a trade-off between granularity vs efficiency, where large PEs with short data widths are favoured by the design efficiency, achieving, theoretically, up to 75 GMAC/s on a Xilinx XC7Z020 @ 100 MHz with an efficiency of 27%. For design efficiency, we propose a figure of merit to evaluate operations per second and resource utilisation with respect to the maximum achievable by the FPGA. Regarding the convolution PEs, we implemented two algorithms: a window-based spatial convolution and Winograd. The former is the best in terms of performance with 150 GMAC/s, reaching up to 47% of efficiency. Winograd also outperformed numerically using a 3× 3 kernel filter, presenting a mean error of 11.01% in 4-bits operands with a PSNR=16.28 dB, compared to the spatial convolution with 38.2% of mean error and PSNR=5.89 dB. Finally, we discuss how the error is mostly dependent on the PE's parameters. In the GEMMA, the error depends on the matrix size, causing limitations in the PE scaling but still applicable to accelerators. The PEs developed during this research will lead to further granular approximate accelerator research.
AB - Low-power consumption and scarce computational resources limit the computation at the edge. Besides, the approximate computing paradigm reports promising techniques for designing accelerators to deal with inherent limitations of the edge, and high-level synthesis with C++ opens the opportunity to use meta-programming for specialisable generic design. This work proposes a framework for automatically generating synthesis-time configurable processing elements (PEs) for matrix multiplication-addition (GEMMA) and convolution. To evaluate our work, we perform a design exploration after varying data bit-width, operand sizes, and kernel sizes. Our analyses include resource consumption scaling, clocks-to-solution, design efficiency, and error distribution, presenting a comprehensive view of how the parameters affect the properties of our generic implementations. The GEMMA presented a trade-off between granularity vs efficiency, where large PEs with short data widths are favoured by the design efficiency, achieving, theoretically, up to 75 GMAC/s on a Xilinx XC7Z020 @ 100 MHz with an efficiency of 27%. For design efficiency, we propose a figure of merit to evaluate operations per second and resource utilisation with respect to the maximum achievable by the FPGA. Regarding the convolution PEs, we implemented two algorithms: a window-based spatial convolution and Winograd. The former is the best in terms of performance with 150 GMAC/s, reaching up to 47% of efficiency. Winograd also outperformed numerically using a 3× 3 kernel filter, presenting a mean error of 11.01% in 4-bits operands with a PSNR=16.28 dB, compared to the spatial convolution with 38.2% of mean error and PSNR=5.89 dB. Finally, we discuss how the error is mostly dependent on the PE's parameters. In the GEMMA, the error depends on the matrix size, causing limitations in the PE scaling but still applicable to accelerators. The PEs developed during this research will lead to further granular approximate accelerator research.
KW - Approximate computing
KW - deep neural networks
KW - edge computing
KW - hardware acceleration
KW - multi-layer neural network
KW - neural network hardware
UR - http://www.scopus.com/inward/record.url?scp=85168767741&partnerID=8YFLogxK
U2 - 10.1145/3594540
DO - 10.1145/3594540
M3 - Artículo
AN - SCOPUS:85168767741
SN - 1539-9087
VL - 22
JO - ACM Transactions on Embedded Computing Systems
JF - ACM Transactions on Embedded Computing Systems
IS - 4
M1 - 75
ER -