本文へジャンプ

成果報告書詳細
管理番号20190000000355
タイトル*平成30年度中間年報 高効率・高速処理を可能とするAIチップ・次世代コンピューティングの技術開発/革新的AIエッジコンピューティング技術の開発/メモリとして接続する小型リニアアレイアクセラレータの研究開発
公開日2019/6/6
報告書年度2018 - 2018
委託先名国立大学法人奈良先端科学技術大学院大学
プロジェクト番号P16007
部署名IoT推進部
和文要約
英文要約Title:Project for Innovative AI Chips and Next-Generation Computing Technology Development/Development of innovative AI edge computing technologies/Research and Development of a Small-Scale Linear Array Accelerator for Memory Interfaces (FY2018-FY2019)FY2018 Annual Report

In order for efficient SIMT/SIMD operation, a multithreading mechanism supported by a huge register file should be incorporated to hide the latency of DDR over 1000 cycles. The huge amount of power consumption of GPU seems inevitable for preserving conventional programmability. On the other hand,a great wall is being conscious on further improving the performance of von Neumann computing infrastructure. In this scenario, GPU is used for developing an algorithm, and Domain Specific Accelerators (DSA) is used for implementing the algorithm on FPGA or ASIC attached to each product. However, FPGA has drawbacks in price and operating frequency, and ASIC has problems in flexibility of algorithm and in development cost including custom high speed I/O. Therefore, more flexible hardware (one of DSAs) specialized for CNN and a systolic array architecture that efficiently executes CNN are reported. These are promising techniques for engineers who can agree that low power consumption and easy-to-estimate performance are important for social implementation irrespective of its restricted programmability. Furthermore, if such types of accelerators can be fabricated with small footprint and with simple scalability just increasing the number of chips depending on applications, powerful and efficient computing platforms will be spread into IoT devices and edge computers that are sensitive in power, price, environment and improving algorithm. In this research, in order to improve the scalability and area efficiency of systolic arrays for edge side, (1) cascaded multichip structure as an AXI slave and intra-chip multi-bus structure ensuring scalability without increasing the number of external memory bus; (2) column multithreading for reducing the number of physical columns to avoid performance degradation due to long wires and self-loop accumulation; (3) multilevel loop controller and adaptive map shifter for reducing startup overhead; are proposed. We designed a systolic array with single column x 64 rows as AXI4 slave by Verilog HDL, estimated the operating frequency and performance by using a prototype system on FPGA, and evaluated the area with TSMC 28nm library and memory generator. We found (1) the execution speed of a matrix multiplication / a convolution operation / a light-field depth extraction with the size larger than the capacity of the local memory is 6.3x / 9.2x /6.6x compared with conventional systolic array (EMAX);(2) the speed with four chips configuration is 19.6x / 16.0x / 8.5x compared with EMAX; (3) the size of the single chip is 8.4mm^2 (0.31x of EMAX) and the basic performance per area is 3/4/0.31=2.4x.
ダウンロード成果報告書データベース(ユーザ登録必須)から、ダウンロードしてください。

▲トップに戻る