论文标题
CNN推理的智能式延长延迟reram处理的智能路径
SMART Paths for Latency Reduction in ReRAM Processing-In-Memory Architecture for CNN Inference
论文作者
论文摘要
这项研究工作提出了针对快速有效的CNN(卷积神经网络)推断的基于模拟重新拉姆的PIM(内存处理)体系结构的设计。对于整体体系结构,我们使用基本的硬件层次结构,例如节点,瓷砖,核心和子阵列。最重要的是,我们设计了层内管道,层间管道和批处理管道以利用体系结构中的并行性,并增加了整体吞吐量以推断输入图像流。我们还通过使用SMART(单周期多跳上异步重复遍历)流量控制降低HOP计数来优化NOC(网络芯片)路由器的性能。最后,我们对VGG(A-E)中不同CNN层的重量复制进行了大规模数据集成像网。在我们的仿真中,我们达到了40.4027的顶部(每秒tera-operations)的最佳性能,这对应于1029 fps(每秒帧)。我们还达到了3.5914 TOPS/W(每瓦每秒的TERA-Operaions),以达到最佳的能源效率。此外,与基线管道的基线体系结构相比,具有积极进取的管道和重量复制的体系结构可以实现14倍的加速,并且与基线相比,Smart Flow Control在此体系结构中实现了1.08倍的速度。最后但并非最不重要的一点是,我们还使用合成流量评估了智能流控制的性能。
This research work proposes a design of an analog ReRAM-based PIM (processing-in-memory) architecture for fast and efficient CNN (convolutional neural network) inference. For the overall architecture, we use the basic hardware hierarchy such as node, tile, core, and subarray. On the top of that, we design intra-layer pipelining, inter-layer pipelining, and batch pipelining to exploit parallelism in the architecture and increase overall throughput for the inference of an input image stream. We also optimize the performance of the NoC (network-on-chip) routers by decreasing hop counts using SMART (single-cycle multi-hop asynchronous repeated traversal) flow control. Finally, we experiment with weight replications for different CNN layers in VGG (A-E) for large-scale data set ImageNet. In our simulation, we achieve 40.4027 TOPS (tera-operations per second) for the best-case performance, which corresponds to over 1029 FPS (frames per second). We also achieve 3.5914 TOPS/W (tera-operaions per second per watt) for the best-case energy efficiency. In addition, the architecture with aggressive pipelining and weight replications can achieve 14X speedup compared to the baseline architecture with basic pipelining, and SMART flow control achieves 1.08X speedup in this architecture compared to the baseline. Last but not least, we also evaluate the performance of SMART flow control using synthetic traffic.
