PIM을 지원하는 GPU system에서의 CNN 연산 가속 = Acceleration of CNN Computation on a PIM-enabled GPU system

최정우 2022년
논문상세정보
' PIM을 지원하는 GPU system에서의 CNN 연산 가속 = Acceleration of CNN Computation on a PIM-enabled GPU system' 의 주제별 논문영향력
논문영향력 선정 방법
논문영향력 요약
주제
  • 응용 물리
  • approximate computing
  • convolutional neural networks
  • gpu
  • image processing
  • processing-in-memory
동일주제 총논문수 논문피인용 총횟수 주제별 논문영향력의 평균
5,005 0

0.0%

' PIM을 지원하는 GPU system에서의 CNN 연산 가속 = Acceleration of CNN Computation on a PIM-enabled GPU system' 의 참고문헌

  • [9] D. Zhang, N. Jayasena, A. Lyashevsky, J. L. Greathouse, L. Xu, and M. Ignatowski, TOP-PIM: Throughput-oriented programmable processing in memory, in Proceedings of International Symposium on High-Performance Parallel and Distributed Computing (HPDC), 2014, pp. 85–98.
    [2014]
  • [8] M. Gokhale, B. Holmes, and K. Iobst, Processing in memory: The Terasys massively parallel PIM array, Computer, vol. 28, no. 4, pp. 23–31, 1995.
    [1995]
  • [87] J. Kim and Y. Kim, HBM: Memory solution for bandwidth-hungry processors, in Proceedings of IEEE Hot Chips Symposium (HCS), 2014, pp. 1–24.
    [2014]
  • [86] Y. Eckert, N. Jayasena, and G. H. Loh, Thermal feasibility of diestacked processing in memory, 2014.
    [2014]
  • [84] M. Gebhart, D. R. Johnson, D. Tarjan, S. W. Keckler, W. J. Dally, E. Lindholm, and K. Skadron, Energy-efficient mechanisms for managing thread context in throughput processors, in Proceedings of IEEE International Symposium on Computer Architecture (ISCA), 2011, pp. 235–246.
    [2011]
  • [83] M. Horowitz, 1.1 Computing’s energy problem (and what we can do about it), in Proceedings of International Solid-State Circuits Conference (ISSCC), 2014, pp. 10–14.
    [2014]
  • [82] B. Dally, Power, programmability, and granularity: The challenges of ExaScale computing, in Proceedings of IEEE International Test Conference, 2011, pp. 12–12.
    [2011]
  • [81] NVIDIA, NVIDIA CUDA SDK 4.2, 2011.
    [2011]
  • [80] H. Jeon, G. Koo, and M. Annavaram, CTA-aware prefetching for GPGPU, Univ. Southern California, Los Angeles, CA, USA, Comput. Eng. Tech. Rep. CENG- 2014–08, 2014.
    [2014]
  • [7] L. Nai, R. Hadidi, J. Sim, H. Kim, P. Kumar, and H. Kim, GraphPIM: Enabling instruction-level PIM offloading in graph computing frameworks, in Proceedings of IEEE International Symposium on High Performance Computer Architecture (HPCA), 2017, pp. 457–468.
    [2017]
  • [79] A. Jog, O. Kayiran, N. Chidambaram Nachiappan, A. K. Mishra, M. T. Kandemir, O. Mutlu, R. Iyer, C. R. Das, OWL: Cooperative thread array aware scheduling techniques for improving GPGPU performance, ACM SIGPLAN Notices, vol. 48, no. 4, pp. 395–406, 2013.
    [2013]
  • [78] A. Sethia, G. Dasika, M. Samadi, and S. Mahlke, APOGEE: Adaptive prefetching on GPUs for energy efficiency, in Proceedings of International Conference on Parallel Architectures and Compilation Techniques (PACT), 2013, pp. 73–82.
    [2013]
  • [77] J. Lee, N. B. Lakshminarayana, H. Kim, and R. Vuduc, Manythread aware prefetching mechanisms for GPGPU applications, in Proceedings of IEEE/ACM International Symposium on Microarchitecture (MICRO), 2010, pp. 213–224.
    [2010]
  • [76] A. Jog, O. Kayiran, A. K. Mishra, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das, Orchestrated scheduling and prefetching for GPGPUs, in Proceedings of IEEE International Symposium on Computer Architecture (ISCA), 2013, pp. 332- 343.
    [2013]
  • [75] M. Lee, S. Song, J. Moon, J. Kim, W. Seo, Y. Cho, and S. Ryu, Improving GPGPU resource utilization through alternative thread block scheduling, in Proceedings of IEEE International Symposium on High Performance Computer Architecture (HPCA), 2014, pp. 260–271.
    [2014]
  • [74] O. Kayiran, A. Jog, M. T. Kandemir, and C. R. Das, Neither more nor less: Optimizing thread-level parallelism for GPGPUs, in Proceedings of International Conference on Parallel Architectures and Compilation Techniques (PACT), 2013, pp. 157–166.
    [2013]
  • [73] J. T. Pawlowski, Hybrid memory cube (HMC), in Proceedings of IEEE Hot Chips Symposium (HCS), 2011, pp. 1–24.
    [2011]
  • [72] J. Standard, High bandwidth memory (HBM) DRAM, JESD235, 2013.
    [2013]
  • [71] H. Wang, R. Singh, M. J. Schulte, and N. S. Kim, "Memory scheduling towards high-throughput cooperative heterogeneous computing," in Proceedings of International Conference on Parallel Architectures and Compilation Techniques (PACT), 2014, pp.331-341.
    [2014]
  • [70] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, "Densely connected convolutional networks," in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp 4700-4708.
    [2017]
  • [69] K. Simonyan, and A. Zisserman. "Very deep convolutional networks for largescale image recognition." arXiv preprint arXiv:1409.1556 (2014).
    [2014]
  • [67] Jia, Zhe, et al. "Dissecting the NVIDIA volta GPU architecture via microbenchmarking." arXiv preprint arXiv:1804.06826 (2018).
    [2018]
  • [66] Nvidia Tesla V100 GPU Architecture, The World’s Most Advanced Data Center GPU. NVIDIA Corporation, 2017.
    [2017]
  • [62] B. Kim, J. Chung, E. Lee, W. Jung, S. Lee, J. Choi, J, Park, M. Wi, S. Lee, and J. H. Ahn, "MViD: Sparse matrix-vector multiplication in mobile dram for accelerating recurrent neural networks." IEEE Transactions on Computers, vol. 60, no. 7, pp. 955-967, 2020.
  • [60] K. Hsieh, E. Ebrahimi, G. Kim, N. Chatterjee, M. O’Connor, N. Vijaykumar, O. Mutlu, and S. W. Keckler, Transparent offloading and mapping (TOM): Enabling programmer-transparent near-data processing in GPU systems, ACM SIGARCH Computer Architecture News, vol. 44, pp. 204–216, 2016.
    [2016]
  • [5] J. Draper, J. Chane, M. Hall, C. Steele, T. Barret, J. LaCoss, J. Granacki, J. Shin, C. Chen, C. W. Kang, I. Kim, and G. Daglikoca, The architecture of the DIVA processing-in-memory chip, in Proceedings of International Conference on Supercomputing (ICS), 2002, pp. 14–25.
    [2002]
  • [59] A. Nag, and R. Balasubramonian. "OrderLight: Lightweight Memory-Ordering Primitive for Efficient Fine-Grained PIM Computations." in Proceedings of IEEE/ACM International Symposium on Microarchitecture (MICRO), 2021, pp. 298-310.
  • [58] A. Farmahini-Farahani, J. H. Ahn, K. Morrow, and N. S. Kim. NDA: Neardram acceleration architecture leveraging commodity dram devices and standard memory modules, in Proceedings of IEEE International Symposium on High Performance Computer Architecture (HPCA), 2015, pp. 283–295.
    [2015]
  • [57] A. Pattnaik X. Tang, A. Jog, O. Kayiran, A. K. Mishra, M. T. Kandemir, O. Mutlu, and C. R. Das, Scheduling techniques for GPU architectures with processing-in-memory capabilities, in Proceedings of International Conference on Parallel Architectures and Compilation Techniques (PACT), 2016, pp. 31–44.
    [2016]
  • [54] W. Jung, D. Jung, B. Kim, S. Lee, W. Rhee, and J. H. Ahn, "Restructuring batch normalization to accelerate CNN training," in Proceedings of Machine Learning and Systems 1 (MLSys), 2019, pp.14-26.
  • [51] H. H. Saleh and E. E. Swartzlander, A floating-point fused dot-product unit, in Proceedings of International Conference on Computer Design (ICCD), 2008, pp. 427–431.
    [2008]
  • [50] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, Imagenet: A largescale hierarchical image database, in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009, pp. 248–255.
    [2009]
  • [49] R. Nishino and S. H. C. Loomis, Cupy: A numpy-compatible library for nvidia gpu calculations, in Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2017, [Online].
    [2017]
  • [48] M. O’Connor, N. Chatterjee, D. Lee, J. Wilson, A. Agrawal, S. W. Keckler, and W. J. Dally, Fine-grained dram: Energy-efficient dram for extreme bandwidth systems, in Proceedings of IEEE/ACM International Symposium on Microarchitecture (MICRO), 2017, pp. 41–54.
    [2017]
  • [47] R. Balasubramonian, A. B. Kahng, N. Muralimanohar, A. Shafiee, and V. Srinivas, Cacti 7: New tools for interconnect exploration in innovative off-chip memories, ACM Transactions on Architecture and Code Optimization (TACO), vol. 14, no. 2, pp. 1–25, 2017.
    [2017]
  • [46] J. Leng, T. Hetherington, A. ElTantawy, S. Gilani, N. S. Kim, T. M. Aamodt, and V. J. Reddi, Gpuwattch: Enabling energy optimizations in gpgpus, ACM SIGARCH Computer Architecture News, vol. 41, no. 3, pp. 487–498, 2013.
    [2013]
  • [45] A. Bakhoda, G. L. Yuan, W. W. Fung, H. Wong, and T. M. Aamodt,Analyzing cuda workloads using a detailed gpu simulator, in Proceedings of International Symposium on Performance Analysis of Systems and Software (ISPASS), 2009, pp. 163–174.
    [2009]
  • [44] S. Hong, P. J. Nair, B. Abali, A. Buyuktosunoglu, K.-H. Kim, and M. Healy, Attache: Towards ideal memory compression by mitigating metadata bandwidth overheads, in Proceedings of IEEE/ACM International Symposium on Microarchitecture (MICRO), 2018, pp. 326–338.
    [2018]
  • [43] J. Ji, J. Li, S. Yan, B. Zhang, and Q. Tian, Super-bit locality sensitive hashing, in Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2012, pp. 108–116.
    [2012]
  • [42] M. S. Charikar, Similarity estimation techniques from rounding algorithms, in Proceedings of ACM symposium on Theory of computing (STOC), 2002, pp. 380–388.
    [2002]
  • [41] J. Lee, J. H. Ahn, and K. Choi, Buffered compares: Excavating the hidden parallelism inside dram architectures with lightweight logic, in Proceedings of Design, Automation & Test in Europe Conference & Exhibition (DATE), 2016, pp. 1243–1248.
    [2016]
  • [40] D. Kim, J. Kung, S. Chai, S. Yalamanchili, and S. Mukhopadhyay,Neurocube: A programmable digital neuromorphic architecture with high-density 3d memory, ACM SIGARCH Computer Architecture News, vol. 44, no. 3, pp. 380–392, 2016
    [2016]
  • [39] J. Lee, J. Lee, D. Han, J. Lee, G. Park, and H.-J. Yoo, 7.7 lnpu: A 25.3 tflops/w sparse deep-neural-network learning processor with fine-grained mixed precision of fp8-fp16, in Proceedings of International Solid-State Circuits Conference (ISSCC), 2019, pp. 142–144.
    [2019]
  • [38] D. Kim, J. Ahn, and S. Yoo, Zena: Zero-aware neural network accelerator, IEEE Design & Test, vol. 35, no. 1, pp. 39–46, 2017.
    [2017]
  • [37] A. Yasoubi, R. Hojabr, and M. Modarressi, Power-efficient accelerator design for neural networks using computation reuse, IEEE Computer Architecture Letters, vol. 16, no. 1, pp. 72–75, 2016.
    [2016]
  • [36] M. Imani, D. Peroni, Y. Kim, A. Rahimi, and T. Rosing, Efficient neural network acceleration on gpgpu using content addressable memory, in Proceedings of Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017, pp. 1026–1031.
    [2017]
  • [35] M. S. Razlighi, M. Imani, F. Koushanfar, and T. Rosing, Looknn: Neural network with no multiplication, in Proceedings of Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017, pp. 1775–1780.
    [2017]
  • [34] L. Mocerino, V. Tenace, and A. Calimera, Energy-efficient convolutional neural networks via recurrent data reuse, in Proceedings of Design, Automation & Test in Europe Conference & Exhibition (DATE), 2019, pp. 848–853.
    [2019]
  • [33] X. Jiao, V. Akhlaghi, Y. Jiang, and R. K. Gupta, Energy-efficient neural networks using approximate computation reuse, in Proceedings of Design, Automation & Test in Europe Conference & Exhibition (DATE), 2018, pp. 1223– 1228.
    [2018]
  • [32] I. Qiqieh, R. Shafik, G. Tarawneh, D. Sokolov, S. Das, and A. Yakovlev,Significance-driven logic compression for energy-efficient multiplier design, IEEE Journal on Emerging and Selected Topics in Circuits and Systems, vol. 8, no. 3, pp. 417–430, 2018.
    [2018]
  • [31] M. S. Ansari, H. Jiang, B. F. Cockburn, and J. Han, Low-power approximate multipliers using encoded partial products and approximate compressors, IEEE Journal on Emerging and Selected Topics in Circuits and Systems, vol. 8, no. 3, pp. 404–416, 2018.
    [2018]
  • [30] M. Imani, D. Peroni, and T. Rosing, Cfpu: Configurable floating point multiplier for energy-efficient computing, in Proceedings of Design Automation Conference (DAC), 2017, pp. 1–6.
    [2017]
  • [2] Y. LeCun, K. Kavukcuoglu, and C. Farabet, "Convolutional networks and applications in vision," in Proceedings of IEEE International Symposium on Circuits and Systems (ISCAS), 2010, pp. 253-256.
    [2010]
  • [29] C. Liu, J. Han, and F. Lombardi, A low-power, high-performance approximate multiplier with configurable partial error recovery, in Proceedings of Design, Automation & Test in Europe Conference & Exhibition (DATE), 2014, pp. 1–4.
    [2014]
  • [28] A. Ghofrani, A. Rahimi, M. A. Lastras-Monta˜no, L. Benini, R. K. Gupta, and K.-T. Cheng, Associative memristive memory for approximate computing in gpus, IEEE Journal on Emerging and Selected Topics in Circuits and Systems, vol. 6, no. 2, pp. 222–234, 2016.
    [2016]
  • [27] V. K. Chippa, D. Mohapatra, K. Roy, S. T. Chakradhar, and A. Raghunathan,Scalable effort hardware design, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 22, no. 9, pp. 2004–2016, 2014.
  • [26] P. K. Krause and I. Polian, Adaptive voltage over-scaling for resilient applications, in Proceedings of Design, Automation & Test in Europe Conference & Exhibition (DATE), 2011, pp. 1–6.
    [2011]
  • [25] H. E. Yantır, A. M. Eltawil, and F. J. Kurdahi, A hybrid approximate computing approach for associative in-memory processors, IEEE Journal on Emerging and Selected Topics in Circuits and Systems, vol. 8, no. 4, pp. 758–769, 2018.
    [2018]
  • [23] A. Li, S. L. Song, W. Liu, X. Liu, A. Kumar, and H. Corporaal, Localityaware CTA clustering for modern GPUs, ACM SIGARCH Computer Architecture News, vol. 45, pp. 297–311, 2017.
    [2017]
  • [22] N. G. GTX, 980: Featuring maxwell, the most advanced GPU ever made, White Paper, NVIDIA Corporation, 2014.
    [2014]
  • [20] M. F. Deering, S. A. Schlapp, and M. G. Lavelle, FBRAM: A new form of memory optimized for 3D graphics, in Proceedings of Computer Graphics and Interactive Techniques (SIGGRAPH), 1994, pp. 167–174.
    [1994]
  • [19] S. Lee, S. Kang, J. Lee, H. Kim, E. Lee, S. Seo, H. Yoon, S. Lee, K. Lim, H. Shin, J. Kim, O, Seongil, A. Lyer, D. Wang, K. Sohn, N. S. Kim, "Hardware Architecture and Software Stack for PIM Based on Commercial DRAM Technology: Industrial Product." in Proceedings of IEEE International Symposium on Computer Architecture (ISCA), 2021, pp. 43-56
  • [16] C. Xie, S. L. Song, J. Wang, W. Zhang, and X. Fu, Processing-in-memory enabled graphics processors for 3D rendering, in Proceedings of IEEE International Symposium on High Performance Computer Architecture (HPCA), 2017, pp. 637– 648.
    [2017]
  • [15] Aga, Shaizeen, Nuwan Jayasena, and Mike Ignatowski. "Co-ML: a case for Co llaborative ML acceleration using near-data processing," in Proceedings of International Symposium on Memory Systems (MEMSYS), 2019, pp. 506-517.
    [2019]
  • [14] Q. Deng, L. Jiang, Y. Zhang, M. Zhang, and J. Yang, Dracc: a dram based accelerator for accurate cnn inference, in Proceedings of Design Automation Conference (DAC), 2018, pp. 1–6.
    [2018]
  • [13] M. He, C. Song, I. Kim, C. Jeong, S. Kim, I. Park, M. Thottethodi, and T. Vijaykumar, Newton: A dram-maker’s accelerator-in-memory (aim) architecture for machine learning, in Proceedings of IEEE/ACM International Symposium on Microarchitecture (MICRO), 2020, pp. 372–385.
    [2020]
  • [12] H. Shin, D. Kim, E. Park, S. Park, Y. Park, and S. Yoo, Mcdram: Low latency and energy-efficient matrix computations in dram, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 37, no. 11, pp. 2613–2622, 2018.
    [2018]
  • [11] M. Gao, G. Ayers, and C. Kozyrakis, Practical near-data processing for inmemory analytics frameworks, in Proceedings of International Conference on Parallel Architectures and Compilation Techniques (PACT), 2015, pp. 113–124.
    [2015]
  • [10] B. Hong, G. Kim, J. H. Ahn, Y. Kwon, H. Kim, and J. Kim,Accelerating linked-list traversal through near-data processing, in Proceedings of International Conference on Parallel Architectures and Compilation Techniques (PACT), 2016, pp. 113–124.
    [2016]
  • Thread cluster memory scheduling : exploiting differences in memory access behavior
  • TETRIS : Scalable and efficient neural network acceleration with 3D memory ,
  • Staged memory scheduling : achieving high performance and scalability in heterogeneous systems
  • Prefetching techniques for near-memory throughput processors ,
  • Power modeling for GPU architectures using McPAT
  • Near data acceleration with concurrent host access ,
  • Modeling deep learning accelerator enabled gpus ,
  • Mini-batch serialization : Cnn training with inter-layer data reuse
  • Lightweight SIMT core designs for intelligent 3D stacked DRAM
  • Learning and transferring midlevel image representations using convolutional neural networks
  • In-place activated batchnorm for memory-optimized training of dnns
  • Gaussian yolov3 : An accurate and fast object detector using localization uncertainty for autonomous driving
  • Evaluating fast algorithms for convolutional neural networks on FPGAs
  • Deep residual learning for image recognition
  • An FPGA design framework for CNN sparsification and acceleration .
  • A scalable processing-in-memory accelerator for parallel graph processing