인기 연구 키워드 :
인기 활용 키워드 :

PIM을 지원하는 GPU system에서의 CNN 연산 가속 = Acceleration of CNN Computation on a PIM-enabled GPU system

최정우 2022년

활용도
공유도
영향력

논문상세정보

- 저자 최정우
- 형태사항 ix, 119: 26 cm
- 일반주기 지도교수: 이혁재
- 학위논문사항 2022. 8, 전기·정보공학부, 학위논문(박사)-, 서울대학교 대학원
- DDC 621.3
- 발행지 서울
- 언어 kor
- 출판년 2022
- 발행사항 서울대학교 대학원
- 주제어 approximate computing convolutional neural networks GPU image processing processing-in-memory
- 참고문헌( 87)
유사주제 논문( 5,001)

인용/피인용

' PIM을 지원하는 GPU system에서의 CNN 연산 가속 = Acceleration of CNN Computation on a PIM-enabled GPU system' 의 주제별 논문영향력

논문영향력 요약
동일주제 총논문수	논문피인용 총횟수	주제별 논문영향력의 평균
주제	응용 물리 approximate computing convolutional neural networks gpu image processing processing-in-memory
5,005	0	0.0%

논문영향력
주제		주제별 논문수	주제별 논문영향력
주제분류(KDC/DDC)	응용 물리	4,649	0.0%
주제어	approximate computing	4	0.0%
	convolutional neural networks	80	0.0%
	gpu	146	0.0%
	image processing	125	0.0%
	processing-in-memory	3	0.0%
계		5,007	0.0%
* 다른 주제어 보유 논문에서 피인용된 횟수

' PIM을 지원하는 GPU system에서의 CNN 연산 가속 = Acceleration of CNN Computation on a PIM-enabled GPU system' 의 참고문헌

[9] D. Zhang, N. Jayasena, A. Lyashevsky, J. L. Greathouse, L. Xu, and M. Ignatowski, TOP-PIM: Throughput-oriented programmable processing in memory, in Proceedings of International Symposium on High-Performance Parallel and Distributed Computing (HPDC), 2014, pp. 85–98.
[2014]
[8] M. Gokhale, B. Holmes, and K. Iobst, Processing in memory: The Terasys massively parallel PIM array, Computer, vol. 28, no. 4, pp. 23–31, 1995.
[1995]
[87] J. Kim and Y. Kim, HBM: Memory solution for bandwidth-hungry processors, in Proceedings of IEEE Hot Chips Symposium (HCS), 2014, pp. 1–24.
[2014]
[86] Y. Eckert, N. Jayasena, and G. H. Loh, Thermal feasibility of diestacked processing in memory, 2014.
[2014]
[84] M. Gebhart, D. R. Johnson, D. Tarjan, S. W. Keckler, W. J. Dally, E. Lindholm, and K. Skadron, Energy-efficient mechanisms for managing thread context in throughput processors, in Proceedings of IEEE International Symposium on Computer Architecture (ISCA), 2011, pp. 235–246.
[2011]
[83] M. Horowitz, 1.1 Computing’s energy problem (and what we can do about it), in Proceedings of International Solid-State Circuits Conference (ISSCC), 2014, pp. 10–14.
[2014]
[82] B. Dally, Power, programmability, and granularity: The challenges of ExaScale computing, in Proceedings of IEEE International Test Conference, 2011, pp. 12–12.
[2011]
[81] NVIDIA, NVIDIA CUDA SDK 4.2, 2011.
[2011]
[80] H. Jeon, G. Koo, and M. Annavaram, CTA-aware prefetching for GPGPU, Univ. Southern California, Los Angeles, CA, USA, Comput. Eng. Tech. Rep. CENG- 2014–08, 2014.
[2014]
[7] L. Nai, R. Hadidi, J. Sim, H. Kim, P. Kumar, and H. Kim, GraphPIM: Enabling instruction-level PIM offloading in graph computing frameworks, in Proceedings of IEEE International Symposium on High Performance Computer Architecture (HPCA), 2017, pp. 457–468.
[2017]
[79] A. Jog, O. Kayiran, N. Chidambaram Nachiappan, A. K. Mishra, M. T. Kandemir, O. Mutlu, R. Iyer, C. R. Das, OWL: Cooperative thread array aware scheduling techniques for improving GPGPU performance, ACM SIGPLAN Notices, vol. 48, no. 4, pp. 395–406, 2013.
[2013]
[78] A. Sethia, G. Dasika, M. Samadi, and S. Mahlke, APOGEE: Adaptive prefetching on GPUs for energy efficiency, in Proceedings of International Conference on Parallel Architectures and Compilation Techniques (PACT), 2013, pp. 73–82.
[2013]
[77] J. Lee, N. B. Lakshminarayana, H. Kim, and R. Vuduc, Manythread aware prefetching mechanisms for GPGPU applications, in Proceedings of IEEE/ACM International Symposium on Microarchitecture (MICRO), 2010, pp. 213–224.
[2010]
[76] A. Jog, O. Kayiran, A. K. Mishra, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das, Orchestrated scheduling and prefetching for GPGPUs, in Proceedings of IEEE International Symposium on Computer Architecture (ISCA), 2013, pp. 332- 343.
[2013]
[75] M. Lee, S. Song, J. Moon, J. Kim, W. Seo, Y. Cho, and S. Ryu, Improving GPGPU resource utilization through alternative thread block scheduling, in Proceedings of IEEE International Symposium on High Performance Computer Architecture (HPCA), 2014, pp. 260–271.
[2014]
[74] O. Kayiran, A. Jog, M. T. Kandemir, and C. R. Das, Neither more nor less: Optimizing thread-level parallelism for GPGPUs, in Proceedings of International Conference on Parallel Architectures and Compilation Techniques (PACT), 2013, pp. 157–166.
[2013]
[73] J. T. Pawlowski, Hybrid memory cube (HMC), in Proceedings of IEEE Hot Chips Symposium (HCS), 2011, pp. 1–24.
[2011]
[72] J. Standard, High bandwidth memory (HBM) DRAM, JESD235, 2013.
[2013]
[71] H. Wang, R. Singh, M. J. Schulte, and N. S. Kim, "Memory scheduling towards high-throughput cooperative heterogeneous computing," in Proceedings of International Conference on Parallel Architectures and Compilation Techniques (PACT), 2014, pp.331-341.
[2014]
[70] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, "Densely connected convolutional networks," in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp 4700-4708.
[2017]
[69] K. Simonyan, and A. Zisserman. "Very deep convolutional networks for largescale image recognition." arXiv preprint arXiv:1409.1556 (2014).
[2014]
[67] Jia, Zhe, et al. "Dissecting the NVIDIA volta GPU architecture via microbenchmarking." arXiv preprint arXiv:1804.06826 (2018).
[2018]
[66] Nvidia Tesla V100 GPU Architecture, The World’s Most Advanced Data Center GPU. NVIDIA Corporation, 2017.
[2017]
[62] B. Kim, J. Chung, E. Lee, W. Jung, S. Lee, J. Choi, J, Park, M. Wi, S. Lee, and J. H. Ahn, "MViD: Sparse matrix-vector multiplication in mobile dram for accelerating recurrent neural networks." IEEE Transactions on Computers, vol. 60, no. 7, pp. 955-967, 2020.
[60] K. Hsieh, E. Ebrahimi, G. Kim, N. Chatterjee, M. O’Connor, N. Vijaykumar, O. Mutlu, and S. W. Keckler, Transparent offloading and mapping (TOM): Enabling programmer-transparent near-data processing in GPU systems, ACM SIGARCH Computer Architecture News, vol. 44, pp. 204–216, 2016.
[2016]
[5] J. Draper, J. Chane, M. Hall, C. Steele, T. Barret, J. LaCoss, J. Granacki, J. Shin, C. Chen, C. W. Kang, I. Kim, and G. Daglikoca, The architecture of the DIVA processing-in-memory chip, in Proceedings of International Conference on Supercomputing (ICS), 2002, pp. 14–25.
[2002]
[59] A. Nag, and R. Balasubramonian. "OrderLight: Lightweight Memory-Ordering Primitive for Efficient Fine-Grained PIM Computations." in Proceedings of IEEE/ACM International Symposium on Microarchitecture (MICRO), 2021, pp. 298-310.
[58] A. Farmahini-Farahani, J. H. Ahn, K. Morrow, and N. S. Kim. NDA: Neardram acceleration architecture leveraging commodity dram devices and standard memory modules, in Proceedings of IEEE International Symposium on High Performance Computer Architecture (HPCA), 2015, pp. 283–295.
[2015]
[57] A. Pattnaik X. Tang, A. Jog, O. Kayiran, A. K. Mishra, M. T. Kandemir, O. Mutlu, and C. R. Das, Scheduling techniques for GPU architectures with processing-in-memory capabilities, in Proceedings of International Conference on Parallel Architectures and Compilation Techniques (PACT), 2016, pp. 31–44.
[2016]
[54] W. Jung, D. Jung, B. Kim, S. Lee, W. Rhee, and J. H. Ahn, "Restructuring batch normalization to accelerate CNN training," in Proceedings of Machine Learning and Systems 1 (MLSys), 2019, pp.14-26.
[51] H. H. Saleh and E. E. Swartzlander, A floating-point fused dot-product unit, in Proceedings of International Conference on Computer Design (ICCD), 2008, pp. 427–431.
[2008]
[50] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, Imagenet: A largescale hierarchical image database, in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009, pp. 248–255.
[2009]
[49] R. Nishino and S. H. C. Loomis, Cupy: A numpy-compatible library for nvidia gpu calculations, in Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2017, [Online].
[2017]
[48] M. O’Connor, N. Chatterjee, D. Lee, J. Wilson, A. Agrawal, S. W. Keckler, and W. J. Dally, Fine-grained dram: Energy-efficient dram for extreme bandwidth systems, in Proceedings of IEEE/ACM International Symposium on Microarchitecture (MICRO), 2017, pp. 41–54.
[2017]
[47] R. Balasubramonian, A. B. Kahng, N. Muralimanohar, A. Shafiee, and V. Srinivas, Cacti 7: New tools for interconnect exploration in innovative off-chip memories, ACM Transactions on Architecture and Code Optimization (TACO), vol. 14, no. 2, pp. 1–25, 2017.
[2017]
[46] J. Leng, T. Hetherington, A. ElTantawy, S. Gilani, N. S. Kim, T. M. Aamodt, and V. J. Reddi, Gpuwattch: Enabling energy optimizations in gpgpus, ACM SIGARCH Computer Architecture News, vol. 41, no. 3, pp. 487–498, 2013.
[2013]
[45] A. Bakhoda, G. L. Yuan, W. W. Fung, H. Wong, and T. M. Aamodt,Analyzing cuda workloads using a detailed gpu simulator, in Proceedings of International Symposium on Performance Analysis of Systems and Software (ISPASS), 2009, pp. 163–174.
[2009]
[44] S. Hong, P. J. Nair, B. Abali, A. Buyuktosunoglu, K.-H. Kim, and M. Healy, Attache: Towards ideal memory compression by mitigating metadata bandwidth overheads, in Proceedings of IEEE/ACM International Symposium on Microarchitecture (MICRO), 2018, pp. 326–338.
[2018]
[43] J. Ji, J. Li, S. Yan, B. Zhang, and Q. Tian, Super-bit locality sensitive hashing, in Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2012, pp. 108–116.
[2012]
[42] M. S. Charikar, Similarity estimation techniques from rounding algorithms, in Proceedings of ACM symposium on Theory of computing (STOC), 2002, pp. 380–388.
[2002]
[41] J. Lee, J. H. Ahn, and K. Choi, Buffered compares: Excavating the hidden parallelism inside dram architectures with lightweight logic, in Proceedings of Design, Automation & Test in Europe Conference & Exhibition (DATE), 2016, pp. 1243–1248.
[2016]
[40] D. Kim, J. Kung, S. Chai, S. Yalamanchili, and S. Mukhopadhyay,Neurocube: A programmable digital neuromorphic architecture with high-density 3d memory, ACM SIGARCH Computer Architecture News, vol. 44, no. 3, pp. 380–392, 2016
[2016]
[39] J. Lee, J. Lee, D. Han, J. Lee, G. Park, and H.-J. Yoo, 7.7 lnpu: A 25.3 tflops/w sparse deep-neural-network learning processor with fine-grained mixed precision of fp8-fp16, in Proceedings of International Solid-State Circuits Conference (ISSCC), 2019, pp. 142–144.
[2019]
[38] D. Kim, J. Ahn, and S. Yoo, Zena: Zero-aware neural network accelerator, IEEE Design & Test, vol. 35, no. 1, pp. 39–46, 2017.
[2017]
[37] A. Yasoubi, R. Hojabr, and M. Modarressi, Power-efficient accelerator design for neural networks using computation reuse, IEEE Computer Architecture Letters, vol. 16, no. 1, pp. 72–75, 2016.
[2016]
[36] M. Imani, D. Peroni, Y. Kim, A. Rahimi, and T. Rosing, Efficient neural network acceleration on gpgpu using content addressable memory, in Proceedings of Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017, pp. 1026–1031.
[2017]
[35] M. S. Razlighi, M. Imani, F. Koushanfar, and T. Rosing, Looknn: Neural network with no multiplication, in Proceedings of Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017, pp. 1775–1780.
[2017]
[34] L. Mocerino, V. Tenace, and A. Calimera, Energy-efficient convolutional neural networks via recurrent data reuse, in Proceedings of Design, Automation & Test in Europe Conference & Exhibition (DATE), 2019, pp. 848–853.
[2019]
[33] X. Jiao, V. Akhlaghi, Y. Jiang, and R. K. Gupta, Energy-efficient neural networks using approximate computation reuse, in Proceedings of Design, Automation & Test in Europe Conference & Exhibition (DATE), 2018, pp. 1223– 1228.
[2018]
[32] I. Qiqieh, R. Shafik, G. Tarawneh, D. Sokolov, S. Das, and A. Yakovlev,Significance-driven logic compression for energy-efficient multiplier design, IEEE Journal on Emerging and Selected Topics in Circuits and Systems, vol. 8, no. 3, pp. 417–430, 2018.
[2018]
[31] M. S. Ansari, H. Jiang, B. F. Cockburn, and J. Han, Low-power approximate multipliers using encoded partial products and approximate compressors, IEEE Journal on Emerging and Selected Topics in Circuits and Systems, vol. 8, no. 3, pp. 404–416, 2018.
[2018]
[30] M. Imani, D. Peroni, and T. Rosing, Cfpu: Configurable floating point multiplier for energy-efficient computing, in Proceedings of Design Automation Conference (DAC), 2017, pp. 1–6.
[2017]
[2] Y. LeCun, K. Kavukcuoglu, and C. Farabet, "Convolutional networks and applications in vision," in Proceedings of IEEE International Symposium on Circuits and Systems (ISCAS), 2010, pp. 253-256.
[2010]
[29] C. Liu, J. Han, and F. Lombardi, A low-power, high-performance approximate multiplier with configurable partial error recovery, in Proceedings of Design, Automation & Test in Europe Conference & Exhibition (DATE), 2014, pp. 1–4.
[2014]
[28] A. Ghofrani, A. Rahimi, M. A. Lastras-Monta˜no, L. Benini, R. K. Gupta, and K.-T. Cheng, Associative memristive memory for approximate computing in gpus, IEEE Journal on Emerging and Selected Topics in Circuits and Systems, vol. 6, no. 2, pp. 222–234, 2016.
[2016]
[27] V. K. Chippa, D. Mohapatra, K. Roy, S. T. Chakradhar, and A. Raghunathan,Scalable effort hardware design, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 22, no. 9, pp. 2004–2016, 2014.
[26] P. K. Krause and I. Polian, Adaptive voltage over-scaling for resilient applications, in Proceedings of Design, Automation & Test in Europe Conference & Exhibition (DATE), 2011, pp. 1–6.
[2011]
[25] H. E. Yantır, A. M. Eltawil, and F. J. Kurdahi, A hybrid approximate computing approach for associative in-memory processors, IEEE Journal on Emerging and Selected Topics in Circuits and Systems, vol. 8, no. 4, pp. 758–769, 2018.
[2018]
[23] A. Li, S. L. Song, W. Liu, X. Liu, A. Kumar, and H. Corporaal, Localityaware CTA clustering for modern GPUs, ACM SIGARCH Computer Architecture News, vol. 45, pp. 297–311, 2017.
[2017]
[22] N. G. GTX, 980: Featuring maxwell, the most advanced GPU ever made, White Paper, NVIDIA Corporation, 2014.
[2014]
[20] M. F. Deering, S. A. Schlapp, and M. G. Lavelle, FBRAM: A new form of memory optimized for 3D graphics, in Proceedings of Computer Graphics and Interactive Techniques (SIGGRAPH), 1994, pp. 167–174.
[1994]
[19] S. Lee, S. Kang, J. Lee, H. Kim, E. Lee, S. Seo, H. Yoon, S. Lee, K. Lim, H. Shin, J. Kim, O, Seongil, A. Lyer, D. Wang, K. Sohn, N. S. Kim, "Hardware Architecture and Software Stack for PIM Based on Commercial DRAM Technology: Industrial Product." in Proceedings of IEEE International Symposium on Computer Architecture (ISCA), 2021, pp. 43-56
[16] C. Xie, S. L. Song, J. Wang, W. Zhang, and X. Fu, Processing-in-memory enabled graphics processors for 3D rendering, in Proceedings of IEEE International Symposium on High Performance Computer Architecture (HPCA), 2017, pp. 637– 648.
[2017]
[15] Aga, Shaizeen, Nuwan Jayasena, and Mike Ignatowski. "Co-ML: a case for Co llaborative ML acceleration using near-data processing," in Proceedings of International Symposium on Memory Systems (MEMSYS), 2019, pp. 506-517.
[2019]
[14] Q. Deng, L. Jiang, Y. Zhang, M. Zhang, and J. Yang, Dracc: a dram based accelerator for accurate cnn inference, in Proceedings of Design Automation Conference (DAC), 2018, pp. 1–6.
[2018]
[13] M. He, C. Song, I. Kim, C. Jeong, S. Kim, I. Park, M. Thottethodi, and T. Vijaykumar, Newton: A dram-maker’s accelerator-in-memory (aim) architecture for machine learning, in Proceedings of IEEE/ACM International Symposium on Microarchitecture (MICRO), 2020, pp. 372–385.
[2020]
[12] H. Shin, D. Kim, E. Park, S. Park, Y. Park, and S. Yoo, Mcdram: Low latency and energy-efficient matrix computations in dram, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 37, no. 11, pp. 2613–2622, 2018.
[2018]
[11] M. Gao, G. Ayers, and C. Kozyrakis, Practical near-data processing for inmemory analytics frameworks, in Proceedings of International Conference on Parallel Architectures and Compilation Techniques (PACT), 2015, pp. 113–124.
[2015]
[10] B. Hong, G. Kim, J. H. Ahn, Y. Kwon, H. Kim, and J. Kim,Accelerating linked-list traversal through near-data processing, in Proceedings of International Conference on Parallel Architectures and Compilation Techniques (PACT), 2016, pp. 113–124.
[2016]
Thread cluster memory scheduling : exploiting differences in memory access behavior
Y. Kim , M. Papamichael , O. Mutlu , and M. Harchol-Balter , pp . 65-76 . [2010]
TETRIS : Scalable and efficient neural network acceleration with 3D memory ,
M. Gao , J. Pu , X. Yang , M. Horowitz , and C. Kozyrakis pp . 751-764 . [2017]
Staged memory scheduling : achieving high performance and scalability in heterogeneous systems
R. Ausavarungnirun , K. K. W. Chang , L. Subramanian , G. H. Loh , and O. Mutlu pp . 416-427 . [2012]
Prefetching techniques for near-memory throughput processors ,
R. Panda , Y. Eckert , N. Jayasena , O. Kayiran , M. Boyer , and L. K. John pp . 1-14 . [2016]
Power modeling for GPU architectures using McPAT
J. Lim , N. B. Lakshminarayana , H. Kim , W. Song , S. Yalamanchili , and W. Sung , vol . 19 , no . 3 , [2014]
Near data acceleration with concurrent host access ,
B. Y. Cho , Y. Kwon , S. Lym , and M. Erez pp . 818-831 [2020]
Modeling deep learning accelerator enabled gpus ,
M. A. Raihan , N. Goli , and T. M. Aamodt , pp . 79-92 . [2019]
Mini-batch serialization : Cnn training with inter-layer data reuse
S. Lym , A . Behroozi , W. Wen , G. Li , Y. Kwon , and M. Erez , 1 ( MLSys ) ,pp . 264-275 . [2019]
Lightweight SIMT core designs for intelligent 3D stacked DRAM
C. D. Kersey , H. Kim , and S. Yalamanchili pp.49-59 [2017]
Learning and transferring midlevel image representations using convolutional neural networks
M. Oquab , L. Bottou , I. Laptev , and J. Sivic , pp . 1717-1724 . [2014]
In-place activated batchnorm for memory-optimized training of dnns
S. R. Bulo , L. Porzi , and P. Kontschieder pp . 5639-5647 [2018]
Gaussian yolov3 : An accurate and fast object detector using localization uncertainty for autonomous driving
J.Choi , D. Chun , H. Kim , and H. J. Lee , pp . 502-511 . [2019]
Evaluating fast algorithms for convolutional neural networks on FPGAs
L. Lu , Y. Liang , Q. Xiao , and S. Yan , pp . 101-108 . [2017]
Deep residual learning for image recognition
K. He , X. Zhang , S. Ren , and J . Sun p. 770-778 [2016]
Dadiannao : A machine-learning supercomputer
Y. Chen , T. Luo , S. Liu , S. Zhang , L. He , J. Wang , L. Li , T. Chen , Z. Xu , N. Sun , and O. Temam pp . 609-622 . [2014]
Anatomy of gpu memory system for multiapplication execution
A . Jog , O. Kayiran , T. Kesten , A. Pattnaik , E. Bolotin , N. Chatterjee , S. W. Keckler , M. T. Kandemir , and C. R. Das , pp . 223-234 . [2015]
An FPGA design framework for CNN sparsification and acceleration .
S. Li , W. Wen , Y. Wang , S. Han , Y. Chen , and H. Li , pp . 28-28 . [2017]
A scalable processing-in-memory accelerator for parallel graph processing
J. Ahn , S. Hong , S. Yoo , O. Mutlu , and K. Choi pp . 105-117 . [2015]

PIM을 지원하는 GPU system에서의 CNN 연산 가속 = Acceleration of CNN Computation on a PIM-enabled GPU system

유사주제 논문( 5,001)

' PIM을 지원하는 GPU system에서의 CNN 연산 가속 = Acceleration of CNN Computation on a PIM-enabled GPU system' 의 주제별 논문영향력

주제별 논문영향력

' PIM을 지원하는 GPU system에서의 CNN 연산 가속 = Acceleration of CNN Computation on a PIM-enabled GPU system' 의 참고문헌

' PIM을 지원하는 GPU system에서의 CNN 연산 가속 = Acceleration of CNN Computation on a PIM-enabled GPU system' 의 유사주제( ) 논문