Optimizing GPU-accelerated applications using workload scheduling and memory management

박정호 2020년

활용도
공유도
영향력

논문상세정보

- 저자 박정호
- 기타서명 워크로드 스케줄링과 메모리 관리를 이용한 GPU 가속 응용프로그램 최적화에 대한 연구
- 형태사항 삽화, 표: 26 cm: ix, 119 장
- 일반주기 참고문헌 수록
- 학위논문사항 2020. 2, 전기·컴퓨터공학부, 서울대학교 대학원, 학위논문(박사) -
- DDC 621.3, 22
- 발행지 서울
- 언어 eng
- 출판년 2020
- 발행사항 서울대학교 대학원
- 주제어 Deep Learning Heterogeneous computing IPsec Optimizations Workload Scheduling APU CUDA GPU OpenCL
- 참고문헌( 87)
유사주제 논문( 6,717)

인용/피인용

Optimizing GPU-accelerated applications using wo ...

' Optimizing GPU-accelerated applications using workload scheduling and memory management' 의 주제별 논문영향력

논문영향력 요약
동일주제 총논문수	논문피인용 총횟수	주제별 논문영향력의 평균
주제	응용 물리 Deep Learning Heterogeneous computing IPsec Optimizations Workload Scheduling apu cuda gpu opencl
6,727	0	0.0%

논문영향력
주제		주제별 논문수	주제별 논문영향력
주제분류(KDC/DDC)	응용 물리	4,649	0.0%
주제어	Deep Learning	1,788	0.0%
	Heterogeneous computing	3	0.0%
	IPsec	4	0.0%
	Optimizations	2	0.0%
	Workload Scheduling	1	0.0%
	apu	3	0.0%
	cuda	87	0.0%
	gpu	146	0.0%
	opencl	44	0.0%
계		6,727	0.0%
* 다른 주제어 보유 논문에서 피인용된 횟수

' Optimizing GPU-accelerated applications using workload scheduling and memory management' 의 참고문헌

vdnn : Virtualized deep neural networks for scalable , memory-efficient neural network design
pages 1–13
inception-resnet and the impact of residual connections on learning
[2016]
cudnn : Efficient primitives for deep learning
[2014]
[84] Linnan Wang, Jinmian Ye, Yiyang Zhao, Wei Wu, Ang Li, Shuaiwen Leon Song, Zenglin Xu, and Tim Kraska. Superneurons: Dynamic gpu memory management for training deep neural networks. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’18, pages 41–53, New York, NY, USA, 2018. ACM.
[74] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition, 2014.
[70] F. Seide and A. Agarwal. Cntk: Microsoft’s open-source deep-learning toolkit. In Proceedings of the 22Nd ACM SIGKDD International Confer110ence on Knowledge Discovery and Data Mining, KDD ’16, pages 2135– 2135, New York, NY, USA, 2016. ACM.
[6] Advanced Micro Devices, Inc. Hip : C++ heterogeneous-compute interface for portability. Website, 2017. http://gpuopen.com/ compute-product/hip-convert-cuda-to-portable-c-code/.
[67] AMD. ¡°AMD PowerNow¡±, http://www.amd.com/usen/Processors/ ProductInformation/0,,30\_118\_10220\_10221\%5E964,00.html.
[63] Ashutosh Pattnaik, Xulong Tang, Adwait Jog, Onur Kayiran, Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, and Chita R. Das. Scheduling techniques for gpu architectures with processing-in-memory capabilities. In Proceedings of the 2016 International Conference on Parallel Architectures and Compilation, PACT ’16, pages 31–44, New York, NY, USA, 2016. ACM.
[54] Chen Li, Rachata Ausavarungnirun, Christopher J. Rossbach, Youtao Zhang, Onur Mutlu, Yang Guo, and Jun Yang. A framework for memory oversubscription management in graphics processing units. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’19, pages 49–63, New York, NY, USA, 2019. ACM.
[50] David B. Kirk and Wen-mei W. Hwu. Programming Massively Parallel Processors: A Hands-on Approach. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1st edition, 2010.
[47] RFC 3784, Intermediate System to Intermediate System (IS-IS) Extensions for Traffic Engineering (TE), IETF, 2004
[47] Khronos Group. OpenCL 2.0 Specification. Khronos Group, November 2013.
[3] Heterogeneous System Architecture. Website. http://www. hsafoundation.com.
[33] Helion Technology Limited. IPsec solutions. Website. http://www. heliontech.com/ipsec.htm.
[2] DPDK: Data Plane Development Kit. Website. http://www.dpdk.org.
[1] APUs-Accerlated Processing Units. Website. http://www.amd.com/ en-us/innovations/software-technologies/apu/.
[19] Andreas Dandalis and Viktor K. Prasanna. An Adaptive Cryptographic Engine for Internet Protocol Security Architectures. volume 9, pages 333–353, July 2004.
Vast : The illusion of a large memory space for gpus
pages 443–454
Using Intel AES New Instructions and PCLMULQDQ to Significantly Improve IPSec Performance on Linux
[2010]
Training deep nets with sublinear memory cost
[2016]
Towards High-performance IPsec on Cavium OCTEON Platform
10 , pages 37–46 , [2011]
Torch7 : A matlab-like environment for machine learning
[2011]
Timothy Lillicrap , Madeleine Leach , Koray Kavukcuoglu , Thore Graepel ,
Theano : A Python framework for fast computation of mathematical expressions .
Supporting x86-64 address translation for 100s of gpu lanes .
Security Architecture for the Internet Protocol .
RFC 4301 , updated by RFC 3168 [1998]
SSLShader :Cheap SSL Acceleration withCommodity Processors .
11 , pages 1–14 [2011]
Rsvm : A region-based software virtual memory for gpu
[2013]
Practical recommendations for gradient-based training of deep architectures . In Neural networks : Tricks of the trade
pages 437–478 [2012]
PacketShader : A GPU-accelerated Software Router
10 , pages 195–206 , [2010]
Overfeat : Integrated recognition , localization and detection using convolutional networks
Optimizing the use of gpu memory in applications with large data sets
pages 408–418
On-the-fly elimination of dynamic irregularities for gpu computing
ACM [2011]
On-line learning in neural networks . chapter On-line Learning and Stochastic Approximations
pages 9–42 [1998]
On large-batch training for deep learning : Generalization gap and sharp minima
[2016]
Network Balancing Act ) : A High-performance Packet Processing Framework for Heterogeneous Processors
15 , pages 22:1–22:14 , [2015]
Natural language processing ( almost ) from scratch .
NVIDIA cuBLAS Library User Guide
[2017]
MIDeA : A Multi-parallel Intrusion Detection Architecture
Kargus : A Highly-scalable Software-based Intrusion Detection System
12 , pages 317–328 , [2012]
Jozefowicz , L. Kaiser , M. Kudlur , J. Levenberg , D. Mane , R.
Interplay between hardware prefetcher and page eviction policy in cpu-gpu unified virtual memory
ACM [2019]
Integrated Cryptographic and Compression Accelerators on Intel Architecture Platforms
Improving gpu performance prediction with data transfer modeling
pages 1097–1106
ImagenetClassification with deepConvolutional neural networks
12 , pages 1097–1105 , USA [2012]
ImageNet : A Large-scale Hierarchical Image Database
[2009]
Imagenet large scale visual recognition challenge
115 ( 3 ) :211–252 ,
IP Routing Processing with Graphic Processors . In Proceedings of the Conference on Design , Automation and Test in Europe
DATE ’ 10 , pages 93–98 , Leuven , Belgium [2010]
IP Lookup on GPU-based Software Routers
10 , pages 429–430 , [2010]
High-Speed FPGA Implementation of Secure Hash Algorithm for IPSec and VPN Applications
37 ( 2 ) :179–195 [2006]
Handwritten digit recognition : applications of neural network chips and automatic learning .
27 ( 11 ) :41–46
Gpuswap : Enabling oversubscription of gpu memory through transparent swapping
ACM [2015]
Gpudmm : A high-performance and memory-oblivious gpu architecture using dynamic memory management .
[2014]
Gpu resource sharing and virtualization on high performance computing systems
pages 733–742 [2011]
Going Deeper with Convolutions
[2015]
Gnort : High Performance Network Intrusion Detection Using Graphics Processors
’ 08 , pages 116–134 , [2008]
GASPP : A GPU-accelerated Stateful Packet Processing Framework
14 , pages 321– 332 [2014]
GAMT : A Fast and Scalable IP Lookup Engine for GPU-based Software Routers
13 , pages 1–12 [2013]
Finegrained resource sharing for concurrent gpgpu kernels . In Proceedings of the 4th USENIX Conference on Hot Topics in Parallelism
HotPar ’ 12 , pages 10–10 [2012]
Experimental Testing of the Gigabit IPSec-Compliant Implementations of Rijndael and Triple DES Using SLAAC-1V FPGA Accelerator Board
01 , pages 220–234 , [2001]
Efficient Software Architecture for IPSec Acceleration Using a Programmable Security Processor
DATE ’ 08 , pages 1148–1153 [2008]
Dynamic warp formation and scheduling for efficient gpu control flow
[2007]
Dynamic load balancing on singleand multi-gpu systems
pages 1–12
Design and Implementation of High Performance IPSec Applications with MultiCore Processors . In Proceedings of the 2008 International Seminar on Future Information Technology and Management Engineering
FITME ’ 08 , pages 595–598 [2008]
Deep residual learning for image recognition
pages 770–778
Deep Residual Learning for Image Recognition
pages 770–778 [2015]
Convolutional neural networks for speech recognition
22 ( 10 ) :1533– 1545
Computers and Intractability ; A Guide to the Theory of NP-Completeness
[1990]
Caffe : Convolutional Architecture for Fast Feature Embedding
[2014]
CUDA C Programming Guide
Bounds on Multiprocessing Timing Anomalies
17 ( 2 ) :416–429 [1969]
Beyond Moore ’ s law : Internet growth trends
33 ( 1 ) :117–119
Automatic cpu-gpu communication management and optimization . In Proceedings of the 32Nd ACM SIGPLAN Conference on Programming Language Design and Implementation
PLDI ’ 11 , pages 142–151ACM [2011]
Automatic gpu memory management for large neural models in tensorflow
Architectural support for address translation on gpus : Designing memory management units for cpu/gpus with unified address spaces
ACM [2014]
Architectural Design Features of a Programmable High Throughput AES Coprocessor .
04 ) Volume 2 - Volume 2 , ITCC ’ 04 , pages 498– , [2004]
Apunet : Revitalizing GPU as packet processing accelerator . In 14th USENIX Symposium on Net103worked Systems Design and Implementation
NSDI 17 ) , pages 83–96
An approximate optimal solution to gpu workload scheduling
20 ( 5 ) :63–76 , [2018]
An accurate gpu performance model for effective control flow divergence optimization
35 ( 7 ) :1165–1178
Adaptive heterogeneous scheduling for integrated gpus
ACM [2014]
Activepointers : A case for software address translation on gpus
pages 596– 608 ,
ASIC design of IPSec hardware accelerator for network security
pages 168–171
A user mode cpu-gpu scheduling framework for hybrid workloads
63 ( C ) :25–36
A survey of homogeneous and heterogeneous system architectures in high performance computing
pages 170–175 , 11 [2016]
A framework for efficient and scalable execution of domain-specific templates on gpus
pages 1–12
A Performance Model for GPUs with Caches . Parallel and Distributed Systems
26 ( 7 ) :1800–1813 ,

Optimizing GPU-accelerated applications using workload scheduling and memory management

유사주제 논문( 6,717)

' Optimizing GPU-accelerated applications using workload scheduling and memory management' 의 주제별 논문영향력

주제별 논문영향력

' Optimizing GPU-accelerated applications using workload scheduling and memory management' 의 참고문헌

' Optimizing GPU-accelerated applications using workload scheduling and memory management' 의 유사주제( ) 논문