박사

Optimizing GPU-accelerated applications using workload scheduling and memory management

박정호 2020년
논문상세정보
' Optimizing GPU-accelerated applications using workload scheduling and memory management' 의 주제별 논문영향력
논문영향력 선정 방법
논문영향력 요약
주제
  • 응용 물리
  • Deep Learning
  • Heterogeneous computing
  • IPsec
  • Optimizations
  • Workload Scheduling
  • apu
  • cuda
  • gpu
  • opencl
동일주제 총논문수 논문피인용 총횟수 주제별 논문영향력의 평균
6,727 0

0.0%

' Optimizing GPU-accelerated applications using workload scheduling and memory management' 의 참고문헌

  • vdnn : Virtualized deep neural networks for scalable , memory-efficient neural network design
    pages 1–13
  • inception-resnet and the impact of residual connections on learning
    [2016]
  • cudnn : Efficient primitives for deep learning
    [2014]
  • [84] Linnan Wang, Jinmian Ye, Yiyang Zhao, Wei Wu, Ang Li, Shuaiwen Leon Song, Zenglin Xu, and Tim Kraska. Superneurons: Dynamic gpu memory management for training deep neural networks. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’18, pages 41–53, New York, NY, USA, 2018. ACM.
  • [74] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition, 2014.
  • [70] F. Seide and A. Agarwal. Cntk: Microsoft’s open-source deep-learning toolkit. In Proceedings of the 22Nd ACM SIGKDD International Confer110ence on Knowledge Discovery and Data Mining, KDD ’16, pages 2135– 2135, New York, NY, USA, 2016. ACM.
  • [6] Advanced Micro Devices, Inc. Hip : C++ heterogeneous-compute interface for portability. Website, 2017. http://gpuopen.com/ compute-product/hip-convert-cuda-to-portable-c-code/.
  • [67] AMD. ¡°AMD PowerNow¡±, http://www.amd.com/usen/Processors/ ProductInformation/0,,30\_118\_10220\_10221\%5E964,00.html.
  • [63] Ashutosh Pattnaik, Xulong Tang, Adwait Jog, Onur Kayiran, Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, and Chita R. Das. Scheduling techniques for gpu architectures with processing-in-memory capabilities. In Proceedings of the 2016 International Conference on Parallel Architectures and Compilation, PACT ’16, pages 31–44, New York, NY, USA, 2016. ACM.
  • [54] Chen Li, Rachata Ausavarungnirun, Christopher J. Rossbach, Youtao Zhang, Onur Mutlu, Yang Guo, and Jun Yang. A framework for memory oversubscription management in graphics processing units. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’19, pages 49–63, New York, NY, USA, 2019. ACM.
  • [50] David B. Kirk and Wen-mei W. Hwu. Programming Massively Parallel Processors: A Hands-on Approach. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1st edition, 2010.
  • [47] RFC 3784, Intermediate System to Intermediate System (IS-IS) Extensions for Traffic Engineering (TE), IETF, 2004
  • [47] Khronos Group. OpenCL 2.0 Specification. Khronos Group, November 2013.
  • [3] Heterogeneous System Architecture. Website. http://www. hsafoundation.com.
  • [33] Helion Technology Limited. IPsec solutions. Website. http://www. heliontech.com/ipsec.htm.
  • [2] DPDK: Data Plane Development Kit. Website. http://www.dpdk.org.
  • [1] APUs-Accerlated Processing Units. Website. http://www.amd.com/ en-us/innovations/software-technologies/apu/.
  • [19] Andreas Dandalis and Viktor K. Prasanna. An Adaptive Cryptographic Engine for Internet Protocol Security Architectures. volume 9, pages 333–353, July 2004.
  • Vast : The illusion of a large memory space for gpus
    pages 443–454
  • Using Intel AES New Instructions and PCLMULQDQ to Significantly Improve IPSec Performance on Linux
    [2010]
  • Training deep nets with sublinear memory cost
    [2016]
  • Towards High-performance IPsec on Cavium OCTEON Platform
    10 , pages 37–46 , [2011]
  • Torch7 : A matlab-like environment for machine learning
    [2011]
  • Timothy Lillicrap , Madeleine Leach , Koray Kavukcuoglu , Thore Graepel ,
  • Theano : A Python framework for fast computation of mathematical expressions .
  • Supporting x86-64 address translation for 100s of gpu lanes .
  • Security Architecture for the Internet Protocol .
    RFC 4301 , updated by RFC 3168 [1998]
  • SSLShader :Cheap SSL Acceleration withCommodity Processors .
    11 , pages 1–14 [2011]
  • Rsvm : A region-based software virtual memory for gpu
    [2013]
  • Practical recommendations for gradient-based training of deep architectures . In Neural networks : Tricks of the trade
    pages 437–478 [2012]
  • PacketShader : A GPU-accelerated Software Router
    10 , pages 195–206 , [2010]
  • Overfeat : Integrated recognition , localization and detection using convolutional networks
  • Optimizing the use of gpu memory in applications with large data sets
    pages 408–418
  • On-the-fly elimination of dynamic irregularities for gpu computing
    ACM [2011]
  • On-line learning in neural networks . chapter On-line Learning and Stochastic Approximations
    pages 9–42 [1998]
  • On large-batch training for deep learning : Generalization gap and sharp minima
    [2016]
  • Network Balancing Act ) : A High-performance Packet Processing Framework for Heterogeneous Processors
    15 , pages 22:1–22:14 , [2015]
  • Natural language processing ( almost ) from scratch .
  • NVIDIA cuBLAS Library User Guide
    [2017]
  • MIDeA : A Multi-parallel Intrusion Detection Architecture
  • Kargus : A Highly-scalable Software-based Intrusion Detection System
    12 , pages 317–328 , [2012]
  • Jozefowicz , L. Kaiser , M. Kudlur , J. Levenberg , D. Mane , R.
  • Interplay between hardware prefetcher and page eviction policy in cpu-gpu unified virtual memory
    ACM [2019]
  • Integrated Cryptographic and Compression Accelerators on Intel Architecture Platforms
  • Improving gpu performance prediction with data transfer modeling
    pages 1097–1106
  • ImagenetClassification with deepConvolutional neural networks
    12 , pages 1097–1105 , USA [2012]
  • ImageNet : A Large-scale Hierarchical Image Database
    [2009]
  • Imagenet large scale visual recognition challenge
    115 ( 3 ) :211–252 ,
  • IP Routing Processing with Graphic Processors . In Proceedings of the Conference on Design , Automation and Test in Europe
    DATE ’ 10 , pages 93–98 , Leuven , Belgium [2010]
  • IP Lookup on GPU-based Software Routers
    10 , pages 429–430 , [2010]
  • High-Speed FPGA Implementation of Secure Hash Algorithm for IPSec and VPN Applications
    37 ( 2 ) :179–195 [2006]
  • Handwritten digit recognition : applications of neural network chips and automatic learning .
    27 ( 11 ) :41–46
  • Gpuswap : Enabling oversubscription of gpu memory through transparent swapping
    ACM [2015]
  • Gpudmm : A high-performance and memory-oblivious gpu architecture using dynamic memory management .
    [2014]
  • Gpu resource sharing and virtualization on high performance computing systems
    pages 733–742 [2011]
  • Going Deeper with Convolutions
    [2015]
  • Gnort : High Performance Network Intrusion Detection Using Graphics Processors
    ’ 08 , pages 116–134 , [2008]
  • GASPP : A GPU-accelerated Stateful Packet Processing Framework
    14 , pages 321– 332 [2014]
  • GAMT : A Fast and Scalable IP Lookup Engine for GPU-based Software Routers
    13 , pages 1–12 [2013]
  • Finegrained resource sharing for concurrent gpgpu kernels . In Proceedings of the 4th USENIX Conference on Hot Topics in Parallelism
    HotPar ’ 12 , pages 10–10 [2012]
  • Experimental Testing of the Gigabit IPSec-Compliant Implementations of Rijndael and Triple DES Using SLAAC-1V FPGA Accelerator Board
    01 , pages 220–234 , [2001]
  • Efficient Software Architecture for IPSec Acceleration Using a Programmable Security Processor
    DATE ’ 08 , pages 1148–1153 [2008]
  • Dynamic warp formation and scheduling for efficient gpu control flow
    [2007]
  • Dynamic load balancing on singleand multi-gpu systems
    pages 1–12
  • Design and Implementation of High Performance IPSec Applications with MultiCore Processors . In Proceedings of the 2008 International Seminar on Future Information Technology and Management Engineering
    FITME ’ 08 , pages 595–598 [2008]
  • Deep residual learning for image recognition
    pages 770–778
  • Deep Residual Learning for Image Recognition
    pages 770–778 [2015]
  • Convolutional neural networks for speech recognition
    22 ( 10 ) :1533– 1545
  • Computers and Intractability ; A Guide to the Theory of NP-Completeness
    [1990]
  • Caffe : Convolutional Architecture for Fast Feature Embedding
    [2014]
  • CUDA C Programming Guide
  • Bounds on Multiprocessing Timing Anomalies
    17 ( 2 ) :416–429 [1969]
  • Beyond Moore ’ s law : Internet growth trends
    33 ( 1 ) :117–119
  • Automatic cpu-gpu communication management and optimization . In Proceedings of the 32Nd ACM SIGPLAN Conference on Programming Language Design and Implementation
    PLDI ’ 11 , pages 142–151ACM [2011]
  • Automatic gpu memory management for large neural models in tensorflow
  • Architectural support for address translation on gpus : Designing memory management units for cpu/gpus with unified address spaces
    ACM [2014]
  • Architectural Design Features of a Programmable High Throughput AES Coprocessor .
    04 ) Volume 2 - Volume 2 , ITCC ’ 04 , pages 498– , [2004]
  • Apunet : Revitalizing GPU as packet processing accelerator . In 14th USENIX Symposium on Net103worked Systems Design and Implementation
    NSDI 17 ) , pages 83–96
  • An approximate optimal solution to gpu workload scheduling
    20 ( 5 ) :63–76 , [2018]
  • An accurate gpu performance model for effective control flow divergence optimization
    35 ( 7 ) :1165–1178
  • Adaptive heterogeneous scheduling for integrated gpus
    ACM [2014]
  • Activepointers : A case for software address translation on gpus
    pages 596– 608 ,
  • ASIC design of IPSec hardware accelerator for network security
    pages 168–171
  • A user mode cpu-gpu scheduling framework for hybrid workloads
    63 ( C ) :25–36
  • A survey of homogeneous and heterogeneous system architectures in high performance computing
    pages 170–175 , 11 [2016]
  • A framework for efficient and scalable execution of domain-specific templates on gpus
    pages 1–12
  • A Performance Model for GPUs with Caches . Parallel and Distributed Systems
    26 ( 7 ) :1800–1813 ,