박사

High level synthesis of OpenCL kernels for FPGAs

조강원 2020년
논문상세정보
' High level synthesis of OpenCL kernels for FPGAs' 의 주제별 논문영향력
논문영향력 선정 방법
논문영향력 요약
주제
  • 응용 물리
  • Datapath
  • High Level Synthesis
  • Memory Access Pattern
  • Work-Item Pipelining
  • fpga
  • opencl
동일주제 총논문수 논문피인용 총횟수 주제별 논문영향력의 평균
4,949 0

0.0%

' High level synthesis of OpenCL kernels for FPGAs' 의 참고문헌

  • ¡°Exploiting memory access patterns to improve memory performance in data-parallel architectures ,
    22 ( 1 ) :105–118 [2011]
  • [8] M. Bauer, H. Cook, and B. Khailany. CudaDMA. http://lightsighter.github.io/CudaDMA/.
  • [80] Xilinx. Virtex UltraScale+. https://www.xilinx.com/products/ silicon-devices/fpga/virtex-ultrascale-plus.html.
  • [78] M. Weinhardt and W. Luk. Pipeline vectorization. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 20(2):234–248, 2001.
  • [77] P. Tu and D. Padua. Gated SSA-based demand-driven symbolic analysis for parallelizing compilers. In Proceedings of the 9th International Conference on Supercomputing, pages 414–423, 1995.
  • [6] AMD. AMD APP SDK OpenCL optimization guide. http://amddev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_ OpenCL_Programming_Optimization_Guide2.pdf, 2015.
  • [69] K. Shagrithaya, K. Kepa, and P. Athanas. Enabling development of OpenCL applications on FPGA platforms. In Proceedings of the 2013 IEEE 24th International Conference on Application-Specific Systems, Architectures and Processors, pages 26–30, 2013.
  • [68] Seoul National University. SnuCL suite: OpenCL frameworks and tools for heterogeneous clusters. http://snucl.snu.ac.kr.
  • [65] A. Putnam, A. M. Caulfield, E. S. Chung, D. Chiou, K. Constantinides, J. Demme, H. Esmaeilzadeh, J. Fowers, G. P. Gopal, J. Gray, M. Hasel94man, S. Hauck, S. Heil, A. Hormati, J.-Y. Kim, S. Lanka, J. Larus, E. Peterson, S. Pope, A. Smith, J. Thong, P. Y. Xiao, and D. Burger. A reconfigurable fabric for accelerating large-scale datacenter services. In Proceeding of the 41st Annual International Symposium on Computer Architecuture, pages 13–24, 2014.
  • [60] K. Ovtcharov, O. Ruwase, J.-Y. Kim, J. Fowers, K. Strauss, and E. S. Chung. Accelerating deep convolutional neural networks using specialized hardware. Technical report, Microsoft Research, 2015. https://www.microsoft.com/enus/research/publication/accelerating-deep-convolutionalneural-networks-using-specialized-hardware/.
  • [5] Amazon. Amazon EC2 F1 instances. https://aws.amazon.com/ec2/instance-types/f1/.
  • [58] NVIDIA. CUDA C best practices guide. http://docs.nvidia.com/ cuda/cuda-c-best-practices-guide/, 2015.
  • [56] S. S. Muchnick. Advanced Compiler Design and Implementation. Morgan Kaufmann Publishers, 1997.
  • [51] Khronos OpenCL Working Group. The OpenCL specification, version 2.1. https://www.khronos.org/registry/OpenCL/specs/opencl-2.1.pdf, 2015.
  • [50] Khronos OpenCL Working Group. The OpenCL specification, version 1.2. https://www.khronos.org/registry/OpenCL/specs/opencl-1.2.pdf, 2012.
  • [4] OpenACC directives for accelerators. https://www.openacc.org/.
  • [49] Khronos Group. SPIR generator/Clang. https://github.com/KhronosGroup/SPIR.
  • [45] H. M. Jacobson, P. N. Kudva, P. Bose, P. W.Cook, S. E. Schuster, E. G. Mercer, andC. J. Myers. Synchronous interlocked pipelines. In Proceedings of Eighth International Symposium on AsynchronousCircuits and Systems, pages 3–12, 2002.
    pages 3–12 [2002]
  • [44] Intel. Avalon interface specifications. https://www.intel.com/ content/dam/www/programmable/us/en/pdfs/literature/manual/ mnl_avalon_spec.pdf, 2019.
  • [43] Intel. Open programmable acceleration engine documentation. https://opae.github.io/.
  • [41] Intel. Intel Quartus Prime. https://www.intel.com/content/www/ us/en/software/programmable/quartus-prime/overview.html.
  • [3] MIOpen. https://github.com/ROCmSoftwarePlatform/MIOpen.
  • [39] M. R. Haghighat and C. D. Polychronopoulos. Symbolic analysis for parallelizing compilers. In ACM Transactions on Programming Languages and Systems, volume 18, pages 477–518, 1996.
  • [35] T. Grosser, A. Groesslinger, and C. Lengauer. Polly – performing polyhedral optimizations on a low-level intermediate representation. Parallel Processing Letters, 22(4), 2012.
  • [34] S. Grauer-Gray and L.-N. Pouchet. PolyBench/GPU. http://web.cse. ohio-state.edu/~pouchet.2/software/polybench/GPU/.
  • [2] clTorch. https://github.com/hughperkins/cltorch.
  • [22] Standard Performance Evaluation Corporation. SPEC ACCEL. https://www.spec.org/accel/.
  • [1] clFFT. https://clmathlibraries.github.io/clFFT/.
  • [19] S. Che, J. W. Sheaffer, and K. Skadron. Dymaxion: Optimizing memory access patterns for heterogeneous systems. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, 2011.
  • [16] J. M. P. Cardoso, P. C. Diniz, and M. Weinhardt. Compiling for reconfigurable computing: A survey. ACM Computing Surveys, 42(4):13:1–13:65, 2010.
  • [14] T. J. Callahan and J. Wawrzynek. Adapting software pipelining for reconfigurable computing. In Proceedings of the 2000 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems, pages 57–64, 2000.
  • [13] T. J. Callahan and J. Wawrzynek. Instruction-level parallelism for reconfigurable computing. In Proceedings of the 8th International Workshop on Field-Programmable Logic and Applications, From FPGAs to Computing Paradigm, pages 248–257, 1998.
  • [12] T. J. Callahan, J. R. Hauser, and J. Wawrzynek. The garp architecture and C compiler. IEEE Computer, 33(4):62–69, 2000.
  • [11] M. Budiu, G. Venkataramani, T.Chelcea, and S.C. Goldstein. SpatialComputation. In Proceedings of the 11th InternationalConference on Architectural Support for Programming Languages and Operating Systems, pages 14–26, 2004.
    pages 14–26 [2004]
  • Vivado design suite
  • Very long instruction word architectures and the ELI512
    pages 140–150 [1983]
  • Trace scheduling : A technique for global microcode compaction .
    C-30 ( 7 ) :478–490 [1981]
  • Theano : A Python framework for fast computation of mathematical expressions .
  • The program dependence web : A representation supporting control, data, and demanddriven interpretation of imperative languages .
    pages 257–271 , [1990]
  • The OpenMP API specification for parallel programming
  • Synthesis of synchronous elastic architectures
    pages 657–662 [2006]
  • Synthesis of platform architectures from OpenCL programs
    pages 186–193 [2011]
  • Structural analysis : a new approach to flow analysis in optimizing compilers
    5 ( 3–4 ) :141–153 [1980]
  • SPARK : A high-level synthesis framework for applying parallelizingCompiler transformations
    pages 461–466 [2003]
  • SDAccel development environment
  • Rodinia : A benchmark suite for heterogeneousComputing
    pages 44–54 [2009]
  • Points-to analysis in almost linear time
    pages 32–41 [1996]
  • Platform-based behavior-level and system-level synthesis
    pages 199–202 [2006]
  • Performance and power of cache-based reconfigurable computing
    pages 395–405 [2009]
  • Parboil : A revised benchmark suite for scientific and commercial throughput computing .
    [2012]
  • PIPSEA : A practical IPsec gateway on embedded APUs
    pages 1255–1267 [2016]
  • Optimized generation of data-path from C codes for FPGAs
    pages 112–117 [2005]
  • Optimization and architecture effects on GPU computing workload performance
    [2012]
  • Openrcl : Low-power highperformance computing with reconfigurable devices
    pages 458–463 [2010]
  • OpenCL overview - the open standard for parallel programming of heterogeneous systems
  • OpenCL for FPGAs : Prototyping a compiler
    [2012]
  • Memory access patterns : The missing piece of the multi-GPU puzzle
    [2015]
  • MASA-OpenCL : Parallel pruned com89parison of long DNA sequences with OpenCL . Concurrency and Computation Practice and Experience
    31 ( 11 ) : e5039 [2019]
  • LegUp : High-level synthesis for FPGA-based processor/accelerator systems
    pages 33–36 [2011]
  • J. D. Poznanovic , and M. B. Gokhale . Trident : An FPGA compiler framework for floating-point algorithms .
    pages 317–322 [2005]
  • Intel Stratix 10 FPGAs overview
  • Intel FPGA SDK for OpenCL
  • Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers
    pages 364–373 , [1990]
  • Impact of cache architecture and interface on performance and area of FPGA-based processor/parallel-accelerator systems
    pages 17–24 [2012]
  • Impact of FPGA architecture on resource sharing in high-level synthesis
    pages 111–114 [2012]
  • High-level Synthesis : Introduction to Chip and System Design
    [1992]
  • GRAPHITE : Polyhedral analyses and optimizations for GCC .
    [2006]
  • From OpenCL to high-performance hardware on FPGAs
    pages 531–534 [2012]
  • FCUDA : Enabling efficient compilation of CUDA kernels onto FPGAs .
    pages 35–42 [2009]
  • Efficiently computing static single assignment form and the control dependence graph
    13 ( 4 ) :451–490 [1991]
  • CudaDMA : Optimizing GPU memory bandwidth via warp specialization
    [2011]
  • Creating HW/SW codesigned MPSoPC ’ s from high level programming models
    pages 554–560 [2011]
  • CUDASW++ : optimizing SmithWaterman sequence database searches for CUDA-enabled graphics processing units
    2 , [2009]
  • Automatic OpenCL work-group size selection for multicore CPUs
    pages 387–397 [2013]
  • An introduction to high-level synthesis
    26 ( 4 ) :8–17 [2009]
  • Achieving a single compute device image in OpenCL for multiple GPUs
    pages 277–288 [2011]
  • APUNet : Revitalizing GPU as packet processing accelerator .
    pages 83–96 [2017]
  • A scalable highbandwidth architecture for lossless compression on FPGAs
    pages 52–59 [2015]
  • A practical automatic polyhedral parallelizer and locality optimizer
    pages 101–113 [2008]
  • 57 ] K. G. Murty
    [1983]
  • 3D finite difference computation on GPUs using CUDA
    pages 79–84 [2009]