Project Details
Performance, Portability, and Productivity for Deep Learning Applications on Multi- and Many-Core Architectures (PPP-DL)
Applicant
Professor Dr. Sergei Gorlatch
Subject Area
Computer Architecture, Embedded and Massively Parallel Systems
Term
since 2022
Project identifier
Deutsche Forschungsgemeinschaft (DFG) - Project number 470527619
Deep Learning (DL) is currently the most popular machine-learning method that solves a great variety of real-world problems in academia and industry. The success of DL applications critically depends on the quality of software that implements DL algorithms for modern parallel architectures like multi-core CPU, Graphics Processing Unit (GPU), Field-Programmable Gate Array (FPGA), etc. The state-of-the-art DL frameworks like TensorFlow and PyTorch rely for high performance upon general-purpose libraries provided by vendors, such as Intel or NVIDIA, causing major weaknesses regarding three fundamental aspects: i) suboptimal performance – many DL-specific optimizations are not applicable because of libraries’ focus toward general-purpose usage; ii) lacking both functional and performance portability, because the libraries are specifically designed and optimized toward architectures of particular vendors only; iii) restricted user productivity, because the libraries are limited to a fixed set of pre- implemented algorithms (e.g., matrix multiplication and convolutions), and it is cumbersome to integrate high-performance libraries into DL frameworks. This project will develop a novel, holistic approach toward automatic code generation and optimization for DL applications targeting modern parallel architectures; its overall goal is to address in one combined approach three major research challenges in the area of high-performance computing for DL: Performance, Portability, and Productivity (PPP). We plan to achieve the goal of the project based on the following new contributions: 1) a new algebraic formalism and a formalism-based Domain-Specific Language (DSL) for conveniently expressing/implementing established and emerging DL applications at a high-level of abstraction, thereby contributing to programmer’s productivity; 2) a uniform low-level programming model for DL applications, which enables functional portability of code by being straightforwardly lowerable to executable code in the state-of-practice parallel programming approaches: OpenMP, CUDA, OpenCL, etc.; 3) a code generation mechanism for our DSL that enables high, portable performance over various architectures and input/output characteristics by automatically generating auto-tunable code in our low-level programming model; 4) a systematic process that integrates our code generation mechanism into modern DL frameworks, based on the emerging MLIR framework; 5) a new auto-tuning system that fully automatically optimizes our generated code via combined numerical search techniques; 6) a new analytical cost model to predict for different architectures the run time of DL applications expressed in our DSL, in order to accelerate the auto-tuning process.We will experimentally compare our approach in terms of all – performance, portability, and productivity – to state-of-the-art approaches for a broad range of DL applications, parallel architectures, and real-world DL data sets.
DFG Programme
Research Grants
International Connection
France, United Kingdom, USA