Parallel computing based on the general purpose Graphic Processing Unit (GPU) provide massive many-core parallelism and can deliver staggering performance improvements over traditional single-core and existing general multi-core computing techniques. The recent introduction of general-purpose GPU (GPGPU) has gained strong interests from the scientific community to tackle many computationally intensive problems. The GPU computing powers, however, have not been fully exploited for many important engineering computing problems in the VLSI design practices. Simulation of massive global interconnects, radio-frequency (RF) and millimeter-wave (MM) integrated circuits (ICs) at very high frequencies remain as difficult problems confronting chip designers. Designing new parallel and scalable computing algorithms, which can unleash the potentials of GPU-based parallel computing techniques, become highly desirable. This research seeks to investigate new parallel simulation approaches to solving massive interconnect circuits and analog/RF/MM integrated circuits based on single node general GPU or networked GPUs on a computer (GPU-cluster). First, the PI will investigate new parallel simulation algorithms based on analytic solution for structured interconnect circuits like on-chip power delivery and clock distribution networks on a GPU or GPU-cluster. Second, the PI proposes developing a very efficient numerical parallel simulation algorithm for analyzing general interconnects. The new algorithm will perform circuit complexity reduction to improve the efficiency. The PI?s team will investigate to parallelize the major computing steps in this method. Third, the PI plans to develop new parallel shooting-Newton methods for high-frequency circuits (RF/MM). The new method will explore structured Krylov-subspace, and GPU-based parallelization to improve efficiency as well as the convergence of RF/MM integrated circuit simulation.
The outcome of this research will add significantly to the core knowledge of parallel numerical analysis of linear and nonlinear dynamic systems on the GPU and GPU-cluster systems. By working with the industry partner, the PI expects to bring immediate impacts on the design community to improve the design productivity for nanometer VLSI systems. The research results will also help the electronic design automation (EDA) community to gain more insight in exploring the current and future general-purpose GPUs for parallelizing entire EDA tools on GPUs and multicore systems. The interdisciplinary nature of proposed research and relevant training will allow students to gain critical skills in the highly competitive high-tech job market. This grant will enable the PI to hire more female and underrepresented minority students to further contribute to the diversity in America?s science and technology workforce.
PI: Sheldon X.-D. Tan, Department of Electrical and Computer Engineering, UC Riverside Project period: 09/1/2010 – 8/31/2013 The goal of this project is to investigate new parallel simulation approaches to solving large interconnect circuits and analog/RF/MM integrated circuits based on GPU platforms. We have the following major outcomes: GLU -- new GPU-accelerated sparse LU factorization solver for VLSI circuit analysis First, we developed a novel GPU-accelerated sparse LU solver, called GLU. Our experimental data show that GLU is faster than all the existing sparse LU solvers including the recently another published GPU-based sparse solver based on the left-looking method. Our research output, the GLU, represents a significant new technical advancement for solving this problem. The API library for GLU solver is provided at www.ee.ucr.edu/~stan/project/glu/glu_proj.htm for free distribution and evaluation. Source codes may be provided in the near future. See the attached Table and figure for the comparison details. GPU-accelerated GMRES solver for large dynamic linear network analysis The new method is based on the preconditioned generalized minimum residual (GMRES) iterative method implemented on heterogeneous CPU-GPU platforms. Numerical results on the set of the published IBM benchmark circuits and mesh-structured power grid networks showed that the GPU-GMRES solver can deliver order of magnitudes speedup the direct LU solver, UMFPACK. GPU-GMRES can also deliver 3-10x speedup over the CPU implementation of the same GMRES method on transient analysis. GPU-accelerated GMRES solver for thermal analysis of 3D circuits with liquid cooling The new method starts from basic physics-based heat equations to model 3D-ICs with inter-tier liquid cooling micro-channels and directly solves the resulting partial differential equations. Numerical results show the proposed GPU-GMRES solver can deliver orders of magnitudes speedup over the parallel LU based solver and up to 4X speedup over CPU-GMRES for both DC and transient thermal analyses on a number of thermal circuits and other published problems. SegSpMV --- new fast sparse GPU-accelerated matrix-vector multiplication algorithms Sparse SpMV is fundamental to many science and engineering problems. We have developed a novel efficient sparse matrix-vector multiplication (spMV) algorithm based on the general CSR (compressed sparse row) format. The new algorithm, called segSpMV, allows more regular memory access than the existing methods as it regularize the sparse data structures. As a result, it can achieve higher performance than existing method. The resulting segSpMV method constantly outperforms all published algorithms and the SpMV method in the recent CUSPARSE library (CUDA 5.0) based on a set of public matrix benchmarks. GPU-accelerated envelop-following algorithm for switching power convert simulation The new method first exploits the parallelism in the envelope-following method and parallelize the Newton update solving part, which is the most computational expensive, in GPU platforms to boost the simulation performance. The new method further applied the matrix-free Krylov basis generation technique, which was previously used for RF simulation. Results from several industrial examples show that the structured parallel shooting-Newton method on Tesla's GPU can lead to speedups of more than 20 X compared to the state-of-the-art implicit GMRES methods under the same accuracy on the CPU. Parallel Monte Carlo statistical analysis of analog circuits using graph-based approaches based on GPU platforms We proposed a new parallel statistical analysis method for large analog circuits using determinant decision diagram (DDD) based graph technique based on GPU platforms. We designed novel data structures to represent the DDD graphs in the GPUs to enable fast memory access of massive parallel threads for computing the numerical values of DDD graphs. Numerical results show that the new evaluation algorithm can achieve about one to two order of magnitudes speedup over the serial CPU based evaluations and 2--3 times speedup over numerical SPICE-based simulation method on some large analog circuits. The PIâ€™s group has published 11 conference papers and 3 journal papers and 3 book chapters. Two Ph.D. students (Dr. Hai Wang and Dr. Xuexin Liu) received this supports from this award. On the educational side, the PI established UCR CUDA Teaching Centers in 2010, which was sponsored by Nvidia Corporation, CA, to advance the state of parallel education using CUDA C/C++. CUDA Teaching Center comes with equipment donations, funding support for course development, course material assistance and software license from Nvidia Corporation. In 2012, the PI also established UCR CUDA Research Center to further promote the research and education at UC Riverside. Nvidia Corporation has donated many GPU cards and teaching materials to support the education and research activities at UCR. The PI also introduced the graduate level course EE/CS 217: GPU Architecture and Parallel Programming (www.ee.ucr.edu/~stan/courses/eecs217/eecs217_home.htm) in 2011. The PI taught the EE/CS217 in every winter from 2011 – 2014. The course was well received by UCR graduate students from both EE and CS departments.