Benchmarking the NVIDIA 8800 GTX with the CUDA Development Platform

Michael McGraw-Herdeg,Douglas P. Enright

msra(2007)

引用 28|浏览1
暂无评分
摘要
Two HPEC Challenge benchmarks, finite impulse response and QR decomposition, were implemented on a NVIDIA 8800 GTX graphics card using a data-parallel implementation approach. For the finite impulse response filter bank benchmark, a fast convolution FFT-based frequency-domain approach on the GPU performed 4 to 35 times faster than the comparable calculation on a CPU. A non-transform time-domain approach outperformed the comparable CPU calculation by a factor of 1.6 to 15. When computing the QR decomposition of a complex matrix, GPU computations are consistently 2.5 times faster than the CPU. All of these parallel algorithms were written in NVIDIA's Compute Unified Device Architecture (CUDA), a C interface that provides quick, effective parallelization. Hardware and Software The NVIDIA 8800 GTX video card has 16 multiprocessors, each composed of 8 SIMD processors operating at 1350 Mhz [1]. Each multiprocessor has 8192 registers, a 16KB parallel data cache of fast “shared memory,” and access to 768 MB of GDDR3 “global memory.” The card is used most efficiently in a data-parallel fashion, when the ratio of computations to memory access is high and when many computations are performed concurrently. Table 1: FIR Test Parameters
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要