The chip giant was very careful to position the chip as not a new graphics chip, but a new “compute and graphics” chip, in that order (italics mine). In fact, nearly everything revealed about the new chip relates to its computational features, rather than traditionally graphics-oriented stuff like texture units and render-back ends. What we do know is that the chip is huge at an estimated 3.0 billion transistors, and will be produced on a 40nm process at TSMC. This is about 40 percent more transistors than the RV870 chip in the new Radeon 5800 series DirectX 11 cards just released by rival AMD. The chip has 512 processing units (Nvidia calls them CUDA cores) organized into 16 “streaming multiprocessors” of 32 cores each. This is more than double the 240 cores in GT200, and the cores have significant enhancements besides. The chip will utilize a 384-bit GDDR5 memory interface.
Here are some of the major bullet points:
Third Generation Streaming Multiprocessor (SM)
- 32 CUDA cores per SM, 4x over GT200
- 8x the peak double precision floating point performance over GT200
- Dual Warp Scheduler that schedules and dispatches two warps of 32 threads
- per clock
- 64 KB of RAM with a configurable partitioning of shared memory and L1 cache
Second Generation Parallel Thread Execution ISA
- Unified Address Space with Full C++ Support
- Optimized for OpenCL and DirectCompute
- Full IEEE 754-2008 32-bit and 64-bit precision
- Full 32-bit integer path with 64-bit extensions
- Memory access instructions to support transition to 64-bit addressing
- Improved Performance through Predication
Improved Memory Subsystem
- NVIDIA Parallel DataCache hierarchy with Configurable L1 and Unified L2
- Caches
- First GPU with ECC memory support
- Greatly improved atomic memory operation performance
NVIDIA GigaThread Engine
- 10x faster application context switching
- Concurrent kernel execution
- Out of Order thread block execution
- Dual overlapped memory transfer engines
There are lots of additional features that should improve the performance of this chip in stream computing tasks, like much faster double-precision floating point computation rate. Current Nvidia GPUs compute double-precision at fraction of the speed of single-precision operations. Double-precision floating point operations should now be at half the performance of single-precision, which is a huge improvement. Big improvements in caching and scheduling are apparent as well. You can read more about the architecture at Nvidia’s new Fermi page, which includes a PDF whitepaper.