Eight months after unveiling Tegra K1’s 32-bit version, Nvidia is providing further architectural details of the chip’s 64-bit version at HOT CHIPS, a technical conference on high-performance chips.
This new version of Tegra K1 pairs our 192-core Kepler architecture-based GPU with Nvidia's own custom-designed, 64-bit, dual-core "Project Denver" CPU, which is fully ARMv8 architecture compatible. Further, Denver is fully pin compatible with the 32-bit Tegra K1.
The 64-bit Tegra K1 is the world’s first 64-bit ARM processor for Android, and according to Nvidia, the new chip completely outpaces other ARM-based mobile processors.
Each of the two Denver cores implements a 7-way superscalar microarchitecture (up to 7 concurrent micro-ops can be executed per clock), and includes a 128KB 4-way L1 instruction cache, a 64KB 4-way L1 data cache, and a 2MB 16-way L2 cache, which services both cores.
Denver implements a process called Dynamic Code Optimization, which optimizes frequently used software routines at runtime into dense tuned microcode-equivalent routines. These are stored in a dedicated, 128MB main-memory-based optimization cache. After being read into the instruction cache, the optimized micro-ops are executed, re-fetched and executed from the instruction cache as long as needed and capacity allows.
Effectively, this reduces the need to re-optimize the software routines. Instead of using hardware to extract the instruction-level parallelism (ILP) inherent in the code, Denver extracts the ILP once via software techniques, and then executes those routines repeatedly, thus amortizing the cost of ILP extraction over the many execution instances.
As part of the Dynamic Code Optimization process, Denver looks across a window of hundreds of instructions and unrolls loops, renames registers, removes unused instructions, and reorders the code in various ways for optimal speed. This effectively doubles the performance of the base-level hardware through the conversion of ARM code to highly optimized microcode routines and increases the execution energy efficiency.
The slight overhead of the dynamic optimization process is outweighed by the performance gains of already having optimized code ready to execute. In cases where code may not be frequently reused, Denver can process those ARM instructions directly without going through the dynamic optimization process.
Dynamic Code Optimization works with all standard ARM-based applications, requiring no customization from developers, and without added power consumption versus other ARM mobile processors. That’s because the 7-wide superscalar design allows faster throughput than would otherwise be possible at the same clock speed.
Denver’s design delivers performance for both single- and multi-threaded applications, as well as multitasking scenarios. Nvidia says that the dual-CPU cores can attain higher performance than existing four- to eight-core mobile CPUs on most mobile workloads.
Denver also features new low latency power-state transitions, in addition to extensive power-gating and dynamic voltage and clock scaling based on workloads. "Combining Dynamic Code Optimization, 7-way superscalar design and efficient power usage, Denver’s performance will rival some mainstream PC-class CPUs at significantly reduced power consumption," Nvdia claims.