Arm on Tuesday released the Cortex-A78 CPU, the Mali-G78 GPU, the Ethos-N78 NPU as well as new technology aimed at helping Android devices catch up to Apple’s iPhones for certain computing tasks such as video games.
Cortex-A78: Bringing PC-level productivity to new smartphone form factors
The new Arm Cortex-A78 CPU answers the call for performance gains combined with more efficiencies in power and area. In fact, the Cortex-A78 is Arm's most efficient Cortex-A CPU ever designed for mobile. It will enable multi-day 5G experiences thanks to a 20% increase in sustained performance over Cortex-A77-based devices within a 1-watt power budget, and more efficient management of compute workloads along with greater on-device Machine Learning (ML) performance. Additionally, the performance-per-watt of Cortex-A78 makes it suited for the greater overall computing needs driven by the emerging category of foldable devices with multiple and larger screens.
The major push on power efficiency translates into higher energy efficiency. At high-performance points, such as those that are the peak for current mobile devices, Cortex-A78 offers 50 percent energy savings over 2019 devices at the same performance as Cortex-A77. This makes Cortex-A78 the most energy-efficient premium Cortex-A CPU ever designed.
AAA mobile gaming is one gaming use-case that is further improved by Cortex-A78, especially when combined with Arm’s Mali GPUs. In fact, the new Arm Mali-G78 GPU is helping to bring high-fidelity gaming experiences to mobile. The greater performance of the new Cortex-A78 and Mali-G78, coupled with the fast speeds and high bandwidth of 5G will enable premium gaming experiences on mobile. Moreover, the efficiency benefits of Cortex-A78 provide longer battery life on smartphones for extended and enhanced ‘all-day-play’. Arm is working with Unity to bring the power of the Burst Compiler to Android, further enhancing multiprocessor performance and power management.
Compared to Cortex-A77, Cortex-A78 uses 8 percent less power, on average, for ML-based tasks, leading to 10 percent efficiency improvements overall.
Cortex-A78 has the same architecture as the previous generation. However, Arm has added microarchitectural features that push performance in an area and power efficient manner.
The performance benefits are enabled through additional microarchitectural features that optimize width and depth. Arm has added greater branch prediction for bandwidth and accuracy, and instruction fusion cases. These microarchitecture improvements enable a 7 percent increase in single-thread performance over the Cortex-A77.
Arm has maximized efficiency through reducing structures that have low performance and area, such as on the L1-I and L1-D caches. The company has then optimized existing structures to consume less power, such as the brand prediction structures. This leads to 4 percent less power for performance per mW and 5 percent less area for performance per mm2 compared to Cortex-A77.
The DynamIQ cluster of 4x Cortex-A77 CPUs and 4x Cortex-A55 CPUs can be upgraded to 4x Cortex-A78 CPUs and 4x Cortex-A55 CPUs. This provides 20 percent sustained performance improvements in 15 percent less area. The sustained performance push through the DynamIQ cluster makes a big difference for any applications that require several high-performance threads in parallel, such as high-fidelity gaming.
Mali-G78: Enabling immersive entertainment on the go
Last year saw a massive leap in Arm graphics performance and efficiency with the Mali-G77 GPU based on Arm's new Valhall architecture. The latest Valhall-based GPU, the Mali-G78, will deliver a 25% increase in graphics performance relative to Mali-G77. With support for up to 24 cores, these advances are made possible via asynchronous top level, tiler enhancements, and improved fragment dependency tracking. Additionally, the power- and energy-efficient Mali-G78 can help extend mobile device battery life, enabling users to enjoy entertainment experiences even longer on the go. For developers, this means Arm is making it easy to optimize their own content to run on Arm Mali GPUs. Enhanced tools from Arm, like the Performance Advisor, allows quick detection of bottle necks and real-time reporting to enable continuous integration and faster workflow.
Mali-G78 enables high-quality mobile gaming experiences, with console games now available on mobile. Alongside higher quality comes greater quantity, with Mali-G78 bringing longer battery life to premium mobile devices.
Mali-G78 provides a 15 percent performance density improvement for gaming content compared to Mali-G77. This means that Mali-G78 will give more performance for the same amount of area as the previous generation. The performance boost is made possible by four key features:
- Support for up to 24 cores
- Asynchronous Top Level
- Tiler improvements
- Improved fragment dependency tracking.
Arm has increased the maximum core count to enable our highest ever performance. The maximum core count on Mali-G77 was 16, so Arm has pushed for greater performance with support for up to 24 cores. Asynchronous Top Level then ensures that all this performance is delivered efficiently and effectively across all the cores. This squeezes as much performance out of mobile games as possible, ensuring maximum performance productivity.
Tiler improvements add an extra layer of quality to mobile games. Games that are adapted from PC and console to mobile often have extremely complicated assets and sophisticated scenes. These cause performance sticking points and bottlenecks. Improvements to the tiler reduce the vertex load on the GPU for these complex scenes and assets. This improves performance for complicated PC and console-like gaming content.
Finally, Arm has enhanced the fragment dependency tracking on Mali-G78. Again, this particularly affects mobile games with complex gaming scenes involving smoke, trees, and grass. The results from this feature change are impressive. On different frames, Arm sees up to 17 percent performance improvements on top mobile games compared to Mali-G77 performance.
Arm is already working with the game and technology development company Crytek to bring their CRYENGINE game engine to the Android mobile ecosystem first. Crytek’s flagship ‘Neon Noir’ demo fully utilizes Vulkan on Arm Mali to achieve graphic fidelity.
The Asynchronous Top-Level feature plays a vital role in energy efficiency. Using Asynchronous Top Level enables a reduction in power, so content is generated in a sustainable way. This means that when the device is outputting content at a desired frame rate, it can clock down to save energy. Increasing the Asynchronous Top Level for this task uses a bit more energy, but the energy saving from reducing the frequency of the shader cores are far higher. This is because the shader cores use 90-95 percent of the GPU’s energy budget.
Another important feature leading to better energy efficiency in Mali-G78 is the new Fused multiply-add (FMA). This has been completely redesigned from the ground up, leading to a 30 percent energy reduction to the unit. The FMA unit is responsible for most the calculations that happen inside a GPU. Therefore, it was a good candidate to target for energy reductions.
Although the GPU’s primary function is graphics processing, the parallel data processing capability makes it suitable for running ML workloads. While the CPU and NPU remain the primary processors for ML, as use-cases get more complex some of these will be offloaded to the GPU. The main ML use-cases for the GPU are linked to security features on the device, different camera, and video modes and applications with AR features.
Focusing specifically on applications, the role of ML on the GPU is important. Real-time AR emojis are a fun feature on modern communication applications, such as Snapchat. This transposes AR cartoon features onto the user’s face when taking a photo or video. The GPU is used to detect the emotion of the face to auto-select the appropriate emoji. Face tracking within the photo or video frame can also be carried out by the GPU. Moreover, more compute-intensive AR-based applications are also possible on the smartphone thanks to ML on the GPU, such as mobile gaming apps that utilize AR features. These games use the GPU to transpose the AR graphics and features onto real-world environments, with ML improving this process.
To carry out these various ML-based tasks, Mali-G78 has seen an average 15 percent performance improvement for various ML workloads compared to Mali-G77. This improvement has been made despite Mali-G77 bringing a huge 60 percent improvement to ML performance over previous GPU generations. Yet again, Asynchronous Top Level is vital in boosting ML performance, as clocking the shader cores helps with the various ML use-cases on the GPU.
Alongside the premium Mali-G78, Arm is introducing a new sub-premium tier of GPUs. The first in this new tier is the Arm Mali-G68, which supports up to 6 cores and inherits all the latest Mali-G78 features.
Introducing the Cortex-X Custom program
The pace of increasing performance in smartphones exceeds that of any other computing device category in the industry today. To address this insatiable demand for the highest performance possible, Arm is introducing a new engagement program called the Cortex-X Custom program to give its partners the option of having more flexibility and scalability for increasing performance.
The Cortex-X Custom Program allows for customization and differentiation beyond the traditional roadmap of Arm Cortex products, enabling Arm's partners with a solution for providing the ultimate performance for specific use cases. Arm's partners can define their own performance points outside of the usual Cortex-A design envelope of performance, power, and area (PPA). This final custom CPU, designed and built by Arm, will then be delivered under the Arm Cortex-X brand.
Cortex-X1 is the most powerful Cortex CPU to date, bringing 30 percent peak performance improvements in the next generation over the current Arm Cortex-A77 CPU.
Cortex-X1 also provides performance uplifts when compared to the Cortex-A78, offering 22 percent integer (single-thread) performance improvements. Furthermore, Cortex-X1 offers 2x machine learning (ML) performance improvements over Cortex-A77.
The DynamIQ cluster of 4x Cortex-A78 and 4x Cortex-A55 provides 20 percent sustained performance improvements over the 4x Cortex-A77 and 4x Cortex-A55 cluster. However, introducing Cortex-X1 enables even greater scalability through bringing a boost in peak performance. Adding 1x Cortex-X1 as part of the DynamIQ cluster alongside 3x Cortex-A78 and 4x Cortex-A55, the peak performance is 30 percent over the previous generation.
The key market for solutions with Cortex-X1 are smartphones and new form factors. The performance uplift supports the move towards new foldable designs and bigger, multiple screens. Cortex-X1 provides quicker, more seamless user experiences, with faster app loading times and improved webpage scrolling responsiveness. The big ML uplift enables more advanced AI and ML-based experiences.
Similar to Cortex-A78, Cortex-X1 enables improvements to multiple digital immersion use-cases and experiences on mobile. These range from common productivity, communication, security, and camera-based use-cases right through to advanced gaming and XR (augmented reality and virtual reality) experiences.
Cortex-X1 has various microarchitecture upgrades that enable ultimate performance. Compared to Cortex-A78, the decode bandwidth has been increased by 25 percent to 5 instructions decoded per cycle. Moreover, the MOP cache throughput has been increased by 33 percent to 8 MOPs per cycle. On Cortex-X1, the Neon engine gets two additional pipes, doubling its compute capacity over Cortex-A78. Finally, on cache sizes, Cortex-X1 supports 64kB L1 and up to 1MB L2 cache. The DynamIQ cluster has also been upgraded to now support 8MB of L3 for ultimate performance. This larger L3 can also be used by Cortex-A78 when used in conjunction with Cortex-X1.
Arm will work with the chip suppliers to high-end Android phones to provide cores that are capable of giving a boost of performance, even if it burns a bit of battery power.
Arm will also give game developers tools to take advantage of the new computing power when they make apps for Android devices.
To address expanding ML use cases ranging from new AR-based smartphone applications to smart home-hubs, Arm is introducing the Ethos-N78 neural processing unit (NPU). This latest highly scalable and efficient NPU builds on the success of the Ethos-N77, by delivering greater on-device ML capabilities, and up to 25% more performance efficiency. The Ethos-N78 also offers unprecedented levels of configurability with available configurations starting at 1 TOP/s on up to 10 TOP/s.