One of the most interesting features planned for AMD's next generation core architecture, which features the new "Bulldozer" core, is something called the "Flex FP," which promises to deliver tremendous floating point capabilities for technical and financial applications.
For those of you not familiar with floating point math, this is the high
level stuff, not 1+1 integer math that most applications use. In
computing, floating point describes a system for representing numbers that
would be too large or too small to be represented as integers. Numbers are
in general represented approximately to a fixed number of significant
digits and scaled using an exponent. AMD claims that its "Flex FP"
floating point unit could offer technical applications and financial
applications that rely on heavy-duty use of floating point math huge
increases in performance over the existing architectures, as well as far
more flexibility.
Flex FP is a single floating point unit that is shared between two
integer cores in a module (so an AMD 16-core "Interlagos" would have 8
Flex FP units). Each Flex FP has its own scheduler; it does not rely on
the integer scheduler to schedule FP commands, nor does it take integer
resources to schedule 256-bit executions. This helps to ensure that the
FP unit stays full as floating point commands occur. AMD says that Intel
and other competitors? architectures have had single scheduler for both
integer and floating point, which means that both integer and floating
point commands are issued by a single shared scheduler vs. having
dedicated schedulers for both integer and floating point executions.
There will be some instruction set extensions that include SSSE3, SSE 4.1
and 4.2, AVX, AES, FMA4, XOP, PCLMULQDQ and others.
One of these new instruction set extensions, AVX, can handle 256-bit FP
executions. However, there is no such thing as a 256-bit command. Single
precision commands are 32-bit and double precision are 64-bit. With
today?s standard 128-bit FPUs, you execute four single precision commands
or two double precision commands in parallel per cycle. With AVX you can
double that, executing eight 32-bit commands or four 64-bit commands per
cycle ? but only if your application supports AVX. If it doesn?t support
AVX, then that flashy new 256-bit FPU only executes in 128-bit mode (half
the throughput). That is, unless you have a Flex FP.
In today?s typical data center workloads, the bulk of the processing is
integer and a smaller portion is floating point. So, in most cases you
don?t want one massive 256-bit floating point unit per core consuming all
of that die space and all of that power just to sit around watching the
integer cores do all of the heavy lifting. By sharing one 256-bit floating
point unit per every 2 cores, AMD can keep die size and power consumption
down, helping hold down both the acquisition cost and long-term management
costs.
The Flex FP unit is built on two 128-bit FMAC units. The FMAC building
blocks are quite robust on their own. Each FMAC can do an FMAC, FADD or a
FMUL per cycle.
"When you compare that competitive solutions that can only do an FADD on
their single FADD pipe or an FMUL on their single FMUL pipe, you start to
see the power of the Flex FP ? whether 128-bit or 256-bit, there is
flexibility for your technical applications. With FMAC, the multiplication
or addition commands don?t start to stack up like a standard FMUL or FADD;
there is flexibility to handle either math on either unit," said John
Fruehe, the director of product marketing for server/workstation products
at AMD.
Here are some additional benefits:
* Non-destructive DEST via FMA4 support (which helps reduce register
pressure)
* Higher accuracy (via elimination of intermediate round step)
* Can accommodate FMUL OR FADD ops (if an app is FADD limited, then
both FMACs can do FADDs, etc), which is a huge benefit
The new AES instructions allow hardware to accelerate the large base of
applications that use this type of standard encryption (FIPS 197). The
"Bulldozer" Flex FP is able to execute these instructions, which operate
on 16 Bytes at a time, at a rate of 1 per cycle, which provides 2X more
bandwidth than current offerings, AMD added.
By having a shared Flex FP the power budget for the processor is held
down. This allows AMD to add more integer cores into the same power
budget. By sharing FP resources (that are often idle in any given cycle)
AMD can add more integer execution resources (which are more often busy
with commands waiting in line). In fact, the Flex FP is designed to reduce
its active idle power consumption to a mere 2% of its peak power
consumption.
"The Flex FP gives you the best of both worlds: performance where you
need it yet smart enough to save power when you don?t need it," Mr. Fruehe
said.
The beauty of the Flex FP is that it is a single 256-bit FPU that is
shared by two integer cores. With each cycle, either core can operate on
256 bits of parallel data via two 128-bit instructions or one 256-bit
instruction, OR each of the integer cores can execute 128-bit commands
simultaneously. This is not something hard coded in the BIOS or in the
application; it can change with each processor cycle to meet the needs at
that moment. When you consider that most of the time servers are
executing integer commands, this means that if a set of FP commands need
to be dispatched, there is probably a high likelihood that only one core
needs to do this, so it has all 256-bit to schedule.
Floating point operations typically have longer latencies so their
utilization is typically much lower; two threads are able to easily
interleave with minimal performance impact. So the idea of sharing doesn?t
necessarily present a dramatic trade-off because of the types of
operations being handled.
Also, each of AMD's pipes can handle SSE or AVX as well as FMUL, FADD, or
FMAC providing the greatest flexibility for any given application.
Existing apps will be able to take full advantage of AMD's hardware with
potential for improvement by leveraging the new ISAs, the company said.
"Obviously, there are benefits of recompiled code that will support the
new AVX instructions. But, if you think that you will have some older
128-bit FP code hanging around (and let?s face it, you will), then don?t
you think having a flexible floating point solution is a more flexible
choice for your applications? For applications to support the new 256-bit
AVX capabilities they will need to be recompiled; this takes time and
testing, so I wouldn?t expect to see rapid movement to AVX until well
after platforms are available on the streets. That means in the meantime,
as we all work through this transition, having flexibility is a good
thing. Which is why we designed the Flex FP the way that we have," Mr.
Fruehe added.