Difference between revisions of "Floating Point Optimization"
(→VFP-Lite RunFast) |
Linux-SWAT (talk | contribs) |
||
(12 intermediate revisions by 5 users not shown) | |||
Line 5: | Line 5: | ||
== Compiler Support == | == Compiler Support == | ||
− | The NEON + VFP-lite is a | + | The NEON + VFP-lite is a design from ARM. Mainline GCC supports it but you may want to use another compiler. |
+ | |||
+ | Code Sourcery Compiler versions: | ||
* CSL 2007q3: Working NEON, Softfp Support | * CSL 2007q3: Working NEON, Softfp Support | ||
* CSL 2008q3: Broken NEON! | * CSL 2008q3: Broken NEON! | ||
Line 12: | Line 14: | ||
Generally the CS2007q3 release is recommended, the CSL 2009q1 release is promising but it has not been thoroughly tested yet. One big problem with the current compilers is the heavy dependence on VFP code, currently they only output NEON code when an obvious chance of vectorization is encountered (rarely). Apart from the esoteric rounding, vector, etc modes of the VFP (most of which compilers don't use) and predication (used occasionally), most VFP floating point instructions can be exactly replicated using an order of magnitude faster NEON instructions.... Infact it has been reported to me that the GCC packaged with the iPhone 3GS SDK does exactly this. Hopefully future compilers will support this feature. | Generally the CS2007q3 release is recommended, the CSL 2009q1 release is promising but it has not been thoroughly tested yet. One big problem with the current compilers is the heavy dependence on VFP code, currently they only output NEON code when an obvious chance of vectorization is encountered (rarely). Apart from the esoteric rounding, vector, etc modes of the VFP (most of which compilers don't use) and predication (used occasionally), most VFP floating point instructions can be exactly replicated using an order of magnitude faster NEON instructions.... Infact it has been reported to me that the GCC packaged with the iPhone 3GS SDK does exactly this. Hopefully future compilers will support this feature. | ||
− | + | In order to instruct the compiler to produce NEON or VFP code you should use the following compile flags: <pre>-mfpu=neon or -mfpu=vfp</pre> | |
+ | Unfortunately the CSL 2007 / 2008 toolchains do not support the passing of values in floating point registers (i talk about this some more in the Transfers section), so you must specify a software ABI via -mfloat-abi=softfp. The CSL 2009q1 release is the first release to support the passing of values in FP registers (AKA hardfp) via the -mfloat-abi=hard compile flag. Note that hardfp compiled binaries are not compatible with softfp ones and vice versa, so make sure your libraries have the correct ABI. Additionally, If you want the compiler to attempt to vectorize your integer / floating point code for the NEON you should add: -ftree-vectorize to your flags. | ||
− | Therefore i recommend the following flags: | + | Therefore i recommend the following flags: |
+ | <pre>-O3 -mcpu=cortex-a8 -mfpu=neon -ftree-vectorize -mfloat-abi=(softfp|hard) -ffast-math -fsingle-precision-constant</pre> | ||
+ | where -mfloat-abi=hard for the CSL 2009q1 release and softfp for all the others. | ||
== VFP-Lite RunFast == | == VFP-Lite RunFast == | ||
Under the correct circumstances some of The VFPs instructions will be executed in the NEON coprocessor. Unfortunately this does not gain the full benefit of the NEON, it still takes 7 cycles for an FMAC / FMUL / FADD. Due to this quirk you will likely get better scalar performance by accessing the NEON directly via Intrinsics or ASM. | Under the correct circumstances some of The VFPs instructions will be executed in the NEON coprocessor. Unfortunately this does not gain the full benefit of the NEON, it still takes 7 cycles for an FMAC / FMUL / FADD. Due to this quirk you will likely get better scalar performance by accessing the NEON directly via Intrinsics or ASM. | ||
− | + | In order for VFP instructions to execute in the NFP the following constraints must be met: | |
* RunFast mode must be enabled | * RunFast mode must be enabled | ||
* Must be single precision floating point operands | * Must be single precision floating point operands | ||
Line 50: | Line 55: | ||
== Single Precision Floating Point == | == Single Precision Floating Point == | ||
− | One important and easy optimization is to make sure that single precision constants are being used. By default this is not the case, instead a double precision constant is being used, so all related operations involving that constant require double precision instructions and cannot be executed on the NEON. eg | + | One important and easy optimization is to make sure that single precision constants are being used. By default this is not the case, instead a double precision constant is being used, so all related operations involving that constant require slower double precision instructions and cannot be executed on the NEON. eg |
<source lang="c"> | <source lang="c"> | ||
Line 72: | Line 77: | ||
You can enforce single precision constants by including the compiler flag: '''-fsingle-precision-constant''', alternatively you can append an 'f' to the end of each constant. ie 2.123f | You can enforce single precision constants by including the compiler flag: '''-fsingle-precision-constant''', alternatively you can append an 'f' to the end of each constant. ie 2.123f | ||
− | Another thing to watch out for is the double versions of the functions in libm (sin, exp, sqrt) | + | Another thing to watch out for is the double versions of the functions in libm (sin, exp, sqrt, etc). By default these functions operate on double precision floating point values and suffer the same problems as the constants. Luckily libm supplies floating point versions as well, they can be accessed by appending an 'f' to the end of the function. ie sinf(), expf(), sqrtf(). |
== NFP / VFP to ARM Transfers == | == NFP / VFP to ARM Transfers == | ||
− | Probably the biggest bottleneck in the architecture is that | + | Probably the biggest bottleneck in the architecture is that in order to transfer a number from the VFP / NFP registers onto the ARM you must stall both the ARM and NFP / VFP for >20 cycles. This is particularly troublesome because this is how GCC (except the CSL 2009q1 release) supplies arguments and recieves returns from functions. Possibly The best way to minimize operand passing stalls is to make the floating point functions inline. |
− | Another source of NFP / VFP | + | Another source of NFP / VFP to ARM transfers are conditional branches that depend on floating point numbers. You can do the condition on the VFP but in order to branch the flags must be sent from the VFP to the ARM. For very simple branches your best bet is to not branch at all and instead use arithmetic. ie |
<source lang="c">if (x < 0) {x += 1.1244;}</source> | <source lang="c">if (x < 0) {x += 1.1244;}</source> | ||
Is the same as: | Is the same as: | ||
Line 105: | Line 110: | ||
}</source> | }</source> | ||
+ | |||
+ | The last common source of transfers is when you cast a floating point value as an integer, by default all integer work will be done in the ARM pipeline and hence a transfer operation occurs. This is particularly problematic for complex algorithms that rely on bitwise or rounding operations on floating point numbers, ie almost all the functions in cmath depend on range reduction (rounding). A smart compiler would recognize that they can almost always be done in the NEON's integer pipeline. | ||
== NEON SIMD == | == NEON SIMD == | ||
Line 111: | Line 118: | ||
== Summary == | == Summary == | ||
− | Therefore, | + | It's often said amongst software developers that you 'may as well not bother trying to outperform a compiler', whilst there is a grain of truth in this where X86 is concerned, this is definitely not the case with Floating point on the ARM Cortex A8. In fact it is almost the opposite, you can almost always make significant gains via targeting the NEON. Therefore, In order to achieve the best floating point performance on the Pandora (or ARM Cortex A8 device): |
− | * Use the CodeSourcery 2007q3 | + | * Use the CodeSourcery 2007q3 or 2009q1 releases and these flags |
+ | <pre> -O3 -mcpu=cortex-a8 -mfpu=neon -ftree-vectorize -mfloat-abi=(softfp|hard) -ffast-math -fsingle-precision-constant</pre> | ||
* Only use single precision floating point | * Only use single precision floating point | ||
+ | * Use NEON intrinsics / ASM when ever you find a bottlenecking FP function. You can do better than the compiler. | ||
+ | * Minimize Conditional Branches | ||
* Enable RunFast mode | * Enable RunFast mode | ||
− | + | ||
+ | For softfp: | ||
* Inline floating point code (unless its very large) | * Inline floating point code (unless its very large) | ||
− | + | * Pass FP arguments via pointers instead of by value and do integer work in between function calls. | |
− | * Pass | + | |
+ | [[Category:Development]] | ||
+ | [[Category:Chipset]] |
Latest revision as of 20:59, 4 June 2015
Contents
Introduction
In the past it was rare for an embedded processor to have dedicated floating point hardware, this usually limited you to either using fixed point math (which can be very tricky to write) or very slow software floating point emulation. Fortunately the ARM Cortex A8 found in the OMAP3 has 2 Floating Point Units, a non-pipelined VFP-lite conforming to the IEEE754 standard for floating point arithmetic and a pipelined SIMD NEON coprocessor. The VFP-lite can handle both single and double precision arithmetic, as well as properly handling exceptions and subnormal numbers. However, Due to the full spec compliance and presence of the NEON, it is a relatively slow implementation in the A8, usually taking between 18 - 21 cycles to perform a single precision multiply accumulate. The NEON unit on the other hand is designed for very fast single precision vector math, it can sustain multiply accumulates at a rate of two per cycle. Efficiently utilizing these coprocessors in GCC will be the focus of this article.
Note: In this article I refer to the A8's integer pipeline as the "ARM" , the VFP-lite as simply the "VFP" and the NEON unit as the "NFP".
Compiler Support
The NEON + VFP-lite is a design from ARM. Mainline GCC supports it but you may want to use another compiler.
Code Sourcery Compiler versions:
- CSL 2007q3: Working NEON, Softfp Support
- CSL 2008q3: Broken NEON!
- CSL 2009q1: Working NEON, Hardfp + Softfp Support
Generally the CS2007q3 release is recommended, the CSL 2009q1 release is promising but it has not been thoroughly tested yet. One big problem with the current compilers is the heavy dependence on VFP code, currently they only output NEON code when an obvious chance of vectorization is encountered (rarely). Apart from the esoteric rounding, vector, etc modes of the VFP (most of which compilers don't use) and predication (used occasionally), most VFP floating point instructions can be exactly replicated using an order of magnitude faster NEON instructions.... Infact it has been reported to me that the GCC packaged with the iPhone 3GS SDK does exactly this. Hopefully future compilers will support this feature.
In order to instruct the compiler to produce NEON or VFP code you should use the following compile flags:
-mfpu=neon or -mfpu=vfp
Unfortunately the CSL 2007 / 2008 toolchains do not support the passing of values in floating point registers (i talk about this some more in the Transfers section), so you must specify a software ABI via -mfloat-abi=softfp. The CSL 2009q1 release is the first release to support the passing of values in FP registers (AKA hardfp) via the -mfloat-abi=hard compile flag. Note that hardfp compiled binaries are not compatible with softfp ones and vice versa, so make sure your libraries have the correct ABI. Additionally, If you want the compiler to attempt to vectorize your integer / floating point code for the NEON you should add: -ftree-vectorize to your flags.
Therefore i recommend the following flags:
-O3 -mcpu=cortex-a8 -mfpu=neon -ftree-vectorize -mfloat-abi=(softfp|hard) -ffast-math -fsingle-precision-constant
where -mfloat-abi=hard for the CSL 2009q1 release and softfp for all the others.
VFP-Lite RunFast
Under the correct circumstances some of The VFPs instructions will be executed in the NEON coprocessor. Unfortunately this does not gain the full benefit of the NEON, it still takes 7 cycles for an FMAC / FMUL / FADD. Due to this quirk you will likely get better scalar performance by accessing the NEON directly via Intrinsics or ASM.
In order for VFP instructions to execute in the NFP the following constraints must be met:
- RunFast mode must be enabled
- Must be single precision floating point operands
- Must not be a vector instruction (GCC doesn't appear to use this feature, so don't worry about it)
Runfast mode is enabled when the following conditions are present:
- Subnormal numbers are being flushed to zero
- Default NaN mode is active
- No floating point exceptions are enabled
I'm not sure if Runfast mode will be enabled by default in the Angstrom distribution packaged with the Pandora. If it isn't you can use the following C code to enforce it:
void enable_runfast()
{
static const unsigned int x = 0x04086060;
static const unsigned int y = 0x03000000;
int r;
asm volatile (
"fmrx %0, fpscr \n\t" //r0 = FPSCR
"and %0, %0, %1 \n\t" //r0 = r0 & 0x04086060
"orr %0, %0, %2 \n\t" //r0 = r0 | 0x03000000
"fmxr fpscr, %0 \n\t" //FPSCR = r0
: "=r"(r)
: "r"(x), "r"(y)
);
}
The instructions that are executed on the NFP are: FADDS, FSUBS, FABSS, FNEGS, FMULS, FNMULS, FMACS, FNMACS, FMSCS, FNMSCS, FCMPS, FCMPES, FCMPZS, FCMPEZS, FUITOS, FSITOS, FTOUIS, FTOSIS, FTOUIZS, FTOSIZS, FSHTOS, FSLTOS, FUHTOS, FULTOS, FTOSHS, FTOSLS, FTOUHS, FTOULS.
Single Precision Floating Point
One important and easy optimization is to make sure that single precision constants are being used. By default this is not the case, instead a double precision constant is being used, so all related operations involving that constant require slower double precision instructions and cannot be executed on the NEON. eg
float foo(float x)
{
return (2.123 * x);
}
might end up the same as:
float foo(float x)
{
double dx = (double) x;
double dy = (double) 2.123;
double dr = dx * dy;
float r = (float) dr;
return r;
}
You can enforce single precision constants by including the compiler flag: -fsingle-precision-constant, alternatively you can append an 'f' to the end of each constant. ie 2.123f
Another thing to watch out for is the double versions of the functions in libm (sin, exp, sqrt, etc). By default these functions operate on double precision floating point values and suffer the same problems as the constants. Luckily libm supplies floating point versions as well, they can be accessed by appending an 'f' to the end of the function. ie sinf(), expf(), sqrtf().
NFP / VFP to ARM Transfers
Probably the biggest bottleneck in the architecture is that in order to transfer a number from the VFP / NFP registers onto the ARM you must stall both the ARM and NFP / VFP for >20 cycles. This is particularly troublesome because this is how GCC (except the CSL 2009q1 release) supplies arguments and recieves returns from functions. Possibly The best way to minimize operand passing stalls is to make the floating point functions inline.
Another source of NFP / VFP to ARM transfers are conditional branches that depend on floating point numbers. You can do the condition on the VFP but in order to branch the flags must be sent from the VFP to the ARM. For very simple branches your best bet is to not branch at all and instead use arithmetic. ie
if (x < 0) {x += 1.1244;}
Is the same as:
x = x + (x < 0) * 1.1244
However you might want to keep a close eye on what the compiler actually produces with the above code.
One interesting fact is that using stores and loads do not cause a stall. So aslong as you don't need the result straight away you can hide the 20 cycle latency. Instead of doing a transfer you; store your NFP / VFP result to memory, do some work on the ARM, then load the result back onto the ARM without penalty. ie
void foo(float *x, float *r)
{
*r = 123 + *x;
}
void bar(float *x, float *r)
{
*r = 546 + *x;
}
void main()
{
float x = 10;
float y, z;
foo(&x, &y)
//do ~20 cycles of ARM work
bar(&y, &z);
}
The last common source of transfers is when you cast a floating point value as an integer, by default all integer work will be done in the ARM pipeline and hence a transfer operation occurs. This is particularly problematic for complex algorithms that rely on bitwise or rounding operations on floating point numbers, ie almost all the functions in cmath depend on range reduction (rounding). A smart compiler would recognize that they can almost always be done in the NEON's integer pipeline.
NEON SIMD
The NEON unit is similar to the MMX and SSE extensions found on X86 processors, it is optimized for Single Instruction Multiple Data (SIMD) operations. The NEON unit has 2 floating point pipelines, an integer pipeline and a 128bit load/store/permute pipeline. When properly utilized it is a very powerful coprocessor. Unfortunately GCC does a rather poor job of vectorizing code for the NEON unit. To get the best performance you should use either the intrinsics provided in the "arm_neon.h" header or hand written assembly.
Summary
It's often said amongst software developers that you 'may as well not bother trying to outperform a compiler', whilst there is a grain of truth in this where X86 is concerned, this is definitely not the case with Floating point on the ARM Cortex A8. In fact it is almost the opposite, you can almost always make significant gains via targeting the NEON. Therefore, In order to achieve the best floating point performance on the Pandora (or ARM Cortex A8 device):
- Use the CodeSourcery 2007q3 or 2009q1 releases and these flags
-O3 -mcpu=cortex-a8 -mfpu=neon -ftree-vectorize -mfloat-abi=(softfp|hard) -ffast-math -fsingle-precision-constant
- Only use single precision floating point
- Use NEON intrinsics / ASM when ever you find a bottlenecking FP function. You can do better than the compiler.
- Minimize Conditional Branches
- Enable RunFast mode
For softfp:
- Inline floating point code (unless its very large)
- Pass FP arguments via pointers instead of by value and do integer work in between function calls.