https://pandorawiki.org/index.php?title=Floating_Point_Optimization&feed=atom&action=historyFloating Point Optimization - Revision history2024-03-29T12:41:15ZRevision history for this page on the wikiMediaWiki 1.32.0-alphahttps://pandorawiki.org/index.php?title=Floating_Point_Optimization&diff=30070&oldid=prevLinux-SWAT at 20:59, 4 June 20152015-06-04T20:59:23Z<p></p>
<table class="diff diff-contentalign-left" data-mw="interface">
<col class="diff-marker" />
<col class="diff-content" />
<col class="diff-marker" />
<col class="diff-content" />
<tr class="diff-title" lang="en">
<td colspan="2" style="background-color: #fff; color: #222; text-align: center;">← Older revision</td>
<td colspan="2" style="background-color: #fff; color: #222; text-align: center;">Revision as of 20:59, 4 June 2015</td>
</tr><tr><td colspan="2" class="diff-lineno" id="mw-diff-left-l5" >Line 5:</td>
<td colspan="2" class="diff-lineno">Line 5:</td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>== Compiler Support ==</div></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>== Compiler Support ==</div></td></tr>
<tr><td class='diff-marker'>−</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>The NEON + VFP-lite is a <del class="diffchange diffchange-inline">new </del>design from ARM <del class="diffchange diffchange-inline">and hence does not yet have very mature </del>compiler <del class="diffchange diffchange-inline">support. At present the Code Sourcery toolchain has the best support since the mainline GCCs do not support NEON yet</del>. Code Sourcery Compiler versions:</div></td><td class='diff-marker'>+</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>The NEON + VFP-lite is a design from ARM<ins class="diffchange diffchange-inline">. Mainline GCC supports it but you may want to use another </ins>compiler.</div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div> </div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>Code Sourcery Compiler versions:</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>* CSL 2007q3: Working NEON, Softfp Support</div></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>* CSL 2007q3: Working NEON, Softfp Support</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>* CSL 2008q3: Broken NEON!</div></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>* CSL 2008q3: Broken NEON!</div></td></tr>
</table>Linux-SWAThttps://pandorawiki.org/index.php?title=Floating_Point_Optimization&diff=28099&oldid=prevLolla at 22:16, 9 October 20132013-10-09T22:16:43Z<p></p>
<table class="diff diff-contentalign-left" data-mw="interface">
<col class="diff-marker" />
<col class="diff-content" />
<col class="diff-marker" />
<col class="diff-content" />
<tr class="diff-title" lang="en">
<td colspan="2" style="background-color: #fff; color: #222; text-align: center;">← Older revision</td>
<td colspan="2" style="background-color: #fff; color: #222; text-align: center;">Revision as of 22:16, 9 October 2013</td>
</tr><tr><td colspan="2" class="diff-lineno" id="mw-diff-left-l112" >Line 112:</td>
<td colspan="2" class="diff-lineno">Line 112:</td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>== NEON SIMD ==</div></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>== NEON SIMD ==</div></td></tr>
<tr><td class='diff-marker'>−</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>The <del class="diffchange diffchange-inline">[[</del>NEON<del class="diffchange diffchange-inline">]] </del>unit is similar to the MMX and SSE extensions found on X86 processors, it is optimized for Single Instruction Multiple Data (SIMD) operations.</div></td><td class='diff-marker'>+</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>The NEON unit is similar to the MMX and SSE extensions found on X86 processors, it is optimized for Single Instruction Multiple Data (SIMD) operations.</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>The NEON unit has 2 floating point pipelines, an integer pipeline and a 128bit load/store/permute pipeline. When properly utilized it is a very powerful coprocessor. Unfortunately GCC does a rather poor job of vectorizing code for the NEON unit. To get the best performance you should use either the intrinsics provided in the "arm_neon.h" header or hand written assembly. </div></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>The NEON unit has 2 floating point pipelines, an integer pipeline and a 128bit load/store/permute pipeline. When properly utilized it is a very powerful coprocessor. Unfortunately GCC does a rather poor job of vectorizing code for the NEON unit. To get the best performance you should use either the intrinsics provided in the "arm_neon.h" header or hand written assembly. </div></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>== Summary ==</div></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>== Summary ==</div></td></tr>
<tr><td class='diff-marker'>−</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>It's often said amongst software developers that you 'may as well not bother trying to outperform a compiler', whilst there is a grain of truth in this where X86 is concerned, this is definitely not the case with Floating point on the ARM Cortex A8. In fact it is almost the opposite, you can almost always make significant gains via targeting the <del class="diffchange diffchange-inline">[[</del>NEON<del class="diffchange diffchange-inline">]]</del>. Therefore, In order to achieve the best floating point performance on the Pandora (or ARM Cortex A8 device):</div></td><td class='diff-marker'>+</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>It's often said amongst software developers that you 'may as well not bother trying to outperform a compiler', whilst there is a grain of truth in this where X86 is concerned, this is definitely not the case with Floating point on the ARM Cortex A8. In fact it is almost the opposite, you can almost always make significant gains via targeting the NEON. Therefore, In order to achieve the best floating point performance on the Pandora (or ARM Cortex A8 device):</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>* Use the CodeSourcery 2007q3 or 2009q1 releases and these flags</div></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>* Use the CodeSourcery 2007q3 or 2009q1 releases and these flags</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div><pre> -O3 -mcpu=cortex-a8 -mfpu=neon -ftree-vectorize -mfloat-abi=(softfp|hard) -ffast-math -fsingle-precision-constant</pre></div></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div><pre> -O3 -mcpu=cortex-a8 -mfpu=neon -ftree-vectorize -mfloat-abi=(softfp|hard) -ffast-math -fsingle-precision-constant</pre></div></td></tr>
</table>Lollahttps://pandorawiki.org/index.php?title=Floating_Point_Optimization&diff=28098&oldid=prevLolla at 22:12, 9 October 20132013-10-09T22:12:25Z<p></p>
<table class="diff diff-contentalign-left" data-mw="interface">
<col class="diff-marker" />
<col class="diff-content" />
<col class="diff-marker" />
<col class="diff-content" />
<tr class="diff-title" lang="en">
<td colspan="2" style="background-color: #fff; color: #222; text-align: center;">← Older revision</td>
<td colspan="2" style="background-color: #fff; color: #222; text-align: center;">Revision as of 22:12, 9 October 2013</td>
</tr><tr><td colspan="2" class="diff-lineno" id="mw-diff-left-l5" >Line 5:</td>
<td colspan="2" class="diff-lineno">Line 5:</td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>== Compiler Support ==</div></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>== Compiler Support ==</div></td></tr>
<tr><td class='diff-marker'>−</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>The NEON + VFP-lite is a new design from ARM and hence does not yet have very mature compiler support. At present the Code Sourcery <del class="diffchange diffchange-inline">[[</del>toolchain<del class="diffchange diffchange-inline">]] </del>has the best support since the mainline GCCs do not support NEON yet. Code Sourcery Compiler versions:</div></td><td class='diff-marker'>+</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>The NEON + VFP-lite is a new design from ARM and hence does not yet have very mature compiler support. At present the Code Sourcery toolchain has the best support since the mainline GCCs do not support NEON yet. Code Sourcery Compiler versions:</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>* CSL 2007q3: Working NEON, Softfp Support</div></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>* CSL 2007q3: Working NEON, Softfp Support</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>* CSL 2008q3: Broken NEON!</div></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>* CSL 2008q3: Broken NEON!</div></td></tr>
<tr><td colspan="2" class="diff-lineno" id="mw-diff-left-l78" >Line 78:</td>
<td colspan="2" class="diff-lineno">Line 78:</td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>== NFP / VFP to ARM Transfers ==</div></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>== NFP / VFP to ARM Transfers ==</div></td></tr>
<tr><td class='diff-marker'>−</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>Probably the biggest bottleneck in the architecture is that in order to transfer a number from the <del class="diffchange diffchange-inline">[[</del>VFP<del class="diffchange diffchange-inline">]] </del>/ NFP registers onto the ARM you must stall both the <del class="diffchange diffchange-inline">[[</del>ARM<del class="diffchange diffchange-inline">]] </del>and <del class="diffchange diffchange-inline">[[</del>NFP<del class="diffchange diffchange-inline">]] </del>/ <del class="diffchange diffchange-inline">[[</del>VFP<del class="diffchange diffchange-inline">]] </del>for >20 cycles. This is particularly troublesome because this is how GCC (except the CSL 2009q1 release) supplies arguments and recieves returns from functions. Possibly The best way to minimize operand passing stalls is to make the floating point functions inline.</div></td><td class='diff-marker'>+</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>Probably the biggest bottleneck in the architecture is that in order to transfer a number from the VFP / NFP registers onto the ARM you must stall both the ARM and NFP / VFP for >20 cycles. This is particularly troublesome because this is how GCC (except the CSL 2009q1 release) supplies arguments and recieves returns from functions. Possibly The best way to minimize operand passing stalls is to make the floating point functions inline.</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td></tr>
<tr><td class='diff-marker'>−</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>Another source of <del class="diffchange diffchange-inline">[[</del>NFP<del class="diffchange diffchange-inline">]] </del>/ <del class="diffchange diffchange-inline">[[</del>VFP<del class="diffchange diffchange-inline">]] </del>to <del class="diffchange diffchange-inline">[[</del>ARM<del class="diffchange diffchange-inline">]] </del>transfers are conditional branches that depend on floating point numbers. You can do the condition on the VFP but in order to branch the flags must be sent from the VFP to the ARM. For very simple branches your best bet is to not branch at all and instead use arithmetic. ie</div></td><td class='diff-marker'>+</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>Another source of NFP / VFP to ARM transfers are conditional branches that depend on floating point numbers. You can do the condition on the VFP but in order to branch the flags must be sent from the VFP to the ARM. For very simple branches your best bet is to not branch at all and instead use arithmetic. ie</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div><source lang="c">if (x < 0) {x += 1.1244;}</source></div></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div><source lang="c">if (x < 0) {x += 1.1244;}</source></div></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>Is the same as:</div></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>Is the same as:</div></td></tr>
</table>Lollahttps://pandorawiki.org/index.php?title=Floating_Point_Optimization&diff=28096&oldid=prevLolla at 21:59, 9 October 20132013-10-09T21:59:04Z<p></p>
<table class="diff diff-contentalign-left" data-mw="interface">
<col class="diff-marker" />
<col class="diff-content" />
<col class="diff-marker" />
<col class="diff-content" />
<tr class="diff-title" lang="en">
<td colspan="2" style="background-color: #fff; color: #222; text-align: center;">← Older revision</td>
<td colspan="2" style="background-color: #fff; color: #222; text-align: center;">Revision as of 21:59, 9 October 2013</td>
</tr><tr><td colspan="2" class="diff-lineno" id="mw-diff-left-l1" >Line 1:</td>
<td colspan="2" class="diff-lineno">Line 1:</td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>== Introduction ==</div></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>== Introduction ==</div></td></tr>
<tr><td class='diff-marker'>−</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>In the past it was rare for an embedded <del class="diffchange diffchange-inline">[[</del>processor<del class="diffchange diffchange-inline">]] </del>to have dedicated floating point hardware, this usually limited you to either using fixed point math (which can be very tricky to write) or very slow software floating point emulation. Fortunately the <del class="diffchange diffchange-inline">[[</del>ARM<del class="diffchange diffchange-inline">]] </del>Cortex A8 found in the <del class="diffchange diffchange-inline">[[</del>OMAP3<del class="diffchange diffchange-inline">]] </del>has 2 Floating Point Units, a non-pipelined <del class="diffchange diffchange-inline">[[</del>VFP-lite<del class="diffchange diffchange-inline">]] </del>conforming to the IEEE754 standard for floating point arithmetic and a pipelined SIMD <del class="diffchange diffchange-inline">[[</del>NEON<del class="diffchange diffchange-inline">]] </del>coprocessor. The VFP-lite can handle both single and double precision arithmetic, as well as properly handling exceptions and subnormal numbers. However, Due to the full spec compliance and presence of the NEON, it is a relatively slow implementation in the A8, usually taking between 18 - 21 cycles to perform a single precision multiply accumulate. The NEON unit on the other hand is designed for very fast single precision vector math, it can sustain multiply accumulates at a rate of two per cycle. Efficiently utilizing these coprocessors in <del class="diffchange diffchange-inline">[[</del>GCC<del class="diffchange diffchange-inline">]] </del>will be the focus of this article.</div></td><td class='diff-marker'>+</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>In the past it was rare for an embedded processor to have dedicated floating point hardware, this usually limited you to either using fixed point math (which can be very tricky to write) or very slow software floating point emulation. Fortunately the ARM Cortex A8 found in the OMAP3 has 2 Floating Point Units, a non-pipelined VFP-lite conforming to the IEEE754 standard for floating point arithmetic and a pipelined SIMD NEON coprocessor. The VFP-lite can handle both single and double precision arithmetic, as well as properly handling exceptions and subnormal numbers. However, Due to the full spec compliance and presence of the NEON, it is a relatively slow implementation in the A8, usually taking between 18 - 21 cycles to perform a single precision multiply accumulate. The NEON unit on the other hand is designed for very fast single precision vector math, it can sustain multiply accumulates at a rate of two per cycle. Efficiently utilizing these coprocessors in GCC will be the focus of this article.</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>'''Note:''' In this article I refer to the A8's integer pipeline as the "ARM" , the VFP-lite as simply the "VFP" and the NEON unit as the "NFP".</div></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>'''Note:''' In this article I refer to the A8's integer pipeline as the "ARM" , the VFP-lite as simply the "VFP" and the NEON unit as the "NFP".</div></td></tr>
</table>Lollahttps://pandorawiki.org/index.php?title=Floating_Point_Optimization&diff=7716&oldid=prevABC: formating2011-04-20T15:09:24Z<p>formating</p>
<table class="diff diff-contentalign-left" data-mw="interface">
<col class="diff-marker" />
<col class="diff-content" />
<col class="diff-marker" />
<col class="diff-content" />
<tr class="diff-title" lang="en">
<td colspan="2" style="background-color: #fff; color: #222; text-align: center;">← Older revision</td>
<td colspan="2" style="background-color: #fff; color: #222; text-align: center;">Revision as of 15:09, 20 April 2011</td>
</tr><tr><td colspan="2" class="diff-lineno" id="mw-diff-left-l1" >Line 1:</td>
<td colspan="2" class="diff-lineno">Line 1:</td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>== Introduction ==</div></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>== Introduction ==</div></td></tr>
<tr><td class='diff-marker'>−</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>In the past it was rare for an embedded processor to have dedicated floating point hardware, this usually limited you to either using fixed point math (which can be very tricky to write) or very slow software floating point emulation. Fortunately the ARM Cortex A8 found in the OMAP3 has 2 Floating Point Units, a non-pipelined VFP-lite conforming to the IEEE754 standard for floating point arithmetic and a pipelined SIMD NEON coprocessor. The VFP-lite can handle both single and double precision arithmetic, as well as properly handling exceptions and subnormal numbers. However, Due to the full spec compliance and presence of the NEON, it is a relatively slow implementation in the A8, usually taking between 18 - 21 cycles to perform a single precision multiply accumulate. The NEON unit on the other hand is designed for very fast single precision vector math, it can sustain multiply accumulates at a rate of two per cycle. Efficiently utilizing these coprocessors in GCC will be the focus of this article.</div></td><td class='diff-marker'>+</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>In the past it was rare for an embedded <ins class="diffchange diffchange-inline">[[</ins>processor<ins class="diffchange diffchange-inline">]] </ins>to have dedicated floating point hardware, this usually limited you to either using fixed point math (which can be very tricky to write) or very slow software floating point emulation. Fortunately the <ins class="diffchange diffchange-inline">[[</ins>ARM<ins class="diffchange diffchange-inline">]] </ins>Cortex A8 found in the <ins class="diffchange diffchange-inline">[[</ins>OMAP3<ins class="diffchange diffchange-inline">]] </ins>has 2 Floating Point Units, a non-pipelined <ins class="diffchange diffchange-inline">[[</ins>VFP-lite<ins class="diffchange diffchange-inline">]] </ins>conforming to the IEEE754 standard for floating point arithmetic and a pipelined SIMD <ins class="diffchange diffchange-inline">[[</ins>NEON<ins class="diffchange diffchange-inline">]] </ins>coprocessor. The VFP-lite can handle both single and double precision arithmetic, as well as properly handling exceptions and subnormal numbers. However, Due to the full spec compliance and presence of the NEON, it is a relatively slow implementation in the A8, usually taking between 18 - 21 cycles to perform a single precision multiply accumulate. The NEON unit on the other hand is designed for very fast single precision vector math, it can sustain multiply accumulates at a rate of two per cycle. Efficiently utilizing these coprocessors in <ins class="diffchange diffchange-inline">[[</ins>GCC<ins class="diffchange diffchange-inline">]] </ins>will be the focus of this article.</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>'''Note:''' In this article I refer to the A8's integer pipeline as the "ARM" , the VFP-lite as simply the "VFP" and the NEON unit as the "NFP".</div></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>'''Note:''' In this article I refer to the A8's integer pipeline as the "ARM" , the VFP-lite as simply the "VFP" and the NEON unit as the "NFP".</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>== Compiler Support ==</div></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>== Compiler Support ==</div></td></tr>
<tr><td class='diff-marker'>−</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>The NEON + VFP-lite is a new design from ARM and hence does not yet have very mature compiler support. At present the Code Sourcery toolchain has the best support since the mainline GCCs do not support NEON yet. Code Sourcery Compiler versions:</div></td><td class='diff-marker'>+</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>The NEON + VFP-lite is a new design from ARM and hence does not yet have very mature compiler support. At present the Code Sourcery <ins class="diffchange diffchange-inline">[[</ins>toolchain<ins class="diffchange diffchange-inline">]] </ins>has the best support since the mainline GCCs do not support NEON yet. Code Sourcery Compiler versions:</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>* CSL 2007q3: Working NEON, Softfp Support</div></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>* CSL 2007q3: Working NEON, Softfp Support</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>* CSL 2008q3: Broken NEON!</div></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>* CSL 2008q3: Broken NEON!</div></td></tr>
<tr><td colspan="2" class="diff-lineno" id="mw-diff-left-l12" >Line 12:</td>
<td colspan="2" class="diff-lineno">Line 12:</td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>Generally the CS2007q3 release is recommended, the CSL 2009q1 release is promising but it has not been thoroughly tested yet. One big problem with the current compilers is the heavy dependence on VFP code, currently they only output NEON code when an obvious chance of vectorization is encountered (rarely). Apart from the esoteric rounding, vector, etc modes of the VFP (most of which compilers don't use) and predication (used occasionally), most VFP floating point instructions can be exactly replicated using an order of magnitude faster NEON instructions.... Infact it has been reported to me that the GCC packaged with the iPhone 3GS SDK does exactly this. Hopefully future compilers will support this feature. </div></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>Generally the CS2007q3 release is recommended, the CSL 2009q1 release is promising but it has not been thoroughly tested yet. One big problem with the current compilers is the heavy dependence on VFP code, currently they only output NEON code when an obvious chance of vectorization is encountered (rarely). Apart from the esoteric rounding, vector, etc modes of the VFP (most of which compilers don't use) and predication (used occasionally), most VFP floating point instructions can be exactly replicated using an order of magnitude faster NEON instructions.... Infact it has been reported to me that the GCC packaged with the iPhone 3GS SDK does exactly this. Hopefully future compilers will support this feature. </div></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td></tr>
<tr><td class='diff-marker'>−</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>In order to instruct the compiler to produce NEON or VFP code you should use the following compile flags: -mfpu=neon or -mfpu=vfp<del class="diffchange diffchange-inline">. </del>Unfortunately the CSL 2007 / 2008 toolchains do not support the passing of values in floating point registers (i talk about this some more in the Transfers section), so you must specify a software ABI via -mfloat-abi=softfp. The CSL 2009q1 release is the first release to support the passing of values in FP registers (AKA hardfp) via the -mfloat-abi=hard compile flag. Note that hardfp compiled binaries are not compatible with softfp ones and vice versa, so make sure your libraries have the correct ABI. Additionally, If you want the compiler to attempt to vectorize your integer / floating point code for the NEON you should add: -ftree-vectorize to your flags. </div></td><td class='diff-marker'>+</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>In order to instruct the compiler to produce NEON or VFP code you should use the following compile flags: <ins class="diffchange diffchange-inline"><pre></ins>-mfpu=neon or -mfpu=vfp<ins class="diffchange diffchange-inline"></pre></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>Unfortunately the CSL 2007 / 2008 toolchains do not support the passing of values in floating point registers (i talk about this some more in the Transfers section), so you must specify a software ABI via -mfloat-abi=softfp. The CSL 2009q1 release is the first release to support the passing of values in FP registers (AKA hardfp) via the -mfloat-abi=hard compile flag. Note that hardfp compiled binaries are not compatible with softfp ones and vice versa, so make sure your libraries have the correct ABI. Additionally, If you want the compiler to attempt to vectorize your integer / floating point code for the NEON you should add: -ftree-vectorize to your flags. </div></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td></tr>
<tr><td class='diff-marker'>−</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>Therefore i recommend the following flags: <del class="diffchange diffchange-inline">'''</del>-O3 -mcpu=cortex-a8 -mfpu=neon -ftree-vectorize -mfloat-abi=(softfp|hard) -ffast-math -fsingle-precision-constant<del class="diffchange diffchange-inline">''' </del>where -mfloat-abi=hard for the CSL 2009q1 release and softfp for all the others.</div></td><td class='diff-marker'>+</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>Therefore i recommend the following flags:</div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins class="diffchange diffchange-inline"><pre></ins>-O3 -mcpu=cortex-a8 -mfpu=neon -ftree-vectorize -mfloat-abi=(softfp|hard) -ffast-math -fsingle-precision-constant<ins class="diffchange diffchange-inline"></pre></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>where -mfloat-abi=hard for the CSL 2009q1 release and softfp for all the others.</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>== VFP-Lite RunFast ==</div></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>== VFP-Lite RunFast ==</div></td></tr>
<tr><td colspan="2" class="diff-lineno" id="mw-diff-left-l75" >Line 75:</td>
<td colspan="2" class="diff-lineno">Line 78:</td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>== NFP / VFP to ARM Transfers ==</div></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>== NFP / VFP to ARM Transfers ==</div></td></tr>
<tr><td class='diff-marker'>−</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>Probably the biggest bottleneck in the architecture is that in order to transfer a number from the VFP / NFP registers onto the ARM you must stall both the ARM and NFP / VFP for >20 cycles. This is particularly troublesome because this is how GCC (except the CSL 2009q1 release) supplies arguments and recieves returns from functions. Possibly The best way to minimize operand passing stalls is to make the floating point functions inline.</div></td><td class='diff-marker'>+</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>Probably the biggest bottleneck in the architecture is that in order to transfer a number from the <ins class="diffchange diffchange-inline">[[</ins>VFP<ins class="diffchange diffchange-inline">]] </ins>/ NFP registers onto the ARM you must stall both the <ins class="diffchange diffchange-inline">[[</ins>ARM<ins class="diffchange diffchange-inline">]] </ins>and <ins class="diffchange diffchange-inline">[[</ins>NFP<ins class="diffchange diffchange-inline">]] </ins>/ <ins class="diffchange diffchange-inline">[[</ins>VFP<ins class="diffchange diffchange-inline">]] </ins>for >20 cycles. This is particularly troublesome because this is how GCC (except the CSL 2009q1 release) supplies arguments and recieves returns from functions. Possibly The best way to minimize operand passing stalls is to make the floating point functions inline.</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td></tr>
<tr><td class='diff-marker'>−</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>Another source of NFP / VFP to ARM transfers are conditional branches that depend on floating point numbers. You can do the condition on the VFP but in order to branch the flags must be sent from the VFP to the ARM. For very simple branches your best bet is to not branch at all and instead use arithmetic. ie</div></td><td class='diff-marker'>+</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>Another source of <ins class="diffchange diffchange-inline">[[</ins>NFP<ins class="diffchange diffchange-inline">]] </ins>/ <ins class="diffchange diffchange-inline">[[</ins>VFP<ins class="diffchange diffchange-inline">]] </ins>to <ins class="diffchange diffchange-inline">[[</ins>ARM<ins class="diffchange diffchange-inline">]] </ins>transfers are conditional branches that depend on floating point numbers. You can do the condition on the VFP but in order to branch the flags must be sent from the VFP to the ARM. For very simple branches your best bet is to not branch at all and instead use arithmetic. ie</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div><source lang="c">if (x < 0) {x += 1.1244;}</source></div></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div><source lang="c">if (x < 0) {x += 1.1244;}</source></div></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>Is the same as:</div></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>Is the same as:</div></td></tr>
<tr><td colspan="2" class="diff-lineno" id="mw-diff-left-l109" >Line 109:</td>
<td colspan="2" class="diff-lineno">Line 112:</td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>== NEON SIMD ==</div></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>== NEON SIMD ==</div></td></tr>
<tr><td class='diff-marker'>−</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>The NEON unit is similar to the MMX and SSE extensions found on X86 processors, it is optimized for Single Instruction Multiple Data (SIMD) operations.</div></td><td class='diff-marker'>+</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>The <ins class="diffchange diffchange-inline">[[</ins>NEON<ins class="diffchange diffchange-inline">]] </ins>unit is similar to the MMX and SSE extensions found on X86 processors, it is optimized for Single Instruction Multiple Data (SIMD) operations.</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>The NEON unit has 2 floating point pipelines, an integer pipeline and a 128bit load/store/permute pipeline. When properly utilized it is a very powerful coprocessor. Unfortunately GCC does a rather poor job of vectorizing code for the NEON unit. To get the best performance you should use either the intrinsics provided in the "arm_neon.h" header or hand written assembly. </div></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>The NEON unit has 2 floating point pipelines, an integer pipeline and a 128bit load/store/permute pipeline. When properly utilized it is a very powerful coprocessor. Unfortunately GCC does a rather poor job of vectorizing code for the NEON unit. To get the best performance you should use either the intrinsics provided in the "arm_neon.h" header or hand written assembly. </div></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>== Summary ==</div></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>== Summary ==</div></td></tr>
<tr><td class='diff-marker'>−</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>It's often said amongst software developers that you 'may as well not bother trying to outperform a compiler', whilst there is a grain of truth in this where X86 is concerned, this is definitely not the case with Floating point on the ARM Cortex A8. In fact it is almost the opposite, you can almost always make significant gains via targeting the NEON. Therefore, In order to achieve the best floating point performance on the Pandora (or ARM Cortex A8 device):</div></td><td class='diff-marker'>+</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>It's often said amongst software developers that you 'may as well not bother trying to outperform a compiler', whilst there is a grain of truth in this where X86 is concerned, this is definitely not the case with Floating point on the ARM Cortex A8. In fact it is almost the opposite, you can almost always make significant gains via targeting the <ins class="diffchange diffchange-inline">[[</ins>NEON<ins class="diffchange diffchange-inline">]]</ins>. Therefore, In order to achieve the best floating point performance on the Pandora (or ARM Cortex A8 device):</div></td></tr>
<tr><td class='diff-marker'>−</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>* Use the CodeSourcery 2007q3 or 2009q1 releases and these flags<del class="diffchange diffchange-inline">: </del>-O3 -mcpu=cortex-a8 -mfpu=neon -ftree-vectorize -mfloat-abi=(softfp|hard) -ffast-math -fsingle-precision-constant</div></td><td class='diff-marker'>+</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>* Use the CodeSourcery 2007q3 or 2009q1 releases and these flags</div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins class="diffchange diffchange-inline"><pre> </ins>-O3 -mcpu=cortex-a8 -mfpu=neon -ftree-vectorize -mfloat-abi=(softfp|hard) -ffast-math -fsingle-precision-constant<ins class="diffchange diffchange-inline"></pre></ins></div></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>* Only use single precision floating point</div></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>* Only use single precision floating point</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>* Use NEON intrinsics / ASM when ever you find a bottlenecking FP function. You can do better than the compiler.</div></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>* Use NEON intrinsics / ASM when ever you find a bottlenecking FP function. You can do better than the compiler.</div></td></tr>
<tr><td colspan="2" class="diff-lineno" id="mw-diff-left-l125" >Line 125:</td>
<td colspan="2" class="diff-lineno">Line 129:</td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>[[Category:Development]]</div></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>[[Category:Development]]</div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;">[[Category:Chipset]]</ins></div></td></tr>
</table>ABChttps://pandorawiki.org/index.php?title=Floating_Point_Optimization&diff=2216&oldid=prevTor: Minor fixes: out perform -> outperform, infact -> in fact2010-03-11T10:31:13Z<p>Minor fixes: out perform -> outperform, infact -> in fact</p>
<table class="diff diff-contentalign-left" data-mw="interface">
<col class="diff-marker" />
<col class="diff-content" />
<col class="diff-marker" />
<col class="diff-content" />
<tr class="diff-title" lang="en">
<td colspan="2" style="background-color: #fff; color: #222; text-align: center;">← Older revision</td>
<td colspan="2" style="background-color: #fff; color: #222; text-align: center;">Revision as of 10:31, 11 March 2010</td>
</tr><tr><td colspan="2" class="diff-lineno" id="mw-diff-left-l113" >Line 113:</td>
<td colspan="2" class="diff-lineno">Line 113:</td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>== Summary ==</div></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>== Summary ==</div></td></tr>
<tr><td class='diff-marker'>−</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>It's often said amongst software developers that you 'may as well not bother trying to <del class="diffchange diffchange-inline">out perform </del>a compiler', whilst there is a grain of truth in this where X86 is concerned, this is definitely not the case with Floating point on the ARM Cortex A8. <del class="diffchange diffchange-inline">Infact </del>it is almost the opposite, you can almost always make significant gains via targeting the NEON. Therefore, In order to achieve the best floating point performance on the Pandora (or ARM Cortex A8 device):</div></td><td class='diff-marker'>+</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>It's often said amongst software developers that you 'may as well not bother trying to <ins class="diffchange diffchange-inline">outperform </ins>a compiler', whilst there is a grain of truth in this where X86 is concerned, this is definitely not the case with Floating point on the ARM Cortex A8. <ins class="diffchange diffchange-inline">In fact </ins>it is almost the opposite, you can almost always make significant gains via targeting the NEON. Therefore, In order to achieve the best floating point performance on the Pandora (or ARM Cortex A8 device):</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>* Use the CodeSourcery 2007q3 or 2009q1 releases and these flags: -O3 -mcpu=cortex-a8 -mfpu=neon -ftree-vectorize -mfloat-abi=(softfp|hard) -ffast-math -fsingle-precision-constant</div></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>* Use the CodeSourcery 2007q3 or 2009q1 releases and these flags: -O3 -mcpu=cortex-a8 -mfpu=neon -ftree-vectorize -mfloat-abi=(softfp|hard) -ffast-math -fsingle-precision-constant</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>* Only use single precision floating point</div></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>* Only use single precision floating point</div></td></tr>
</table>Torhttps://pandorawiki.org/index.php?title=Floating_Point_Optimization&diff=2215&oldid=prevTor: Fixed some minor text quirks2010-03-11T10:23:39Z<p>Fixed some minor text quirks</p>
<table class="diff diff-contentalign-left" data-mw="interface">
<col class="diff-marker" />
<col class="diff-content" />
<col class="diff-marker" />
<col class="diff-content" />
<tr class="diff-title" lang="en">
<td colspan="2" style="background-color: #fff; color: #222; text-align: center;">← Older revision</td>
<td colspan="2" style="background-color: #fff; color: #222; text-align: center;">Revision as of 10:23, 11 March 2010</td>
</tr><tr><td colspan="2" class="diff-lineno" id="mw-diff-left-l12" >Line 12:</td>
<td colspan="2" class="diff-lineno">Line 12:</td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>Generally the CS2007q3 release is recommended, the CSL 2009q1 release is promising but it has not been thoroughly tested yet. One big problem with the current compilers is the heavy dependence on VFP code, currently they only output NEON code when an obvious chance of vectorization is encountered (rarely). Apart from the esoteric rounding, vector, etc modes of the VFP (most of which compilers don't use) and predication (used occasionally), most VFP floating point instructions can be exactly replicated using an order of magnitude faster NEON instructions.... Infact it has been reported to me that the GCC packaged with the iPhone 3GS SDK does exactly this. Hopefully future compilers will support this feature. </div></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>Generally the CS2007q3 release is recommended, the CSL 2009q1 release is promising but it has not been thoroughly tested yet. One big problem with the current compilers is the heavy dependence on VFP code, currently they only output NEON code when an obvious chance of vectorization is encountered (rarely). Apart from the esoteric rounding, vector, etc modes of the VFP (most of which compilers don't use) and predication (used occasionally), most VFP floating point instructions can be exactly replicated using an order of magnitude faster NEON instructions.... Infact it has been reported to me that the GCC packaged with the iPhone 3GS SDK does exactly this. Hopefully future compilers will support this feature. </div></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td></tr>
<tr><td class='diff-marker'>−</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div><del class="diffchange diffchange-inline">Inorder </del>to instruct the compiler to produce NEON or VFP code you should use the following compile flags: -mfpu=neon or -mfpu=vfp. Unfortunately the CSL 2007 / 2008 toolchains do not support the passing of values in floating point registers (i talk about this some more in the Transfers section), so you must specify a software ABI via -mfloat-abi=softfp. The CSL 2009q1 release is the first release to support the passing of values in FP registers (AKA hardfp) via the -mfloat-abi=hard compile flag. Note that hardfp compiled binaries are not compatible with softfp ones and vice versa, so make sure your libraries have the correct ABI. Additionally, If you want the compiler to attempt to vectorize your integer / floating point code for the NEON you should add: -ftree-vectorize to your flags. </div></td><td class='diff-marker'>+</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins class="diffchange diffchange-inline">In order </ins>to instruct the compiler to produce NEON or VFP code you should use the following compile flags: -mfpu=neon or -mfpu=vfp. Unfortunately the CSL 2007 / 2008 toolchains do not support the passing of values in floating point registers (i talk about this some more in the Transfers section), so you must specify a software ABI via -mfloat-abi=softfp. The CSL 2009q1 release is the first release to support the passing of values in FP registers (AKA hardfp) via the -mfloat-abi=hard compile flag. Note that hardfp compiled binaries are not compatible with softfp ones and vice versa, so make sure your libraries have the correct ABI. Additionally, If you want the compiler to attempt to vectorize your integer / floating point code for the NEON you should add: -ftree-vectorize to your flags. </div></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>Therefore i recommend the following flags: '''-O3 -mcpu=cortex-a8 -mfpu=neon -ftree-vectorize -mfloat-abi=(softfp|hard) -ffast-math -fsingle-precision-constant''' where -mfloat-abi=hard for the CSL 2009q1 release and softfp for all the others.</div></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>Therefore i recommend the following flags: '''-O3 -mcpu=cortex-a8 -mfpu=neon -ftree-vectorize -mfloat-abi=(softfp|hard) -ffast-math -fsingle-precision-constant''' where -mfloat-abi=hard for the CSL 2009q1 release and softfp for all the others.</div></td></tr>
<tr><td colspan="2" class="diff-lineno" id="mw-diff-left-l19" >Line 19:</td>
<td colspan="2" class="diff-lineno">Line 19:</td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>Under the correct circumstances some of The VFPs instructions will be executed in the NEON coprocessor. Unfortunately this does not gain the full benefit of the NEON, it still takes 7 cycles for an FMAC / FMUL / FADD. Due to this quirk you will likely get better scalar performance by accessing the NEON directly via Intrinsics or ASM.</div></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>Under the correct circumstances some of The VFPs instructions will be executed in the NEON coprocessor. Unfortunately this does not gain the full benefit of the NEON, it still takes 7 cycles for an FMAC / FMUL / FADD. Due to this quirk you will likely get better scalar performance by accessing the NEON directly via Intrinsics or ASM.</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td></tr>
<tr><td class='diff-marker'>−</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div><del class="diffchange diffchange-inline">Inorder </del>for VFP instructions to execute in the NFP the following constraints must be met:</div></td><td class='diff-marker'>+</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins class="diffchange diffchange-inline">In order </ins>for VFP instructions to execute in the NFP the following constraints must be met:</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>* RunFast mode must be enabled</div></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>* RunFast mode must be enabled</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>* Must be single precision floating point operands</div></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>* Must be single precision floating point operands</div></td></tr>
<tr><td colspan="2" class="diff-lineno" id="mw-diff-left-l72" >Line 72:</td>
<td colspan="2" class="diff-lineno">Line 72:</td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>You can enforce single precision constants by including the compiler flag: '''-fsingle-precision-constant''', alternatively you can append an 'f' to the end of each constant. ie 2.123f</div></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>You can enforce single precision constants by including the compiler flag: '''-fsingle-precision-constant''', alternatively you can append an 'f' to the end of each constant. ie 2.123f</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td></tr>
<tr><td class='diff-marker'>−</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>Another thing to watch out for is the double versions of the functions in libm (sin, exp, sqrt, etc). By default these functions operate on double precision floating point values and suffer the same problems as the constants. Luckily libm supplies floating point versions <del class="diffchange diffchange-inline">aswell</del>, they can be accessed by appending an 'f' to the end of the function. ie sinf(), expf(), sqrtf().</div></td><td class='diff-marker'>+</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>Another thing to watch out for is the double versions of the functions in libm (sin, exp, sqrt, etc). By default these functions operate on double precision floating point values and suffer the same problems as the constants. Luckily libm supplies floating point versions <ins class="diffchange diffchange-inline">as well</ins>, they can be accessed by appending an 'f' to the end of the function. ie sinf(), expf(), sqrtf().</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>== NFP / VFP to ARM Transfers ==</div></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>== NFP / VFP to ARM Transfers ==</div></td></tr>
<tr><td class='diff-marker'>−</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>Probably the biggest bottleneck in the architecture is that <del class="diffchange diffchange-inline">inorder </del>to transfer a number from the VFP / NFP registers onto the ARM you must stall both the ARM and NFP / VFP for >20 cycles. This is particularly troublesome because this is how GCC (except the CSL 2009q1 release) supplies arguments and recieves returns from functions. Possibly The best way to minimize operand passing stalls is to make the floating point functions inline.</div></td><td class='diff-marker'>+</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>Probably the biggest bottleneck in the architecture is that <ins class="diffchange diffchange-inline">in order </ins>to transfer a number from the VFP / NFP registers onto the ARM you must stall both the ARM and NFP / VFP for >20 cycles. This is particularly troublesome because this is how GCC (except the CSL 2009q1 release) supplies arguments and recieves returns from functions. Possibly The best way to minimize operand passing stalls is to make the floating point functions inline.</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td></tr>
<tr><td class='diff-marker'>−</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>Another source of NFP / VFP to ARM transfers are conditional branches that depend on floating point numbers. You can do the condition on the VFP but <del class="diffchange diffchange-inline">inorder </del>to branch the flags must be sent from the VFP to the ARM. For very simple branches your best bet is to not branch at all and instead use arithmetic. ie</div></td><td class='diff-marker'>+</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>Another source of NFP / VFP to ARM transfers are conditional branches that depend on floating point numbers. You can do the condition on the VFP but <ins class="diffchange diffchange-inline">in order </ins>to branch the flags must be sent from the VFP to the ARM. For very simple branches your best bet is to not branch at all and instead use arithmetic. ie</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div><source lang="c">if (x < 0) {x += 1.1244;}</source></div></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div><source lang="c">if (x < 0) {x += 1.1244;}</source></div></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>Is the same as:</div></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>Is the same as:</div></td></tr>
<tr><td colspan="2" class="diff-lineno" id="mw-diff-left-l113" >Line 113:</td>
<td colspan="2" class="diff-lineno">Line 113:</td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>== Summary ==</div></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>== Summary ==</div></td></tr>
<tr><td class='diff-marker'>−</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>It's often said amongst software developers that you 'may <del class="diffchange diffchange-inline">aswell </del>not bother trying to out perform a compiler', whilst there is a grain of truth in this where X86 is concerned, this is definitely not the case with Floating point on the ARM Cortex A8. Infact it is almost the opposite, you can almost always make significant gains via targeting the NEON. Therefore, <del class="diffchange diffchange-inline">Inorder </del>to achieve the best floating point performance on the Pandora (or ARM Cortex A8 device):</div></td><td class='diff-marker'>+</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>It's often said amongst software developers that you 'may <ins class="diffchange diffchange-inline">as well </ins>not bother trying to out perform a compiler', whilst there is a grain of truth in this where X86 is concerned, this is definitely not the case with Floating point on the ARM Cortex A8. Infact it is almost the opposite, you can almost always make significant gains via targeting the NEON. Therefore, <ins class="diffchange diffchange-inline">In order </ins>to achieve the best floating point performance on the Pandora (or ARM Cortex A8 device):</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>* Use the CodeSourcery 2007q3 or 2009q1 releases and these flags: -O3 -mcpu=cortex-a8 -mfpu=neon -ftree-vectorize -mfloat-abi=(softfp|hard) -ffast-math -fsingle-precision-constant</div></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>* Use the CodeSourcery 2007q3 or 2009q1 releases and these flags: -O3 -mcpu=cortex-a8 -mfpu=neon -ftree-vectorize -mfloat-abi=(softfp|hard) -ffast-math -fsingle-precision-constant</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>* Only use single precision floating point</div></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>* Only use single precision floating point</div></td></tr>
</table>Torhttps://pandorawiki.org/index.php?title=Floating_Point_Optimization&diff=1998&oldid=prevGlenn: +cat2010-01-10T20:42:13Z<p>+cat</p>
<table class="diff diff-contentalign-left" data-mw="interface">
<col class="diff-marker" />
<col class="diff-content" />
<col class="diff-marker" />
<col class="diff-content" />
<tr class="diff-title" lang="en">
<td colspan="2" style="background-color: #fff; color: #222; text-align: center;">← Older revision</td>
<td colspan="2" style="background-color: #fff; color: #222; text-align: center;">Revision as of 20:42, 10 January 2010</td>
</tr><tr><td colspan="2" class="diff-lineno" id="mw-diff-left-l123" >Line 123:</td>
<td colspan="2" class="diff-lineno">Line 123:</td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>* Inline floating point code (unless its very large)</div></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>* Inline floating point code (unless its very large)</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>* Pass FP arguments via pointers instead of by value and do integer work in between function calls.</div></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>* Pass FP arguments via pointers instead of by value and do integer work in between function calls.</div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;"></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;">[[Category:Development]]</ins></div></td></tr>
</table>Glennhttps://pandorawiki.org/index.php?title=Floating_Point_Optimization&diff=1437&oldid=prevAdventus: /* Summary */2009-08-01T01:06:56Z<p><span dir="auto"><span class="autocomment">Summary</span></span></p>
<table class="diff diff-contentalign-left" data-mw="interface">
<col class="diff-marker" />
<col class="diff-content" />
<col class="diff-marker" />
<col class="diff-content" />
<tr class="diff-title" lang="en">
<td colspan="2" style="background-color: #fff; color: #222; text-align: center;">← Older revision</td>
<td colspan="2" style="background-color: #fff; color: #222; text-align: center;">Revision as of 01:06, 1 August 2009</td>
</tr><tr><td colspan="2" class="diff-lineno" id="mw-diff-left-l113" >Line 113:</td>
<td colspan="2" class="diff-lineno">Line 113:</td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>== Summary ==</div></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>== Summary ==</div></td></tr>
<tr><td class='diff-marker'>−</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>Therefore, Inorder to achieve the best floating point performance on the Pandora:</div></td><td class='diff-marker'>+</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins class="diffchange diffchange-inline">It's often said amongst software developers that you 'may aswell not bother trying to out perform a compiler', whilst there is a grain of truth in this where X86 is concerned, this is definitely not the case with Floating point on the ARM Cortex A8. Infact it is almost the opposite, you can almost always make significant gains via targeting the NEON. </ins>Therefore, Inorder to achieve the best floating point performance on the Pandora <ins class="diffchange diffchange-inline">(or ARM Cortex A8 device)</ins>:</div></td></tr>
<tr><td class='diff-marker'>−</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>* Use the CodeSourcery 2007q3 <del class="diffchange diffchange-inline">release </del>and these flags: -O3 -mcpu=cortex-a8 -mfpu=neon -ftree-vectorize -mfloat-abi=softfp -ffast-math -fsingle-precision-constant</div></td><td class='diff-marker'>+</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>* Use the CodeSourcery 2007q3 <ins class="diffchange diffchange-inline">or 2009q1 releases </ins>and these flags: -O3 -mcpu=cortex-a8 -mfpu=neon -ftree-vectorize -mfloat-abi=<ins class="diffchange diffchange-inline">(</ins>softfp<ins class="diffchange diffchange-inline">|hard) </ins>-ffast-math -fsingle-precision-constant</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>* Only use single precision floating point</div></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>* Only use single precision floating point</div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;">* Use NEON intrinsics / ASM when ever you find a bottlenecking FP function. You can do better than the compiler.</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;">* Minimize Conditional Branches</ins></div></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>* Enable RunFast mode</div></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>* Enable RunFast mode</div></td></tr>
<tr><td class='diff-marker'>−</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div><del class="diffchange diffchange-inline">* Use NEON intrinsics / ASM for vector, or even scalar, code.</del></div></td><td class='diff-marker'>+</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div> </div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins class="diffchange diffchange-inline">For softfp:</ins></div></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>* Inline floating point code (unless its very large)</div></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>* Inline floating point code (unless its very large)</div></td></tr>
<tr><td class='diff-marker'>−</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div><del class="diffchange diffchange-inline">* Minimize Conditional Branches</del></div></td><td class='diff-marker'>+</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>* Pass <ins class="diffchange diffchange-inline">FP arguments </ins>via pointers instead of by value and do integer work in between function calls.</div></td></tr>
<tr><td class='diff-marker'>−</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>* Pass <del class="diffchange diffchange-inline">Arguments </del>via pointers instead of by value and do integer work in between function calls.</div></td><td colspan="2"> </td></tr>
</table>Adventushttps://pandorawiki.org/index.php?title=Floating_Point_Optimization&diff=1436&oldid=prevAdventus: /* NFP / VFP to ARM Transfers */2009-08-01T00:58:13Z<p><span dir="auto"><span class="autocomment">NFP / VFP to ARM Transfers</span></span></p>
<table class="diff diff-contentalign-left" data-mw="interface">
<col class="diff-marker" />
<col class="diff-content" />
<col class="diff-marker" />
<col class="diff-content" />
<tr class="diff-title" lang="en">
<td colspan="2" style="background-color: #fff; color: #222; text-align: center;">← Older revision</td>
<td colspan="2" style="background-color: #fff; color: #222; text-align: center;">Revision as of 00:58, 1 August 2009</td>
</tr><tr><td colspan="2" class="diff-lineno" id="mw-diff-left-l105" >Line 105:</td>
<td colspan="2" class="diff-lineno">Line 105:</td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>}</source></div></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>}</source></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;"></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;">The last common source of transfers is when you cast a floating point value as an integer, by default all integer work will be done in the ARM pipeline and hence a transfer operation occurs. This is particularly problematic for complex algorithms that rely on bitwise or rounding operations on floating point numbers, ie almost all the functions in cmath depend on range reduction (rounding). A smart compiler would recognize that they can almost always be done in the NEON's integer pipeline.</ins></div></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>== NEON SIMD ==</div></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #222; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>== NEON SIMD ==</div></td></tr>
</table>Adventus