Floating Point Optimization

From Pandora Wiki
Revision as of 06:57, 25 July 2009 by Adventus (talk | contribs)
Jump to: navigation, search

Introduction

In the past it was rare for an embedded processor to have dedicated floating point hardware, this usually limited you to either using fixed point math (which can be very tricky to write) or very slow software floating point emulation. Fortunately the ARM Cortex A8 found in the OMAP3 has 2 Floating Point Units, a non-pipelined VFP-lite conforming to the IEEE754 standard for floating point arithmetic and a pipelined SIMD NEON coprocessor. The VFP-lite can handle both single and double precession arithmetic, as well as properly handling exceptions and subnormal numbers. However, This full spec compliance and the limited die space available has resulted in a relatively slow implementation, it usually takes from 18 to 21 cycles to perform a single precision multiply accumulate. The NEON unit on the other hand is designed for very fast single precision vector math, it can sustain multiply accumulates at a rate of two per cycle. Efficiently utilizing these coprocessors in GCC will be the focus of this article.

Note: In this article I refer to the A8's integer pipeline as the "ARM" , the VFP-lite as simply the "VFP" and the NEON unit as the "NFP".

VFP-Lite RunFast

The VFP-Lite has one saving grace, under the correct circumstances some of its instructions will be executed in the NEON coprocessor and will gain the full benefits of doing so. Inorder for this to occur the following constraints must be met:

  • RunFast mode must be enabled
  • Must be single precision floating point operands
  • Must not be a vector instruction (GCC doesn't appear to use this feature, so don't worry about it)

Runfast mode is enabled when the following conditions are present:

  • subnormal numbers are being flushed to zero
  • default NaN mode is active
  • no floating point exceptions are enabled


At the present time, it is unsure to me whether Runfast mode will be enabled by default in the Angstrom distribution. If it is not you can use the following C code to enforce it:

void enable_runfast()
{
	static const unsigned int x = 0x04086060;
	static const unsigned int y = 0x03000000;
	int r;
	asm volatile (
		"fmrx	%0, fpscr			\n\t"	//r0 = FPSCR
		"and	%0, %0, %1			\n\t"	//r0 = r0 & 0x04086060
		"orr	%0, %0, %2			\n\t"	//r0 = r0 | 0x03000000
		"fmxr	fpscr, %0			\n\t"	//FPSCR = r0
		: "=r"(r)
		: "r"(x), "r"(y)
	);
}

The instructions that are executed on the NFP are: FADDS, FSUBS, FABSS, FNEGS, FMULS, FNMULS, FMACS, FNMACS, FMSCS, FNMSCS, FCMPS, FCMPES, FCMPZS, FCMPEZS, FUITOS, FSITOS, FTOUIS, FTOSIS, FTOUIZS, FTOSIZS, FSHTOS, FSLTOS, FUHTOS, FULTOS, FTOSHS, FTOSLS, FTOUHS, FTOULS.

Single Precision Constants

One important and easy optimization is to make sure that single precision constants are begin used. By default this is not the case, instead a double precision constant is being used, so all related operations involving that constant require double precision instructions and cannot be executed on the NEON. eg

float foo(float x)
{ 
	float y = 2.123;
	float r = y * x;
	return r; 
}

might end up the same as:

float foo(float x)
{
	double dx = (double) x;
	double dy = (double) 2.123; 
	double dr = dx * dy;
	float r = (float) dr;
	return r;
}

You can enforce single precision constants by including the compiler flag: -fsingle-precision-constant.

NFP / VFP to ARM Transfers

Probably the biggest bottleneck in the architecture is that inorder to transfer a number from the VFP / NFP registers onto the ARM you must stall both the ARM and NFP / VFP for >20 cycles. This is particularly troublesome because this is how GCC supplies arguments and recieves returns from functions. Possibly The best way to minimize operand passing stalls is to make the floating point functions inline.

Another source of NFP / VFP - ARM transfers are conditional branches that depend on floating point numbers. You can do the condition on the VFP but inorder to branch the flags must be sent from the VFP to the ARM. For very simple branches your best bet is to not branch at all and instead use arithmetic. ie

if (x < 0) {x += 1.1244;}

Is the same as:

x = x + (x < 0) * 1.1244

However you might want to keep a close eye on what the compiler actually produces with the above code.

One interesting fact is that using stores and loads do not cause a stall. So aslong as you don't need the result straight away you can hide the 20 cycle latency. Instead of doing a transfer you; store your NFP / VFP result to memory, do some work on the ARM, then load the result back onto the ARM without penalty. ie

void foo(float x, float *r)
{
	*r = 123 + x;
}

void bar()
{
	float x = 10;
	float r;
	foo(x, &r)
	//do ~20 cycles of ARM work
	//then access r, ie r = r * 10;

}

NEON SIMD

The NEON unit is similar to the MMX and SSE extensions found on X86 processors, it is optimized for Single Instruction Multiple Data (SIMD) operations. The NEON unit has 2 floating point pipelines, an integer pipeline and a 128bit load/store/permute pipeline. When properly utilized it is a very powerful coprocessor. Unfortunately GCC does a rather poor job of vectorizing code for the NEON unit. To get the best performance you should use either the intrinsics provided in the "arm_neon.h" header or hand written assembly.