Assembly Code Optimization

Assembly code optimization on the Cortex-A8

This guide presents specific optimization techniques for the ARM Cortex-A8 processor and its dual-issue, in-order pipeline.

Use the ARMv7 movw/movt instructions

Newer ARM processors allow loading 32-bit values as two 16-bit immediates. The movw instruction loads the lower 16 bits, and movt loads the upper 16 bits. The movw instruction clears the upper 16 bits, so that 16-bit values can be loaded using a single instruction. The movt instruction does not affect the lower bits.

On older ARM processors, it was common to load 32-bit values with a PC-relative load. This should be avoided because it may result in a cache miss.

Branch Prediction

There is a one-cycle stall in instruction fetch when a branch is predicted taken. It is therefore preferable to structure code so that the most likely code path is the one where the branch is not taken.

Unconditional branches may be mispredicted, and load instructions which follow branches may be decoded and cause cache accesses. Avoid placing load instructions after branches unless you intend the CPU to prefetch the addresses they reference.

The branch predictor has a call/return stack for jumps which reference r14 or stack operations using r13 which load the program counter. For best performance, make sure any pop, mov pc,lr or bx lr jumps to the same location as was set by the matching bl instruction. For non-return jumps, use a register other than r14.

Although instructions are decoded in pairs, the branch predictor can only predict one branch target per cycle. If you have a conditional branch which is immediately followed by another branch, and the first branch is likely to be taken, place a no-operation instruction between the branches to prevent decoding and prediction of the second branch. Inserting NOPs is detrimental if the first branch is infrequently taken.

Dual-Issue Restrictions

Only one branch instruction can issue per cycle. Only one load or store instruction can issue per cycle. Instructions which write to the same register can not issue together.

There is a one-cycle delay before any written register can be used as the address of a subsequent load or store instruction. There is a one-cycle delay before any written register can be used by an instruction which performs a shift or rotation on that register.

There is one cycle delay before the result of a load can be used. In the aforementioned cases involving subsequent shift or rotate instructions, or the address of a load or store instruction, the total delay is two cycles.

Stores occur at the end of the pipeline and store instructions can issue in the same cycle as another instruction which writes the same register. Similarly, conditional branches can issue in the same cycle as flag-setting instructions because the branch is not resolved until the following cycle. In all other cases, instructions can not issue together if the second instruction depends on the results of the first.

Instruction pairing restrictions following a branch

The Cortex-A8 processor normally fetches two instructions per cycle and the branch predictor tests each against the global history buffer. However, there is only one entry in the branch target buffer for each pair of instructions. Therefore, branch instructions, and certain instructions which resemble branches, may adversely affect branch prediction when present in pairs. To reduce the risk of branch misprediction, avoid pairing branch instructions with the following:

Branch instructions
Load instructions, whether or not they write r15
Arithmetic/logic instructions which do not set flags and do not have an immediate value, whether or not they write r15

Flag-setting instructions can be used to avoid this restriction. Additionally, MOV instructions that do not have immediate values can be replaced with ADD, SUB, ORR, or EOR instructions using zero as an immediate value.

Code alignment

Code alignment should be used with caution. Alignment has the potential to increase performance, but may be detrimental in some circumstances.

Because the instruction decoder always fetches 64-bit aligned words from the level-1 instruction cache, aligning code can improve instruction fetch and decode. However, the global history buffer of the branch predictor is indexed by the low bits of the instruction address. Excessive code alignment can result in a suboptimal distribution of entries in the branch history tables and increase branch misprediction. Additionally, the speculative prefetch may retrieve and decode instructions used as padding, even if those instructions are never executed. The types of instructions used as padding will affect the branch predictor as stated above, and this may affect the prefetch of code into the instruction cache.

NEON

As the manual states, passing data from NEON (and VFP) back to ARM takes at least 20 cycles, but there are some tricks that can be used. The vmov instruction itself that is passing the data will not cause any stalls (provided the needed NEON register is already available), the stall will actually happen on any ARM instruction that follows instead (including branches). This means that:

Multiple "vmov.32 rX, dY[Z]" can be done back-to-back and only incur one stall after the whole sequence
Additional NEON instructions can be issued after "vmov" without stalling (with enough instructions the stall can be completely avoided)
Additional unrelated NEON instructions can be issued before "vmov" (to hide the latency of instruction that produces the source register data)

Even though the VFP is not pipelined, the VFP vldr instructions are. "vldr s0, [r0]" will be faster than "vld1.32 {d0[0]}, [r0]" (2 vs 3 cycles, vldr also gives more flexible addressing mode). The same can't be said about "vmov sX, sY", it waits for all previous instructions to complete before issuing.

You should avoid any NEON loads that come from a location that has recently been written to. NEON does not have store forwarding, so this can cost dozens of cycles. This can happen unknowingly if you use unaligned loads, because they're performed with two aligned accesses. For example, if you have this sequence:

loop:
 vld1.u32 { q0 }, [r0]
 vadd.u32 q0, q0, q1
 vst1.u32 { q0 }, [r0]!
 subs r1, #1
 bne loop

If r0 is unaligned, the next iteration of vld1.u32 will overlap with the previous iteration of vst1.u32, which will cause a stall. To avoid this situation you can reorder the loop so that the store happens one element behind.

Assembly Code Optimization

Contents