Difference between revisions of "Assembly Code Optimization"

From Pandora Wiki
Jump to: navigation, search
(clarify some wording)
(Code alignment)
Line 35: Line 35:
  
 
== Code alignment ==
 
== Code alignment ==
Code alignment may reduce performance.
+
Code alignment should be used with caution.  Alignment has the potential to increase performance, but may be detrimental in some circumstances.
  
In some cases, aligning code may improve instruction fetch and decode.  However, excessive code alignment can result in a suboptimal distribution of entries in the branch history tables and increase branch misprediction.
+
Because the instruction decoder always fetches 64-bit aligned words from the level-1 instruction cache, aligning code can improve instruction fetch and decode.  However, excessive code alignment can result in a suboptimal distribution of entries in the branch history tables and increase branch misprediction.  Additionally, the speculative prefetch may retrieve and decode instructions used as padding, even if those instructions are never executed.  The types of instructions used as padding will affect the instruction decoder as stated above, and this may reduce the prefetch of code into the instruction cache.
  
 
[[Category:Development]]
 
[[Category:Development]]

Revision as of 12:30, 13 March 2010

Assembly code optimization on the Cortex-A8

This guide presents specific optimization techniques for the ARM Cortex-A8 processor and its dual-issue, in-order pipeline.

Use the ARMv7 movw/movt instructions

Newer ARM processors allow loading 32-bit values as two 16-bit immediates. The movw instruction loads the lower 16 bits, and movt loads the upper 16 bits. The movw instruction clears the upper 16 bits, so that 16-bit values can be loaded using a single instruction. The movt instruction does not affect the lower bits.

On older ARM processors, it was common to load 32-bit values with a PC-relative load. This should be avoided because it may result in a cache miss.

Branch Prediction

Branches which have not been seen before are predicted not taken. It is therefore preferable to structure code so the the most likely code path is the one where the branch is not taken.

Unconditional branches may be mispredicted, and load instructions which follow branches may be decoded and cause cache accesses. Avoid placing load instructions after branches unless you intend the CPU to prefetch the addresses they reference.

The branch predictor has a call/return stack for jumps which reference r14 or stack operations using r13 which load the program counter. For best performance, make sure any pop, mov pc,lr or bx lr jumps to the same location as was set by the matching bl instruction. For non-return jumps, use a register other than r14.

Although instructions are decoded in pairs, the branch predictor can only predict one branch per cycle. If you have a conditional branch which is immediately followed by another branch, and the first branch is very likely to be taken, place a no-operation instruction between the branches to prevent decoding and prediction of the second branch. Inserting NOPs is detrimental if the first branch is not frequently taken.

Dual-Issue Restrictions

Only one branch instruction can issue per cycle. Only one load or store instruction can issue per cycle. Instructions which write to the same register can not issue together.

There is a one-cycle delay before any written register can be used as the address of a load or store instruction.

There is one cycle delay before the result of a load can be used. Stores occur at the end of the pipeline and store instructions can issue in the same cycle as another instruction which writes the same register. Similarly, conditional branches can issue in the same cycle as flag-setting instructions because the branch is not resolved until the following cycle.

Shift or rotate instructions take two cycles, and may stall if any of three preceding instructions write to the shifted register.

Instruction pairing restrictions following a branch

The Cortex-A8 processor normally decodes two instructions per cycle, but branch instructions, and certain instructions which resemble branches, are decoded at a rate of only one per cycle if present in pairs. These instructions can still issue and execute in parallel with other instructions that were already in the queue. However, following a mispredicted branch, the queue will be empty, and the following instructions issue and execute at a rate of only one per cycle:

  • Branch instructions
  • Load instructions, whether or not they reference r15
  • Arithmetic/logic instructions which do not set flags and do not have an immediate value.

Flag-setting instructions may be used to avoid this restriction. Additionally, MOV or MVN instructions which do not have immediate values can often be replaced with ADD, ORR, or XOR instructions using immediate values.

Code alignment

Code alignment should be used with caution. Alignment has the potential to increase performance, but may be detrimental in some circumstances.

Because the instruction decoder always fetches 64-bit aligned words from the level-1 instruction cache, aligning code can improve instruction fetch and decode. However, excessive code alignment can result in a suboptimal distribution of entries in the branch history tables and increase branch misprediction. Additionally, the speculative prefetch may retrieve and decode instructions used as padding, even if those instructions are never executed. The types of instructions used as padding will affect the instruction decoder as stated above, and this may reduce the prefetch of code into the instruction cache.