<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://pandorawiki.org/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=Ari64</id>
	<title>Pandora Wiki - User contributions [en]</title>
	<link rel="self" type="application/atom+xml" href="https://pandorawiki.org/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=Ari64"/>
	<link rel="alternate" type="text/html" href="https://pandorawiki.org/Special:Contributions/Ari64"/>
	<updated>2026-04-13T13:30:42Z</updated>
	<subtitle>User contributions</subtitle>
	<generator>MediaWiki 1.32.0-alpha</generator>
	<entry>
		<id>https://pandorawiki.org/index.php?title=Mupen64plus_dynamic_recompiler&amp;diff=5496</id>
		<title>Mupen64plus dynamic recompiler</title>
		<link rel="alternate" type="text/html" href="https://pandorawiki.org/index.php?title=Mupen64plus_dynamic_recompiler&amp;diff=5496"/>
		<updated>2011-02-09T01:19:41Z</updated>

		<summary type="html">&lt;p&gt;Ari64: Cache is 32MB by default in 20110128&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This page describes the dynamic recompiler in [[Mupen64Plus]], and the changes made for the ARM port.&lt;br /&gt;
&lt;br /&gt;
==The original dynamic recompiler by Hacktarux==&lt;br /&gt;
&lt;br /&gt;
The dynamic recompiler used in Mupen64plus v1.5 is based on the original written by Hacktarux in 2002.&lt;br /&gt;
&lt;br /&gt;
It recompiles contiguous blocks of MIPS instructions.  First, each instruction is decoded into a dynamically-allocated 132-byte data structure.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
typedef struct _precomp_instr&lt;br /&gt;
{&lt;br /&gt;
   void (*ops)();&lt;br /&gt;
   union&lt;br /&gt;
     {&lt;br /&gt;
    struct&lt;br /&gt;
      {&lt;br /&gt;
         long long int *rs;&lt;br /&gt;
         long long int *rt;&lt;br /&gt;
         short immediate;&lt;br /&gt;
      } i;&lt;br /&gt;
    struct&lt;br /&gt;
      {&lt;br /&gt;
         unsigned int inst_index;&lt;br /&gt;
      } j;&lt;br /&gt;
    struct&lt;br /&gt;
      {&lt;br /&gt;
         long long int *rs;&lt;br /&gt;
         long long int *rt;&lt;br /&gt;
         long long int *rd;&lt;br /&gt;
         unsigned char sa;&lt;br /&gt;
         unsigned char nrd;&lt;br /&gt;
      } r;&lt;br /&gt;
    struct&lt;br /&gt;
      {&lt;br /&gt;
         unsigned char base;&lt;br /&gt;
         unsigned char ft;&lt;br /&gt;
         short offset;&lt;br /&gt;
      } lf;&lt;br /&gt;
    struct&lt;br /&gt;
      {&lt;br /&gt;
         unsigned char ft;&lt;br /&gt;
         unsigned char fs;&lt;br /&gt;
         unsigned char fd;&lt;br /&gt;
      } cf;&lt;br /&gt;
     } f;&lt;br /&gt;
   unsigned int addr; /* word-aligned instruction address in r4300 address space */&lt;br /&gt;
   unsigned int local_addr; /* byte offset to start of corresponding x86_64 instructions, from start of code block */&lt;br /&gt;
   reg_cache_struct reg_cache_infos;&lt;br /&gt;
} precomp_instr;&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The decoded instructions are then compiled, generating x86 instructions for each MIPS instruction.  A 32K block is allocated with malloc() to hold the x86 code.  If this size proves insufficient, it is incrementally resized with realloc().&lt;br /&gt;
&lt;br /&gt;
MIPS registers are allocated to x86 registers with a least-recently-used replacement policy.  All cached registers are written back prior to a branch, or a memory read or write.&lt;br /&gt;
&lt;br /&gt;
To facilitate invalidation and replacement of modified code blocks, each 4K page is compiled separately.  If a sequence of instructions crosses a 4K page boundary, the current block is ended, and the next instructions will be compiled as a separate block.&lt;br /&gt;
&lt;br /&gt;
Branch instructions within a 4K page are compiled as branches directly to the target address.  Branches which cross a 4K page go through an indirect address lookup.&lt;br /&gt;
&lt;br /&gt;
Compiled code blocks are invalidated on write.  On an actual MIPS CPU, the instruction cache is invalidated using the CACHE instruction, however a few N64 games clear the cache using other methods.  Trapping writes appears to be the most reliable method of ensuring cache coherency in the emulated system.  The cache instruction is ignored.&lt;br /&gt;
&lt;br /&gt;
==Problems with the original design==&lt;br /&gt;
&lt;br /&gt;
The most significant performance problem with this design is its excessive memory usage.  The decoded instruction data is retained (132 bytes for each MIPS instruction) and occasionally referenced during execution.  Memory accesses frequently miss the L2 cache, resulting in poor performance.&lt;br /&gt;
&lt;br /&gt;
Additionally, the register cache is relatively inefficient, since all registers are flushed before any read or write operation.&lt;br /&gt;
&lt;br /&gt;
==A new approach==&lt;br /&gt;
&lt;br /&gt;
To reduce memory usage, the new design allocates a single large block of memory (generally 32 MiB, but the size is configurable) which is used for recompiled code.  This memory is allocated using mmap with the PROT_EXEC bit set, to ensure operation on CPUs with no-execute (NX) page permissions.&lt;br /&gt;
&lt;br /&gt;
Contiguous blocks of MIPS instructions are compiled (that is, it does not attempt to follow branches or 'hot-paths').&lt;br /&gt;
&lt;br /&gt;
The recompiler consists of eight stages, plus a linker, and a memory manager.&lt;br /&gt;
&lt;br /&gt;
Compiled blocks are invalidated on write, as before, however as the compiler will cross a 4K page, writes may invalidate adjacent pages as well.&lt;br /&gt;
&lt;br /&gt;
Currently the dynarec generates x86, x86-64, ARMv5, and ARMv7 (little-endian).  Most of the code is shared between the architectures, but a different code generator is included at compile time, using different #include statements depending on the CPU type.&lt;br /&gt;
&lt;br /&gt;
==Pass 1: Disassembly==&lt;br /&gt;
&lt;br /&gt;
When an instruction address is encountered which has not been compiled, the function new_recompile_block is called, with the (virtual) address of the target instruction as its sole parameter.  If the address is invalid, the function returns a nonzero value and the caller is responsible for handling the pagefault.&lt;br /&gt;
&lt;br /&gt;
Instructions are decoded until an unconditional jump, usually a return, is encountered.  Disassembly is ordinarilly continued past a JAL (subroutine call) instruction, however this strategy is abandoned if invalid instructions are encountered.&lt;br /&gt;
&lt;br /&gt;
Up to 4096 instructions (16K of MIPS code) may be disassembled at once.  Surprisingly, some games do actually reach this limit.&lt;br /&gt;
&lt;br /&gt;
==Pass 2: Liveness analysis==&lt;br /&gt;
&lt;br /&gt;
After disassembly, liveness analysis is performed on the registers.  This determines when a particular register will no longer be used, and thus can be removed from the register cache.&lt;br /&gt;
&lt;br /&gt;
A separate analysis is done on the upper 32 bits of 64-bit registers.  This can determine when only the lower 32 bits of a register are significant, thus allowing use of a 32-bit register.  This enables more efficient code generation on 32-bit processors.&lt;br /&gt;
&lt;br /&gt;
==Pass 3: Register allocation==&lt;br /&gt;
&lt;br /&gt;
The 31 MIPS registers must be mapped onto seven available registers on x86, or twelve available registers on ARM.&lt;br /&gt;
&lt;br /&gt;
Instructions with 64-bit results require two registers on 32-bit host processors.  32-bit instructions require only one.  A flag is set for registers containing 32-bit values, and these registers will be sign-extended before they are written to the register file.&lt;br /&gt;
&lt;br /&gt;
Registers are made available using a least-soon-needed replacement policy.  When the register cache is full, and no registers can be eliminated using the liveness analysis, a ten-instruction lookahead is used to determine which registers will not be needed soon and these registers are evicted from the cache.&lt;br /&gt;
&lt;br /&gt;
==Pass 4: Free unused registers==&lt;br /&gt;
&lt;br /&gt;
After the initial register allocation, the allocations are reviewed to determine if any registers remain allocated longer than necessary.  These are then removed from the register cache.  This avoids having to write back a large number of registers just before a branch, and makes more registers available for the next pass.&lt;br /&gt;
&lt;br /&gt;
==Pass 5: Pre-allocate registers==&lt;br /&gt;
&lt;br /&gt;
If a register will be used soon and needs to be loaded, try to load it early if the register is available.  This improves execution on CPUs with a load-use penalty.&lt;br /&gt;
&lt;br /&gt;
==Pass 6: Optimize clean/dirty state==&lt;br /&gt;
&lt;br /&gt;
If a cached register is 'dirty' and needs to be written out, try to do so as soon as the register will no longer be modified.  This avoids having to write the same register on multiple code paths due to conditional branches.  Additionally, try to avoid writing out dirty registers inside of loops.&lt;br /&gt;
&lt;br /&gt;
==Pass 7: Identify where registers are assumed to be 32-bit==&lt;br /&gt;
&lt;br /&gt;
When a 64-bit register is mapped to a 32-bit register with the assumption that the value will be sign-extended before being used, it is necessary to ensure that no register contains a 64-bit value when branching to such a location.  These instructions are flagged to identify them as requiring 32-bit inputs.  This information is used by the linker, and the exception return (ERET) handler.&lt;br /&gt;
&lt;br /&gt;
==Pass 8: Assembly==&lt;br /&gt;
&lt;br /&gt;
This generates and outputs the recompiled code.&lt;br /&gt;
&lt;br /&gt;
Following the main code block, handlers for certain exceptions as well as alternate entry points are added.&lt;br /&gt;
&lt;br /&gt;
If a recompiled instruction relies on a certain MIPS register being cached in a certain native register, then a short 'stub' of code is generated to load the necessary registers.  When an instruction outside of this block needs to jump to that location, it will instead jump to the stub.  The necessary registers will be loaded, and then it will jump into the main code sequence.&lt;br /&gt;
&lt;br /&gt;
On architectures which require literal pools (ARMv5) these are inserted as necessary.&lt;br /&gt;
&lt;br /&gt;
==Linker==&lt;br /&gt;
&lt;br /&gt;
The linker fills in all unresolved branches.  Branches within the block are linked to their target address.  Branches which jump outside of the block are linked to their target if that address has been compiled already. These inter-block branches are recorded in the jump_out array.  This information will be used to remove the links in the event that the target of the branch is invalidated.&lt;br /&gt;
&lt;br /&gt;
Unresolved branches point to a stub which loads the address of the branch instruction and the virtual address of its target into registers, and calls the dynamic linker.  When this code is executed, the dynamic linker will compile the target if necessary, and then patch the branch instruction with the new address.&lt;br /&gt;
&lt;br /&gt;
==Memory manager==&lt;br /&gt;
&lt;br /&gt;
The last step in new_recompile_block is to ensure that there will be sufficient memory available to compile the next block.  If there is not, then the oldest blocks are purged.&lt;br /&gt;
&lt;br /&gt;
The dynarec cache can be described as a 32MB circular buffer divided into eight segments.  Memory is allocated in order from beginning to end.&lt;br /&gt;
&lt;br /&gt;
When there are less than 2 empty segments, the next segment in sequence is cleared, wrapping around to the beginning of the buffer.  This continues as memory is needed, wrapping around from end to beginning.&lt;br /&gt;
&lt;br /&gt;
==Invalidation and restoration==&lt;br /&gt;
&lt;br /&gt;
Normally, code blocks are looked up via a hash table, using the virtual address of the target instruction.  However, for purposes of invalidation, blocks are grouped by physical address.  This can be described as a virtually-indexed, physically-tagged (VIPT) cache.&lt;br /&gt;
&lt;br /&gt;
References to compiled blocks are stored in one of 4096 linked lists in the jump_in array.  Each list covers a 4K block of memory, and 2048 such lists are sufficient to cover the 8MB of RAM in the Nintendo 64.  The remaining lists are for code in ROM, and the bootloader in SP memory.&lt;br /&gt;
&lt;br /&gt;
When a write hits a memory page marked as cached, all entries in the corresponding list are invalidated.  If any code is found to cross a 4K boundary, the adjacent lists are invalidated also.&lt;br /&gt;
&lt;br /&gt;
Sometimes blocks may be invalidated even when none of the code is actually modified.  This can happen if data is written to memory in the same 4K page, or if code is reloaded without actually modifying it.  If blocks which were previously invalidated are subsequently found to be unmodified, those blocks are marked in the restore_candidate array.  If the block remains unmodified, it will be restored as a valid block, to avoid recompiling blocks which do not need to be recompiled.  This is performed by the clean_blocks function which is called periodically.&lt;br /&gt;
&lt;br /&gt;
The jump_in array, which lists unmodified blocks, is physically indexed, however the jump_dirty array, which lists potentially-modified blocks, is virtually indexed.  This allows blocks which have changed physical addresses, but are still at the same virtual address, to be recognized as being the same as before, and not in need of recompilation.&lt;br /&gt;
&lt;br /&gt;
==Dynamic linker==&lt;br /&gt;
&lt;br /&gt;
Branches with unresolved addresses jump to the dynamic linker.  This will look through the jump_in list corresponding to the physical page containing the virtual target address.  If found, the branch instruction will be patched with the address, and then it will jump to this address.&lt;br /&gt;
&lt;br /&gt;
If not found, the jump_dirty list will be searched for blocks which were previously compiled but may have been modified.  If a potential match is found, the code will be compared against a cached copy to determine if any changes have been made.  If not, then it will jump to the block.  Because the memory could be modified again, branch instructions referencing these blocks are not altered, and continue to point to the dynamic linker.  These blocks will continue to be verified each time they are called, until restored to the jump_in list by the clean_blocks function described above.&lt;br /&gt;
&lt;br /&gt;
If no compiled block is found, or the existing block was modified, the target is recompiled.&lt;br /&gt;
&lt;br /&gt;
==Address lookup==&lt;br /&gt;
&lt;br /&gt;
When a JR (jump register) instruction is encountered, the address of the recompiled code must be looked up using the address of the MIPS code.  The majority of such instructions jump to the link register (r31) to return to the location following a JAL (call) instruction.&lt;br /&gt;
&lt;br /&gt;
When a JAL or JALR is executed, the address of the following instruction is inserted into a small 32-entry hash table, which is checked when a JR r31 instruction is executed.  This allows for a quick return from subroutine calls.&lt;br /&gt;
&lt;br /&gt;
If the JR instruction uses a register other than r31, or the small hash table lookup fails to find a match, a larger 131072-entry hash table is checked.  This table contains 65536 bins with up to 2 addresses per bin.  If this also fails to find a match (which occurs less than 1% of the time) an exhaustive search of all compiled addresses within that 4K memory page is performed.&lt;br /&gt;
&lt;br /&gt;
If no match is found by any of these methods, the target address is compiled, and the new address is inserted into the hash table.&lt;br /&gt;
&lt;br /&gt;
==Cycle counting==&lt;br /&gt;
&lt;br /&gt;
Cycles are counted before each branch by adding the cycles from the preceding instructions to a specific register.  The cycle count is in R10 on ARM and ESI on x86.  The value in this register is normally a negative number.  When this number exceeds zero, a conditional branch is taken which jumps to an interrupt handler.&lt;br /&gt;
&lt;br /&gt;
For example, the following x86 code adds eight cycles:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
add $8,%esi&lt;br /&gt;
jns interrupt_handler&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The conditional branch jumps to a small bit of code located after the main compiled block, which saves the cached registers to the register file, sets the instruction pointer which will be used upon return from the interrupt, and then calls cc_interrupt.&lt;br /&gt;
&lt;br /&gt;
As in the original mupen64plus, the emulated clock runs at 37.5 MHz, and each instruction takes 2 clock cycles.&lt;br /&gt;
&lt;br /&gt;
==Interrupt handler==&lt;br /&gt;
&lt;br /&gt;
When the cycle count register reaches its limit, cc_interrupt is called, which in turn calls gen_interupt [sic].  If interrupts are not enabled, cc_interrupt returns.  If interrupts are enabled, and an interrupt is to be taken, the pending_exception flag will be set.  In this case, cc_interrupt does not return, and instead pops the stack and causes an unconditional jump to the address in pcaddr (usually 0x80000180).&lt;br /&gt;
&lt;br /&gt;
There is one additional case where the interrupt handler may be called.  If interrupts were disabled, and are enabled by writing to coprocessor 0 register 12, any pending interrupts are handled immediately.&lt;br /&gt;
&lt;br /&gt;
==Delay slots==&lt;br /&gt;
&lt;br /&gt;
MIPS has 'delay slots', where the instruction after the branch is executed before the branch is taken.  Instructions in delay slots are issued out-of-order in the recompiled code.&lt;br /&gt;
&lt;br /&gt;
[[Image:Recompiler delay slot reordering.png]]&lt;br /&gt;
&lt;br /&gt;
When a branch jumps into the delay slot of another branch, this case must be handled slightly differently:&lt;br /&gt;
&lt;br /&gt;
[[Image:Recompiler delay slot reordering 2.png]]&lt;br /&gt;
&lt;br /&gt;
The branch test and delay slot are executed in-order if a dependency exists, or for 'likely' branches where the delay slot is nullified when the branch condition is false.  These cases are infrequent (typically less than 10% of branches).&lt;br /&gt;
&lt;br /&gt;
==Constant propagation==&lt;br /&gt;
&lt;br /&gt;
When an instruction loads a constant into a register, the register is tagged as a constant.  The constant tag will be retained if subsequent instructions modify the constant using other constants.  During assembly, such a sequence of instructions is combined into a single load.  For example:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
LUI r8,12340000  --&amp;gt;  mov $0x12345678,%eax&lt;br /&gt;
ORI r8,r8,5678&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This optimization is not performed where a branch target intervenes, eg&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
...&lt;br /&gt;
 LUI r8,12340000  --&amp;gt;  mov $0x12340000,%eax&lt;br /&gt;
L1:&lt;br /&gt;
 ORI r8,r8,5678   --&amp;gt;  or $0x5678,%eax&lt;br /&gt;
 ...&lt;br /&gt;
 BEQ r0,r0,L1&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Registers containing constants are identified by bits in the isconst and wasconst fields of the regstat structure.  The wasconst bit is set for a register if the register contained a known constant before the instruction, and the isconst bit is set for a register if the register will contain a known constant after the instruction executes.&lt;br /&gt;
&lt;br /&gt;
==Translation lookaside buffer emulation==&lt;br /&gt;
&lt;br /&gt;
Most Nintendo 64 games do not use virtual memory, but some do.  At startup, main memory is directly mapped at addresses from 0x80000000 to 0x803FFFFF, or up to 0x807FFFFF if the memory expansion is used.&lt;br /&gt;
&lt;br /&gt;
Normally, read or write operations are checked against this range, and if outside this range, control is passed to the appropriate I/O handler.  This is done as follows:&lt;br /&gt;
&lt;br /&gt;
MIPS instruction: LW r1,8(r2)&lt;br /&gt;
&lt;br /&gt;
ARM code:&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
add r1, r2, #8&lt;br /&gt;
cmp r1, #8388608&lt;br /&gt;
bvc handler&lt;br /&gt;
ldr r1, [r1]&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If there are valid entries in the TLB, this would instead be compiled as follows:&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
add r1, r2, #8&lt;br /&gt;
mov r0, #264&lt;br /&gt;
add r0, r0, r1, lsr #12&lt;br /&gt;
ldr r0, [r11, r0, lsl #2]&lt;br /&gt;
tst r0, r0&lt;br /&gt;
bmi handler&lt;br /&gt;
ldr r1, [r1, r0, lsl #2]&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This looks up the offset in the memory_map table (which, in this example, is located at r11+264*4).  The high bit is tested to determine whether a valid mapping exists for this page.&lt;br /&gt;
&lt;br /&gt;
==Page fault emulation==&lt;br /&gt;
&lt;br /&gt;
If a memory address references an invalid page, a conditional branch is taken to a bit of code placed after the main block.  This will save any modified cached registers, and call an appropriate handler function for the address.  The handler will either perform I/O, or generate an exception (pagefault).&lt;br /&gt;
&lt;br /&gt;
==Mapping executable pages==&lt;br /&gt;
&lt;br /&gt;
If the dynamic recompiler encounters code which is not in contiguous physical memory, it will end the block at the page boundary, so that the block can be removed cleanly if the mapping is changed.&lt;br /&gt;
&lt;br /&gt;
There is a special case of this, where a branch and delay slot span two pages.  If a branch instruction is the last instruction in a virtual memory page, it is compiled in a different manner than other branches.  The branch condition is evaluated, and the target address is placed in a register (%ebp on x86, and r8 on ARM).  A special form of the dynamic linker (dyna_linker_ds) is used to link the branch to its corresponding delay slot in another block.  If no page fault occurs, the delay slot executes and then jumps to the address in the register.  For conditional branches that are not taken, the target address is the next instruction.  This code is generated by the pagespan_assemble function.&lt;br /&gt;
&lt;br /&gt;
== Self-modifying code detection ==&lt;br /&gt;
&lt;br /&gt;
Pages not containing code which has been compiled or where the code may have been modified since compilation are marked in the invalid_code array.  Writes are checked against this array, and if a write hits a valid (compiled and unmodified) page, invalidate_block is called:&lt;br /&gt;
&lt;br /&gt;
MIPS instruction: SW r1,8(r2)&lt;br /&gt;
&lt;br /&gt;
ARM code:&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
ldr r3, [r11, #88]  // pointer to invalid_code&lt;br /&gt;
add r4, r2, #8&lt;br /&gt;
cmp r4, #8388608&lt;br /&gt;
bvc handler&lt;br /&gt;
str r1, [r4]&lt;br /&gt;
ldrb r14, [r3, r4 lsr #12]&lt;br /&gt;
cmp r14, #1&lt;br /&gt;
bne invstub&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In TLB mode, the invalid_code array is not checked directly.  Instead, pages are marked non-writable in memory_map.&lt;br /&gt;
&lt;br /&gt;
== Long jumps ==&lt;br /&gt;
&lt;br /&gt;
Branch instructions are limited to a +/-32MB range on ARM.  In some cases, the dynamic recompiler needs to generate calls to locations beyond this range.  This is accomplished via a jump table located at the end of the code generation area, and the full address is loaded via a pointer.  The jump table is generated in arch_init().&lt;br /&gt;
&lt;br /&gt;
As these indirect jumps cause some delay, it is best to avoid this situation if possible, by locating this area close to the other executable code.&lt;br /&gt;
&lt;br /&gt;
== Compile options ==&lt;br /&gt;
&lt;br /&gt;
=== ARMv5_ONLY ===&lt;br /&gt;
&lt;br /&gt;
If this is defined, the UXTH instruction is not used, and the dynamic recompiler will generate literal pools instead of using movw/movt.  This provides compatibility with older processors, but generates somewhat less efficient code.&lt;br /&gt;
&lt;br /&gt;
=== RAM_OFFSET ===&lt;br /&gt;
&lt;br /&gt;
When compiling for ARM, this allocates an additional register which is used to add an offset to all pointers to memory addresses between 0x80000000 and 0x807FFFFF.  This allows the N64's memory to be mapped at an alternate address.  This incurs a small performance penalty, but is required for certain operating systems (eg Google Android, which places shared libraries at 0x80000000).&lt;br /&gt;
&lt;br /&gt;
This option is not used for x86.  The x86 instruction set allows for a full 32-bit offset in the instruction encoding, making it unnecessary to allocate an additional register for this purpose.&lt;br /&gt;
&lt;br /&gt;
=== CORTEX_A8_BRANCH_PREDICTION_HACK ===&lt;br /&gt;
&lt;br /&gt;
If this is defined, the dynamic recompiler will avoid generating consecutive branch instructions without another instruction in between.  This avoids a possible branch misprediction on the Cortex-A8 due to this processor having dual instruction decoders, but only one branch-prediction unit.  See [[Assembly Code Optimization]] for details.&lt;br /&gt;
&lt;br /&gt;
=== USE_MINI_HT ===&lt;br /&gt;
&lt;br /&gt;
If this is defined, attempt to look up return addresses in a small hash table before checking the larger hash table.  Usually improves performance.&lt;br /&gt;
&lt;br /&gt;
=== IMM_PREFETCH ===&lt;br /&gt;
&lt;br /&gt;
If this is defined, the x86 PREFETCH instruction is used to prefetch entries from the hash table.  The increase in code size often outweighs the benefit of this.&lt;br /&gt;
&lt;br /&gt;
=== REG_PREFETCH ===&lt;br /&gt;
&lt;br /&gt;
Similar to the above, but loads the address into a register first, then uses the ARM PLD instruction.  The increase in code size almost always outweighs the benefit of this.&lt;br /&gt;
&lt;br /&gt;
=== R29_HACK ===&lt;br /&gt;
&lt;br /&gt;
Assume that the stack pointer (r29) is always a valid memory address and do not check it.  It is similar to the optimization described [http://strmnnrmn.blogspot.com/2007/08/interesting-dynarec-hack.html here].  This can crash the emulator and is not enabled by default.&lt;br /&gt;
&lt;br /&gt;
== Debugging ==&lt;br /&gt;
&lt;br /&gt;
Debugging information can be obtained by defining the assem_debug macro as printf.  This will cause the dynamic recompiler to print debugging information to stdout.  For each disassembled MIPS instruction, an entry similar to the following will be printed:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
U: r1 r8 r11 r16 r31 UU: r29 32: r0 r9&lt;br /&gt;
pre: eax=-1 ecx=9 edx=-1 ebx=-1 ebp=29 esi=36 edi=-1&lt;br /&gt;
needs: ecx ebp esi r: r9&lt;br /&gt;
entry: eax=-1 ecx=9 edx=-1 ebx=-1 ebp=29 esi=36 edi=-1&lt;br /&gt;
dirty: ecx ebp esi &lt;br /&gt;
  800001d8: LW r16,r29+14&lt;br /&gt;
eax=16 ecx=9 edx=-1 ebx=-1 ebp=29 esi=36 edi=-1 dirty: eax ecx ebp esi &lt;br /&gt;
 32: r0 r9 r16&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
U: A list of MIPS registers which will not be used before they are overwritten (liveness analysis)&lt;br /&gt;
&lt;br /&gt;
UU: A list of MIPS registers for which the upper 32 bits will not be used before they are overwritten&lt;br /&gt;
&lt;br /&gt;
32: Registers that contain 32-bit sign-extended values&lt;br /&gt;
&lt;br /&gt;
pre: The state of the register mapping prior to execution of this instruction.  (-1 = no mapping; 36 = cycle count; The complete list of values with special meanings can be found in the source code)&lt;br /&gt;
&lt;br /&gt;
needs: a list of register mappings that were considered necessary and which could not be eliminated to make room for other mappings&lt;br /&gt;
&lt;br /&gt;
r: Registers that are known to contain 32-bit values and where optimizations rely on the assumption that the register does not contain a value outside of the range -2&amp;lt;sup&amp;gt;31&amp;lt;/sup&amp;gt; to 2&amp;lt;sup&amp;gt;31&amp;lt;/sup&amp;gt;-1&lt;br /&gt;
&lt;br /&gt;
entry: The minimum set of register mappings required to jump to this point&lt;br /&gt;
&lt;br /&gt;
dirty: Cached registers that have been modified and will need to be written back&lt;br /&gt;
&lt;br /&gt;
address: instruction - The decoded opcode, followed by the register mapping in effect after this instruction executes&lt;br /&gt;
&lt;br /&gt;
An asterisk (*) designates locations which are the target of a branch instruction.  Constant propagation will not be performed across these points.&lt;br /&gt;
&lt;br /&gt;
After the complete disassembly, the recompiled native code is shown.&lt;br /&gt;
&lt;br /&gt;
Note that the output can be quite voluminous; 20-30 MB is typical.&lt;br /&gt;
&lt;br /&gt;
==Potential improvements==&lt;br /&gt;
&lt;br /&gt;
===Copy propagation/offset propagation===&lt;br /&gt;
&lt;br /&gt;
A common instruction sequence is of the form:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
LUI r9,12340000&lt;br /&gt;
ADD r9,r9,r8&lt;br /&gt;
LW r9,5678(r9)&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
It would be helpful to recognize this as a load from r8+12345678.  The current constant propagation code does not do so.&lt;br /&gt;
&lt;br /&gt;
===Constant propagation and register assignment===&lt;br /&gt;
&lt;br /&gt;
Constant propagation is currently done after register assignment.  Registers are assigned even if the register will always contain a known value.  In certain cases, such as where the constant is used only to generate a memory address, this could be avoided and no register would need to be allocated.&lt;br /&gt;
&lt;br /&gt;
===Unaligned memory access===&lt;br /&gt;
&lt;br /&gt;
A small improvement could be made by combining adjacent LWL/LWR instructions.  The potential gain from doing so is very limited because these instructions typically represent less than 1% of all memory accesses.&lt;br /&gt;
&lt;br /&gt;
===SLT/branch merging===&lt;br /&gt;
&lt;br /&gt;
A frequent occurrence in MIPS code is an SLT or SLTI instruction followed by a branch.  This is generated relatively inefficiently on x86 and ARM, first doing a compare and set, then testing this value and doing a conditional branch.  Doing only one comparison would save at least one instruction, and could potentially save up to three instructions if the liveness analysis reveals that the result of the SLT instruction is used nowhere else.&lt;br /&gt;
&lt;br /&gt;
While a potentially useful optimization, there are several problems with this approach.  First, there are often additional instructions between the slt and the branch. These must be checked to make sure they do not modify the registers as that would prevent reordering the instruction stream as desired.  Secondly, if the result of the slt is found to be live, but unmodified, on both paths of the branch, clean_registers will normally write this value before the branch, to avoid duplicating the writeback code on both paths of the branch.  This optimization would have to be removed if the slt was combined with the branch.&lt;br /&gt;
&lt;br /&gt;
===x86-64===&lt;br /&gt;
&lt;br /&gt;
Currently the x86-64 backend generates only 32-bit instructions.  Proper 64-bit code generation would improve performance.&lt;br /&gt;
&lt;br /&gt;
===PowerPC===&lt;br /&gt;
&lt;br /&gt;
It would be possible to add a PowerPC code generator to the dynamic recompiler.  Currently no one is working on this.  (The mupen64gc project is using a different codebase.)&lt;br /&gt;
&lt;br /&gt;
The following is a summary of the changes which would be necessary to add a PowerPC backend.&lt;br /&gt;
&lt;br /&gt;
The slt* instructions use conditional moves, which are unavailable on PowerPC.  A suitable alternative (such as moves from the condition register) would need to be used.&lt;br /&gt;
&lt;br /&gt;
The assembler can generate as much as 256K of code in a single block, however conditional branches on PowerPC are limited to +/-64K.  It will be necessary to either restrict the block size, or insert jump tables in a manner similar to the literal pools on ARM.&lt;br /&gt;
&lt;br /&gt;
PowerPC generally relies on early branch resolution rather than statistical branch prediction.  Scheduling branch condition tests earlier may be advantageous.  (For example, the address bounds check could be done in address_generation, rather than just before the load or store.  Similarly it may be advantageous to test the branch condition and update the cycle count before executing the delay slot.)&lt;br /&gt;
&lt;br /&gt;
===MIPS===&lt;br /&gt;
&lt;br /&gt;
Recompiling MIPS into MIPS would be relatively straightforward, however the current code generator has no facility for filling delay slots.  This capability would be required for efficient code generation.&lt;br /&gt;
&lt;br /&gt;
[[Category:Development]]&lt;/div&gt;</summary>
		<author><name>Ari64</name></author>
		
	</entry>
	<entry>
		<id>https://pandorawiki.org/index.php?title=Mupen64plus_dynamic_recompiler&amp;diff=5298</id>
		<title>Mupen64plus dynamic recompiler</title>
		<link rel="alternate" type="text/html" href="https://pandorawiki.org/index.php?title=Mupen64plus_dynamic_recompiler&amp;diff=5298"/>
		<updated>2011-02-02T21:58:46Z</updated>

		<summary type="html">&lt;p&gt;Ari64: /* Long jumps */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This page describes the dynamic recompiler in [[Mupen64Plus]], and the changes made for the ARM port.&lt;br /&gt;
&lt;br /&gt;
==The original dynamic recompiler by Hacktarux==&lt;br /&gt;
&lt;br /&gt;
The dynamic recompiler used in Mupen64plus v1.5 is based on the original written by Hacktarux in 2002.&lt;br /&gt;
&lt;br /&gt;
It recompiles contiguous blocks of MIPS instructions.  First, each instruction is decoded into a dynamically-allocated 132-byte data structure.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
typedef struct _precomp_instr&lt;br /&gt;
{&lt;br /&gt;
   void (*ops)();&lt;br /&gt;
   union&lt;br /&gt;
     {&lt;br /&gt;
    struct&lt;br /&gt;
      {&lt;br /&gt;
         long long int *rs;&lt;br /&gt;
         long long int *rt;&lt;br /&gt;
         short immediate;&lt;br /&gt;
      } i;&lt;br /&gt;
    struct&lt;br /&gt;
      {&lt;br /&gt;
         unsigned int inst_index;&lt;br /&gt;
      } j;&lt;br /&gt;
    struct&lt;br /&gt;
      {&lt;br /&gt;
         long long int *rs;&lt;br /&gt;
         long long int *rt;&lt;br /&gt;
         long long int *rd;&lt;br /&gt;
         unsigned char sa;&lt;br /&gt;
         unsigned char nrd;&lt;br /&gt;
      } r;&lt;br /&gt;
    struct&lt;br /&gt;
      {&lt;br /&gt;
         unsigned char base;&lt;br /&gt;
         unsigned char ft;&lt;br /&gt;
         short offset;&lt;br /&gt;
      } lf;&lt;br /&gt;
    struct&lt;br /&gt;
      {&lt;br /&gt;
         unsigned char ft;&lt;br /&gt;
         unsigned char fs;&lt;br /&gt;
         unsigned char fd;&lt;br /&gt;
      } cf;&lt;br /&gt;
     } f;&lt;br /&gt;
   unsigned int addr; /* word-aligned instruction address in r4300 address space */&lt;br /&gt;
   unsigned int local_addr; /* byte offset to start of corresponding x86_64 instructions, from start of code block */&lt;br /&gt;
   reg_cache_struct reg_cache_infos;&lt;br /&gt;
} precomp_instr;&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The decoded instructions are then compiled, generating x86 instructions for each MIPS instruction.  A 32K block is allocated with malloc() to hold the x86 code.  If this size proves insufficient, it is incrementally resized with realloc().&lt;br /&gt;
&lt;br /&gt;
MIPS registers are allocated to x86 registers with a least-recently-used replacement policy.  All cached registers are written back prior to a branch, or a memory read or write.&lt;br /&gt;
&lt;br /&gt;
To facilitate invalidation and replacement of modified code blocks, each 4K page is compiled separately.  If a sequence of instructions crosses a 4K page boundary, the current block is ended, and the next instructions will be compiled as a separate block.&lt;br /&gt;
&lt;br /&gt;
Branch instructions within a 4K page are compiled as branches directly to the target address.  Branches which cross a 4K page go through an indirect address lookup.&lt;br /&gt;
&lt;br /&gt;
Compiled code blocks are invalidated on write.  On an actual MIPS CPU, the instruction cache is invalidated using the CACHE instruction, however a few N64 games clear the cache using other methods.  Trapping writes appears to be the most reliable method of ensuring cache coherency in the emulated system.  The cache instruction is ignored.&lt;br /&gt;
&lt;br /&gt;
==Problems with the original design==&lt;br /&gt;
&lt;br /&gt;
The most significant performance problem with this design is its excessive memory usage.  The decoded instruction data is retained (132 bytes for each MIPS instruction) and occasionally referenced during execution.  Memory accesses frequently miss the L2 cache, resulting in poor performance.&lt;br /&gt;
&lt;br /&gt;
Additionally, the register cache is relatively inefficient, since all registers are flushed before any read or write operation.&lt;br /&gt;
&lt;br /&gt;
==A new approach==&lt;br /&gt;
&lt;br /&gt;
To reduce memory usage, the new design allocates a single large block of memory (currently 16 MiB) which is used for recompiled code.  This memory is allocated using mmap with the PROT_EXEC bit set, to ensure operation on CPUs with no-execute (NX) page permissions.&lt;br /&gt;
&lt;br /&gt;
Contiguous blocks of MIPS instructions are compiled (that is, it does not attempt to follow branches or 'hot-paths').&lt;br /&gt;
&lt;br /&gt;
The recompiler consists of eight stages, plus a linker, and a memory manager.&lt;br /&gt;
&lt;br /&gt;
Compiled blocks are invalidated on write, as before, however as the compiler will cross a 4K page, writes may invalidate adjacent pages as well.&lt;br /&gt;
&lt;br /&gt;
Currently the dynarec generates x86, x86-64, ARMv5, and ARMv7 (little-endian).  Most of the code is shared between the architectures, but a different code generator is included at compile time, using different #include statements depending on the CPU type.&lt;br /&gt;
&lt;br /&gt;
==Pass 1: Disassembly==&lt;br /&gt;
&lt;br /&gt;
When an instruction address is encountered which has not been compiled, the function new_recompile_block is called, with the (virtual) address of the target instruction as its sole parameter.  If the address is invalid, the function returns a nonzero value and the caller is responsible for handling the pagefault.&lt;br /&gt;
&lt;br /&gt;
Instructions are decoded until an unconditional jump, usually a return, is encountered.  Disassembly is ordinarilly continued past a JAL (subroutine call) instruction, however this strategy is abandoned if invalid instructions are encountered.&lt;br /&gt;
&lt;br /&gt;
Up to 4096 instructions (16K of MIPS code) may be disassembled at once.  Surprisingly, some games do actually reach this limit.&lt;br /&gt;
&lt;br /&gt;
==Pass 2: Liveness analysis==&lt;br /&gt;
&lt;br /&gt;
After disassembly, liveness analysis is performed on the registers.  This determines when a particular register will no longer be used, and thus can be removed from the register cache.&lt;br /&gt;
&lt;br /&gt;
A separate analysis is done on the upper 32 bits of 64-bit registers.  This can determine when only the lower 32 bits of a register are significant, thus allowing use of a 32-bit register.  This enables more efficient code generation on 32-bit processors.&lt;br /&gt;
&lt;br /&gt;
==Pass 3: Register allocation==&lt;br /&gt;
&lt;br /&gt;
The 31 MIPS registers must be mapped onto seven available registers on x86, or twelve available registers on ARM.&lt;br /&gt;
&lt;br /&gt;
Instructions with 64-bit results require two registers on 32-bit host processors.  32-bit instructions require only one.  A flag is set for registers containing 32-bit values, and these registers will be sign-extended before they are written to the register file.&lt;br /&gt;
&lt;br /&gt;
Registers are made available using a least-soon-needed replacement policy.  When the register cache is full, and no registers can be eliminated using the liveness analysis, a ten-instruction lookahead is used to determine which registers will not be needed soon and these registers are evicted from the cache.&lt;br /&gt;
&lt;br /&gt;
==Pass 4: Free unused registers==&lt;br /&gt;
&lt;br /&gt;
After the initial register allocation, the allocations are reviewed to determine if any registers remain allocated longer than necessary.  These are then removed from the register cache.  This avoids having to write back a large number of registers just before a branch, and makes more registers available for the next pass.&lt;br /&gt;
&lt;br /&gt;
==Pass 5: Pre-allocate registers==&lt;br /&gt;
&lt;br /&gt;
If a register will be used soon and needs to be loaded, try to load it early if the register is available.  This improves execution on CPUs with a load-use penalty.&lt;br /&gt;
&lt;br /&gt;
==Pass 6: Optimize clean/dirty state==&lt;br /&gt;
&lt;br /&gt;
If a cached register is 'dirty' and needs to be written out, try to do so as soon as the register will no longer be modified.  This avoids having to write the same register on multiple code paths due to conditional branches.  Additionally, try to avoid writing out dirty registers inside of loops.&lt;br /&gt;
&lt;br /&gt;
==Pass 7: Identify where registers are assumed to be 32-bit==&lt;br /&gt;
&lt;br /&gt;
When a 64-bit register is mapped to a 32-bit register with the assumption that the value will be sign-extended before being used, it is necessary to ensure that no register contains a 64-bit value when branching to such a location.  These instructions are flagged to identify them as requiring 32-bit inputs.  This information is used by the linker, and the exception return (ERET) handler.&lt;br /&gt;
&lt;br /&gt;
==Pass 8: Assembly==&lt;br /&gt;
&lt;br /&gt;
This generates and outputs the recompiled code.&lt;br /&gt;
&lt;br /&gt;
Following the main code block, handlers for certain exceptions as well as alternate entry points are added.&lt;br /&gt;
&lt;br /&gt;
If a recompiled instruction relies on a certain MIPS register being cached in a certain native register, then a short 'stub' of code is generated to load the necessary registers.  When an instruction outside of this block needs to jump to that location, it will instead jump to the stub.  The necessary registers will be loaded, and then it will jump into the main code sequence.&lt;br /&gt;
&lt;br /&gt;
On architectures which require literal pools (ARMv5) these are inserted as necessary.&lt;br /&gt;
&lt;br /&gt;
==Linker==&lt;br /&gt;
&lt;br /&gt;
The linker fills in all unresolved branches.  Branches within the block are linked to their target address.  Branches which jump outside of the block are linked to their target if that address has been compiled already. These inter-block branches are recorded in the jump_out array.  This information will be used to remove the links in the event that the target of the branch is invalidated.&lt;br /&gt;
&lt;br /&gt;
Unresolved branches point to a stub which loads the address of the branch instruction and the virtual address of its target into registers, and calls the dynamic linker.  When this code is executed, the dynamic linker will compile the target if necessary, and then patch the branch instruction with the new address.&lt;br /&gt;
&lt;br /&gt;
==Memory manager==&lt;br /&gt;
&lt;br /&gt;
The last step in new_recompile_block is to ensure that there will be sufficient memory available to compile the next block.  If there is not, then the oldest blocks are purged.&lt;br /&gt;
&lt;br /&gt;
The dynarec cache can be described as a 16MB circular buffer divided into eight segments.  Memory is allocated in order from beginning to end.&lt;br /&gt;
&lt;br /&gt;
When there are less than 2 empty segments, the next segment in sequence is cleared, wrapping around to the beginning of the buffer.  This continues as memory is needed, wrapping around from end to beginning.&lt;br /&gt;
&lt;br /&gt;
==Invalidation and restoration==&lt;br /&gt;
&lt;br /&gt;
Normally, code blocks are looked up via a hash table, using the virtual address of the target instruction.  However, for purposes of invalidation, blocks are grouped by physical address.  This can be described as a virtually-indexed, physically-tagged (VIPT) cache.&lt;br /&gt;
&lt;br /&gt;
References to compiled blocks are stored in one of 4096 linked lists in the jump_in array.  Each list covers a 4K block of memory, and 2048 such lists are sufficient to cover the 8MB of RAM in the Nintendo 64.  The remaining lists are for code in ROM, and the bootloader in SP memory.&lt;br /&gt;
&lt;br /&gt;
When a write hits a memory page marked as cached, all entries in the corresponding list are invalidated.  If any code is found to cross a 4K boundary, the adjacent lists are invalidated also.&lt;br /&gt;
&lt;br /&gt;
Sometimes blocks may be invalidated even when none of the code is actually modified.  This can happen if data is written to memory in the same 4K page, or if code is reloaded without actually modifying it.  If blocks which were previously invalidated are subsequently found to be unmodified, those blocks are marked in the restore_candidate array.  If the block remains unmodified, it will be restored as a valid block, to avoid recompiling blocks which do not need to be recompiled.  This is performed by the clean_blocks function which is called periodically.&lt;br /&gt;
&lt;br /&gt;
The jump_in array, which lists unmodified blocks, is physically indexed, however the jump_dirty array, which lists potentially-modified blocks, is virtually indexed.  This allows blocks which have changed physical addresses, but are still at the same virtual address, to be recognized as being the same as before, and not in need of recompilation.&lt;br /&gt;
&lt;br /&gt;
==Dynamic linker==&lt;br /&gt;
&lt;br /&gt;
Branches with unresolved addresses jump to the dynamic linker.  This will look through the jump_in list corresponding to the physical page containing the virtual target address.  If found, the branch instruction will be patched with the address, and then it will jump to this address.&lt;br /&gt;
&lt;br /&gt;
If not found, the jump_dirty list will be searched for blocks which were previously compiled but may have been modified.  If a potential match is found, the code will be compared against a cached copy to determine if any changes have been made.  If not, then it will jump to the block.  Because the memory could be modified again, branch instructions referencing these blocks are not altered, and continue to point to the dynamic linker.  These blocks will continue to be verified each time they are called, until restored to the jump_in list by the clean_blocks function described above.&lt;br /&gt;
&lt;br /&gt;
If no compiled block is found, or the existing block was modified, the target is recompiled.&lt;br /&gt;
&lt;br /&gt;
==Address lookup==&lt;br /&gt;
&lt;br /&gt;
When a JR (jump register) instruction is encountered, the address of the recompiled code must be looked up using the address of the MIPS code.  The majority of such instructions jump to the link register (r31) to return to the location following a JAL (call) instruction.&lt;br /&gt;
&lt;br /&gt;
When a JAL or JALR is executed, the address of the following instruction is inserted into a small 32-entry hash table, which is checked when a JR r31 instruction is executed.  This allows for a quick return from subroutine calls.&lt;br /&gt;
&lt;br /&gt;
If the JR instruction uses a register other than r31, or the small hash table lookup fails to find a match, a larger 131072-entry hash table is checked.  This table contains 65536 bins with up to 2 addresses per bin.  If this also fails to find a match (which occurs less than 1% of the time) an exhaustive search of all compiled addresses within that 4K memory page is performed.&lt;br /&gt;
&lt;br /&gt;
If no match is found by any of these methods, the target address is compiled, and the new address is inserted into the hash table.&lt;br /&gt;
&lt;br /&gt;
==Cycle counting==&lt;br /&gt;
&lt;br /&gt;
Cycles are counted before each branch by adding the cycles from the preceding instructions to a specific register.  The cycle count is in R10 on ARM and ESI on x86.  The value in this register is normally a negative number.  When this number exceeds zero, a conditional branch is taken which jumps to an interrupt handler.&lt;br /&gt;
&lt;br /&gt;
For example, the following x86 code adds eight cycles:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
add $8,%esi&lt;br /&gt;
jns interrupt_handler&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The conditional branch jumps to a small bit of code located after the main compiled block, which saves the cached registers to the register file, sets the instruction pointer which will be used upon return from the interrupt, and then calls cc_interrupt.&lt;br /&gt;
&lt;br /&gt;
As in the original mupen64plus, the emulated clock runs at 37.5 MHz, and each instruction takes 2 clock cycles.&lt;br /&gt;
&lt;br /&gt;
==Interrupt handler==&lt;br /&gt;
&lt;br /&gt;
When the cycle count register reaches its limit, cc_interrupt is called, which in turn calls gen_interupt [sic].  If interrupts are not enabled, cc_interrupt returns.  If interrupts are enabled, and an interrupt is to be taken, the pending_exception flag will be set.  In this case, cc_interrupt does not return, and instead pops the stack and causes an unconditional jump to the address in pcaddr (usually 0x80000180).&lt;br /&gt;
&lt;br /&gt;
There is one additional case where the interrupt handler may be called.  If interrupts were disabled, and are enabled by writing to coprocessor 0 register 12, any pending interrupts are handled immediately.&lt;br /&gt;
&lt;br /&gt;
==Delay slots==&lt;br /&gt;
&lt;br /&gt;
MIPS has 'delay slots', where the instruction after the branch is executed before the branch is taken.  Instructions in delay slots are issued out-of-order in the recompiled code.&lt;br /&gt;
&lt;br /&gt;
[[Image:Recompiler delay slot reordering.png]]&lt;br /&gt;
&lt;br /&gt;
When a branch jumps into the delay slot of another branch, this case must be handled slightly differently:&lt;br /&gt;
&lt;br /&gt;
[[Image:Recompiler delay slot reordering 2.png]]&lt;br /&gt;
&lt;br /&gt;
The branch test and delay slot are executed in-order if a dependency exists, or for 'likely' branches where the delay slot is nullified when the branch condition is false.  These cases are infrequent (typically less than 10% of branches).&lt;br /&gt;
&lt;br /&gt;
==Constant propagation==&lt;br /&gt;
&lt;br /&gt;
When an instruction loads a constant into a register, the register is tagged as a constant.  The constant tag will be retained if subsequent instructions modify the constant using other constants.  During assembly, such a sequence of instructions is combined into a single load.  For example:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
LUI r8,12340000  --&amp;gt;  mov $0x12345678,%eax&lt;br /&gt;
ORI r8,r8,5678&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This optimization is not performed where a branch target intervenes, eg&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
...&lt;br /&gt;
 LUI r8,12340000  --&amp;gt;  mov $0x12340000,%eax&lt;br /&gt;
L1:&lt;br /&gt;
 ORI r8,r8,5678   --&amp;gt;  or $0x5678,%eax&lt;br /&gt;
 ...&lt;br /&gt;
 BEQ r0,r0,L1&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Registers containing constants are identified by bits in the isconst and wasconst fields of the regstat structure.  The wasconst bit is set for a register if the register contained a known constant before the instruction, and the isconst bit is set for a register if the register will contain a known constant after the instruction executes.&lt;br /&gt;
&lt;br /&gt;
==Translation lookaside buffer emulation==&lt;br /&gt;
&lt;br /&gt;
Most Nintendo 64 games do not use virtual memory, but some do.  At startup, main memory is directly mapped at addresses from 0x80000000 to 0x803FFFFF, or up to 0x807FFFFF if the memory expansion is used.&lt;br /&gt;
&lt;br /&gt;
Normally, read or write operations are checked against this range, and if outside this range, control is passed to the appropriate I/O handler.  This is done as follows:&lt;br /&gt;
&lt;br /&gt;
MIPS instruction: LW r1,8(r2)&lt;br /&gt;
&lt;br /&gt;
ARM code:&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
add r1, r2, #8&lt;br /&gt;
cmp r1, #8388608&lt;br /&gt;
bvc handler&lt;br /&gt;
ldr r1, [r1]&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If there are valid entries in the TLB, this would instead be compiled as follows:&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
add r1, r2, #8&lt;br /&gt;
mov r0, #264&lt;br /&gt;
add r0, r0, r1, lsr #12&lt;br /&gt;
ldr r0, [r11, r0, lsl #2]&lt;br /&gt;
tst r0, r0&lt;br /&gt;
bmi handler&lt;br /&gt;
ldr r1, [r1, r0, lsl #2]&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This looks up the offset in the memory_map table (which, in this example, is located at r11+264*4).  The high bit is tested to determine whether a valid mapping exists for this page.&lt;br /&gt;
&lt;br /&gt;
==Page fault emulation==&lt;br /&gt;
&lt;br /&gt;
If a memory address references an invalid page, a conditional branch is taken to a bit of code placed after the main block.  This will save any modified cached registers, and call an appropriate handler function for the address.  The handler will either perform I/O, or generate an exception (pagefault).&lt;br /&gt;
&lt;br /&gt;
==Mapping executable pages==&lt;br /&gt;
&lt;br /&gt;
If the dynamic recompiler encounters code which is not in contiguous physical memory, it will end the block at the page boundary, so that the block can be removed cleanly if the mapping is changed.&lt;br /&gt;
&lt;br /&gt;
There is a special case of this, where a branch and delay slot span two pages.  If a branch instruction is the last instruction in a virtual memory page, it is compiled in a different manner than other branches.  The branch condition is evaluated, and the target address is placed in a register (%ebp on x86, and r8 on ARM).  A special form of the dynamic linker (dyna_linker_ds) is used to link the branch to its corresponding delay slot in another block.  If no page fault occurs, the delay slot executes and then jumps to the address in the register.  For conditional branches that are not taken, the target address is the next instruction.  This code is generated by the pagespan_assemble function.&lt;br /&gt;
&lt;br /&gt;
== Self-modifying code detection ==&lt;br /&gt;
&lt;br /&gt;
Pages not containing code which has been compiled or where the code may have been modified since compilation are marked in the invalid_code array.  Writes are checked against this array, and if a write hits a valid (compiled and unmodified) page, invalidate_block is called:&lt;br /&gt;
&lt;br /&gt;
MIPS instruction: SW r1,8(r2)&lt;br /&gt;
&lt;br /&gt;
ARM code:&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
ldr r3, [r11, #88]  // pointer to invalid_code&lt;br /&gt;
add r4, r2, #8&lt;br /&gt;
cmp r4, #8388608&lt;br /&gt;
bvc handler&lt;br /&gt;
str r1, [r4]&lt;br /&gt;
ldrb r14, [r3, r4 lsr #12]&lt;br /&gt;
cmp r14, #1&lt;br /&gt;
bne invstub&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In TLB mode, the invalid_code array is not checked directly.  Instead, pages are marked non-writable in memory_map.&lt;br /&gt;
&lt;br /&gt;
== Long jumps ==&lt;br /&gt;
&lt;br /&gt;
Branch instructions are limited to a +/-32MB range on ARM.  In some cases, the dynamic recompiler needs to generate calls to locations beyond this range.  This is accomplished via a jump table located at the end of the code generation area, and the full address is loaded via a pointer.  The jump table is generated in arch_init().&lt;br /&gt;
&lt;br /&gt;
As these indirect jumps cause some delay, it is best to avoid this situation if possible, by locating this area close to the other executable code.&lt;br /&gt;
&lt;br /&gt;
== Compile options ==&lt;br /&gt;
&lt;br /&gt;
=== ARMv5_ONLY ===&lt;br /&gt;
&lt;br /&gt;
If this is defined, the UXTH instruction is not used, and the dynamic recompiler will generate literal pools instead of using movw/movt.  This provides compatibility with older processors, but generates somewhat less efficient code.&lt;br /&gt;
&lt;br /&gt;
=== RAM_OFFSET ===&lt;br /&gt;
&lt;br /&gt;
When compiling for ARM, this allocates an additional register which is used to add an offset to all pointers to memory addresses between 0x80000000 and 0x807FFFFF.  This allows the N64's memory to be mapped at an alternate address.  This incurs a small performance penalty, but is required for certain operating systems (eg Google Android, which places shared libraries at 0x80000000).&lt;br /&gt;
&lt;br /&gt;
This option is not used for x86.  The x86 instruction set allows for a full 32-bit offset in the instruction encoding, making it unnecessary to allocate an additional register for this purpose.&lt;br /&gt;
&lt;br /&gt;
=== CORTEX_A8_BRANCH_PREDICTION_HACK ===&lt;br /&gt;
&lt;br /&gt;
If this is defined, the dynamic recompiler will avoid generating consecutive branch instructions without another instruction in between.  This avoids a possible branch misprediction on the Cortex-A8 due to this processor having dual instruction decoders, but only one branch-prediction unit.  See [[Assembly Code Optimization]] for details.&lt;br /&gt;
&lt;br /&gt;
=== USE_MINI_HT ===&lt;br /&gt;
&lt;br /&gt;
If this is defined, attempt to look up return addresses in a small hash table before checking the larger hash table.  Usually improves performance.&lt;br /&gt;
&lt;br /&gt;
=== IMM_PREFETCH ===&lt;br /&gt;
&lt;br /&gt;
If this is defined, the x86 PREFETCH instruction is used to prefetch entries from the hash table.  The increase in code size often outweighs the benefit of this.&lt;br /&gt;
&lt;br /&gt;
=== REG_PREFETCH ===&lt;br /&gt;
&lt;br /&gt;
Similar to the above, but loads the address into a register first, then uses the ARM PLD instruction.  The increase in code size almost always outweighs the benefit of this.&lt;br /&gt;
&lt;br /&gt;
=== R29_HACK ===&lt;br /&gt;
&lt;br /&gt;
Assume that the stack pointer (r29) is always a valid memory address and do not check it.  It is similar to the optimization described [http://strmnnrmn.blogspot.com/2007/08/interesting-dynarec-hack.html here].  This can crash the emulator and is not enabled by default.&lt;br /&gt;
&lt;br /&gt;
== Debugging ==&lt;br /&gt;
&lt;br /&gt;
Debugging information can be obtained by defining the assem_debug macro as printf.  This will cause the dynamic recompiler to print debugging information to stdout.  For each disassembled MIPS instruction, an entry similar to the following will be printed:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
U: r1 r8 r11 r16 r31 UU: r29 32: r0 r9&lt;br /&gt;
pre: eax=-1 ecx=9 edx=-1 ebx=-1 ebp=29 esi=36 edi=-1&lt;br /&gt;
needs: ecx ebp esi r: r9&lt;br /&gt;
entry: eax=-1 ecx=9 edx=-1 ebx=-1 ebp=29 esi=36 edi=-1&lt;br /&gt;
dirty: ecx ebp esi &lt;br /&gt;
  800001d8: LW r16,r29+14&lt;br /&gt;
eax=16 ecx=9 edx=-1 ebx=-1 ebp=29 esi=36 edi=-1 dirty: eax ecx ebp esi &lt;br /&gt;
 32: r0 r9 r16&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
U: A list of MIPS registers which will not be used before they are overwritten (liveness analysis)&lt;br /&gt;
&lt;br /&gt;
UU: A list of MIPS registers for which the upper 32 bits will not be used before they are overwritten&lt;br /&gt;
&lt;br /&gt;
32: Registers that contain 32-bit sign-extended values&lt;br /&gt;
&lt;br /&gt;
pre: The state of the register mapping prior to execution of this instruction.  (-1 = no mapping; 36 = cycle count; The complete list of values with special meanings can be found in the source code)&lt;br /&gt;
&lt;br /&gt;
needs: a list of register mappings that were considered necessary and which could not be eliminated to make room for other mappings&lt;br /&gt;
&lt;br /&gt;
r: Registers that are known to contain 32-bit values and where optimizations rely on the assumption that the register does not contain a value outside of the range -2&amp;lt;sup&amp;gt;31&amp;lt;/sup&amp;gt; to 2&amp;lt;sup&amp;gt;31&amp;lt;/sup&amp;gt;-1&lt;br /&gt;
&lt;br /&gt;
entry: The minimum set of register mappings required to jump to this point&lt;br /&gt;
&lt;br /&gt;
dirty: Cached registers that have been modified and will need to be written back&lt;br /&gt;
&lt;br /&gt;
address: instruction - The decoded opcode, followed by the register mapping in effect after this instruction executes&lt;br /&gt;
&lt;br /&gt;
An asterisk (*) designates locations which are the target of a branch instruction.  Constant propagation will not be performed across these points.&lt;br /&gt;
&lt;br /&gt;
After the complete disassembly, the recompiled native code is shown.&lt;br /&gt;
&lt;br /&gt;
Note that the output can be quite voluminous; 20-30 MB is typical.&lt;br /&gt;
&lt;br /&gt;
==Potential improvements==&lt;br /&gt;
&lt;br /&gt;
===Copy propagation/offset propagation===&lt;br /&gt;
&lt;br /&gt;
A common instruction sequence is of the form:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
LUI r9,12340000&lt;br /&gt;
ADD r9,r9,r8&lt;br /&gt;
LW r9,5678(r9)&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
It would be helpful to recognize this as a load from r8+12345678.  The current constant propagation code does not do so.&lt;br /&gt;
&lt;br /&gt;
===Constant propagation and register assignment===&lt;br /&gt;
&lt;br /&gt;
Constant propagation is currently done after register assignment.  Registers are assigned even if the register will always contain a known value.  In certain cases, such as where the constant is used only to generate a memory address, this could be avoided and no register would need to be allocated.&lt;br /&gt;
&lt;br /&gt;
===Unaligned memory access===&lt;br /&gt;
&lt;br /&gt;
A small improvement could be made by combining adjacent LWL/LWR instructions.  The potential gain from doing so is very limited because these instructions typically represent less than 1% of all memory accesses.&lt;br /&gt;
&lt;br /&gt;
===SLT/branch merging===&lt;br /&gt;
&lt;br /&gt;
A frequent occurrence in MIPS code is an SLT or SLTI instruction followed by a branch.  This is generated relatively inefficiently on x86 and ARM, first doing a compare and set, then testing this value and doing a conditional branch.  Doing only one comparison would save at least one instruction, and could potentially save up to three instructions if the liveness analysis reveals that the result of the SLT instruction is used nowhere else.&lt;br /&gt;
&lt;br /&gt;
While a potentially useful optimization, there are several problems with this approach.  First, there are often additional instructions between the slt and the branch. These must be checked to make sure they do not modify the registers as that would prevent reordering the instruction stream as desired.  Secondly, if the result of the slt is found to be live, but unmodified, on both paths of the branch, clean_registers will normally write this value before the branch, to avoid duplicating the writeback code on both paths of the branch.  This optimization would have to be removed if the slt was combined with the branch.&lt;br /&gt;
&lt;br /&gt;
===x86-64===&lt;br /&gt;
&lt;br /&gt;
Currently the x86-64 backend generates only 32-bit instructions.  Proper 64-bit code generation would improve performance.&lt;br /&gt;
&lt;br /&gt;
===PowerPC===&lt;br /&gt;
&lt;br /&gt;
It would be possible to add a PowerPC code generator to the dynamic recompiler.  Currently no one is working on this.  (The mupen64gc project is using a different codebase.)&lt;br /&gt;
&lt;br /&gt;
The following is a summary of the changes which would be necessary to add a PowerPC backend.&lt;br /&gt;
&lt;br /&gt;
The slt* instructions use conditional moves, which are unavailable on PowerPC.  A suitable alternative (such as moves from the condition register) would need to be used.&lt;br /&gt;
&lt;br /&gt;
The assembler can generate as much as 256K of code in a single block, however conditional branches on PowerPC are limited to +/-64K.  It will be necessary to either restrict the block size, or insert jump tables in a manner similar to the literal pools on ARM.&lt;br /&gt;
&lt;br /&gt;
PowerPC generally relies on early branch resolution rather than statistical branch prediction.  Scheduling branch condition tests earlier may be advantageous.  (For example, the address bounds check could be done in address_generation, rather than just before the load or store.  Similarly it may be advantageous to test the branch condition and update the cycle count before executing the delay slot.)&lt;br /&gt;
&lt;br /&gt;
===MIPS===&lt;br /&gt;
&lt;br /&gt;
Recompiling MIPS into MIPS would be relatively straightforward, however the current code generator has no facility for filling delay slots.  This capability would be required for efficient code generation.&lt;br /&gt;
&lt;br /&gt;
[[Category:Development]]&lt;/div&gt;</summary>
		<author><name>Ari64</name></author>
		
	</entry>
	<entry>
		<id>https://pandorawiki.org/index.php?title=Mupen64plus_dynamic_recompiler&amp;diff=5297</id>
		<title>Mupen64plus dynamic recompiler</title>
		<link rel="alternate" type="text/html" href="https://pandorawiki.org/index.php?title=Mupen64plus_dynamic_recompiler&amp;diff=5297"/>
		<updated>2011-02-02T21:53:39Z</updated>

		<summary type="html">&lt;p&gt;Ari64: /* Constant propagation and register assignment */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This page describes the dynamic recompiler in [[Mupen64Plus]], and the changes made for the ARM port.&lt;br /&gt;
&lt;br /&gt;
==The original dynamic recompiler by Hacktarux==&lt;br /&gt;
&lt;br /&gt;
The dynamic recompiler used in Mupen64plus v1.5 is based on the original written by Hacktarux in 2002.&lt;br /&gt;
&lt;br /&gt;
It recompiles contiguous blocks of MIPS instructions.  First, each instruction is decoded into a dynamically-allocated 132-byte data structure.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
typedef struct _precomp_instr&lt;br /&gt;
{&lt;br /&gt;
   void (*ops)();&lt;br /&gt;
   union&lt;br /&gt;
     {&lt;br /&gt;
    struct&lt;br /&gt;
      {&lt;br /&gt;
         long long int *rs;&lt;br /&gt;
         long long int *rt;&lt;br /&gt;
         short immediate;&lt;br /&gt;
      } i;&lt;br /&gt;
    struct&lt;br /&gt;
      {&lt;br /&gt;
         unsigned int inst_index;&lt;br /&gt;
      } j;&lt;br /&gt;
    struct&lt;br /&gt;
      {&lt;br /&gt;
         long long int *rs;&lt;br /&gt;
         long long int *rt;&lt;br /&gt;
         long long int *rd;&lt;br /&gt;
         unsigned char sa;&lt;br /&gt;
         unsigned char nrd;&lt;br /&gt;
      } r;&lt;br /&gt;
    struct&lt;br /&gt;
      {&lt;br /&gt;
         unsigned char base;&lt;br /&gt;
         unsigned char ft;&lt;br /&gt;
         short offset;&lt;br /&gt;
      } lf;&lt;br /&gt;
    struct&lt;br /&gt;
      {&lt;br /&gt;
         unsigned char ft;&lt;br /&gt;
         unsigned char fs;&lt;br /&gt;
         unsigned char fd;&lt;br /&gt;
      } cf;&lt;br /&gt;
     } f;&lt;br /&gt;
   unsigned int addr; /* word-aligned instruction address in r4300 address space */&lt;br /&gt;
   unsigned int local_addr; /* byte offset to start of corresponding x86_64 instructions, from start of code block */&lt;br /&gt;
   reg_cache_struct reg_cache_infos;&lt;br /&gt;
} precomp_instr;&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The decoded instructions are then compiled, generating x86 instructions for each MIPS instruction.  A 32K block is allocated with malloc() to hold the x86 code.  If this size proves insufficient, it is incrementally resized with realloc().&lt;br /&gt;
&lt;br /&gt;
MIPS registers are allocated to x86 registers with a least-recently-used replacement policy.  All cached registers are written back prior to a branch, or a memory read or write.&lt;br /&gt;
&lt;br /&gt;
To facilitate invalidation and replacement of modified code blocks, each 4K page is compiled separately.  If a sequence of instructions crosses a 4K page boundary, the current block is ended, and the next instructions will be compiled as a separate block.&lt;br /&gt;
&lt;br /&gt;
Branch instructions within a 4K page are compiled as branches directly to the target address.  Branches which cross a 4K page go through an indirect address lookup.&lt;br /&gt;
&lt;br /&gt;
Compiled code blocks are invalidated on write.  On an actual MIPS CPU, the instruction cache is invalidated using the CACHE instruction, however a few N64 games clear the cache using other methods.  Trapping writes appears to be the most reliable method of ensuring cache coherency in the emulated system.  The cache instruction is ignored.&lt;br /&gt;
&lt;br /&gt;
==Problems with the original design==&lt;br /&gt;
&lt;br /&gt;
The most significant performance problem with this design is its excessive memory usage.  The decoded instruction data is retained (132 bytes for each MIPS instruction) and occasionally referenced during execution.  Memory accesses frequently miss the L2 cache, resulting in poor performance.&lt;br /&gt;
&lt;br /&gt;
Additionally, the register cache is relatively inefficient, since all registers are flushed before any read or write operation.&lt;br /&gt;
&lt;br /&gt;
==A new approach==&lt;br /&gt;
&lt;br /&gt;
To reduce memory usage, the new design allocates a single large block of memory (currently 16 MiB) which is used for recompiled code.  This memory is allocated using mmap with the PROT_EXEC bit set, to ensure operation on CPUs with no-execute (NX) page permissions.&lt;br /&gt;
&lt;br /&gt;
Contiguous blocks of MIPS instructions are compiled (that is, it does not attempt to follow branches or 'hot-paths').&lt;br /&gt;
&lt;br /&gt;
The recompiler consists of eight stages, plus a linker, and a memory manager.&lt;br /&gt;
&lt;br /&gt;
Compiled blocks are invalidated on write, as before, however as the compiler will cross a 4K page, writes may invalidate adjacent pages as well.&lt;br /&gt;
&lt;br /&gt;
Currently the dynarec generates x86, x86-64, ARMv5, and ARMv7 (little-endian).  Most of the code is shared between the architectures, but a different code generator is included at compile time, using different #include statements depending on the CPU type.&lt;br /&gt;
&lt;br /&gt;
==Pass 1: Disassembly==&lt;br /&gt;
&lt;br /&gt;
When an instruction address is encountered which has not been compiled, the function new_recompile_block is called, with the (virtual) address of the target instruction as its sole parameter.  If the address is invalid, the function returns a nonzero value and the caller is responsible for handling the pagefault.&lt;br /&gt;
&lt;br /&gt;
Instructions are decoded until an unconditional jump, usually a return, is encountered.  Disassembly is ordinarilly continued past a JAL (subroutine call) instruction, however this strategy is abandoned if invalid instructions are encountered.&lt;br /&gt;
&lt;br /&gt;
Up to 4096 instructions (16K of MIPS code) may be disassembled at once.  Surprisingly, some games do actually reach this limit.&lt;br /&gt;
&lt;br /&gt;
==Pass 2: Liveness analysis==&lt;br /&gt;
&lt;br /&gt;
After disassembly, liveness analysis is performed on the registers.  This determines when a particular register will no longer be used, and thus can be removed from the register cache.&lt;br /&gt;
&lt;br /&gt;
A separate analysis is done on the upper 32 bits of 64-bit registers.  This can determine when only the lower 32 bits of a register are significant, thus allowing use of a 32-bit register.  This enables more efficient code generation on 32-bit processors.&lt;br /&gt;
&lt;br /&gt;
==Pass 3: Register allocation==&lt;br /&gt;
&lt;br /&gt;
The 31 MIPS registers must be mapped onto seven available registers on x86, or twelve available registers on ARM.&lt;br /&gt;
&lt;br /&gt;
Instructions with 64-bit results require two registers on 32-bit host processors.  32-bit instructions require only one.  A flag is set for registers containing 32-bit values, and these registers will be sign-extended before they are written to the register file.&lt;br /&gt;
&lt;br /&gt;
Registers are made available using a least-soon-needed replacement policy.  When the register cache is full, and no registers can be eliminated using the liveness analysis, a ten-instruction lookahead is used to determine which registers will not be needed soon and these registers are evicted from the cache.&lt;br /&gt;
&lt;br /&gt;
==Pass 4: Free unused registers==&lt;br /&gt;
&lt;br /&gt;
After the initial register allocation, the allocations are reviewed to determine if any registers remain allocated longer than necessary.  These are then removed from the register cache.  This avoids having to write back a large number of registers just before a branch, and makes more registers available for the next pass.&lt;br /&gt;
&lt;br /&gt;
==Pass 5: Pre-allocate registers==&lt;br /&gt;
&lt;br /&gt;
If a register will be used soon and needs to be loaded, try to load it early if the register is available.  This improves execution on CPUs with a load-use penalty.&lt;br /&gt;
&lt;br /&gt;
==Pass 6: Optimize clean/dirty state==&lt;br /&gt;
&lt;br /&gt;
If a cached register is 'dirty' and needs to be written out, try to do so as soon as the register will no longer be modified.  This avoids having to write the same register on multiple code paths due to conditional branches.  Additionally, try to avoid writing out dirty registers inside of loops.&lt;br /&gt;
&lt;br /&gt;
==Pass 7: Identify where registers are assumed to be 32-bit==&lt;br /&gt;
&lt;br /&gt;
When a 64-bit register is mapped to a 32-bit register with the assumption that the value will be sign-extended before being used, it is necessary to ensure that no register contains a 64-bit value when branching to such a location.  These instructions are flagged to identify them as requiring 32-bit inputs.  This information is used by the linker, and the exception return (ERET) handler.&lt;br /&gt;
&lt;br /&gt;
==Pass 8: Assembly==&lt;br /&gt;
&lt;br /&gt;
This generates and outputs the recompiled code.&lt;br /&gt;
&lt;br /&gt;
Following the main code block, handlers for certain exceptions as well as alternate entry points are added.&lt;br /&gt;
&lt;br /&gt;
If a recompiled instruction relies on a certain MIPS register being cached in a certain native register, then a short 'stub' of code is generated to load the necessary registers.  When an instruction outside of this block needs to jump to that location, it will instead jump to the stub.  The necessary registers will be loaded, and then it will jump into the main code sequence.&lt;br /&gt;
&lt;br /&gt;
On architectures which require literal pools (ARMv5) these are inserted as necessary.&lt;br /&gt;
&lt;br /&gt;
==Linker==&lt;br /&gt;
&lt;br /&gt;
The linker fills in all unresolved branches.  Branches within the block are linked to their target address.  Branches which jump outside of the block are linked to their target if that address has been compiled already. These inter-block branches are recorded in the jump_out array.  This information will be used to remove the links in the event that the target of the branch is invalidated.&lt;br /&gt;
&lt;br /&gt;
Unresolved branches point to a stub which loads the address of the branch instruction and the virtual address of its target into registers, and calls the dynamic linker.  When this code is executed, the dynamic linker will compile the target if necessary, and then patch the branch instruction with the new address.&lt;br /&gt;
&lt;br /&gt;
==Memory manager==&lt;br /&gt;
&lt;br /&gt;
The last step in new_recompile_block is to ensure that there will be sufficient memory available to compile the next block.  If there is not, then the oldest blocks are purged.&lt;br /&gt;
&lt;br /&gt;
The dynarec cache can be described as a 16MB circular buffer divided into eight segments.  Memory is allocated in order from beginning to end.&lt;br /&gt;
&lt;br /&gt;
When there are less than 2 empty segments, the next segment in sequence is cleared, wrapping around to the beginning of the buffer.  This continues as memory is needed, wrapping around from end to beginning.&lt;br /&gt;
&lt;br /&gt;
==Invalidation and restoration==&lt;br /&gt;
&lt;br /&gt;
Normally, code blocks are looked up via a hash table, using the virtual address of the target instruction.  However, for purposes of invalidation, blocks are grouped by physical address.  This can be described as a virtually-indexed, physically-tagged (VIPT) cache.&lt;br /&gt;
&lt;br /&gt;
References to compiled blocks are stored in one of 4096 linked lists in the jump_in array.  Each list covers a 4K block of memory, and 2048 such lists are sufficient to cover the 8MB of RAM in the Nintendo 64.  The remaining lists are for code in ROM, and the bootloader in SP memory.&lt;br /&gt;
&lt;br /&gt;
When a write hits a memory page marked as cached, all entries in the corresponding list are invalidated.  If any code is found to cross a 4K boundary, the adjacent lists are invalidated also.&lt;br /&gt;
&lt;br /&gt;
Sometimes blocks may be invalidated even when none of the code is actually modified.  This can happen if data is written to memory in the same 4K page, or if code is reloaded without actually modifying it.  If blocks which were previously invalidated are subsequently found to be unmodified, those blocks are marked in the restore_candidate array.  If the block remains unmodified, it will be restored as a valid block, to avoid recompiling blocks which do not need to be recompiled.  This is performed by the clean_blocks function which is called periodically.&lt;br /&gt;
&lt;br /&gt;
The jump_in array, which lists unmodified blocks, is physically indexed, however the jump_dirty array, which lists potentially-modified blocks, is virtually indexed.  This allows blocks which have changed physical addresses, but are still at the same virtual address, to be recognized as being the same as before, and not in need of recompilation.&lt;br /&gt;
&lt;br /&gt;
==Dynamic linker==&lt;br /&gt;
&lt;br /&gt;
Branches with unresolved addresses jump to the dynamic linker.  This will look through the jump_in list corresponding to the physical page containing the virtual target address.  If found, the branch instruction will be patched with the address, and then it will jump to this address.&lt;br /&gt;
&lt;br /&gt;
If not found, the jump_dirty list will be searched for blocks which were previously compiled but may have been modified.  If a potential match is found, the code will be compared against a cached copy to determine if any changes have been made.  If not, then it will jump to the block.  Because the memory could be modified again, branch instructions referencing these blocks are not altered, and continue to point to the dynamic linker.  These blocks will continue to be verified each time they are called, until restored to the jump_in list by the clean_blocks function described above.&lt;br /&gt;
&lt;br /&gt;
If no compiled block is found, or the existing block was modified, the target is recompiled.&lt;br /&gt;
&lt;br /&gt;
==Address lookup==&lt;br /&gt;
&lt;br /&gt;
When a JR (jump register) instruction is encountered, the address of the recompiled code must be looked up using the address of the MIPS code.  The majority of such instructions jump to the link register (r31) to return to the location following a JAL (call) instruction.&lt;br /&gt;
&lt;br /&gt;
When a JAL or JALR is executed, the address of the following instruction is inserted into a small 32-entry hash table, which is checked when a JR r31 instruction is executed.  This allows for a quick return from subroutine calls.&lt;br /&gt;
&lt;br /&gt;
If the JR instruction uses a register other than r31, or the small hash table lookup fails to find a match, a larger 131072-entry hash table is checked.  This table contains 65536 bins with up to 2 addresses per bin.  If this also fails to find a match (which occurs less than 1% of the time) an exhaustive search of all compiled addresses within that 4K memory page is performed.&lt;br /&gt;
&lt;br /&gt;
If no match is found by any of these methods, the target address is compiled, and the new address is inserted into the hash table.&lt;br /&gt;
&lt;br /&gt;
==Cycle counting==&lt;br /&gt;
&lt;br /&gt;
Cycles are counted before each branch by adding the cycles from the preceding instructions to a specific register.  The cycle count is in R10 on ARM and ESI on x86.  The value in this register is normally a negative number.  When this number exceeds zero, a conditional branch is taken which jumps to an interrupt handler.&lt;br /&gt;
&lt;br /&gt;
For example, the following x86 code adds eight cycles:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
add $8,%esi&lt;br /&gt;
jns interrupt_handler&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The conditional branch jumps to a small bit of code located after the main compiled block, which saves the cached registers to the register file, sets the instruction pointer which will be used upon return from the interrupt, and then calls cc_interrupt.&lt;br /&gt;
&lt;br /&gt;
As in the original mupen64plus, the emulated clock runs at 37.5 MHz, and each instruction takes 2 clock cycles.&lt;br /&gt;
&lt;br /&gt;
==Interrupt handler==&lt;br /&gt;
&lt;br /&gt;
When the cycle count register reaches its limit, cc_interrupt is called, which in turn calls gen_interupt [sic].  If interrupts are not enabled, cc_interrupt returns.  If interrupts are enabled, and an interrupt is to be taken, the pending_exception flag will be set.  In this case, cc_interrupt does not return, and instead pops the stack and causes an unconditional jump to the address in pcaddr (usually 0x80000180).&lt;br /&gt;
&lt;br /&gt;
There is one additional case where the interrupt handler may be called.  If interrupts were disabled, and are enabled by writing to coprocessor 0 register 12, any pending interrupts are handled immediately.&lt;br /&gt;
&lt;br /&gt;
==Delay slots==&lt;br /&gt;
&lt;br /&gt;
MIPS has 'delay slots', where the instruction after the branch is executed before the branch is taken.  Instructions in delay slots are issued out-of-order in the recompiled code.&lt;br /&gt;
&lt;br /&gt;
[[Image:Recompiler delay slot reordering.png]]&lt;br /&gt;
&lt;br /&gt;
When a branch jumps into the delay slot of another branch, this case must be handled slightly differently:&lt;br /&gt;
&lt;br /&gt;
[[Image:Recompiler delay slot reordering 2.png]]&lt;br /&gt;
&lt;br /&gt;
The branch test and delay slot are executed in-order if a dependency exists, or for 'likely' branches where the delay slot is nullified when the branch condition is false.  These cases are infrequent (typically less than 10% of branches).&lt;br /&gt;
&lt;br /&gt;
==Constant propagation==&lt;br /&gt;
&lt;br /&gt;
When an instruction loads a constant into a register, the register is tagged as a constant.  The constant tag will be retained if subsequent instructions modify the constant using other constants.  During assembly, such a sequence of instructions is combined into a single load.  For example:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
LUI r8,12340000  --&amp;gt;  mov $0x12345678,%eax&lt;br /&gt;
ORI r8,r8,5678&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This optimization is not performed where a branch target intervenes, eg&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
...&lt;br /&gt;
 LUI r8,12340000  --&amp;gt;  mov $0x12340000,%eax&lt;br /&gt;
L1:&lt;br /&gt;
 ORI r8,r8,5678   --&amp;gt;  or $0x5678,%eax&lt;br /&gt;
 ...&lt;br /&gt;
 BEQ r0,r0,L1&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Registers containing constants are identified by bits in the isconst and wasconst fields of the regstat structure.  The wasconst bit is set for a register if the register contained a known constant before the instruction, and the isconst bit is set for a register if the register will contain a known constant after the instruction executes.&lt;br /&gt;
&lt;br /&gt;
==Translation lookaside buffer emulation==&lt;br /&gt;
&lt;br /&gt;
Most Nintendo 64 games do not use virtual memory, but some do.  At startup, main memory is directly mapped at addresses from 0x80000000 to 0x803FFFFF, or up to 0x807FFFFF if the memory expansion is used.&lt;br /&gt;
&lt;br /&gt;
Normally, read or write operations are checked against this range, and if outside this range, control is passed to the appropriate I/O handler.  This is done as follows:&lt;br /&gt;
&lt;br /&gt;
MIPS instruction: LW r1,8(r2)&lt;br /&gt;
&lt;br /&gt;
ARM code:&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
add r1, r2, #8&lt;br /&gt;
cmp r1, #8388608&lt;br /&gt;
bvc handler&lt;br /&gt;
ldr r1, [r1]&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If there are valid entries in the TLB, this would instead be compiled as follows:&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
add r1, r2, #8&lt;br /&gt;
mov r0, #264&lt;br /&gt;
add r0, r0, r1, lsr #12&lt;br /&gt;
ldr r0, [r11, r0, lsl #2]&lt;br /&gt;
tst r0, r0&lt;br /&gt;
bmi handler&lt;br /&gt;
ldr r1, [r1, r0, lsl #2]&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This looks up the offset in the memory_map table (which, in this example, is located at r11+264*4).  The high bit is tested to determine whether a valid mapping exists for this page.&lt;br /&gt;
&lt;br /&gt;
==Page fault emulation==&lt;br /&gt;
&lt;br /&gt;
If a memory address references an invalid page, a conditional branch is taken to a bit of code placed after the main block.  This will save any modified cached registers, and call an appropriate handler function for the address.  The handler will either perform I/O, or generate an exception (pagefault).&lt;br /&gt;
&lt;br /&gt;
==Mapping executable pages==&lt;br /&gt;
&lt;br /&gt;
If the dynamic recompiler encounters code which is not in contiguous physical memory, it will end the block at the page boundary, so that the block can be removed cleanly if the mapping is changed.&lt;br /&gt;
&lt;br /&gt;
There is a special case of this, where a branch and delay slot span two pages.  If a branch instruction is the last instruction in a virtual memory page, it is compiled in a different manner than other branches.  The branch condition is evaluated, and the target address is placed in a register (%ebp on x86, and r8 on ARM).  A special form of the dynamic linker (dyna_linker_ds) is used to link the branch to its corresponding delay slot in another block.  If no page fault occurs, the delay slot executes and then jumps to the address in the register.  For conditional branches that are not taken, the target address is the next instruction.  This code is generated by the pagespan_assemble function.&lt;br /&gt;
&lt;br /&gt;
== Self-modifying code detection ==&lt;br /&gt;
&lt;br /&gt;
Pages not containing code which has been compiled or where the code may have been modified since compilation are marked in the invalid_code array.  Writes are checked against this array, and if a write hits a valid (compiled and unmodified) page, invalidate_block is called:&lt;br /&gt;
&lt;br /&gt;
MIPS instruction: SW r1,8(r2)&lt;br /&gt;
&lt;br /&gt;
ARM code:&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
ldr r3, [r11, #88]  // pointer to invalid_code&lt;br /&gt;
add r4, r2, #8&lt;br /&gt;
cmp r4, #8388608&lt;br /&gt;
bvc handler&lt;br /&gt;
str r1, [r4]&lt;br /&gt;
ldrb r14, [r3, r4 lsr #12]&lt;br /&gt;
cmp r14, #1&lt;br /&gt;
bne invstub&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In TLB mode, the invalid_code array is not checked directly.  Instead, pages are marked non-writable in memory_map.&lt;br /&gt;
&lt;br /&gt;
== Compile options ==&lt;br /&gt;
&lt;br /&gt;
=== ARMv5_ONLY ===&lt;br /&gt;
&lt;br /&gt;
If this is defined, the UXTH instruction is not used, and the dynamic recompiler will generate literal pools instead of using movw/movt.  This provides compatibility with older processors, but generates somewhat less efficient code.&lt;br /&gt;
&lt;br /&gt;
=== RAM_OFFSET ===&lt;br /&gt;
&lt;br /&gt;
When compiling for ARM, this allocates an additional register which is used to add an offset to all pointers to memory addresses between 0x80000000 and 0x807FFFFF.  This allows the N64's memory to be mapped at an alternate address.  This incurs a small performance penalty, but is required for certain operating systems (eg Google Android, which places shared libraries at 0x80000000).&lt;br /&gt;
&lt;br /&gt;
This option is not used for x86.  The x86 instruction set allows for a full 32-bit offset in the instruction encoding, making it unnecessary to allocate an additional register for this purpose.&lt;br /&gt;
&lt;br /&gt;
=== CORTEX_A8_BRANCH_PREDICTION_HACK ===&lt;br /&gt;
&lt;br /&gt;
If this is defined, the dynamic recompiler will avoid generating consecutive branch instructions without another instruction in between.  This avoids a possible branch misprediction on the Cortex-A8 due to this processor having dual instruction decoders, but only one branch-prediction unit.  See [[Assembly Code Optimization]] for details.&lt;br /&gt;
&lt;br /&gt;
=== USE_MINI_HT ===&lt;br /&gt;
&lt;br /&gt;
If this is defined, attempt to look up return addresses in a small hash table before checking the larger hash table.  Usually improves performance.&lt;br /&gt;
&lt;br /&gt;
=== IMM_PREFETCH ===&lt;br /&gt;
&lt;br /&gt;
If this is defined, the x86 PREFETCH instruction is used to prefetch entries from the hash table.  The increase in code size often outweighs the benefit of this.&lt;br /&gt;
&lt;br /&gt;
=== REG_PREFETCH ===&lt;br /&gt;
&lt;br /&gt;
Similar to the above, but loads the address into a register first, then uses the ARM PLD instruction.  The increase in code size almost always outweighs the benefit of this.&lt;br /&gt;
&lt;br /&gt;
=== R29_HACK ===&lt;br /&gt;
&lt;br /&gt;
Assume that the stack pointer (r29) is always a valid memory address and do not check it.  It is similar to the optimization described [http://strmnnrmn.blogspot.com/2007/08/interesting-dynarec-hack.html here].  This can crash the emulator and is not enabled by default.&lt;br /&gt;
&lt;br /&gt;
== Debugging ==&lt;br /&gt;
&lt;br /&gt;
Debugging information can be obtained by defining the assem_debug macro as printf.  This will cause the dynamic recompiler to print debugging information to stdout.  For each disassembled MIPS instruction, an entry similar to the following will be printed:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
U: r1 r8 r11 r16 r31 UU: r29 32: r0 r9&lt;br /&gt;
pre: eax=-1 ecx=9 edx=-1 ebx=-1 ebp=29 esi=36 edi=-1&lt;br /&gt;
needs: ecx ebp esi r: r9&lt;br /&gt;
entry: eax=-1 ecx=9 edx=-1 ebx=-1 ebp=29 esi=36 edi=-1&lt;br /&gt;
dirty: ecx ebp esi &lt;br /&gt;
  800001d8: LW r16,r29+14&lt;br /&gt;
eax=16 ecx=9 edx=-1 ebx=-1 ebp=29 esi=36 edi=-1 dirty: eax ecx ebp esi &lt;br /&gt;
 32: r0 r9 r16&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
U: A list of MIPS registers which will not be used before they are overwritten (liveness analysis)&lt;br /&gt;
&lt;br /&gt;
UU: A list of MIPS registers for which the upper 32 bits will not be used before they are overwritten&lt;br /&gt;
&lt;br /&gt;
32: Registers that contain 32-bit sign-extended values&lt;br /&gt;
&lt;br /&gt;
pre: The state of the register mapping prior to execution of this instruction.  (-1 = no mapping; 36 = cycle count; The complete list of values with special meanings can be found in the source code)&lt;br /&gt;
&lt;br /&gt;
needs: a list of register mappings that were considered necessary and which could not be eliminated to make room for other mappings&lt;br /&gt;
&lt;br /&gt;
r: Registers that are known to contain 32-bit values and where optimizations rely on the assumption that the register does not contain a value outside of the range -2&amp;lt;sup&amp;gt;31&amp;lt;/sup&amp;gt; to 2&amp;lt;sup&amp;gt;31&amp;lt;/sup&amp;gt;-1&lt;br /&gt;
&lt;br /&gt;
entry: The minimum set of register mappings required to jump to this point&lt;br /&gt;
&lt;br /&gt;
dirty: Cached registers that have been modified and will need to be written back&lt;br /&gt;
&lt;br /&gt;
address: instruction - The decoded opcode, followed by the register mapping in effect after this instruction executes&lt;br /&gt;
&lt;br /&gt;
An asterisk (*) designates locations which are the target of a branch instruction.  Constant propagation will not be performed across these points.&lt;br /&gt;
&lt;br /&gt;
After the complete disassembly, the recompiled native code is shown.&lt;br /&gt;
&lt;br /&gt;
Note that the output can be quite voluminous; 20-30 MB is typical.&lt;br /&gt;
&lt;br /&gt;
==Potential improvements==&lt;br /&gt;
&lt;br /&gt;
===Copy propagation/offset propagation===&lt;br /&gt;
&lt;br /&gt;
A common instruction sequence is of the form:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
LUI r9,12340000&lt;br /&gt;
ADD r9,r9,r8&lt;br /&gt;
LW r9,5678(r9)&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
It would be helpful to recognize this as a load from r8+12345678.  The current constant propagation code does not do so.&lt;br /&gt;
&lt;br /&gt;
===Constant propagation and register assignment===&lt;br /&gt;
&lt;br /&gt;
Constant propagation is currently done after register assignment.  Registers are assigned even if the register will always contain a known value.  In certain cases, such as where the constant is used only to generate a memory address, this could be avoided and no register would need to be allocated.&lt;br /&gt;
&lt;br /&gt;
===Unaligned memory access===&lt;br /&gt;
&lt;br /&gt;
A small improvement could be made by combining adjacent LWL/LWR instructions.  The potential gain from doing so is very limited because these instructions typically represent less than 1% of all memory accesses.&lt;br /&gt;
&lt;br /&gt;
===SLT/branch merging===&lt;br /&gt;
&lt;br /&gt;
A frequent occurrence in MIPS code is an SLT or SLTI instruction followed by a branch.  This is generated relatively inefficiently on x86 and ARM, first doing a compare and set, then testing this value and doing a conditional branch.  Doing only one comparison would save at least one instruction, and could potentially save up to three instructions if the liveness analysis reveals that the result of the SLT instruction is used nowhere else.&lt;br /&gt;
&lt;br /&gt;
While a potentially useful optimization, there are several problems with this approach.  First, there are often additional instructions between the slt and the branch. These must be checked to make sure they do not modify the registers as that would prevent reordering the instruction stream as desired.  Secondly, if the result of the slt is found to be live, but unmodified, on both paths of the branch, clean_registers will normally write this value before the branch, to avoid duplicating the writeback code on both paths of the branch.  This optimization would have to be removed if the slt was combined with the branch.&lt;br /&gt;
&lt;br /&gt;
===x86-64===&lt;br /&gt;
&lt;br /&gt;
Currently the x86-64 backend generates only 32-bit instructions.  Proper 64-bit code generation would improve performance.&lt;br /&gt;
&lt;br /&gt;
===PowerPC===&lt;br /&gt;
&lt;br /&gt;
It would be possible to add a PowerPC code generator to the dynamic recompiler.  Currently no one is working on this.  (The mupen64gc project is using a different codebase.)&lt;br /&gt;
&lt;br /&gt;
The following is a summary of the changes which would be necessary to add a PowerPC backend.&lt;br /&gt;
&lt;br /&gt;
The slt* instructions use conditional moves, which are unavailable on PowerPC.  A suitable alternative (such as moves from the condition register) would need to be used.&lt;br /&gt;
&lt;br /&gt;
The assembler can generate as much as 256K of code in a single block, however conditional branches on PowerPC are limited to +/-64K.  It will be necessary to either restrict the block size, or insert jump tables in a manner similar to the literal pools on ARM.&lt;br /&gt;
&lt;br /&gt;
PowerPC generally relies on early branch resolution rather than statistical branch prediction.  Scheduling branch condition tests earlier may be advantageous.  (For example, the address bounds check could be done in address_generation, rather than just before the load or store.  Similarly it may be advantageous to test the branch condition and update the cycle count before executing the delay slot.)&lt;br /&gt;
&lt;br /&gt;
===MIPS===&lt;br /&gt;
&lt;br /&gt;
Recompiling MIPS into MIPS would be relatively straightforward, however the current code generator has no facility for filling delay slots.  This capability would be required for efficient code generation.&lt;br /&gt;
&lt;br /&gt;
[[Category:Development]]&lt;/div&gt;</summary>
		<author><name>Ari64</name></author>
		
	</entry>
	<entry>
		<id>https://pandorawiki.org/index.php?title=Mupen64plus_dynamic_recompiler&amp;diff=5296</id>
		<title>Mupen64plus dynamic recompiler</title>
		<link rel="alternate" type="text/html" href="https://pandorawiki.org/index.php?title=Mupen64plus_dynamic_recompiler&amp;diff=5296"/>
		<updated>2011-02-02T21:50:39Z</updated>

		<summary type="html">&lt;p&gt;Ari64: /* Compile options */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This page describes the dynamic recompiler in [[Mupen64Plus]], and the changes made for the ARM port.&lt;br /&gt;
&lt;br /&gt;
==The original dynamic recompiler by Hacktarux==&lt;br /&gt;
&lt;br /&gt;
The dynamic recompiler used in Mupen64plus v1.5 is based on the original written by Hacktarux in 2002.&lt;br /&gt;
&lt;br /&gt;
It recompiles contiguous blocks of MIPS instructions.  First, each instruction is decoded into a dynamically-allocated 132-byte data structure.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
typedef struct _precomp_instr&lt;br /&gt;
{&lt;br /&gt;
   void (*ops)();&lt;br /&gt;
   union&lt;br /&gt;
     {&lt;br /&gt;
    struct&lt;br /&gt;
      {&lt;br /&gt;
         long long int *rs;&lt;br /&gt;
         long long int *rt;&lt;br /&gt;
         short immediate;&lt;br /&gt;
      } i;&lt;br /&gt;
    struct&lt;br /&gt;
      {&lt;br /&gt;
         unsigned int inst_index;&lt;br /&gt;
      } j;&lt;br /&gt;
    struct&lt;br /&gt;
      {&lt;br /&gt;
         long long int *rs;&lt;br /&gt;
         long long int *rt;&lt;br /&gt;
         long long int *rd;&lt;br /&gt;
         unsigned char sa;&lt;br /&gt;
         unsigned char nrd;&lt;br /&gt;
      } r;&lt;br /&gt;
    struct&lt;br /&gt;
      {&lt;br /&gt;
         unsigned char base;&lt;br /&gt;
         unsigned char ft;&lt;br /&gt;
         short offset;&lt;br /&gt;
      } lf;&lt;br /&gt;
    struct&lt;br /&gt;
      {&lt;br /&gt;
         unsigned char ft;&lt;br /&gt;
         unsigned char fs;&lt;br /&gt;
         unsigned char fd;&lt;br /&gt;
      } cf;&lt;br /&gt;
     } f;&lt;br /&gt;
   unsigned int addr; /* word-aligned instruction address in r4300 address space */&lt;br /&gt;
   unsigned int local_addr; /* byte offset to start of corresponding x86_64 instructions, from start of code block */&lt;br /&gt;
   reg_cache_struct reg_cache_infos;&lt;br /&gt;
} precomp_instr;&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The decoded instructions are then compiled, generating x86 instructions for each MIPS instruction.  A 32K block is allocated with malloc() to hold the x86 code.  If this size proves insufficient, it is incrementally resized with realloc().&lt;br /&gt;
&lt;br /&gt;
MIPS registers are allocated to x86 registers with a least-recently-used replacement policy.  All cached registers are written back prior to a branch, or a memory read or write.&lt;br /&gt;
&lt;br /&gt;
To facilitate invalidation and replacement of modified code blocks, each 4K page is compiled separately.  If a sequence of instructions crosses a 4K page boundary, the current block is ended, and the next instructions will be compiled as a separate block.&lt;br /&gt;
&lt;br /&gt;
Branch instructions within a 4K page are compiled as branches directly to the target address.  Branches which cross a 4K page go through an indirect address lookup.&lt;br /&gt;
&lt;br /&gt;
Compiled code blocks are invalidated on write.  On an actual MIPS CPU, the instruction cache is invalidated using the CACHE instruction, however a few N64 games clear the cache using other methods.  Trapping writes appears to be the most reliable method of ensuring cache coherency in the emulated system.  The cache instruction is ignored.&lt;br /&gt;
&lt;br /&gt;
==Problems with the original design==&lt;br /&gt;
&lt;br /&gt;
The most significant performance problem with this design is its excessive memory usage.  The decoded instruction data is retained (132 bytes for each MIPS instruction) and occasionally referenced during execution.  Memory accesses frequently miss the L2 cache, resulting in poor performance.&lt;br /&gt;
&lt;br /&gt;
Additionally, the register cache is relatively inefficient, since all registers are flushed before any read or write operation.&lt;br /&gt;
&lt;br /&gt;
==A new approach==&lt;br /&gt;
&lt;br /&gt;
To reduce memory usage, the new design allocates a single large block of memory (currently 16 MiB) which is used for recompiled code.  This memory is allocated using mmap with the PROT_EXEC bit set, to ensure operation on CPUs with no-execute (NX) page permissions.&lt;br /&gt;
&lt;br /&gt;
Contiguous blocks of MIPS instructions are compiled (that is, it does not attempt to follow branches or 'hot-paths').&lt;br /&gt;
&lt;br /&gt;
The recompiler consists of eight stages, plus a linker, and a memory manager.&lt;br /&gt;
&lt;br /&gt;
Compiled blocks are invalidated on write, as before, however as the compiler will cross a 4K page, writes may invalidate adjacent pages as well.&lt;br /&gt;
&lt;br /&gt;
Currently the dynarec generates x86, x86-64, ARMv5, and ARMv7 (little-endian).  Most of the code is shared between the architectures, but a different code generator is included at compile time, using different #include statements depending on the CPU type.&lt;br /&gt;
&lt;br /&gt;
==Pass 1: Disassembly==&lt;br /&gt;
&lt;br /&gt;
When an instruction address is encountered which has not been compiled, the function new_recompile_block is called, with the (virtual) address of the target instruction as its sole parameter.  If the address is invalid, the function returns a nonzero value and the caller is responsible for handling the pagefault.&lt;br /&gt;
&lt;br /&gt;
Instructions are decoded until an unconditional jump, usually a return, is encountered.  Disassembly is ordinarilly continued past a JAL (subroutine call) instruction, however this strategy is abandoned if invalid instructions are encountered.&lt;br /&gt;
&lt;br /&gt;
Up to 4096 instructions (16K of MIPS code) may be disassembled at once.  Surprisingly, some games do actually reach this limit.&lt;br /&gt;
&lt;br /&gt;
==Pass 2: Liveness analysis==&lt;br /&gt;
&lt;br /&gt;
After disassembly, liveness analysis is performed on the registers.  This determines when a particular register will no longer be used, and thus can be removed from the register cache.&lt;br /&gt;
&lt;br /&gt;
A separate analysis is done on the upper 32 bits of 64-bit registers.  This can determine when only the lower 32 bits of a register are significant, thus allowing use of a 32-bit register.  This enables more efficient code generation on 32-bit processors.&lt;br /&gt;
&lt;br /&gt;
==Pass 3: Register allocation==&lt;br /&gt;
&lt;br /&gt;
The 31 MIPS registers must be mapped onto seven available registers on x86, or twelve available registers on ARM.&lt;br /&gt;
&lt;br /&gt;
Instructions with 64-bit results require two registers on 32-bit host processors.  32-bit instructions require only one.  A flag is set for registers containing 32-bit values, and these registers will be sign-extended before they are written to the register file.&lt;br /&gt;
&lt;br /&gt;
Registers are made available using a least-soon-needed replacement policy.  When the register cache is full, and no registers can be eliminated using the liveness analysis, a ten-instruction lookahead is used to determine which registers will not be needed soon and these registers are evicted from the cache.&lt;br /&gt;
&lt;br /&gt;
==Pass 4: Free unused registers==&lt;br /&gt;
&lt;br /&gt;
After the initial register allocation, the allocations are reviewed to determine if any registers remain allocated longer than necessary.  These are then removed from the register cache.  This avoids having to write back a large number of registers just before a branch, and makes more registers available for the next pass.&lt;br /&gt;
&lt;br /&gt;
==Pass 5: Pre-allocate registers==&lt;br /&gt;
&lt;br /&gt;
If a register will be used soon and needs to be loaded, try to load it early if the register is available.  This improves execution on CPUs with a load-use penalty.&lt;br /&gt;
&lt;br /&gt;
==Pass 6: Optimize clean/dirty state==&lt;br /&gt;
&lt;br /&gt;
If a cached register is 'dirty' and needs to be written out, try to do so as soon as the register will no longer be modified.  This avoids having to write the same register on multiple code paths due to conditional branches.  Additionally, try to avoid writing out dirty registers inside of loops.&lt;br /&gt;
&lt;br /&gt;
==Pass 7: Identify where registers are assumed to be 32-bit==&lt;br /&gt;
&lt;br /&gt;
When a 64-bit register is mapped to a 32-bit register with the assumption that the value will be sign-extended before being used, it is necessary to ensure that no register contains a 64-bit value when branching to such a location.  These instructions are flagged to identify them as requiring 32-bit inputs.  This information is used by the linker, and the exception return (ERET) handler.&lt;br /&gt;
&lt;br /&gt;
==Pass 8: Assembly==&lt;br /&gt;
&lt;br /&gt;
This generates and outputs the recompiled code.&lt;br /&gt;
&lt;br /&gt;
Following the main code block, handlers for certain exceptions as well as alternate entry points are added.&lt;br /&gt;
&lt;br /&gt;
If a recompiled instruction relies on a certain MIPS register being cached in a certain native register, then a short 'stub' of code is generated to load the necessary registers.  When an instruction outside of this block needs to jump to that location, it will instead jump to the stub.  The necessary registers will be loaded, and then it will jump into the main code sequence.&lt;br /&gt;
&lt;br /&gt;
On architectures which require literal pools (ARMv5) these are inserted as necessary.&lt;br /&gt;
&lt;br /&gt;
==Linker==&lt;br /&gt;
&lt;br /&gt;
The linker fills in all unresolved branches.  Branches within the block are linked to their target address.  Branches which jump outside of the block are linked to their target if that address has been compiled already. These inter-block branches are recorded in the jump_out array.  This information will be used to remove the links in the event that the target of the branch is invalidated.&lt;br /&gt;
&lt;br /&gt;
Unresolved branches point to a stub which loads the address of the branch instruction and the virtual address of its target into registers, and calls the dynamic linker.  When this code is executed, the dynamic linker will compile the target if necessary, and then patch the branch instruction with the new address.&lt;br /&gt;
&lt;br /&gt;
==Memory manager==&lt;br /&gt;
&lt;br /&gt;
The last step in new_recompile_block is to ensure that there will be sufficient memory available to compile the next block.  If there is not, then the oldest blocks are purged.&lt;br /&gt;
&lt;br /&gt;
The dynarec cache can be described as a 16MB circular buffer divided into eight segments.  Memory is allocated in order from beginning to end.&lt;br /&gt;
&lt;br /&gt;
When there are less than 2 empty segments, the next segment in sequence is cleared, wrapping around to the beginning of the buffer.  This continues as memory is needed, wrapping around from end to beginning.&lt;br /&gt;
&lt;br /&gt;
==Invalidation and restoration==&lt;br /&gt;
&lt;br /&gt;
Normally, code blocks are looked up via a hash table, using the virtual address of the target instruction.  However, for purposes of invalidation, blocks are grouped by physical address.  This can be described as a virtually-indexed, physically-tagged (VIPT) cache.&lt;br /&gt;
&lt;br /&gt;
References to compiled blocks are stored in one of 4096 linked lists in the jump_in array.  Each list covers a 4K block of memory, and 2048 such lists are sufficient to cover the 8MB of RAM in the Nintendo 64.  The remaining lists are for code in ROM, and the bootloader in SP memory.&lt;br /&gt;
&lt;br /&gt;
When a write hits a memory page marked as cached, all entries in the corresponding list are invalidated.  If any code is found to cross a 4K boundary, the adjacent lists are invalidated also.&lt;br /&gt;
&lt;br /&gt;
Sometimes blocks may be invalidated even when none of the code is actually modified.  This can happen if data is written to memory in the same 4K page, or if code is reloaded without actually modifying it.  If blocks which were previously invalidated are subsequently found to be unmodified, those blocks are marked in the restore_candidate array.  If the block remains unmodified, it will be restored as a valid block, to avoid recompiling blocks which do not need to be recompiled.  This is performed by the clean_blocks function which is called periodically.&lt;br /&gt;
&lt;br /&gt;
The jump_in array, which lists unmodified blocks, is physically indexed, however the jump_dirty array, which lists potentially-modified blocks, is virtually indexed.  This allows blocks which have changed physical addresses, but are still at the same virtual address, to be recognized as being the same as before, and not in need of recompilation.&lt;br /&gt;
&lt;br /&gt;
==Dynamic linker==&lt;br /&gt;
&lt;br /&gt;
Branches with unresolved addresses jump to the dynamic linker.  This will look through the jump_in list corresponding to the physical page containing the virtual target address.  If found, the branch instruction will be patched with the address, and then it will jump to this address.&lt;br /&gt;
&lt;br /&gt;
If not found, the jump_dirty list will be searched for blocks which were previously compiled but may have been modified.  If a potential match is found, the code will be compared against a cached copy to determine if any changes have been made.  If not, then it will jump to the block.  Because the memory could be modified again, branch instructions referencing these blocks are not altered, and continue to point to the dynamic linker.  These blocks will continue to be verified each time they are called, until restored to the jump_in list by the clean_blocks function described above.&lt;br /&gt;
&lt;br /&gt;
If no compiled block is found, or the existing block was modified, the target is recompiled.&lt;br /&gt;
&lt;br /&gt;
==Address lookup==&lt;br /&gt;
&lt;br /&gt;
When a JR (jump register) instruction is encountered, the address of the recompiled code must be looked up using the address of the MIPS code.  The majority of such instructions jump to the link register (r31) to return to the location following a JAL (call) instruction.&lt;br /&gt;
&lt;br /&gt;
When a JAL or JALR is executed, the address of the following instruction is inserted into a small 32-entry hash table, which is checked when a JR r31 instruction is executed.  This allows for a quick return from subroutine calls.&lt;br /&gt;
&lt;br /&gt;
If the JR instruction uses a register other than r31, or the small hash table lookup fails to find a match, a larger 131072-entry hash table is checked.  This table contains 65536 bins with up to 2 addresses per bin.  If this also fails to find a match (which occurs less than 1% of the time) an exhaustive search of all compiled addresses within that 4K memory page is performed.&lt;br /&gt;
&lt;br /&gt;
If no match is found by any of these methods, the target address is compiled, and the new address is inserted into the hash table.&lt;br /&gt;
&lt;br /&gt;
==Cycle counting==&lt;br /&gt;
&lt;br /&gt;
Cycles are counted before each branch by adding the cycles from the preceding instructions to a specific register.  The cycle count is in R10 on ARM and ESI on x86.  The value in this register is normally a negative number.  When this number exceeds zero, a conditional branch is taken which jumps to an interrupt handler.&lt;br /&gt;
&lt;br /&gt;
For example, the following x86 code adds eight cycles:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
add $8,%esi&lt;br /&gt;
jns interrupt_handler&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The conditional branch jumps to a small bit of code located after the main compiled block, which saves the cached registers to the register file, sets the instruction pointer which will be used upon return from the interrupt, and then calls cc_interrupt.&lt;br /&gt;
&lt;br /&gt;
As in the original mupen64plus, the emulated clock runs at 37.5 MHz, and each instruction takes 2 clock cycles.&lt;br /&gt;
&lt;br /&gt;
==Interrupt handler==&lt;br /&gt;
&lt;br /&gt;
When the cycle count register reaches its limit, cc_interrupt is called, which in turn calls gen_interupt [sic].  If interrupts are not enabled, cc_interrupt returns.  If interrupts are enabled, and an interrupt is to be taken, the pending_exception flag will be set.  In this case, cc_interrupt does not return, and instead pops the stack and causes an unconditional jump to the address in pcaddr (usually 0x80000180).&lt;br /&gt;
&lt;br /&gt;
There is one additional case where the interrupt handler may be called.  If interrupts were disabled, and are enabled by writing to coprocessor 0 register 12, any pending interrupts are handled immediately.&lt;br /&gt;
&lt;br /&gt;
==Delay slots==&lt;br /&gt;
&lt;br /&gt;
MIPS has 'delay slots', where the instruction after the branch is executed before the branch is taken.  Instructions in delay slots are issued out-of-order in the recompiled code.&lt;br /&gt;
&lt;br /&gt;
[[Image:Recompiler delay slot reordering.png]]&lt;br /&gt;
&lt;br /&gt;
When a branch jumps into the delay slot of another branch, this case must be handled slightly differently:&lt;br /&gt;
&lt;br /&gt;
[[Image:Recompiler delay slot reordering 2.png]]&lt;br /&gt;
&lt;br /&gt;
The branch test and delay slot are executed in-order if a dependency exists, or for 'likely' branches where the delay slot is nullified when the branch condition is false.  These cases are infrequent (typically less than 10% of branches).&lt;br /&gt;
&lt;br /&gt;
==Constant propagation==&lt;br /&gt;
&lt;br /&gt;
When an instruction loads a constant into a register, the register is tagged as a constant.  The constant tag will be retained if subsequent instructions modify the constant using other constants.  During assembly, such a sequence of instructions is combined into a single load.  For example:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
LUI r8,12340000  --&amp;gt;  mov $0x12345678,%eax&lt;br /&gt;
ORI r8,r8,5678&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This optimization is not performed where a branch target intervenes, eg&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
...&lt;br /&gt;
 LUI r8,12340000  --&amp;gt;  mov $0x12340000,%eax&lt;br /&gt;
L1:&lt;br /&gt;
 ORI r8,r8,5678   --&amp;gt;  or $0x5678,%eax&lt;br /&gt;
 ...&lt;br /&gt;
 BEQ r0,r0,L1&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Registers containing constants are identified by bits in the isconst and wasconst fields of the regstat structure.  The wasconst bit is set for a register if the register contained a known constant before the instruction, and the isconst bit is set for a register if the register will contain a known constant after the instruction executes.&lt;br /&gt;
&lt;br /&gt;
==Translation lookaside buffer emulation==&lt;br /&gt;
&lt;br /&gt;
Most Nintendo 64 games do not use virtual memory, but some do.  At startup, main memory is directly mapped at addresses from 0x80000000 to 0x803FFFFF, or up to 0x807FFFFF if the memory expansion is used.&lt;br /&gt;
&lt;br /&gt;
Normally, read or write operations are checked against this range, and if outside this range, control is passed to the appropriate I/O handler.  This is done as follows:&lt;br /&gt;
&lt;br /&gt;
MIPS instruction: LW r1,8(r2)&lt;br /&gt;
&lt;br /&gt;
ARM code:&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
add r1, r2, #8&lt;br /&gt;
cmp r1, #8388608&lt;br /&gt;
bvc handler&lt;br /&gt;
ldr r1, [r1]&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If there are valid entries in the TLB, this would instead be compiled as follows:&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
add r1, r2, #8&lt;br /&gt;
mov r0, #264&lt;br /&gt;
add r0, r0, r1, lsr #12&lt;br /&gt;
ldr r0, [r11, r0, lsl #2]&lt;br /&gt;
tst r0, r0&lt;br /&gt;
bmi handler&lt;br /&gt;
ldr r1, [r1, r0, lsl #2]&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This looks up the offset in the memory_map table (which, in this example, is located at r11+264*4).  The high bit is tested to determine whether a valid mapping exists for this page.&lt;br /&gt;
&lt;br /&gt;
==Page fault emulation==&lt;br /&gt;
&lt;br /&gt;
If a memory address references an invalid page, a conditional branch is taken to a bit of code placed after the main block.  This will save any modified cached registers, and call an appropriate handler function for the address.  The handler will either perform I/O, or generate an exception (pagefault).&lt;br /&gt;
&lt;br /&gt;
==Mapping executable pages==&lt;br /&gt;
&lt;br /&gt;
If the dynamic recompiler encounters code which is not in contiguous physical memory, it will end the block at the page boundary, so that the block can be removed cleanly if the mapping is changed.&lt;br /&gt;
&lt;br /&gt;
There is a special case of this, where a branch and delay slot span two pages.  If a branch instruction is the last instruction in a virtual memory page, it is compiled in a different manner than other branches.  The branch condition is evaluated, and the target address is placed in a register (%ebp on x86, and r8 on ARM).  A special form of the dynamic linker (dyna_linker_ds) is used to link the branch to its corresponding delay slot in another block.  If no page fault occurs, the delay slot executes and then jumps to the address in the register.  For conditional branches that are not taken, the target address is the next instruction.  This code is generated by the pagespan_assemble function.&lt;br /&gt;
&lt;br /&gt;
== Self-modifying code detection ==&lt;br /&gt;
&lt;br /&gt;
Pages not containing code which has been compiled or where the code may have been modified since compilation are marked in the invalid_code array.  Writes are checked against this array, and if a write hits a valid (compiled and unmodified) page, invalidate_block is called:&lt;br /&gt;
&lt;br /&gt;
MIPS instruction: SW r1,8(r2)&lt;br /&gt;
&lt;br /&gt;
ARM code:&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
ldr r3, [r11, #88]  // pointer to invalid_code&lt;br /&gt;
add r4, r2, #8&lt;br /&gt;
cmp r4, #8388608&lt;br /&gt;
bvc handler&lt;br /&gt;
str r1, [r4]&lt;br /&gt;
ldrb r14, [r3, r4 lsr #12]&lt;br /&gt;
cmp r14, #1&lt;br /&gt;
bne invstub&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In TLB mode, the invalid_code array is not checked directly.  Instead, pages are marked non-writable in memory_map.&lt;br /&gt;
&lt;br /&gt;
== Compile options ==&lt;br /&gt;
&lt;br /&gt;
=== ARMv5_ONLY ===&lt;br /&gt;
&lt;br /&gt;
If this is defined, the UXTH instruction is not used, and the dynamic recompiler will generate literal pools instead of using movw/movt.  This provides compatibility with older processors, but generates somewhat less efficient code.&lt;br /&gt;
&lt;br /&gt;
=== RAM_OFFSET ===&lt;br /&gt;
&lt;br /&gt;
When compiling for ARM, this allocates an additional register which is used to add an offset to all pointers to memory addresses between 0x80000000 and 0x807FFFFF.  This allows the N64's memory to be mapped at an alternate address.  This incurs a small performance penalty, but is required for certain operating systems (eg Google Android, which places shared libraries at 0x80000000).&lt;br /&gt;
&lt;br /&gt;
This option is not used for x86.  The x86 instruction set allows for a full 32-bit offset in the instruction encoding, making it unnecessary to allocate an additional register for this purpose.&lt;br /&gt;
&lt;br /&gt;
=== CORTEX_A8_BRANCH_PREDICTION_HACK ===&lt;br /&gt;
&lt;br /&gt;
If this is defined, the dynamic recompiler will avoid generating consecutive branch instructions without another instruction in between.  This avoids a possible branch misprediction on the Cortex-A8 due to this processor having dual instruction decoders, but only one branch-prediction unit.  See [[Assembly Code Optimization]] for details.&lt;br /&gt;
&lt;br /&gt;
=== USE_MINI_HT ===&lt;br /&gt;
&lt;br /&gt;
If this is defined, attempt to look up return addresses in a small hash table before checking the larger hash table.  Usually improves performance.&lt;br /&gt;
&lt;br /&gt;
=== IMM_PREFETCH ===&lt;br /&gt;
&lt;br /&gt;
If this is defined, the x86 PREFETCH instruction is used to prefetch entries from the hash table.  The increase in code size often outweighs the benefit of this.&lt;br /&gt;
&lt;br /&gt;
=== REG_PREFETCH ===&lt;br /&gt;
&lt;br /&gt;
Similar to the above, but loads the address into a register first, then uses the ARM PLD instruction.  The increase in code size almost always outweighs the benefit of this.&lt;br /&gt;
&lt;br /&gt;
=== R29_HACK ===&lt;br /&gt;
&lt;br /&gt;
Assume that the stack pointer (r29) is always a valid memory address and do not check it.  It is similar to the optimization described [http://strmnnrmn.blogspot.com/2007/08/interesting-dynarec-hack.html here].  This can crash the emulator and is not enabled by default.&lt;br /&gt;
&lt;br /&gt;
== Debugging ==&lt;br /&gt;
&lt;br /&gt;
Debugging information can be obtained by defining the assem_debug macro as printf.  This will cause the dynamic recompiler to print debugging information to stdout.  For each disassembled MIPS instruction, an entry similar to the following will be printed:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
U: r1 r8 r11 r16 r31 UU: r29 32: r0 r9&lt;br /&gt;
pre: eax=-1 ecx=9 edx=-1 ebx=-1 ebp=29 esi=36 edi=-1&lt;br /&gt;
needs: ecx ebp esi r: r9&lt;br /&gt;
entry: eax=-1 ecx=9 edx=-1 ebx=-1 ebp=29 esi=36 edi=-1&lt;br /&gt;
dirty: ecx ebp esi &lt;br /&gt;
  800001d8: LW r16,r29+14&lt;br /&gt;
eax=16 ecx=9 edx=-1 ebx=-1 ebp=29 esi=36 edi=-1 dirty: eax ecx ebp esi &lt;br /&gt;
 32: r0 r9 r16&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
U: A list of MIPS registers which will not be used before they are overwritten (liveness analysis)&lt;br /&gt;
&lt;br /&gt;
UU: A list of MIPS registers for which the upper 32 bits will not be used before they are overwritten&lt;br /&gt;
&lt;br /&gt;
32: Registers that contain 32-bit sign-extended values&lt;br /&gt;
&lt;br /&gt;
pre: The state of the register mapping prior to execution of this instruction.  (-1 = no mapping; 36 = cycle count; The complete list of values with special meanings can be found in the source code)&lt;br /&gt;
&lt;br /&gt;
needs: a list of register mappings that were considered necessary and which could not be eliminated to make room for other mappings&lt;br /&gt;
&lt;br /&gt;
r: Registers that are known to contain 32-bit values and where optimizations rely on the assumption that the register does not contain a value outside of the range -2&amp;lt;sup&amp;gt;31&amp;lt;/sup&amp;gt; to 2&amp;lt;sup&amp;gt;31&amp;lt;/sup&amp;gt;-1&lt;br /&gt;
&lt;br /&gt;
entry: The minimum set of register mappings required to jump to this point&lt;br /&gt;
&lt;br /&gt;
dirty: Cached registers that have been modified and will need to be written back&lt;br /&gt;
&lt;br /&gt;
address: instruction - The decoded opcode, followed by the register mapping in effect after this instruction executes&lt;br /&gt;
&lt;br /&gt;
An asterisk (*) designates locations which are the target of a branch instruction.  Constant propagation will not be performed across these points.&lt;br /&gt;
&lt;br /&gt;
After the complete disassembly, the recompiled native code is shown.&lt;br /&gt;
&lt;br /&gt;
Note that the output can be quite voluminous; 20-30 MB is typical.&lt;br /&gt;
&lt;br /&gt;
==Potential improvements==&lt;br /&gt;
&lt;br /&gt;
===Copy propagation/offset propagation===&lt;br /&gt;
&lt;br /&gt;
A common instruction sequence is of the form:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
LUI r9,12340000&lt;br /&gt;
ADD r9,r9,r8&lt;br /&gt;
LW r9,5678(r9)&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
It would be helpful to recognize this as a load from r8+12345678.  The current constant propagation code does not do so.&lt;br /&gt;
&lt;br /&gt;
===Unaligned memory access===&lt;br /&gt;
&lt;br /&gt;
A small improvement could be made by combining adjacent LWL/LWR instructions.  The potential gain from doing so is very limited because these instructions typically represent less than 1% of all memory accesses.&lt;br /&gt;
&lt;br /&gt;
===SLT/branch merging===&lt;br /&gt;
&lt;br /&gt;
A frequent occurrence in MIPS code is an SLT or SLTI instruction followed by a branch.  This is generated relatively inefficiently on x86 and ARM, first doing a compare and set, then testing this value and doing a conditional branch.  Doing only one comparison would save at least one instruction, and could potentially save up to three instructions if the liveness analysis reveals that the result of the SLT instruction is used nowhere else.&lt;br /&gt;
&lt;br /&gt;
While a potentially useful optimization, there are several problems with this approach.  First, there are often additional instructions between the slt and the branch. These must be checked to make sure they do not modify the registers as that would prevent reordering the instruction stream as desired.  Secondly, if the result of the slt is found to be live, but unmodified, on both paths of the branch, clean_registers will normally write this value before the branch, to avoid duplicating the writeback code on both paths of the branch.  This optimization would have to be removed if the slt was combined with the branch.&lt;br /&gt;
&lt;br /&gt;
===x86-64===&lt;br /&gt;
&lt;br /&gt;
Currently the x86-64 backend generates only 32-bit instructions.  Proper 64-bit code generation would improve performance.&lt;br /&gt;
&lt;br /&gt;
===PowerPC===&lt;br /&gt;
&lt;br /&gt;
It would be possible to add a PowerPC code generator to the dynamic recompiler.  Currently no one is working on this.  (The mupen64gc project is using a different codebase.)&lt;br /&gt;
&lt;br /&gt;
The following is a summary of the changes which would be necessary to add a PowerPC backend.&lt;br /&gt;
&lt;br /&gt;
The slt* instructions use conditional moves, which are unavailable on PowerPC.  A suitable alternative (such as moves from the condition register) would need to be used.&lt;br /&gt;
&lt;br /&gt;
The assembler can generate as much as 256K of code in a single block, however conditional branches on PowerPC are limited to +/-64K.  It will be necessary to either restrict the block size, or insert jump tables in a manner similar to the literal pools on ARM.&lt;br /&gt;
&lt;br /&gt;
PowerPC generally relies on early branch resolution rather than statistical branch prediction.  Scheduling branch condition tests earlier may be advantageous.  (For example, the address bounds check could be done in address_generation, rather than just before the load or store.  Similarly it may be advantageous to test the branch condition and update the cycle count before executing the delay slot.)&lt;br /&gt;
&lt;br /&gt;
===MIPS===&lt;br /&gt;
&lt;br /&gt;
Recompiling MIPS into MIPS would be relatively straightforward, however the current code generator has no facility for filling delay slots.  This capability would be required for efficient code generation.&lt;br /&gt;
&lt;br /&gt;
[[Category:Development]]&lt;/div&gt;</summary>
		<author><name>Ari64</name></author>
		
	</entry>
	<entry>
		<id>https://pandorawiki.org/index.php?title=Mupen64plus_dynamic_recompiler&amp;diff=4399</id>
		<title>Mupen64plus dynamic recompiler</title>
		<link rel="alternate" type="text/html" href="https://pandorawiki.org/index.php?title=Mupen64plus_dynamic_recompiler&amp;diff=4399"/>
		<updated>2010-12-18T02:54:01Z</updated>

		<summary type="html">&lt;p&gt;Ari64: /* Self-modifying code detection */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This page describes the dynamic recompiler in [[Mupen64Plus]], and the changes made for the ARM port.&lt;br /&gt;
&lt;br /&gt;
==The original dynamic recompiler by Hacktarux==&lt;br /&gt;
&lt;br /&gt;
The dynamic recompiler used in Mupen64plus v1.5 is based on the original written by Hacktarux in 2002.&lt;br /&gt;
&lt;br /&gt;
It recompiles contiguous blocks of MIPS instructions.  First, each instruction is decoded into a dynamically-allocated 132-byte data structure.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
typedef struct _precomp_instr&lt;br /&gt;
{&lt;br /&gt;
   void (*ops)();&lt;br /&gt;
   union&lt;br /&gt;
     {&lt;br /&gt;
    struct&lt;br /&gt;
      {&lt;br /&gt;
         long long int *rs;&lt;br /&gt;
         long long int *rt;&lt;br /&gt;
         short immediate;&lt;br /&gt;
      } i;&lt;br /&gt;
    struct&lt;br /&gt;
      {&lt;br /&gt;
         unsigned int inst_index;&lt;br /&gt;
      } j;&lt;br /&gt;
    struct&lt;br /&gt;
      {&lt;br /&gt;
         long long int *rs;&lt;br /&gt;
         long long int *rt;&lt;br /&gt;
         long long int *rd;&lt;br /&gt;
         unsigned char sa;&lt;br /&gt;
         unsigned char nrd;&lt;br /&gt;
      } r;&lt;br /&gt;
    struct&lt;br /&gt;
      {&lt;br /&gt;
         unsigned char base;&lt;br /&gt;
         unsigned char ft;&lt;br /&gt;
         short offset;&lt;br /&gt;
      } lf;&lt;br /&gt;
    struct&lt;br /&gt;
      {&lt;br /&gt;
         unsigned char ft;&lt;br /&gt;
         unsigned char fs;&lt;br /&gt;
         unsigned char fd;&lt;br /&gt;
      } cf;&lt;br /&gt;
     } f;&lt;br /&gt;
   unsigned int addr; /* word-aligned instruction address in r4300 address space */&lt;br /&gt;
   unsigned int local_addr; /* byte offset to start of corresponding x86_64 instructions, from start of code block */&lt;br /&gt;
   reg_cache_struct reg_cache_infos;&lt;br /&gt;
} precomp_instr;&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The decoded instructions are then compiled, generating x86 instructions for each MIPS instruction.  A 32K block is allocated with malloc() to hold the x86 code.  If this size proves insufficient, it is incrementally resized with realloc().&lt;br /&gt;
&lt;br /&gt;
MIPS registers are allocated to x86 registers with a least-recently-used replacement policy.  All cached registers are written back prior to a branch, or a memory read or write.&lt;br /&gt;
&lt;br /&gt;
To facilitate invalidation and replacement of modified code blocks, each 4K page is compiled separately.  If a sequence of instructions crosses a 4K page boundary, the current block is ended, and the next instructions will be compiled as a separate block.&lt;br /&gt;
&lt;br /&gt;
Branch instructions within a 4K page are compiled as branches directly to the target address.  Branches which cross a 4K page go through an indirect address lookup.&lt;br /&gt;
&lt;br /&gt;
Compiled code blocks are invalidated on write.  On an actual MIPS CPU, the instruction cache is invalidated using the CACHE instruction, however a few N64 games clear the cache using other methods.  Trapping writes appears to be the most reliable method of ensuring cache coherency in the emulated system.  The cache instruction is ignored.&lt;br /&gt;
&lt;br /&gt;
==Problems with the original design==&lt;br /&gt;
&lt;br /&gt;
The most significant performance problem with this design is its excessive memory usage.  The decoded instruction data is retained (132 bytes for each MIPS instruction) and occasionally referenced during execution.  Memory accesses frequently miss the L2 cache, resulting in poor performance.&lt;br /&gt;
&lt;br /&gt;
Additionally, the register cache is relatively inefficient, since all registers are flushed before any read or write operation.&lt;br /&gt;
&lt;br /&gt;
==A new approach==&lt;br /&gt;
&lt;br /&gt;
To reduce memory usage, the new design allocates a single large block of memory (currently 16 MiB) which is used for recompiled code.  This memory is allocated using mmap with the PROT_EXEC bit set, to ensure operation on CPUs with no-execute (NX) page permissions.&lt;br /&gt;
&lt;br /&gt;
Contiguous blocks of MIPS instructions are compiled (that is, it does not attempt to follow branches or 'hot-paths').&lt;br /&gt;
&lt;br /&gt;
The recompiler consists of eight stages, plus a linker, and a memory manager.&lt;br /&gt;
&lt;br /&gt;
Compiled blocks are invalidated on write, as before, however as the compiler will cross a 4K page, writes may invalidate adjacent pages as well.&lt;br /&gt;
&lt;br /&gt;
Currently the dynarec generates x86, x86-64, ARMv5, and ARMv7 (little-endian).  Most of the code is shared between the architectures, but a different code generator is included at compile time, using different #include statements depending on the CPU type.&lt;br /&gt;
&lt;br /&gt;
==Pass 1: Disassembly==&lt;br /&gt;
&lt;br /&gt;
When an instruction address is encountered which has not been compiled, the function new_recompile_block is called, with the (virtual) address of the target instruction as its sole parameter.  If the address is invalid, the function returns a nonzero value and the caller is responsible for handling the pagefault.&lt;br /&gt;
&lt;br /&gt;
Instructions are decoded until an unconditional jump, usually a return, is encountered.  Disassembly is ordinarilly continued past a JAL (subroutine call) instruction, however this strategy is abandoned if invalid instructions are encountered.&lt;br /&gt;
&lt;br /&gt;
Up to 4096 instructions (16K of MIPS code) may be disassembled at once.  Surprisingly, some games do actually reach this limit.&lt;br /&gt;
&lt;br /&gt;
==Pass 2: Liveness analysis==&lt;br /&gt;
&lt;br /&gt;
After disassembly, liveness analysis is performed on the registers.  This determines when a particular register will no longer be used, and thus can be removed from the register cache.&lt;br /&gt;
&lt;br /&gt;
A separate analysis is done on the upper 32 bits of 64-bit registers.  This can determine when only the lower 32 bits of a register are significant, thus allowing use of a 32-bit register.  This enables more efficient code generation on 32-bit processors.&lt;br /&gt;
&lt;br /&gt;
==Pass 3: Register allocation==&lt;br /&gt;
&lt;br /&gt;
The 31 MIPS registers must be mapped onto seven available registers on x86, or twelve available registers on ARM.&lt;br /&gt;
&lt;br /&gt;
Instructions with 64-bit results require two registers on 32-bit host processors.  32-bit instructions require only one.  A flag is set for registers containing 32-bit values, and these registers will be sign-extended before they are written to the register file.&lt;br /&gt;
&lt;br /&gt;
Registers are made available using a least-soon-needed replacement policy.  When the register cache is full, and no registers can be eliminated using the liveness analysis, a ten-instruction lookahead is used to determine which registers will not be needed soon and these registers are evicted from the cache.&lt;br /&gt;
&lt;br /&gt;
==Pass 4: Free unused registers==&lt;br /&gt;
&lt;br /&gt;
After the initial register allocation, the allocations are reviewed to determine if any registers remain allocated longer than necessary.  These are then removed from the register cache.  This avoids having to write back a large number of registers just before a branch, and makes more registers available for the next pass.&lt;br /&gt;
&lt;br /&gt;
==Pass 5: Pre-allocate registers==&lt;br /&gt;
&lt;br /&gt;
If a register will be used soon and needs to be loaded, try to load it early if the register is available.  This improves execution on CPUs with a load-use penalty.&lt;br /&gt;
&lt;br /&gt;
==Pass 6: Optimize clean/dirty state==&lt;br /&gt;
&lt;br /&gt;
If a cached register is 'dirty' and needs to be written out, try to do so as soon as the register will no longer be modified.  This avoids having to write the same register on multiple code paths due to conditional branches.  Additionally, try to avoid writing out dirty registers inside of loops.&lt;br /&gt;
&lt;br /&gt;
==Pass 7: Identify where registers are assumed to be 32-bit==&lt;br /&gt;
&lt;br /&gt;
When a 64-bit register is mapped to a 32-bit register with the assumption that the value will be sign-extended before being used, it is necessary to ensure that no register contains a 64-bit value when branching to such a location.  These instructions are flagged to identify them as requiring 32-bit inputs.  This information is used by the linker, and the exception return (ERET) handler.&lt;br /&gt;
&lt;br /&gt;
==Pass 8: Assembly==&lt;br /&gt;
&lt;br /&gt;
This generates and outputs the recompiled code.&lt;br /&gt;
&lt;br /&gt;
Following the main code block, handlers for certain exceptions as well as alternate entry points are added.&lt;br /&gt;
&lt;br /&gt;
If a recompiled instruction relies on a certain MIPS register being cached in a certain native register, then a short 'stub' of code is generated to load the necessary registers.  When an instruction outside of this block needs to jump to that location, it will instead jump to the stub.  The necessary registers will be loaded, and then it will jump into the main code sequence.&lt;br /&gt;
&lt;br /&gt;
On architectures which require literal pools (ARMv5) these are inserted as necessary.&lt;br /&gt;
&lt;br /&gt;
==Linker==&lt;br /&gt;
&lt;br /&gt;
The linker fills in all unresolved branches.  Branches within the block are linked to their target address.  Branches which jump outside of the block are linked to their target if that address has been compiled already. These inter-block branches are recorded in the jump_out array.  This information will be used to remove the links in the event that the target of the branch is invalidated.&lt;br /&gt;
&lt;br /&gt;
Unresolved branches point to a stub which loads the address of the branch instruction and the virtual address of its target into registers, and calls the dynamic linker.  When this code is executed, the dynamic linker will compile the target if necessary, and then patch the branch instruction with the new address.&lt;br /&gt;
&lt;br /&gt;
==Memory manager==&lt;br /&gt;
&lt;br /&gt;
The last step in new_recompile_block is to ensure that there will be sufficient memory available to compile the next block.  If there is not, then the oldest blocks are purged.&lt;br /&gt;
&lt;br /&gt;
The dynarec cache can be described as a 16MB circular buffer divided into eight segments.  Memory is allocated in order from beginning to end.&lt;br /&gt;
&lt;br /&gt;
When there are less than 2 empty segments, the next segment in sequence is cleared, wrapping around to the beginning of the buffer.  This continues as memory is needed, wrapping around from end to beginning.&lt;br /&gt;
&lt;br /&gt;
==Invalidation and restoration==&lt;br /&gt;
&lt;br /&gt;
Normally, code blocks are looked up via a hash table, using the virtual address of the target instruction.  However, for purposes of invalidation, blocks are grouped by physical address.  This can be described as a virtually-indexed, physically-tagged (VIPT) cache.&lt;br /&gt;
&lt;br /&gt;
References to compiled blocks are stored in one of 4096 linked lists in the jump_in array.  Each list covers a 4K block of memory, and 2048 such lists are sufficient to cover the 8MB of RAM in the Nintendo 64.  The remaining lists are for code in ROM, and the bootloader in SP memory.&lt;br /&gt;
&lt;br /&gt;
When a write hits a memory page marked as cached, all entries in the corresponding list are invalidated.  If any code is found to cross a 4K boundary, the adjacent lists are invalidated also.&lt;br /&gt;
&lt;br /&gt;
Sometimes blocks may be invalidated even when none of the code is actually modified.  This can happen if data is written to memory in the same 4K page, or if code is reloaded without actually modifying it.  If blocks which were previously invalidated are subsequently found to be unmodified, those blocks are marked in the restore_candidate array.  If the block remains unmodified, it will be restored as a valid block, to avoid recompiling blocks which do not need to be recompiled.  This is performed by the clean_blocks function which is called periodically.&lt;br /&gt;
&lt;br /&gt;
The jump_in array, which lists unmodified blocks, is physically indexed, however the jump_dirty array, which lists potentially-modified blocks, is virtually indexed.  This allows blocks which have changed physical addresses, but are still at the same virtual address, to be recognized as being the same as before, and not in need of recompilation.&lt;br /&gt;
&lt;br /&gt;
==Dynamic linker==&lt;br /&gt;
&lt;br /&gt;
Branches with unresolved addresses jump to the dynamic linker.  This will look through the jump_in list corresponding to the physical page containing the virtual target address.  If found, the branch instruction will be patched with the address, and then it will jump to this address.&lt;br /&gt;
&lt;br /&gt;
If not found, the jump_dirty list will be searched for blocks which were previously compiled but may have been modified.  If a potential match is found, the code will be compared against a cached copy to determine if any changes have been made.  If not, then it will jump to the block.  Because the memory could be modified again, branch instructions referencing these blocks are not altered, and continue to point to the dynamic linker.  These blocks will continue to be verified each time they are called, until restored to the jump_in list by the clean_blocks function described above.&lt;br /&gt;
&lt;br /&gt;
If no compiled block is found, or the existing block was modified, the target is recompiled.&lt;br /&gt;
&lt;br /&gt;
==Address lookup==&lt;br /&gt;
&lt;br /&gt;
When a JR (jump register) instruction is encountered, the address of the recompiled code must be looked up using the address of the MIPS code.  The majority of such instructions jump to the link register (r31) to return to the location following a JAL (call) instruction.&lt;br /&gt;
&lt;br /&gt;
When a JAL or JALR is executed, the address of the following instruction is inserted into a small 32-entry hash table, which is checked when a JR r31 instruction is executed.  This allows for a quick return from subroutine calls.&lt;br /&gt;
&lt;br /&gt;
If the JR instruction uses a register other than r31, or the small hash table lookup fails to find a match, a larger 131072-entry hash table is checked.  This table contains 65536 bins with up to 2 addresses per bin.  If this also fails to find a match (which occurs less than 1% of the time) an exhaustive search of all compiled addresses within that 4K memory page is performed.&lt;br /&gt;
&lt;br /&gt;
If no match is found by any of these methods, the target address is compiled, and the new address is inserted into the hash table.&lt;br /&gt;
&lt;br /&gt;
==Cycle counting==&lt;br /&gt;
&lt;br /&gt;
Cycles are counted before each branch by adding the cycles from the preceding instructions to a specific register.  The cycle count is in R10 on ARM and ESI on x86.  The value in this register is normally a negative number.  When this number exceeds zero, a conditional branch is taken which jumps to an interrupt handler.&lt;br /&gt;
&lt;br /&gt;
For example, the following x86 code adds eight cycles:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
add $8,%esi&lt;br /&gt;
jns interrupt_handler&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The conditional branch jumps to a small bit of code located after the main compiled block, which saves the cached registers to the register file, sets the instruction pointer which will be used upon return from the interrupt, and then calls cc_interrupt.&lt;br /&gt;
&lt;br /&gt;
As in the original mupen64plus, the emulated clock runs at 37.5 MHz, and each instruction takes 2 clock cycles.&lt;br /&gt;
&lt;br /&gt;
==Interrupt handler==&lt;br /&gt;
&lt;br /&gt;
When the cycle count register reaches its limit, cc_interrupt is called, which in turn calls gen_interupt [sic].  If interrupts are not enabled, cc_interrupt returns.  If interrupts are enabled, and an interrupt is to be taken, the pending_exception flag will be set.  In this case, cc_interrupt does not return, and instead pops the stack and causes an unconditional jump to the address in pcaddr (usually 0x80000180).&lt;br /&gt;
&lt;br /&gt;
There is one additional case where the interrupt handler may be called.  If interrupts were disabled, and are enabled by writing to coprocessor 0 register 12, any pending interrupts are handled immediately.&lt;br /&gt;
&lt;br /&gt;
==Delay slots==&lt;br /&gt;
&lt;br /&gt;
MIPS has 'delay slots', where the instruction after the branch is executed before the branch is taken.  Instructions in delay slots are issued out-of-order in the recompiled code.&lt;br /&gt;
&lt;br /&gt;
[[Image:Recompiler delay slot reordering.png]]&lt;br /&gt;
&lt;br /&gt;
When a branch jumps into the delay slot of another branch, this case must be handled slightly differently:&lt;br /&gt;
&lt;br /&gt;
[[Image:Recompiler delay slot reordering 2.png]]&lt;br /&gt;
&lt;br /&gt;
The branch test and delay slot are executed in-order if a dependency exists, or for 'likely' branches where the delay slot is nullified when the branch condition is false.  These cases are infrequent (typically less than 10% of branches).&lt;br /&gt;
&lt;br /&gt;
==Constant propagation==&lt;br /&gt;
&lt;br /&gt;
When an instruction loads a constant into a register, the register is tagged as a constant.  The constant tag will be retained if subsequent instructions modify the constant using other constants.  During assembly, such a sequence of instructions is combined into a single load.  For example:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
LUI r8,12340000  --&amp;gt;  mov $0x12345678,%eax&lt;br /&gt;
ORI r8,r8,5678&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This optimization is not performed where a branch target intervenes, eg&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
...&lt;br /&gt;
 LUI r8,12340000  --&amp;gt;  mov $0x12340000,%eax&lt;br /&gt;
L1:&lt;br /&gt;
 ORI r8,r8,5678   --&amp;gt;  or $0x5678,%eax&lt;br /&gt;
 ...&lt;br /&gt;
 BEQ r0,r0,L1&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Registers containing constants are identified by bits in the isconst and wasconst fields of the regstat structure.  The wasconst bit is set for a register if the register contained a known constant before the instruction, and the isconst bit is set for a register if the register will contain a known constant after the instruction executes.&lt;br /&gt;
&lt;br /&gt;
==Translation lookaside buffer emulation==&lt;br /&gt;
&lt;br /&gt;
Most Nintendo 64 games do not use virtual memory, but some do.  At startup, main memory is directly mapped at addresses from 0x80000000 to 0x803FFFFF, or up to 0x807FFFFF if the memory expansion is used.&lt;br /&gt;
&lt;br /&gt;
Normally, read or write operations are checked against this range, and if outside this range, control is passed to the appropriate I/O handler.  This is done as follows:&lt;br /&gt;
&lt;br /&gt;
MIPS instruction: LW r1,8(r2)&lt;br /&gt;
&lt;br /&gt;
ARM code:&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
add r1, r2, #8&lt;br /&gt;
cmp r1, #8388608&lt;br /&gt;
bvc handler&lt;br /&gt;
ldr r1, [r1]&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If there are valid entries in the TLB, this would instead be compiled as follows:&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
add r1, r2, #8&lt;br /&gt;
mov r0, #264&lt;br /&gt;
add r0, r0, r1, lsr #12&lt;br /&gt;
ldr r0, [r11, r0, lsl #2]&lt;br /&gt;
tst r0, r0&lt;br /&gt;
bmi handler&lt;br /&gt;
ldr r1, [r1, r0, lsl #2]&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This looks up the offset in the memory_map table (which, in this example, is located at r11+264*4).  The high bit is tested to determine whether a valid mapping exists for this page.&lt;br /&gt;
&lt;br /&gt;
==Page fault emulation==&lt;br /&gt;
&lt;br /&gt;
If a memory address references an invalid page, a conditional branch is taken to a bit of code placed after the main block.  This will save any modified cached registers, and call an appropriate handler function for the address.  The handler will either perform I/O, or generate an exception (pagefault).&lt;br /&gt;
&lt;br /&gt;
==Mapping executable pages==&lt;br /&gt;
&lt;br /&gt;
If the dynamic recompiler encounters code which is not in contiguous physical memory, it will end the block at the page boundary, so that the block can be removed cleanly if the mapping is changed.&lt;br /&gt;
&lt;br /&gt;
There is a special case of this, where a branch and delay slot span two pages.  If a branch instruction is the last instruction in a virtual memory page, it is compiled in a different manner than other branches.  The branch condition is evaluated, and the target address is placed in a register (%ebp on x86, and r8 on ARM).  A special form of the dynamic linker (dyna_linker_ds) is used to link the branch to its corresponding delay slot in another block.  If no page fault occurs, the delay slot executes and then jumps to the address in the register.  For conditional branches that are not taken, the target address is the next instruction.  This code is generated by the pagespan_assemble function.&lt;br /&gt;
&lt;br /&gt;
== Self-modifying code detection ==&lt;br /&gt;
&lt;br /&gt;
Pages not containing code which has been compiled or where the code may have been modified since compilation are marked in the invalid_code array.  Writes are checked against this array, and if a write hits a valid (compiled and unmodified) page, invalidate_block is called:&lt;br /&gt;
&lt;br /&gt;
MIPS instruction: SW r1,8(r2)&lt;br /&gt;
&lt;br /&gt;
ARM code:&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
ldr r3, [r11, #88]  // pointer to invalid_code&lt;br /&gt;
add r4, r2, #8&lt;br /&gt;
cmp r4, #8388608&lt;br /&gt;
bvc handler&lt;br /&gt;
str r1, [r4]&lt;br /&gt;
ldrb r14, [r3, r4 lsr #12]&lt;br /&gt;
cmp r14, #1&lt;br /&gt;
bne invstub&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In TLB mode, the invalid_code array is not checked directly.  Instead, pages are marked non-writable in memory_map.&lt;br /&gt;
&lt;br /&gt;
== Compile options ==&lt;br /&gt;
&lt;br /&gt;
=== ARMv5_ONLY ===&lt;br /&gt;
&lt;br /&gt;
If this is defined, the UXTH instruction is not used, and the dynamic recompiler will generate literal pools instead of using movw/movt.  This provides compatibility with older processors, but generates somewhat less efficient code.&lt;br /&gt;
&lt;br /&gt;
=== CORTEX_A8_BRANCH_PREDICTION_HACK ===&lt;br /&gt;
&lt;br /&gt;
If this is defined, the dynamic recompiler will avoid generating consecutive branch instructions without another instruction in between.  This avoids a possible branch misprediction on the Cortex-A8 due to this processor having dual instruction decoders, but only one branch-prediction unit.  See [[Assembly Code Optimization]] for details.&lt;br /&gt;
&lt;br /&gt;
=== USE_MINI_HT ===&lt;br /&gt;
&lt;br /&gt;
If this is defined, attempt to look up return addresses in a small hash table before checking the larger hash table.  Usually improves performance.&lt;br /&gt;
&lt;br /&gt;
=== IMM_PREFETCH ===&lt;br /&gt;
&lt;br /&gt;
If this is defined, the x86 PREFETCH instruction is used to prefetch entries from the hash table.  The increase in code size often outweighs the benefit of this.&lt;br /&gt;
&lt;br /&gt;
=== REG_PREFETCH ===&lt;br /&gt;
&lt;br /&gt;
Similar to the above, but loads the address into a register first, then uses the ARM PLD instruction.  The increase in code size almost always outweighs the benefit of this.&lt;br /&gt;
&lt;br /&gt;
=== R29_HACK ===&lt;br /&gt;
&lt;br /&gt;
Assume that the stack pointer (r29) is always a valid memory address and do not check it.  It is similar to the optimization described [http://strmnnrmn.blogspot.com/2007/08/interesting-dynarec-hack.html here].  This can crash the emulator and is not enabled by default.&lt;br /&gt;
&lt;br /&gt;
== Debugging ==&lt;br /&gt;
&lt;br /&gt;
Debugging information can be obtained by defining the assem_debug macro as printf.  This will cause the dynamic recompiler to print debugging information to stdout.  For each disassembled MIPS instruction, an entry similar to the following will be printed:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
U: r1 r8 r11 r16 r31 UU: r29 32: r0 r9&lt;br /&gt;
pre: eax=-1 ecx=9 edx=-1 ebx=-1 ebp=29 esi=36 edi=-1&lt;br /&gt;
needs: ecx ebp esi r: r9&lt;br /&gt;
entry: eax=-1 ecx=9 edx=-1 ebx=-1 ebp=29 esi=36 edi=-1&lt;br /&gt;
dirty: ecx ebp esi &lt;br /&gt;
  800001d8: LW r16,r29+14&lt;br /&gt;
eax=16 ecx=9 edx=-1 ebx=-1 ebp=29 esi=36 edi=-1 dirty: eax ecx ebp esi &lt;br /&gt;
 32: r0 r9 r16&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
U: A list of MIPS registers which will not be used before they are overwritten (liveness analysis)&lt;br /&gt;
&lt;br /&gt;
UU: A list of MIPS registers for which the upper 32 bits will not be used before they are overwritten&lt;br /&gt;
&lt;br /&gt;
32: Registers that contain 32-bit sign-extended values&lt;br /&gt;
&lt;br /&gt;
pre: The state of the register mapping prior to execution of this instruction.  (-1 = no mapping; 36 = cycle count; The complete list of values with special meanings can be found in the source code)&lt;br /&gt;
&lt;br /&gt;
needs: a list of register mappings that were considered necessary and which could not be eliminated to make room for other mappings&lt;br /&gt;
&lt;br /&gt;
r: Registers that are known to contain 32-bit values and where optimizations rely on the assumption that the register does not contain a value outside of the range -2&amp;lt;sup&amp;gt;31&amp;lt;/sup&amp;gt; to 2&amp;lt;sup&amp;gt;31&amp;lt;/sup&amp;gt;-1&lt;br /&gt;
&lt;br /&gt;
entry: The minimum set of register mappings required to jump to this point&lt;br /&gt;
&lt;br /&gt;
dirty: Cached registers that have been modified and will need to be written back&lt;br /&gt;
&lt;br /&gt;
address: instruction - The decoded opcode, followed by the register mapping in effect after this instruction executes&lt;br /&gt;
&lt;br /&gt;
An asterisk (*) designates locations which are the target of a branch instruction.  Constant propagation will not be performed across these points.&lt;br /&gt;
&lt;br /&gt;
After the complete disassembly, the recompiled native code is shown.&lt;br /&gt;
&lt;br /&gt;
Note that the output can be quite voluminous; 20-30 MB is typical.&lt;br /&gt;
&lt;br /&gt;
==Potential improvements==&lt;br /&gt;
&lt;br /&gt;
===Copy propagation/offset propagation===&lt;br /&gt;
&lt;br /&gt;
A common instruction sequence is of the form:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
LUI r9,12340000&lt;br /&gt;
ADD r9,r9,r8&lt;br /&gt;
LW r9,5678(r9)&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
It would be helpful to recognize this as a load from r8+12345678.  The current constant propagation code does not do so.&lt;br /&gt;
&lt;br /&gt;
===Unaligned memory access===&lt;br /&gt;
&lt;br /&gt;
A small improvement could be made by combining adjacent LWL/LWR instructions.  The potential gain from doing so is very limited because these instructions typically represent less than 1% of all memory accesses.&lt;br /&gt;
&lt;br /&gt;
===SLT/branch merging===&lt;br /&gt;
&lt;br /&gt;
A frequent occurrence in MIPS code is an SLT or SLTI instruction followed by a branch.  This is generated relatively inefficiently on x86 and ARM, first doing a compare and set, then testing this value and doing a conditional branch.  Doing only one comparison would save at least one instruction, and could potentially save up to three instructions if the liveness analysis reveals that the result of the SLT instruction is used nowhere else.&lt;br /&gt;
&lt;br /&gt;
While a potentially useful optimization, there are several problems with this approach.  First, there are often additional instructions between the slt and the branch. These must be checked to make sure they do not modify the registers as that would prevent reordering the instruction stream as desired.  Secondly, if the result of the slt is found to be live, but unmodified, on both paths of the branch, clean_registers will normally write this value before the branch, to avoid duplicating the writeback code on both paths of the branch.  This optimization would have to be removed if the slt was combined with the branch.&lt;br /&gt;
&lt;br /&gt;
===x86-64===&lt;br /&gt;
&lt;br /&gt;
Currently the x86-64 backend generates only 32-bit instructions.  Proper 64-bit code generation would improve performance.&lt;br /&gt;
&lt;br /&gt;
===PowerPC===&lt;br /&gt;
&lt;br /&gt;
It would be possible to add a PowerPC code generator to the dynamic recompiler.  Currently no one is working on this.  (The mupen64gc project is using a different codebase.)&lt;br /&gt;
&lt;br /&gt;
The following is a summary of the changes which would be necessary to add a PowerPC backend.&lt;br /&gt;
&lt;br /&gt;
The slt* instructions use conditional moves, which are unavailable on PowerPC.  A suitable alternative (such as moves from the condition register) would need to be used.&lt;br /&gt;
&lt;br /&gt;
The assembler can generate as much as 256K of code in a single block, however conditional branches on PowerPC are limited to +/-64K.  It will be necessary to either restrict the block size, or insert jump tables in a manner similar to the literal pools on ARM.&lt;br /&gt;
&lt;br /&gt;
PowerPC generally relies on early branch resolution rather than statistical branch prediction.  Scheduling branch condition tests earlier may be advantageous.  (For example, the address bounds check could be done in address_generation, rather than just before the load or store.  Similarly it may be advantageous to test the branch condition and update the cycle count before executing the delay slot.)&lt;br /&gt;
&lt;br /&gt;
===MIPS===&lt;br /&gt;
&lt;br /&gt;
Recompiling MIPS into MIPS would be relatively straightforward, however the current code generator has no facility for filling delay slots.  This capability would be required for efficient code generation.&lt;br /&gt;
&lt;br /&gt;
[[Category:Development]]&lt;/div&gt;</summary>
		<author><name>Ari64</name></author>
		
	</entry>
	<entry>
		<id>https://pandorawiki.org/index.php?title=Mupen64plus_dynamic_recompiler&amp;diff=4243</id>
		<title>Mupen64plus dynamic recompiler</title>
		<link rel="alternate" type="text/html" href="https://pandorawiki.org/index.php?title=Mupen64plus_dynamic_recompiler&amp;diff=4243"/>
		<updated>2010-11-29T05:36:32Z</updated>

		<summary type="html">&lt;p&gt;Ari64: Interrupt handler&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This page describes the dynamic recompiler in Mupen64plus, and the changes made for the ARM port.&lt;br /&gt;
&lt;br /&gt;
==The original dynamic recompiler by Hacktarux==&lt;br /&gt;
&lt;br /&gt;
The dynamic recompiler used in Mupen64plus v1.5 is based on the original written by Hacktarux in 2002.&lt;br /&gt;
&lt;br /&gt;
It recompiles contiguous blocks of MIPS instructions.  First, each instruction is decoded into a dynamically-allocated 132-byte data structure.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
typedef struct _precomp_instr&lt;br /&gt;
{&lt;br /&gt;
   void (*ops)();&lt;br /&gt;
   union&lt;br /&gt;
     {&lt;br /&gt;
    struct&lt;br /&gt;
      {&lt;br /&gt;
         long long int *rs;&lt;br /&gt;
         long long int *rt;&lt;br /&gt;
         short immediate;&lt;br /&gt;
      } i;&lt;br /&gt;
    struct&lt;br /&gt;
      {&lt;br /&gt;
         unsigned int inst_index;&lt;br /&gt;
      } j;&lt;br /&gt;
    struct&lt;br /&gt;
      {&lt;br /&gt;
         long long int *rs;&lt;br /&gt;
         long long int *rt;&lt;br /&gt;
         long long int *rd;&lt;br /&gt;
         unsigned char sa;&lt;br /&gt;
         unsigned char nrd;&lt;br /&gt;
      } r;&lt;br /&gt;
    struct&lt;br /&gt;
      {&lt;br /&gt;
         unsigned char base;&lt;br /&gt;
         unsigned char ft;&lt;br /&gt;
         short offset;&lt;br /&gt;
      } lf;&lt;br /&gt;
    struct&lt;br /&gt;
      {&lt;br /&gt;
         unsigned char ft;&lt;br /&gt;
         unsigned char fs;&lt;br /&gt;
         unsigned char fd;&lt;br /&gt;
      } cf;&lt;br /&gt;
     } f;&lt;br /&gt;
   unsigned int addr; /* word-aligned instruction address in r4300 address space */&lt;br /&gt;
   unsigned int local_addr; /* byte offset to start of corresponding x86_64 instructions, from start of code block */&lt;br /&gt;
   reg_cache_struct reg_cache_infos;&lt;br /&gt;
} precomp_instr;&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The decoded instructions are then compiled, generating x86 instructions for each MIPS instruction.  A 32K block is allocated with malloc() to hold the x86 code.  If this size proves insufficient, it is incrementally resized with realloc().&lt;br /&gt;
&lt;br /&gt;
MIPS registers are allocated to x86 registers with a least-recently-used replacement policy.  All cached registers are written back prior to a branch, or a memory read or write.&lt;br /&gt;
&lt;br /&gt;
To facilitate invalidation and replacement of modified code blocks, each 4K page is compiled separately.  If a sequence of instructions crosses a 4K page boundary, the current block is ended, and the next instructions will be compiled as a separate block.&lt;br /&gt;
&lt;br /&gt;
Branch instructions within a 4K page are compiled as branches directly to the target address.  Branches which cross a 4K page go through an indirect address lookup.&lt;br /&gt;
&lt;br /&gt;
Compiled code blocks are invalidated on write.  On an actual MIPS CPU, the instruction cache is invalidated using the CACHE instruction, however a few N64 games clear the cache using other methods.  Trapping writes appears to be the most reliable method of ensuring cache coherency in the emulated system.  The cache instruction is ignored.&lt;br /&gt;
&lt;br /&gt;
==Problems with the original design==&lt;br /&gt;
&lt;br /&gt;
The most significant performance problem with this design is its excessive memory usage.  The decoded instruction data is retained (132 bytes for each MIPS instruction) and occasionally referenced during execution.  Memory accesses frequently miss the L2 cache, resulting in poor performance.&lt;br /&gt;
&lt;br /&gt;
Additionally, the register cache is relatively inefficient, since all registers are flushed before any read or write operation.&lt;br /&gt;
&lt;br /&gt;
==A new approach==&lt;br /&gt;
&lt;br /&gt;
To reduce memory usage, the new design allocates a single large block of memory (currently 16 MiB) which is used for recompiled code.  This memory is allocated using mmap with the PROT_EXEC bit set, to ensure operation on CPUs with no-execute (NX) page permissions.&lt;br /&gt;
&lt;br /&gt;
Contiguous blocks of MIPS instructions are compiled (that is, it does not attempt to follow branches or 'hot-paths').&lt;br /&gt;
&lt;br /&gt;
The recompiler consists of eight stages, plus a linker, and a memory manager.&lt;br /&gt;
&lt;br /&gt;
Compiled blocks are invalidated on write, as before, however as the compiler will cross a 4K page, writes may invalidate adjacent pages as well.&lt;br /&gt;
&lt;br /&gt;
Currently the dynarec generates x86, x86-64, ARMv5, and ARMv7 (little-endian).  Most of the code is shared between the architectures, but a different code generator is included at compile time, using different #include statements depending on the CPU type.&lt;br /&gt;
&lt;br /&gt;
==Pass 1: Disassembly==&lt;br /&gt;
&lt;br /&gt;
When an instruction address is encountered which has not been compiled, the function new_recompile_block is called, with the (virtual) address of the target instruction as its sole parameter.  If the address is invalid, the function returns a nonzero value and the caller is responsible for handling the pagefault.&lt;br /&gt;
&lt;br /&gt;
Instructions are decoded until an unconditional jump, usually a return, is encountered.  Disassembly is ordinarilly continued past a JAL (subroutine call) instruction, however this strategy is abandoned if invalid instructions are encountered.&lt;br /&gt;
&lt;br /&gt;
Up to 4096 instructions (16K of MIPS code) may be disassembled at once.  Surprisingly, some games do actually reach this limit.&lt;br /&gt;
&lt;br /&gt;
==Pass 2: Liveness analysis==&lt;br /&gt;
&lt;br /&gt;
After disassembly, liveness analysis is performed on the registers.  This determines when a particular register will no longer be used, and thus can be removed from the register cache.&lt;br /&gt;
&lt;br /&gt;
A separate analysis is done on the upper 32 bits of 64-bit registers.  This can determine when only the lower 32 bits of a register are significant, thus allowing use of a 32-bit register.  This enables more efficient code generation on 32-bit processors.&lt;br /&gt;
&lt;br /&gt;
==Pass 3: Register allocation==&lt;br /&gt;
&lt;br /&gt;
The 31 MIPS registers must be mapped onto seven available registers on x86, or twelve available registers on ARM.&lt;br /&gt;
&lt;br /&gt;
Instructions with 64-bit results require two registers on 32-bit host processors.  32-bit instructions require only one.  A flag is set for registers containing 32-bit values, and these registers will be sign-extended before they are written to the register file.&lt;br /&gt;
&lt;br /&gt;
Registers are made available using a least-soon-needed replacement policy.  When the register cache is full, and no registers can be eliminated using the liveness analysis, a ten-instruction lookahead is used to determine which registers will not be needed soon and these registers are evicted from the cache.&lt;br /&gt;
&lt;br /&gt;
==Pass 4: Free unused registers==&lt;br /&gt;
&lt;br /&gt;
After the initial register allocation, the allocations are reviewed to determine if any registers remain allocated longer than necessary.  These are then removed from the register cache.  This avoids having to write back a large number of registers just before a branch, and makes more registers available for the next pass.&lt;br /&gt;
&lt;br /&gt;
==Pass 5: Pre-allocate registers==&lt;br /&gt;
&lt;br /&gt;
If a register will be used soon and needs to be loaded, try to load it early if the register is available.  This improves execution on CPUs with a load-use penalty.&lt;br /&gt;
&lt;br /&gt;
==Pass 6: Optimize clean/dirty state==&lt;br /&gt;
&lt;br /&gt;
If a cached register is 'dirty' and needs to be written out, try to do so as soon as the register will no longer be modified.  This avoids having to write the same register on multiple code paths due to conditional branches.  Additionally, try to avoid writing out dirty registers inside of loops.&lt;br /&gt;
&lt;br /&gt;
==Pass 7: Identify where registers are assumed to be 32-bit==&lt;br /&gt;
&lt;br /&gt;
When a 64-bit register is mapped to a 32-bit register with the assumption that the value will be sign-extended before being used, it is necessary to ensure that no register contains a 64-bit value when branching to such a location.  These instructions are flagged to identify them as requiring 32-bit inputs.  This information is used by the linker, and the exception return (ERET) handler.&lt;br /&gt;
&lt;br /&gt;
==Pass 8: Assembly==&lt;br /&gt;
&lt;br /&gt;
This generates and outputs the recompiled code.&lt;br /&gt;
&lt;br /&gt;
Following the main code block, handlers for certain exceptions as well as alternate entry points are added.&lt;br /&gt;
&lt;br /&gt;
If a recompiled instruction relies on a certain MIPS register being cached in a certain native register, then a short 'stub' of code is generated to load the necessary registers.  When an instruction outside of this block needs to jump to that location, it will instead jump to the stub.  The necessary registers will be loaded, and then it will jump into the main code sequence.&lt;br /&gt;
&lt;br /&gt;
On architectures which require literal pools (ARMv5) these are inserted as necessary.&lt;br /&gt;
&lt;br /&gt;
==Linker==&lt;br /&gt;
&lt;br /&gt;
The linker fills in all unresolved branches.  Branches within the block are linked to their target address.  Branches which jump outside of the block are linked to their target if that address has been compiled already. These inter-block branches are recorded in the jump_out array.  This information will be used to remove the links in the event that the target of the branch is invalidated.&lt;br /&gt;
&lt;br /&gt;
Unresolved branches point to a stub which loads the address of the branch instruction and the virtual address of its target into registers, and calls the dynamic linker.  When this code is executed, the dynamic linker will compile the target if necessary, and then patch the branch instruction with the new address.&lt;br /&gt;
&lt;br /&gt;
==Memory manager==&lt;br /&gt;
&lt;br /&gt;
The last step in new_recompile_block is to ensure that there will be sufficient memory available to compile the next block.  If there is not, then the oldest blocks are purged.&lt;br /&gt;
&lt;br /&gt;
The dynarec cache can be described as a 16MB circular buffer divided into eight segments.  Memory is allocated in order from beginning to end.&lt;br /&gt;
&lt;br /&gt;
When there are less than 2 empty segments, the next segment in sequence is cleared, wrapping around to the beginning of the buffer.  This continues as memory is needed, wrapping around from end to beginning.&lt;br /&gt;
&lt;br /&gt;
==Invalidation and restoration==&lt;br /&gt;
&lt;br /&gt;
Normally, code blocks are looked up via a hash table, using the virtual address of the target instruction.  However, for purposes of invalidation, blocks are grouped by physical address.  This can be described as a virtually-indexed, physically-tagged (VIPT) cache.&lt;br /&gt;
&lt;br /&gt;
References to compiled blocks are stored in one of 4096 linked lists in the jump_in array.  Each list covers a 4K block of memory, and 2048 such lists are sufficient to cover the 8MB of RAM in the Nintendo 64.  The remaining lists are for code in ROM, and the bootloader in SP memory.&lt;br /&gt;
&lt;br /&gt;
When a write hits a memory page marked as cached, all entries in the corresponding list are invalidated.  If any code is found to cross a 4K boundary, the adjacent lists are invalidated also.&lt;br /&gt;
&lt;br /&gt;
Sometimes blocks may be invalidated even when none of the code is actually modified.  This can happen if data is written to memory in the same 4K page, or if code is reloaded without actually modifying it.  If blocks which were previously invalidated are subsequently found to be unmodified, those blocks are marked in the restore_candidate array.  If the block remains unmodified, it will be restored as a valid block, to avoid recompiling blocks which do not need to be recompiled.  This is performed by the clean_blocks function which is called periodically.&lt;br /&gt;
&lt;br /&gt;
The jump_in array, which lists unmodified blocks, is physically indexed, however the jump_dirty array, which lists potentially-modified blocks, is virtually indexed.  This allows blocks which have changed physical addresses, but are still at the same virtual address, to be recognized as being the same as before, and not in need of recompilation.&lt;br /&gt;
&lt;br /&gt;
==Dynamic linker==&lt;br /&gt;
&lt;br /&gt;
Branches with unresolved addresses jump to the dynamic linker.  This will look through the jump_in list corresponding to the physical page containing the virtual target address.  If found, the branch instruction will be patched with the address, and then it will jump to this address.&lt;br /&gt;
&lt;br /&gt;
If not found, the jump_dirty list will be searched for blocks which were previously compiled but may have been modified.  If a potential match is found, the code will be compared against a cached copy to determine if any changes have been made.  If not, then it will jump to the block.  Because the memory could be modified again, branch instructions referencing these blocks are not altered, and continue to point to the dynamic linker.  These blocks will continue to be verified each time they are called, until restored to the jump_in list by the clean_blocks function described above.&lt;br /&gt;
&lt;br /&gt;
If no compiled block is found, or the existing block was modified, the target is recompiled.&lt;br /&gt;
&lt;br /&gt;
==Address lookup==&lt;br /&gt;
&lt;br /&gt;
When a JR (jump register) instruction is encountered, the address of the recompiled code must be looked up using the address of the MIPS code.  The majority of such instructions jump to the link register (r31) to return to the location following a JAL (call) instruction.&lt;br /&gt;
&lt;br /&gt;
When a JAL or JALR is executed, the address of the following instruction is inserted into a small 32-entry hash table, which is checked when a JR r31 instruction is executed.  This allows for a quick return from subroutine calls.&lt;br /&gt;
&lt;br /&gt;
If the JR instruction uses a register other than r31, or the small hash table lookup fails to find a match, a larger 131072-entry hash table is checked.  This table contains 65536 bins with up to 2 addresses per bin.  If this also fails to find a match (which occurs less than 1% of the time) an exhaustive search of all compiled addresses within that 4K memory page is performed.&lt;br /&gt;
&lt;br /&gt;
If no match is found by any of these methods, the target address is compiled, and the new address is inserted into the hash table.&lt;br /&gt;
&lt;br /&gt;
==Cycle counting==&lt;br /&gt;
&lt;br /&gt;
Cycles are counted before each branch by adding the cycles from the preceding instructions to a specific register.  The cycle count is in R10 on ARM and ESI on x86.  The value in this register is normally a negative number.  When this number exceeds zero, a conditional branch is taken which jumps to an interrupt handler.&lt;br /&gt;
&lt;br /&gt;
For example, the following x86 code adds eight cycles:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
add $8,%esi&lt;br /&gt;
jns interrupt_handler&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The conditional branch jumps to a small bit of code located after the main compiled block, which saves the cached registers to the register file, sets the instruction pointer which will be used upon return from the interrupt, and then calls cc_interrupt.&lt;br /&gt;
&lt;br /&gt;
As in the original mupen64plus, the emulated clock runs at 37.5 MHz, and each instruction takes 2 clock cycles.&lt;br /&gt;
&lt;br /&gt;
==Interrupt handler==&lt;br /&gt;
&lt;br /&gt;
When the cycle count register reaches its limit, cc_interrupt is called, which in turn calls gen_interupt [sic].  If interrupts are not enabled, cc_interrupt returns.  If interrupts are enabled, and an interrupt is to be taken, the pending_exception flag will be set.  In this case, cc_interrupt does not return, and instead pops the stack and causes an unconditional jump to the address in pcaddr (usually 0x80000180).&lt;br /&gt;
&lt;br /&gt;
There is one additional case where the interrupt handler may be called.  If interrupts were disabled, and are enabled by writing to coprocessor 0 register 12, any pending interrupts are handled immediately.&lt;br /&gt;
&lt;br /&gt;
==Delay slots==&lt;br /&gt;
&lt;br /&gt;
MIPS has 'delay slots', where the instruction after the branch is executed before the branch is taken.  Instructions in delay slots are issued out-of-order in the recompiled code.&lt;br /&gt;
&lt;br /&gt;
[[Image:Recompiler delay slot reordering.png]]&lt;br /&gt;
&lt;br /&gt;
When a branch jumps into the delay slot of another branch, this case must be handled slightly differently:&lt;br /&gt;
&lt;br /&gt;
[[Image:Recompiler delay slot reordering 2.png]]&lt;br /&gt;
&lt;br /&gt;
The branch test and delay slot are executed in-order if a dependency exists, or for 'likely' branches where the delay slot is nullified when the branch condition is false.  These cases are infrequent (typically less than 10% of branches).&lt;br /&gt;
&lt;br /&gt;
==Constant propagation==&lt;br /&gt;
&lt;br /&gt;
When an instruction loads a constant into a register, the register is tagged as a constant.  The constant tag will be retained if subsequent instructions modify the constant using other constants.  During assembly, such a sequence of instructions is combined into a single load.  For example:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
LUI r8,12340000  --&amp;gt;  mov $0x12345678,%eax&lt;br /&gt;
ORI r8,r8,5678&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This optimization is not performed where a branch target intervenes, eg&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
...&lt;br /&gt;
 LUI r8,12340000  --&amp;gt;  mov $0x12340000,%eax&lt;br /&gt;
L1:&lt;br /&gt;
 ORI r8,r8,5678   --&amp;gt;  or $0x5678,%eax&lt;br /&gt;
 ...&lt;br /&gt;
 BEQ r0,r0,L1&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Registers containing constants are identified by bits in the isconst and wasconst fields of the regstat structure.  The wasconst bit is set for a register if the register contained a known constant before the instruction, and the isconst bit is set for a register if the register will contain a known constant after the instruction executes.&lt;br /&gt;
&lt;br /&gt;
==Translation lookaside buffer emulation==&lt;br /&gt;
&lt;br /&gt;
Most Nintendo 64 games do not use virtual memory, but some do.  At startup, main memory is directly mapped at addresses from 0x80000000 to 0x803FFFFF, or up to 0x807FFFFF if the memory expansion is used.&lt;br /&gt;
&lt;br /&gt;
Normally, read or write operations are checked against this range, and if outside this range, control is passed to the appropriate I/O handler.  This is done as follows:&lt;br /&gt;
&lt;br /&gt;
MIPS instruction: LW r1,8(r2)&lt;br /&gt;
&lt;br /&gt;
ARM code:&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
add r1, r2, #8&lt;br /&gt;
cmp r1, #8388608&lt;br /&gt;
bvc handler&lt;br /&gt;
ldr r1, [r1]&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If there are valid entries in the TLB, this would instead be compiled as follows:&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
add r1, r2, #8&lt;br /&gt;
mov r0, #264&lt;br /&gt;
add r0, r0, r1, lsr #12&lt;br /&gt;
ldr r0, [r11, r0, lsl #2]&lt;br /&gt;
tst r0, r0&lt;br /&gt;
bmi handler&lt;br /&gt;
ldr r1, [r1, r0, lsl #2]&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This looks up the offset in the memory_map table (which, in this example, is located at r11+264*4).  The high bit is tested to determine whether a valid mapping exists for this page.&lt;br /&gt;
&lt;br /&gt;
==Page fault emulation==&lt;br /&gt;
&lt;br /&gt;
If a memory address references an invalid page, a conditional branch is taken to a bit of code placed after the main block.  This will save any modified cached registers, and call an appropriate handler function for the address.  The handler will either perform I/O, or generate an exception (pagefault).&lt;br /&gt;
&lt;br /&gt;
==Mapping executable pages==&lt;br /&gt;
&lt;br /&gt;
If the dynamic recompiler encounters code which is not in contiguous physical memory, it will end the block at the page boundary, so that the block can be removed cleanly if the mapping is changed.&lt;br /&gt;
&lt;br /&gt;
There is a special case of this, where a branch and delay slot span two pages.  If a branch instruction is the last instruction in a virtual memory page, it is compiled in a different manner than other branches.  The branch condition is evaluated, and the target address is placed in a register (%ebp on x86, and r8 on ARM).  A special form of the dynamic linker (dyna_linker_ds) is used to link the branch to its corresponding delay slot in another block.  If no page fault occurs, the delay slot executes and then jumps to the address in the register.  For conditional branches that are not taken, the target address is the next instruction.  This code is generated by the pagespan_assemble function.&lt;br /&gt;
&lt;br /&gt;
== Compile options ==&lt;br /&gt;
&lt;br /&gt;
=== ARMv5_ONLY ===&lt;br /&gt;
&lt;br /&gt;
If this is defined, the UXTH instruction is not used, and the dynamic recompiler will generate literal pools instead of using movw/movt.  This provides compatibility with older processors, but generates somewhat less efficient code.&lt;br /&gt;
&lt;br /&gt;
=== CORTEX_A8_BRANCH_PREDICTION_HACK ===&lt;br /&gt;
&lt;br /&gt;
If this is defined, the dynamic recompiler will avoid generating consecutive branch instructions without another instruction in between.  This avoids a possible branch misprediction on the Cortex-A8 due to this processor having dual instruction decoders, but only one branch-prediction unit.  See [[Assembly Code Optimization]] for details.&lt;br /&gt;
&lt;br /&gt;
=== USE_MINI_HT ===&lt;br /&gt;
&lt;br /&gt;
If this is defined, attempt to look up return addresses in a small hash table before checking the larger hash table.  Usually improves performance.&lt;br /&gt;
&lt;br /&gt;
=== IMM_PREFETCH ===&lt;br /&gt;
&lt;br /&gt;
If this is defined, the x86 PREFETCH instruction is used to prefetch entries from the hash table.  The increase in code size often outweighs the benefit of this.&lt;br /&gt;
&lt;br /&gt;
=== REG_PREFETCH ===&lt;br /&gt;
&lt;br /&gt;
Similar to the above, but loads the address into a register first, then uses the ARM PLD instruction.  The increase in code size almost always outweighs the benefit of this.&lt;br /&gt;
&lt;br /&gt;
=== R29_HACK ===&lt;br /&gt;
&lt;br /&gt;
Assume that the stack pointer (r29) is always a valid memory address and do not check it.  It is similar to the optimization described [http://strmnnrmn.blogspot.com/2007/08/interesting-dynarec-hack.html here].  This can crash the emulator and is not enabled by default.&lt;br /&gt;
&lt;br /&gt;
== Debugging ==&lt;br /&gt;
&lt;br /&gt;
Debugging information can be obtained by defining the assem_debug macro as printf.  This will cause the dynamic recompiler to print debugging information to stdout.  For each disassembled MIPS instruction, an entry similar to the following will be printed:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
U: r1 r8 r11 r16 r31 UU: r29 32: r0 r9&lt;br /&gt;
pre: eax=-1 ecx=9 edx=-1 ebx=-1 ebp=29 esi=36 edi=-1&lt;br /&gt;
needs: ecx ebp esi r: r9&lt;br /&gt;
entry: eax=-1 ecx=9 edx=-1 ebx=-1 ebp=29 esi=36 edi=-1&lt;br /&gt;
dirty: ecx ebp esi &lt;br /&gt;
  800001d8: LW r16,r29+14&lt;br /&gt;
eax=16 ecx=9 edx=-1 ebx=-1 ebp=29 esi=36 edi=-1 dirty: eax ecx ebp esi &lt;br /&gt;
 32: r0 r9 r16&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
U: A list of MIPS registers which will not be used before they are overwritten (liveness analysis)&lt;br /&gt;
&lt;br /&gt;
UU: A list of MIPS registers for which the upper 32 bits will not be used before they are overwritten&lt;br /&gt;
&lt;br /&gt;
32: Registers that contain 32-bit sign-extended values&lt;br /&gt;
&lt;br /&gt;
pre: The state of the register mapping prior to execution of this instruction.  (-1 = no mapping; 36 = cycle count; The complete list of values with special meanings can be found in the source code)&lt;br /&gt;
&lt;br /&gt;
needs: a list of register mappings that were considered necessary and which could not be eliminated to make room for other mappings&lt;br /&gt;
&lt;br /&gt;
r: Registers that are known to contain 32-bit values and where optimizations rely on the assumption that the register does not contain a value outside of the range -2&amp;lt;sup&amp;gt;31&amp;lt;/sup&amp;gt; to 2&amp;lt;sup&amp;gt;31&amp;lt;/sup&amp;gt;-1&lt;br /&gt;
&lt;br /&gt;
entry: The minimum set of register mappings required to jump to this point&lt;br /&gt;
&lt;br /&gt;
dirty: Cached registers that have been modified and will need to be written back&lt;br /&gt;
&lt;br /&gt;
address: instruction - The decoded opcode, followed by the register mapping in effect after this instruction executes&lt;br /&gt;
&lt;br /&gt;
An asterisk (*) designates locations which are the target of a branch instruction.  Constant propagation will not be performed across these points.&lt;br /&gt;
&lt;br /&gt;
After the complete disassembly, the recompiled native code is shown.&lt;br /&gt;
&lt;br /&gt;
Note that the output can be quite voluminous; 20-30 MB is typical.&lt;br /&gt;
&lt;br /&gt;
==Potential improvements==&lt;br /&gt;
&lt;br /&gt;
===Copy propagation/offset propagation===&lt;br /&gt;
&lt;br /&gt;
A common instruction sequence is of the form:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
LUI r9,12340000&lt;br /&gt;
ADD r9,r9,r8&lt;br /&gt;
LW r9,5678(r9)&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
It would be helpful to recognize this as a load from r8+12345678.  The current constant propagation code does not do so.&lt;br /&gt;
&lt;br /&gt;
===Unaligned memory access===&lt;br /&gt;
&lt;br /&gt;
A small improvement could be made by combining adjacent LWL/LWR instructions.  The potential gain from doing so is very limited because these instructions typically represent less than 1% of all memory accesses.&lt;br /&gt;
&lt;br /&gt;
===SLT/branch merging===&lt;br /&gt;
&lt;br /&gt;
A frequent occurrence in MIPS code is an SLT or SLTI instruction followed by a branch.  This is generated relatively inefficiently on x86 and ARM, first doing a compare and set, then testing this value and doing a conditional branch.  Doing only one comparison would save at least one instruction, and could potentially save up to three instructions if the liveness analysis reveals that the result of the SLT instruction is used nowhere else.&lt;br /&gt;
&lt;br /&gt;
While a potentially useful optimization, there are several problems with this approach.  First, there are often additional instructions between the slt and the branch. These must be checked to make sure they do not modify the registers as that would prevent reordering the instruction stream as desired.  Secondly, if the result of the slt is found to be live, but unmodified, on both paths of the branch, clean_registers will normally write this value before the branch, to avoid duplicating the writeback code on both paths of the branch.  This optimization would have to be removed if the slt was combined with the branch.&lt;br /&gt;
&lt;br /&gt;
===x86-64===&lt;br /&gt;
&lt;br /&gt;
Currently the x86-64 backend generates only 32-bit instructions.  Proper 64-bit code generation would improve performance.&lt;br /&gt;
&lt;br /&gt;
===PowerPC===&lt;br /&gt;
&lt;br /&gt;
It would be possible to add a PowerPC code generator to the dynamic recompiler.  Currently no one is working on this.  (The mupen64gc project is using a different codebase.)&lt;br /&gt;
&lt;br /&gt;
The following is a summary of the changes which would be necessary to add a PowerPC backend.&lt;br /&gt;
&lt;br /&gt;
The slt* instructions use conditional moves, which are unavailable on PowerPC.  A suitable alternative (such as moves from the condition register) would need to be used.&lt;br /&gt;
&lt;br /&gt;
The assembler can generate as much as 256K of code in a single block, however conditional branches on PowerPC are limited to +/-64K.  It will be necessary to either restrict the block size, or insert jump tables in a manner similar to the literal pools on ARM.&lt;br /&gt;
&lt;br /&gt;
PowerPC generally relies on early branch resolution rather than statistical branch prediction.  Scheduling branch condition tests earlier may be advantageous.  (For example, the address bounds check could be done in address_generation, rather than just before the load or store.  Similarly it may be advantageous to test the branch condition and update the cycle count before executing the delay slot.)&lt;br /&gt;
&lt;br /&gt;
===MIPS===&lt;br /&gt;
&lt;br /&gt;
Recompiling MIPS into MIPS would be relatively straightforward, however the current code generator has no facility for filling delay slots.  This capability would be required for efficient code generation.&lt;br /&gt;
&lt;br /&gt;
[[Category:Development]]&lt;/div&gt;</summary>
		<author><name>Ari64</name></author>
		
	</entry>
	<entry>
		<id>https://pandorawiki.org/index.php?title=Assembly_Code_Optimization&amp;diff=4123</id>
		<title>Assembly Code Optimization</title>
		<link rel="alternate" type="text/html" href="https://pandorawiki.org/index.php?title=Assembly_Code_Optimization&amp;diff=4123"/>
		<updated>2010-11-12T12:58:42Z</updated>

		<summary type="html">&lt;p&gt;Ari64: /* Instruction pairing restrictions following a branch */ Remove somewhat inaccurate statement&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Assembly code optimization on the Cortex-A8 ==&lt;br /&gt;
This guide presents specific optimization techniques for the ARM Cortex-A8 processor and its dual-issue, in-order pipeline.&lt;br /&gt;
&lt;br /&gt;
== Use the ARMv7 movw/movt instructions ==&lt;br /&gt;
Newer ARM processors allow loading 32-bit values as two 16-bit immediates.  The movw instruction loads the lower 16 bits, and movt loads the upper 16 bits.  The movw instruction clears the upper 16 bits, so that 16-bit values can be loaded using a single instruction.  The movt instruction does not affect the lower bits.&lt;br /&gt;
&lt;br /&gt;
On older ARM processors, it was common to load 32-bit values with a PC-relative load.  This should be avoided because it may result in a cache miss.&lt;br /&gt;
&lt;br /&gt;
== Branch Prediction ==&lt;br /&gt;
There is a one-cycle stall in instruction fetch when a branch is predicted taken.  It is therefore preferable to structure code so that the most likely code path is the one where the branch is not taken.&lt;br /&gt;
&lt;br /&gt;
Unconditional branches may be mispredicted, and load instructions which follow branches may be decoded and cause cache accesses.  Avoid placing load instructions after branches unless you intend the CPU to prefetch the addresses they reference.&lt;br /&gt;
&lt;br /&gt;
The branch predictor has a call/return stack for jumps which reference r14 or stack operations using r13 which load the program counter.  For best performance, make sure any pop, mov pc,lr or bx lr jumps to the same location as was set by the matching bl instruction.  For non-return jumps, use a register other than r14.&lt;br /&gt;
&lt;br /&gt;
Although instructions are decoded in pairs, the branch predictor can only predict one branch target per cycle.  If you have a conditional branch which is immediately followed by another branch, and the first branch is likely to be taken, place a no-operation instruction between the branches to prevent decoding and prediction of the second branch.  Inserting NOPs is detrimental if the first branch is infrequently taken.&lt;br /&gt;
&lt;br /&gt;
== Dual-Issue Restrictions ==&lt;br /&gt;
Only one branch instruction can issue per cycle.  Only one load or store instruction can issue per cycle.  Instructions which write to the same register can not issue together.&lt;br /&gt;
&lt;br /&gt;
There is a one-cycle delay before any written register can be used as the address of a subsequent load or store instruction.  There is a one-cycle delay before any written register can be used by an instruction which performs a shift or rotation on that register.&lt;br /&gt;
&lt;br /&gt;
There is one cycle delay before the result of a load can be used.  In the aforementioned cases involving subsequent shift or rotate instructions, or the address of a load or store instruction, the total delay is two cycles.  &lt;br /&gt;
&lt;br /&gt;
Stores occur at the end of the pipeline and store instructions can issue in the same cycle as another instruction which writes the same register.  Similarly, conditional branches can issue in the same cycle as flag-setting instructions because the branch is not resolved until the following cycle.  In all other cases, instructions can not issue together if the second instruction depends on the results of the first.&lt;br /&gt;
&lt;br /&gt;
== Instruction pairing restrictions following a branch ==&lt;br /&gt;
The Cortex-A8 processor normally fetches two instructions per cycle and the branch predictor tests each against the global history buffer.  However, there is only one entry in the branch target buffer for each pair of instructions.  Therefore, branch instructions, and certain instructions which resemble branches, may adversely affect branch prediction when present in pairs.  To reduce the risk of branch misprediction, avoid pairing branch instructions with the following:&lt;br /&gt;
&lt;br /&gt;
* Branch instructions&lt;br /&gt;
* Load instructions, whether or not they write r15&lt;br /&gt;
* Arithmetic/logic instructions which do not set flags and do not have an immediate value, whether or not they write r15&lt;br /&gt;
&lt;br /&gt;
Flag-setting instructions can be used to avoid this restriction.  Additionally, MOV instructions that do not have immediate values can be replaced with ADD, SUB, ORR, or EOR instructions using zero as an immediate value.&lt;br /&gt;
&lt;br /&gt;
== Code alignment ==&lt;br /&gt;
Code alignment should be used with caution.  Alignment has the potential to increase performance, but may be detrimental in some circumstances.&lt;br /&gt;
&lt;br /&gt;
Because the instruction decoder always fetches 64-bit aligned words from the level-1 instruction cache, aligning code can improve instruction fetch and decode.  However, the global history buffer of the branch predictor is indexed by the low bits of the instruction address.  Excessive code alignment can result in a suboptimal distribution of entries in the branch history tables and increase branch misprediction.  Additionally, the speculative prefetch may retrieve and decode instructions used as padding, even if those instructions are never executed.  The types of instructions used as padding will affect the branch predictor as stated above, and this may affect the prefetch of code into the instruction cache.&lt;br /&gt;
&lt;br /&gt;
[[Category:Development]]&lt;/div&gt;</summary>
		<author><name>Ari64</name></author>
		
	</entry>
	<entry>
		<id>https://pandorawiki.org/index.php?title=Assembly_Code_Optimization&amp;diff=4106</id>
		<title>Assembly Code Optimization</title>
		<link rel="alternate" type="text/html" href="https://pandorawiki.org/index.php?title=Assembly_Code_Optimization&amp;diff=4106"/>
		<updated>2010-11-10T13:34:24Z</updated>

		<summary type="html">&lt;p&gt;Ari64: /* Code alignment */ Effect on branch predictor GHB&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Assembly code optimization on the Cortex-A8 ==&lt;br /&gt;
This guide presents specific optimization techniques for the ARM Cortex-A8 processor and its dual-issue, in-order pipeline.&lt;br /&gt;
&lt;br /&gt;
== Use the ARMv7 movw/movt instructions ==&lt;br /&gt;
Newer ARM processors allow loading 32-bit values as two 16-bit immediates.  The movw instruction loads the lower 16 bits, and movt loads the upper 16 bits.  The movw instruction clears the upper 16 bits, so that 16-bit values can be loaded using a single instruction.  The movt instruction does not affect the lower bits.&lt;br /&gt;
&lt;br /&gt;
On older ARM processors, it was common to load 32-bit values with a PC-relative load.  This should be avoided because it may result in a cache miss.&lt;br /&gt;
&lt;br /&gt;
== Branch Prediction ==&lt;br /&gt;
There is a one-cycle stall in instruction fetch when a branch is predicted taken.  It is therefore preferable to structure code so that the most likely code path is the one where the branch is not taken.&lt;br /&gt;
&lt;br /&gt;
Unconditional branches may be mispredicted, and load instructions which follow branches may be decoded and cause cache accesses.  Avoid placing load instructions after branches unless you intend the CPU to prefetch the addresses they reference.&lt;br /&gt;
&lt;br /&gt;
The branch predictor has a call/return stack for jumps which reference r14 or stack operations using r13 which load the program counter.  For best performance, make sure any pop, mov pc,lr or bx lr jumps to the same location as was set by the matching bl instruction.  For non-return jumps, use a register other than r14.&lt;br /&gt;
&lt;br /&gt;
Although instructions are decoded in pairs, the branch predictor can only predict one branch target per cycle.  If you have a conditional branch which is immediately followed by another branch, and the first branch is likely to be taken, place a no-operation instruction between the branches to prevent decoding and prediction of the second branch.  Inserting NOPs is detrimental if the first branch is infrequently taken.&lt;br /&gt;
&lt;br /&gt;
== Dual-Issue Restrictions ==&lt;br /&gt;
Only one branch instruction can issue per cycle.  Only one load or store instruction can issue per cycle.  Instructions which write to the same register can not issue together.&lt;br /&gt;
&lt;br /&gt;
There is a one-cycle delay before any written register can be used as the address of a subsequent load or store instruction.  There is a one-cycle delay before any written register can be used by an instruction which performs a shift or rotation on that register.&lt;br /&gt;
&lt;br /&gt;
There is one cycle delay before the result of a load can be used.  In the aforementioned cases involving subsequent shift or rotate instructions, or the address of a load or store instruction, the total delay is two cycles.  &lt;br /&gt;
&lt;br /&gt;
Stores occur at the end of the pipeline and store instructions can issue in the same cycle as another instruction which writes the same register.  Similarly, conditional branches can issue in the same cycle as flag-setting instructions because the branch is not resolved until the following cycle.  In all other cases, instructions can not issue together if the second instruction depends on the results of the first.&lt;br /&gt;
&lt;br /&gt;
== Instruction pairing restrictions following a branch ==&lt;br /&gt;
The Cortex-A8 processor normally fetches two instructions per cycle and the branch predictor tests each against the global history buffer.  However, there is only one entry in the branch target buffer for each pair of instructions.  Therefore, branch instructions, and certain instructions which resemble branches, may adversely affect branch prediction when present in pairs.  To reduce the risk of branch misprediction, avoid pairing branch instructions with the following:&lt;br /&gt;
&lt;br /&gt;
* Branch instructions&lt;br /&gt;
* Load instructions, whether or not they write r15&lt;br /&gt;
* Arithmetic/logic instructions which do not set flags and do not have an immediate value, whether or not they write r15&lt;br /&gt;
&lt;br /&gt;
Flag-setting instructions can be used to avoid this restriction.  Additionally, MOV instructions that do not have immediate values can be replaced with ADD, SUB, ORR, or EOR instructions using zero as an immediate value.&lt;br /&gt;
&lt;br /&gt;
One exception to this rule is that placing such an instruction (eg mov r0,r0) following an unconditional branch may reduce the severity of a misprediction, due to the second instruction possibly being incorrectly predicted as a branch, but resulting in the correct target being retrieved from the BTB.&lt;br /&gt;
&lt;br /&gt;
== Code alignment ==&lt;br /&gt;
Code alignment should be used with caution.  Alignment has the potential to increase performance, but may be detrimental in some circumstances.&lt;br /&gt;
&lt;br /&gt;
Because the instruction decoder always fetches 64-bit aligned words from the level-1 instruction cache, aligning code can improve instruction fetch and decode.  However, the global history buffer of the branch predictor is indexed by the low bits of the instruction address.  Excessive code alignment can result in a suboptimal distribution of entries in the branch history tables and increase branch misprediction.  Additionally, the speculative prefetch may retrieve and decode instructions used as padding, even if those instructions are never executed.  The types of instructions used as padding will affect the branch predictor as stated above, and this may affect the prefetch of code into the instruction cache.&lt;br /&gt;
&lt;br /&gt;
[[Category:Development]]&lt;/div&gt;</summary>
		<author><name>Ari64</name></author>
		
	</entry>
	<entry>
		<id>https://pandorawiki.org/index.php?title=Assembly_Code_Optimization&amp;diff=4105</id>
		<title>Assembly Code Optimization</title>
		<link rel="alternate" type="text/html" href="https://pandorawiki.org/index.php?title=Assembly_Code_Optimization&amp;diff=4105"/>
		<updated>2010-11-10T13:30:04Z</updated>

		<summary type="html">&lt;p&gt;Ari64: /* Code alignment */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Assembly code optimization on the Cortex-A8 ==&lt;br /&gt;
This guide presents specific optimization techniques for the ARM Cortex-A8 processor and its dual-issue, in-order pipeline.&lt;br /&gt;
&lt;br /&gt;
== Use the ARMv7 movw/movt instructions ==&lt;br /&gt;
Newer ARM processors allow loading 32-bit values as two 16-bit immediates.  The movw instruction loads the lower 16 bits, and movt loads the upper 16 bits.  The movw instruction clears the upper 16 bits, so that 16-bit values can be loaded using a single instruction.  The movt instruction does not affect the lower bits.&lt;br /&gt;
&lt;br /&gt;
On older ARM processors, it was common to load 32-bit values with a PC-relative load.  This should be avoided because it may result in a cache miss.&lt;br /&gt;
&lt;br /&gt;
== Branch Prediction ==&lt;br /&gt;
There is a one-cycle stall in instruction fetch when a branch is predicted taken.  It is therefore preferable to structure code so that the most likely code path is the one where the branch is not taken.&lt;br /&gt;
&lt;br /&gt;
Unconditional branches may be mispredicted, and load instructions which follow branches may be decoded and cause cache accesses.  Avoid placing load instructions after branches unless you intend the CPU to prefetch the addresses they reference.&lt;br /&gt;
&lt;br /&gt;
The branch predictor has a call/return stack for jumps which reference r14 or stack operations using r13 which load the program counter.  For best performance, make sure any pop, mov pc,lr or bx lr jumps to the same location as was set by the matching bl instruction.  For non-return jumps, use a register other than r14.&lt;br /&gt;
&lt;br /&gt;
Although instructions are decoded in pairs, the branch predictor can only predict one branch target per cycle.  If you have a conditional branch which is immediately followed by another branch, and the first branch is likely to be taken, place a no-operation instruction between the branches to prevent decoding and prediction of the second branch.  Inserting NOPs is detrimental if the first branch is infrequently taken.&lt;br /&gt;
&lt;br /&gt;
== Dual-Issue Restrictions ==&lt;br /&gt;
Only one branch instruction can issue per cycle.  Only one load or store instruction can issue per cycle.  Instructions which write to the same register can not issue together.&lt;br /&gt;
&lt;br /&gt;
There is a one-cycle delay before any written register can be used as the address of a subsequent load or store instruction.  There is a one-cycle delay before any written register can be used by an instruction which performs a shift or rotation on that register.&lt;br /&gt;
&lt;br /&gt;
There is one cycle delay before the result of a load can be used.  In the aforementioned cases involving subsequent shift or rotate instructions, or the address of a load or store instruction, the total delay is two cycles.  &lt;br /&gt;
&lt;br /&gt;
Stores occur at the end of the pipeline and store instructions can issue in the same cycle as another instruction which writes the same register.  Similarly, conditional branches can issue in the same cycle as flag-setting instructions because the branch is not resolved until the following cycle.  In all other cases, instructions can not issue together if the second instruction depends on the results of the first.&lt;br /&gt;
&lt;br /&gt;
== Instruction pairing restrictions following a branch ==&lt;br /&gt;
The Cortex-A8 processor normally fetches two instructions per cycle and the branch predictor tests each against the global history buffer.  However, there is only one entry in the branch target buffer for each pair of instructions.  Therefore, branch instructions, and certain instructions which resemble branches, may adversely affect branch prediction when present in pairs.  To reduce the risk of branch misprediction, avoid pairing branch instructions with the following:&lt;br /&gt;
&lt;br /&gt;
* Branch instructions&lt;br /&gt;
* Load instructions, whether or not they write r15&lt;br /&gt;
* Arithmetic/logic instructions which do not set flags and do not have an immediate value, whether or not they write r15&lt;br /&gt;
&lt;br /&gt;
Flag-setting instructions can be used to avoid this restriction.  Additionally, MOV instructions that do not have immediate values can be replaced with ADD, SUB, ORR, or EOR instructions using zero as an immediate value.&lt;br /&gt;
&lt;br /&gt;
One exception to this rule is that placing such an instruction (eg mov r0,r0) following an unconditional branch may reduce the severity of a misprediction, due to the second instruction possibly being incorrectly predicted as a branch, but resulting in the correct target being retrieved from the BTB.&lt;br /&gt;
&lt;br /&gt;
== Code alignment ==&lt;br /&gt;
Code alignment should be used with caution.  Alignment has the potential to increase performance, but may be detrimental in some circumstances.&lt;br /&gt;
&lt;br /&gt;
Because the instruction decoder always fetches 64-bit aligned words from the level-1 instruction cache, aligning code can improve instruction fetch and decode.  However, excessive code alignment can result in a suboptimal distribution of entries in the branch history tables and increase branch misprediction.  Additionally, the speculative prefetch may retrieve and decode instructions used as padding, even if those instructions are never executed.  The types of instructions used as padding will affect the branch predictor as stated above, and this may affect the prefetch of code into the instruction cache.&lt;br /&gt;
&lt;br /&gt;
[[Category:Development]]&lt;/div&gt;</summary>
		<author><name>Ari64</name></author>
		
	</entry>
	<entry>
		<id>https://pandorawiki.org/index.php?title=Assembly_Code_Optimization&amp;diff=4103</id>
		<title>Assembly Code Optimization</title>
		<link rel="alternate" type="text/html" href="https://pandorawiki.org/index.php?title=Assembly_Code_Optimization&amp;diff=4103"/>
		<updated>2010-11-10T13:27:53Z</updated>

		<summary type="html">&lt;p&gt;Ari64: /* Code alignment */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Assembly code optimization on the Cortex-A8 ==&lt;br /&gt;
This guide presents specific optimization techniques for the ARM Cortex-A8 processor and its dual-issue, in-order pipeline.&lt;br /&gt;
&lt;br /&gt;
== Use the ARMv7 movw/movt instructions ==&lt;br /&gt;
Newer ARM processors allow loading 32-bit values as two 16-bit immediates.  The movw instruction loads the lower 16 bits, and movt loads the upper 16 bits.  The movw instruction clears the upper 16 bits, so that 16-bit values can be loaded using a single instruction.  The movt instruction does not affect the lower bits.&lt;br /&gt;
&lt;br /&gt;
On older ARM processors, it was common to load 32-bit values with a PC-relative load.  This should be avoided because it may result in a cache miss.&lt;br /&gt;
&lt;br /&gt;
== Branch Prediction ==&lt;br /&gt;
There is a one-cycle stall in instruction fetch when a branch is predicted taken.  It is therefore preferable to structure code so that the most likely code path is the one where the branch is not taken.&lt;br /&gt;
&lt;br /&gt;
Unconditional branches may be mispredicted, and load instructions which follow branches may be decoded and cause cache accesses.  Avoid placing load instructions after branches unless you intend the CPU to prefetch the addresses they reference.&lt;br /&gt;
&lt;br /&gt;
The branch predictor has a call/return stack for jumps which reference r14 or stack operations using r13 which load the program counter.  For best performance, make sure any pop, mov pc,lr or bx lr jumps to the same location as was set by the matching bl instruction.  For non-return jumps, use a register other than r14.&lt;br /&gt;
&lt;br /&gt;
Although instructions are decoded in pairs, the branch predictor can only predict one branch target per cycle.  If you have a conditional branch which is immediately followed by another branch, and the first branch is likely to be taken, place a no-operation instruction between the branches to prevent decoding and prediction of the second branch.  Inserting NOPs is detrimental if the first branch is infrequently taken.&lt;br /&gt;
&lt;br /&gt;
== Dual-Issue Restrictions ==&lt;br /&gt;
Only one branch instruction can issue per cycle.  Only one load or store instruction can issue per cycle.  Instructions which write to the same register can not issue together.&lt;br /&gt;
&lt;br /&gt;
There is a one-cycle delay before any written register can be used as the address of a subsequent load or store instruction.  There is a one-cycle delay before any written register can be used by an instruction which performs a shift or rotation on that register.&lt;br /&gt;
&lt;br /&gt;
There is one cycle delay before the result of a load can be used.  In the aforementioned cases involving subsequent shift or rotate instructions, or the address of a load or store instruction, the total delay is two cycles.  &lt;br /&gt;
&lt;br /&gt;
Stores occur at the end of the pipeline and store instructions can issue in the same cycle as another instruction which writes the same register.  Similarly, conditional branches can issue in the same cycle as flag-setting instructions because the branch is not resolved until the following cycle.  In all other cases, instructions can not issue together if the second instruction depends on the results of the first.&lt;br /&gt;
&lt;br /&gt;
== Instruction pairing restrictions following a branch ==&lt;br /&gt;
The Cortex-A8 processor normally fetches two instructions per cycle and the branch predictor tests each against the global history buffer.  However, there is only one entry in the branch target buffer for each pair of instructions.  Therefore, branch instructions, and certain instructions which resemble branches, may adversely affect branch prediction when present in pairs.  To reduce the risk of branch misprediction, avoid pairing branch instructions with the following:&lt;br /&gt;
&lt;br /&gt;
* Branch instructions&lt;br /&gt;
* Load instructions, whether or not they write r15&lt;br /&gt;
* Arithmetic/logic instructions which do not set flags and do not have an immediate value, whether or not they write r15&lt;br /&gt;
&lt;br /&gt;
Flag-setting instructions can be used to avoid this restriction.  Additionally, MOV instructions that do not have immediate values can be replaced with ADD, SUB, ORR, or EOR instructions using zero as an immediate value.&lt;br /&gt;
&lt;br /&gt;
One exception to this rule is that placing such an instruction (eg mov r0,r0) following an unconditional branch may reduce the severity of a misprediction, due to the second instruction possibly being incorrectly predicted as a branch, but resulting in the correct target being retrieved from the BTB.&lt;br /&gt;
&lt;br /&gt;
== Code alignment ==&lt;br /&gt;
Code alignment should be used with caution.  Alignment has the potential to increase performance, but may be detrimental in some circumstances.&lt;br /&gt;
&lt;br /&gt;
Because the instruction decoder always fetches 64-bit aligned words from the level-1 instruction cache, aligning code can improve instruction fetch and decode.  However, excessive code alignment can result in a suboptimal distribution of entries in the branch history tables and increase branch misprediction.  Additionally, the speculative prefetch may retrieve and decode instructions used as padding, even if those instructions are never executed.  The types of instructions used as padding will affect the branch predictor as stated above, and this may reduce the prefetch of code into the instruction cache.&lt;br /&gt;
&lt;br /&gt;
[[Category:Development]]&lt;/div&gt;</summary>
		<author><name>Ari64</name></author>
		
	</entry>
	<entry>
		<id>https://pandorawiki.org/index.php?title=Mupen64plus_dynamic_recompiler&amp;diff=4102</id>
		<title>Mupen64plus dynamic recompiler</title>
		<link rel="alternate" type="text/html" href="https://pandorawiki.org/index.php?title=Mupen64plus_dynamic_recompiler&amp;diff=4102"/>
		<updated>2010-11-10T13:13:51Z</updated>

		<summary type="html">&lt;p&gt;Ari64: Debugging&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This page describes the dynamic recompiler in Mupen64plus, and the changes made for the ARM port.&lt;br /&gt;
&lt;br /&gt;
==The original dynamic recompiler by Hacktarux==&lt;br /&gt;
&lt;br /&gt;
The dynamic recompiler used in Mupen64plus v1.5 is based on the original written by Hacktarux in 2002.&lt;br /&gt;
&lt;br /&gt;
It recompiles contiguous blocks of MIPS instructions.  First, each instruction is decoded into a dynamically-allocated 132-byte data structure.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
typedef struct _precomp_instr&lt;br /&gt;
{&lt;br /&gt;
   void (*ops)();&lt;br /&gt;
   union&lt;br /&gt;
     {&lt;br /&gt;
    struct&lt;br /&gt;
      {&lt;br /&gt;
         long long int *rs;&lt;br /&gt;
         long long int *rt;&lt;br /&gt;
         short immediate;&lt;br /&gt;
      } i;&lt;br /&gt;
    struct&lt;br /&gt;
      {&lt;br /&gt;
         unsigned int inst_index;&lt;br /&gt;
      } j;&lt;br /&gt;
    struct&lt;br /&gt;
      {&lt;br /&gt;
         long long int *rs;&lt;br /&gt;
         long long int *rt;&lt;br /&gt;
         long long int *rd;&lt;br /&gt;
         unsigned char sa;&lt;br /&gt;
         unsigned char nrd;&lt;br /&gt;
      } r;&lt;br /&gt;
    struct&lt;br /&gt;
      {&lt;br /&gt;
         unsigned char base;&lt;br /&gt;
         unsigned char ft;&lt;br /&gt;
         short offset;&lt;br /&gt;
      } lf;&lt;br /&gt;
    struct&lt;br /&gt;
      {&lt;br /&gt;
         unsigned char ft;&lt;br /&gt;
         unsigned char fs;&lt;br /&gt;
         unsigned char fd;&lt;br /&gt;
      } cf;&lt;br /&gt;
     } f;&lt;br /&gt;
   unsigned int addr; /* word-aligned instruction address in r4300 address space */&lt;br /&gt;
   unsigned int local_addr; /* byte offset to start of corresponding x86_64 instructions, from start of code block */&lt;br /&gt;
   reg_cache_struct reg_cache_infos;&lt;br /&gt;
} precomp_instr;&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The decoded instructions are then compiled, generating x86 instructions for each MIPS instruction.  A 32K block is allocated with malloc() to hold the x86 code.  If this size proves insufficient, it is incrementally resized with realloc().&lt;br /&gt;
&lt;br /&gt;
MIPS registers are allocated to x86 registers with a least-recently-used replacement policy.  All cached registers are written back prior to a branch, or a memory read or write.&lt;br /&gt;
&lt;br /&gt;
To facilitate invalidation and replacement of modified code blocks, each 4K page is compiled separately.  If a sequence of instructions crosses a 4K page boundary, the current block is ended, and the next instructions will be compiled as a separate block.&lt;br /&gt;
&lt;br /&gt;
Branch instructions within a 4K page are compiled as branches directly to the target address.  Branches which cross a 4K page go through an indirect address lookup.&lt;br /&gt;
&lt;br /&gt;
Compiled code blocks are invalidated on write.  On an actual MIPS CPU, the instruction cache is invalidated using the CACHE instruction, however a few N64 games clear the cache using other methods.  Trapping writes appears to be the most reliable method of ensuring cache coherency in the emulated system.  The cache instruction is ignored.&lt;br /&gt;
&lt;br /&gt;
==Problems with the original design==&lt;br /&gt;
&lt;br /&gt;
The most significant performance problem with this design is its excessive memory usage.  The decoded instruction data is retained (132 bytes for each MIPS instruction) and occasionally referenced during execution.  Memory accesses frequently miss the L2 cache, resulting in poor performance.&lt;br /&gt;
&lt;br /&gt;
Additionally, the register cache is relatively inefficient, since all registers are flushed before any read or write operation.&lt;br /&gt;
&lt;br /&gt;
==A new approach==&lt;br /&gt;
&lt;br /&gt;
To reduce memory usage, the new design allocates a single large block of memory (currently 16 MiB) which is used for recompiled code.  This memory is allocated using mmap with the PROT_EXEC bit set, to ensure operation on CPUs with no-execute (NX) page permissions.&lt;br /&gt;
&lt;br /&gt;
Contiguous blocks of MIPS instructions are compiled (that is, it does not attempt to follow branches or 'hot-paths').&lt;br /&gt;
&lt;br /&gt;
The recompiler consists of eight stages, plus a linker, and a memory manager.&lt;br /&gt;
&lt;br /&gt;
Compiled blocks are invalidated on write, as before, however as the compiler will cross a 4K page, writes may invalidate adjacent pages as well.&lt;br /&gt;
&lt;br /&gt;
Currently the dynarec generates x86, x86-64, ARMv5, and ARMv7 (little-endian).  Most of the code is shared between the architectures, but a different code generator is included at compile time, using different #include statements depending on the CPU type.&lt;br /&gt;
&lt;br /&gt;
==Pass 1: Disassembly==&lt;br /&gt;
&lt;br /&gt;
When an instruction address is encountered which has not been compiled, the function new_recompile_block is called, with the (virtual) address of the target instruction as its sole parameter.  If the address is invalid, the function returns a nonzero value and the caller is responsible for handling the pagefault.&lt;br /&gt;
&lt;br /&gt;
Instructions are decoded until an unconditional jump, usually a return, is encountered.  Disassembly is ordinarilly continued past a JAL (subroutine call) instruction, however this strategy is abandoned if invalid instructions are encountered.&lt;br /&gt;
&lt;br /&gt;
Up to 4096 instructions (16K of MIPS code) may be disassembled at once.  Surprisingly, some games do actually reach this limit.&lt;br /&gt;
&lt;br /&gt;
==Pass 2: Liveness analysis==&lt;br /&gt;
&lt;br /&gt;
After disassembly, liveness analysis is performed on the registers.  This determines when a particular register will no longer be used, and thus can be removed from the register cache.&lt;br /&gt;
&lt;br /&gt;
A separate analysis is done on the upper 32 bits of 64-bit registers.  This can determine when only the lower 32 bits of a register are significant, thus allowing use of a 32-bit register.  This enables more efficient code generation on 32-bit processors.&lt;br /&gt;
&lt;br /&gt;
==Pass 3: Register allocation==&lt;br /&gt;
&lt;br /&gt;
The 31 MIPS registers must be mapped onto seven available registers on x86, or twelve available registers on ARM.&lt;br /&gt;
&lt;br /&gt;
Instructions with 64-bit results require two registers on 32-bit host processors.  32-bit instructions require only one.  A flag is set for registers containing 32-bit values, and these registers will be sign-extended before they are written to the register file.&lt;br /&gt;
&lt;br /&gt;
Registers are made available using a least-soon-needed replacement policy.  When the register cache is full, and no registers can be eliminated using the liveness analysis, a ten-instruction lookahead is used to determine which registers will not be needed soon and these registers are evicted from the cache.&lt;br /&gt;
&lt;br /&gt;
==Pass 4: Free unused registers==&lt;br /&gt;
&lt;br /&gt;
After the initial register allocation, the allocations are reviewed to determine if any registers remain allocated longer than necessary.  These are then removed from the register cache.  This avoids having to write back a large number of registers just before a branch, and makes more registers available for the next pass.&lt;br /&gt;
&lt;br /&gt;
==Pass 5: Pre-allocate registers==&lt;br /&gt;
&lt;br /&gt;
If a register will be used soon and needs to be loaded, try to load it early if the register is available.  This improves execution on CPUs with a load-use penalty.&lt;br /&gt;
&lt;br /&gt;
==Pass 6: Optimize clean/dirty state==&lt;br /&gt;
&lt;br /&gt;
If a cached register is 'dirty' and needs to be written out, try to do so as soon as the register will no longer be modified.  This avoids having to write the same register on multiple code paths due to conditional branches.  Additionally, try to avoid writing out dirty registers inside of loops.&lt;br /&gt;
&lt;br /&gt;
==Pass 7: Identify where registers are assumed to be 32-bit==&lt;br /&gt;
&lt;br /&gt;
When a 64-bit register is mapped to a 32-bit register with the assumption that the value will be sign-extended before being used, it is necessary to ensure that no register contains a 64-bit value when branching to such a location.  These instructions are flagged to identify them as requiring 32-bit inputs.  This information is used by the linker, and the exception return (ERET) handler.&lt;br /&gt;
&lt;br /&gt;
==Pass 8: Assembly==&lt;br /&gt;
&lt;br /&gt;
This generates and outputs the recompiled code.&lt;br /&gt;
&lt;br /&gt;
Following the main code block, handlers for certain exceptions as well as alternate entry points are added.&lt;br /&gt;
&lt;br /&gt;
If a recompiled instruction relies on a certain MIPS register being cached in a certain native register, then a short 'stub' of code is generated to load the necessary registers.  When an instruction outside of this block needs to jump to that location, it will instead jump to the stub.  The necessary registers will be loaded, and then it will jump into the main code sequence.&lt;br /&gt;
&lt;br /&gt;
On architectures which require literal pools (ARMv5) these are inserted as necessary.&lt;br /&gt;
&lt;br /&gt;
==Linker==&lt;br /&gt;
&lt;br /&gt;
The linker fills in all unresolved branches.  Branches within the block are linked to their target address.  Branches which jump outside of the block are linked to their target if that address has been compiled already. These inter-block branches are recorded in the jump_out array.  This information will be used to remove the links in the event that the target of the branch is invalidated.&lt;br /&gt;
&lt;br /&gt;
Unresolved branches point to a stub which loads the address of the branch instruction and the virtual address of its target into registers, and calls the dynamic linker.  When this code is executed, the dynamic linker will compile the target if necessary, and then patch the branch instruction with the new address.&lt;br /&gt;
&lt;br /&gt;
==Memory manager==&lt;br /&gt;
&lt;br /&gt;
The last step in new_recompile_block is to ensure that there will be sufficient memory available to compile the next block.  If there is not, then the oldest blocks are purged.&lt;br /&gt;
&lt;br /&gt;
The dynarec cache can be described as a 16MB circular buffer divided into eight segments.  Memory is allocated in order from beginning to end.&lt;br /&gt;
&lt;br /&gt;
When there are less than 2 empty segments, the next segment in sequence is cleared, wrapping around to the beginning of the buffer.  This continues as memory is needed, wrapping around from end to beginning.&lt;br /&gt;
&lt;br /&gt;
==Invalidation and restoration==&lt;br /&gt;
&lt;br /&gt;
Normally, code blocks are looked up via a hash table, using the virtual address of the target instruction.  However, for purposes of invalidation, blocks are grouped by physical address.  This can be described as a virtually-indexed, physically-tagged (VIPT) cache.&lt;br /&gt;
&lt;br /&gt;
References to compiled blocks are stored in one of 4096 linked lists in the jump_in array.  Each list covers a 4K block of memory, and 2048 such lists are sufficient to cover the 8MB of RAM in the Nintendo 64.  The remaining lists are for code in ROM, and the bootloader in SP memory.&lt;br /&gt;
&lt;br /&gt;
When a write hits a memory page marked as cached, all entries in the corresponding list are invalidated.  If any code is found to cross a 4K boundary, the adjacent lists are invalidated also.&lt;br /&gt;
&lt;br /&gt;
Sometimes blocks may be invalidated even when none of the code is actually modified.  This can happen if data is written to memory in the same 4K page, or if code is reloaded without actually modifying it.  If blocks which were previously invalidated are subsequently found to be unmodified, those blocks are marked in the restore_candidate array.  If the block remains unmodified, it will be restored as a valid block, to avoid recompiling blocks which do not need to be recompiled.  This is performed by the clean_blocks function which is called periodically.&lt;br /&gt;
&lt;br /&gt;
The jump_in array, which lists unmodified blocks, is physically indexed, however the jump_dirty array, which lists potentially-modified blocks, is virtually indexed.  This allows blocks which have changed physical addresses, but are still at the same virtual address, to be recognized as being the same as before, and not in need of recompilation.&lt;br /&gt;
&lt;br /&gt;
==Dynamic linker==&lt;br /&gt;
&lt;br /&gt;
Branches with unresolved addresses jump to the dynamic linker.  This will look through the jump_in list corresponding to the physical page containing the virtual target address.  If found, the branch instruction will be patched with the address, and then it will jump to this address.&lt;br /&gt;
&lt;br /&gt;
If not found, the jump_dirty list will be searched for blocks which were previously compiled but may have been modified.  If a potential match is found, the code will be compared against a cached copy to determine if any changes have been made.  If not, then it will jump to the block.  Because the memory could be modified again, branch instructions referencing these blocks are not altered, and continue to point to the dynamic linker.  These blocks will continue to be verified each time they are called, until restored to the jump_in list by the clean_blocks function described above.&lt;br /&gt;
&lt;br /&gt;
If no compiled block is found, or the existing block was modified, the target is recompiled.&lt;br /&gt;
&lt;br /&gt;
==Address lookup==&lt;br /&gt;
&lt;br /&gt;
When a JR (jump register) instruction is encountered, the address of the recompiled code must be looked up using the address of the MIPS code.  The majority of such instructions jump to the link register (r31) to return to the location following a JAL (call) instruction.&lt;br /&gt;
&lt;br /&gt;
When a JAL or JALR is executed, the address of the following instruction is inserted into a small 32-entry hash table, which is checked when a JR r31 instruction is executed.  This allows for a quick return from subroutine calls.&lt;br /&gt;
&lt;br /&gt;
If the JR instruction uses a register other than r31, or the small hash table lookup fails to find a match, a larger 131072-entry hash table is checked.  This table contains 65536 bins with up to 2 addresses per bin.  If this also fails to find a match (which occurs less than 1% of the time) an exhaustive search of all compiled addresses within that 4K memory page is performed.&lt;br /&gt;
&lt;br /&gt;
If no match is found by any of these methods, the target address is compiled, and the new address is inserted into the hash table.&lt;br /&gt;
&lt;br /&gt;
==Cycle counting==&lt;br /&gt;
&lt;br /&gt;
Cycles are counted before each branch by adding the cycles from the preceding instructions to a specific register.  The cycle count is in R10 on ARM and ESI on x86.  The value in this register is normally a negative number.  When this number exceeds zero, a conditional branch is taken which jumps to an interrupt handler.&lt;br /&gt;
&lt;br /&gt;
For example, the following x86 code adds eight cycles:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
add $8,%esi&lt;br /&gt;
jns interrupt_handler&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The conditional branch jumps to a small bit of code located after the main compiled block, which saves the cached registers to the register file, sets the instruction pointer which will be used upon return from the interrupt, and then calls cc_interrupt.&lt;br /&gt;
&lt;br /&gt;
As in the original mupen64plus, the emulated clock runs at 37.5 MHz, and each instruction takes 2 clock cycles.&lt;br /&gt;
&lt;br /&gt;
==Delay slots==&lt;br /&gt;
&lt;br /&gt;
MIPS has 'delay slots', where the instruction after the branch is executed before the branch is taken.  Instructions in delay slots are issued out-of-order in the recompiled code.&lt;br /&gt;
&lt;br /&gt;
[[Image:Recompiler delay slot reordering.png]]&lt;br /&gt;
&lt;br /&gt;
When a branch jumps into the delay slot of another branch, this case must be handled slightly differently:&lt;br /&gt;
&lt;br /&gt;
[[Image:Recompiler delay slot reordering 2.png]]&lt;br /&gt;
&lt;br /&gt;
The branch test and delay slot are executed in-order if a dependency exists, or for 'likely' branches where the delay slot is nullified when the branch condition is false.  These cases are infrequent (typically less than 10% of branches).&lt;br /&gt;
&lt;br /&gt;
==Constant propagation==&lt;br /&gt;
&lt;br /&gt;
When an instruction loads a constant into a register, the register is tagged as a constant.  The constant tag will be retained if subsequent instructions modify the constant using other constants.  During assembly, such a sequence of instructions is combined into a single load.  For example:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
LUI r8,12340000  --&amp;gt;  mov $0x12345678,%eax&lt;br /&gt;
ORI r8,r8,5678&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This optimization is not performed where a branch target intervenes, eg&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
...&lt;br /&gt;
 LUI r8,12340000  --&amp;gt;  mov $0x12340000,%eax&lt;br /&gt;
L1:&lt;br /&gt;
 ORI r8,r8,5678   --&amp;gt;  or $0x5678,%eax&lt;br /&gt;
 ...&lt;br /&gt;
 BEQ r0,r0,L1&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Registers containing constants are identified by bits in the isconst and wasconst fields of the regstat structure.  The wasconst bit is set for a register if the register contained a known constant before the instruction, and the isconst bit is set for a register if the register will contain a known constant after the instruction executes.&lt;br /&gt;
&lt;br /&gt;
==Translation lookaside buffer emulation==&lt;br /&gt;
&lt;br /&gt;
Most Nintendo 64 games do not use virtual memory, but some do.  At startup, main memory is directly mapped at addresses from 0x80000000 to 0x803FFFFF, or up to 0x807FFFFF if the memory expansion is used.&lt;br /&gt;
&lt;br /&gt;
Normally, read or write operations are checked against this range, and if outside this range, control is passed to the appropriate I/O handler.  This is done as follows:&lt;br /&gt;
&lt;br /&gt;
MIPS instruction: LW r1,8(r2)&lt;br /&gt;
&lt;br /&gt;
ARM code:&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
add r1, r2, #8&lt;br /&gt;
cmp r1, #8388608&lt;br /&gt;
bvc handler&lt;br /&gt;
ldr r1, [r1]&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If there are valid entries in the TLB, this would instead be compiled as follows:&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
add r1, r2, #8&lt;br /&gt;
mov r0, #264&lt;br /&gt;
add r0, r0, r1, lsr #12&lt;br /&gt;
ldr r0, [r11, r0, lsl #2]&lt;br /&gt;
tst r0, r0&lt;br /&gt;
bmi handler&lt;br /&gt;
ldr r1, [r1, r0, lsl #2]&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This looks up the offset in the memory_map table (which, in this example, is located at r11+264*4).  The high bit is tested to determine whether a valid mapping exists for this page.&lt;br /&gt;
&lt;br /&gt;
==Page fault emulation==&lt;br /&gt;
&lt;br /&gt;
If a memory address references an invalid page, a conditional branch is taken to a bit of code placed after the main block.  This will save any modified cached registers, and call an appropriate handler function for the address.  The handler will either perform I/O, or generate an exception (pagefault).&lt;br /&gt;
&lt;br /&gt;
==Mapping executable pages==&lt;br /&gt;
&lt;br /&gt;
If the dynamic recompiler encounters code which is not in contiguous physical memory, it will end the block at the page boundary, so that the block can be removed cleanly if the mapping is changed.&lt;br /&gt;
&lt;br /&gt;
There is a special case of this, where a branch and delay slot span two pages.  If a branch instruction is the last instruction in a virtual memory page, it is compiled in a different manner than other branches.  The branch condition is evaluated, and the target address is placed in a register (%ebp on x86, and r8 on ARM).  A special form of the dynamic linker (dyna_linker_ds) is used to link the branch to its corresponding delay slot in another block.  If no page fault occurs, the delay slot executes and then jumps to the address in the register.  For conditional branches that are not taken, the target address is the next instruction.  This code is generated by the pagespan_assemble function.&lt;br /&gt;
&lt;br /&gt;
== Compile options ==&lt;br /&gt;
&lt;br /&gt;
=== ARMv5_ONLY ===&lt;br /&gt;
&lt;br /&gt;
If this is defined, the UXTH instruction is not used, and the dynamic recompiler will generate literal pools instead of using movw/movt.  This provides compatibility with older processors, but generates somewhat less efficient code.&lt;br /&gt;
&lt;br /&gt;
=== CORTEX_A8_BRANCH_PREDICTION_HACK ===&lt;br /&gt;
&lt;br /&gt;
If this is defined, the dynamic recompiler will avoid generating consecutive branch instructions without another instruction in between.  This avoids a possible branch misprediction on the Cortex-A8 due to this processor having dual instruction decoders, but only one branch-prediction unit.  See [[Assembly Code Optimization]] for details.&lt;br /&gt;
&lt;br /&gt;
=== USE_MINI_HT ===&lt;br /&gt;
&lt;br /&gt;
If this is defined, attempt to look up return addresses in a small hash table before checking the larger hash table.  Usually improves performance.&lt;br /&gt;
&lt;br /&gt;
=== IMM_PREFETCH ===&lt;br /&gt;
&lt;br /&gt;
If this is defined, the x86 PREFETCH instruction is used to prefetch entries from the hash table.  The increase in code size often outweighs the benefit of this.&lt;br /&gt;
&lt;br /&gt;
=== REG_PREFETCH ===&lt;br /&gt;
&lt;br /&gt;
Similar to the above, but loads the address into a register first, then uses the ARM PLD instruction.  The increase in code size almost always outweighs the benefit of this.&lt;br /&gt;
&lt;br /&gt;
=== R29_HACK ===&lt;br /&gt;
&lt;br /&gt;
Assume that the stack pointer (r29) is always a valid memory address and do not check it.  It is similar to the optimization described [http://strmnnrmn.blogspot.com/2007/08/interesting-dynarec-hack.html here].  This can crash the emulator and is not enabled by default.&lt;br /&gt;
&lt;br /&gt;
== Debugging ==&lt;br /&gt;
&lt;br /&gt;
Debugging information can be obtained by defining the assem_debug macro as printf.  This will cause the dynamic recompiler to print debugging information to stdout.  For each disassembled MIPS instruction, an entry similar to the following will be printed:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
U: r1 r8 r11 r16 r31 UU: r29 32: r0 r9&lt;br /&gt;
pre: eax=-1 ecx=9 edx=-1 ebx=-1 ebp=29 esi=36 edi=-1&lt;br /&gt;
needs: ecx ebp esi r: r9&lt;br /&gt;
entry: eax=-1 ecx=9 edx=-1 ebx=-1 ebp=29 esi=36 edi=-1&lt;br /&gt;
dirty: ecx ebp esi &lt;br /&gt;
  800001d8: LW r16,r29+14&lt;br /&gt;
eax=16 ecx=9 edx=-1 ebx=-1 ebp=29 esi=36 edi=-1 dirty: eax ecx ebp esi &lt;br /&gt;
 32: r0 r9 r16&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
U: A list of MIPS registers which will not be used before they are overwritten (liveness analysis)&lt;br /&gt;
&lt;br /&gt;
UU: A list of MIPS registers for which the upper 32 bits will not be used before they are overwritten&lt;br /&gt;
&lt;br /&gt;
32: Registers that contain 32-bit sign-extended values&lt;br /&gt;
&lt;br /&gt;
pre: The state of the register mapping prior to execution of this instruction.  (-1 = no mapping; 36 = cycle count; The complete list of values with special meanings can be found in the source code)&lt;br /&gt;
&lt;br /&gt;
needs: a list of register mappings that were considered necessary and which could not be eliminated to make room for other mappings&lt;br /&gt;
&lt;br /&gt;
r: Registers that are known to contain 32-bit values and where optimizations rely on the assumption that the register does not contain a value outside of the range -2&amp;lt;sup&amp;gt;31&amp;lt;/sup&amp;gt; to 2&amp;lt;sup&amp;gt;31&amp;lt;/sup&amp;gt;-1&lt;br /&gt;
&lt;br /&gt;
entry: The minimum set of register mappings required to jump to this point&lt;br /&gt;
&lt;br /&gt;
dirty: Cached registers that have been modified and will need to be written back&lt;br /&gt;
&lt;br /&gt;
address: instruction - The decoded opcode, followed by the register mapping in effect after this instruction executes&lt;br /&gt;
&lt;br /&gt;
An asterisk (*) designates locations which are the target of a branch instruction.  Constant propagation will not be performed across these points.&lt;br /&gt;
&lt;br /&gt;
After the complete disassembly, the recompiled native code is shown.&lt;br /&gt;
&lt;br /&gt;
Note that the output can be quite voluminous; 20-30 MB is typical.&lt;br /&gt;
&lt;br /&gt;
==Potential improvements==&lt;br /&gt;
&lt;br /&gt;
===Copy propagation/offset propagation===&lt;br /&gt;
&lt;br /&gt;
A common instruction sequence is of the form:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
LUI r9,12340000&lt;br /&gt;
ADD r9,r9,r8&lt;br /&gt;
LW r9,5678(r9)&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
It would be helpful to recognize this as a load from r8+12345678.  The current constant propagation code does not do so.&lt;br /&gt;
&lt;br /&gt;
===Unaligned memory access===&lt;br /&gt;
&lt;br /&gt;
A small improvement could be made by combining adjacent LWL/LWR instructions.  The potential gain from doing so is very limited because these instructions typically represent less than 1% of all memory accesses.&lt;br /&gt;
&lt;br /&gt;
===SLT/branch merging===&lt;br /&gt;
&lt;br /&gt;
A frequent occurrence in MIPS code is an SLT or SLTI instruction followed by a branch.  This is generated relatively inefficiently on x86 and ARM, first doing a compare and set, then testing this value and doing a conditional branch.  Doing only one comparison would save at least one instruction, and could potentially save up to three instructions if the liveness analysis reveals that the result of the SLT instruction is used nowhere else.&lt;br /&gt;
&lt;br /&gt;
While a potentially useful optimization, there are several problems with this approach.  First, there are often additional instructions between the slt and the branch. These must be checked to make sure they do not modify the registers as that would prevent reordering the instruction stream as desired.  Secondly, if the result of the slt is found to be live, but unmodified, on both paths of the branch, clean_registers will normally write this value before the branch, to avoid duplicating the writeback code on both paths of the branch.  This optimization would have to be removed if the slt was combined with the branch.&lt;br /&gt;
&lt;br /&gt;
===x86-64===&lt;br /&gt;
&lt;br /&gt;
Currently the x86-64 backend generates only 32-bit instructions.  Proper 64-bit code generation would improve performance.&lt;br /&gt;
&lt;br /&gt;
===PowerPC===&lt;br /&gt;
&lt;br /&gt;
It would be possible to add a PowerPC code generator to the dynamic recompiler.  Currently no one is working on this.  (The mupen64gc project is using a different codebase.)&lt;br /&gt;
&lt;br /&gt;
The following is a summary of the changes which would be necessary to add a PowerPC backend.&lt;br /&gt;
&lt;br /&gt;
The slt* instructions use conditional moves, which are unavailable on PowerPC.  A suitable alternative (such as moves from the condition register) would need to be used.&lt;br /&gt;
&lt;br /&gt;
The assembler can generate as much as 256K of code in a single block, however conditional branches on PowerPC are limited to +/-64K.  It will be necessary to either restrict the block size, or insert jump tables in a manner similar to the literal pools on ARM.&lt;br /&gt;
&lt;br /&gt;
PowerPC generally relies on early branch resolution rather than statistical branch prediction.  Scheduling branch condition tests earlier may be advantageous.  (For example, the address bounds check could be done in address_generation, rather than just before the load or store.  Similarly it may be advantageous to test the branch condition and update the cycle count before executing the delay slot.)&lt;br /&gt;
&lt;br /&gt;
===MIPS===&lt;br /&gt;
&lt;br /&gt;
Recompiling MIPS into MIPS would be relatively straightforward, however the current code generator has no facility for filling delay slots.  This capability would be required for efficient code generation.&lt;br /&gt;
&lt;br /&gt;
[[Category:Development]]&lt;/div&gt;</summary>
		<author><name>Ari64</name></author>
		
	</entry>
	<entry>
		<id>https://pandorawiki.org/index.php?title=Mupen64plus_dynamic_recompiler&amp;diff=3854</id>
		<title>Mupen64plus dynamic recompiler</title>
		<link rel="alternate" type="text/html" href="https://pandorawiki.org/index.php?title=Mupen64plus_dynamic_recompiler&amp;diff=3854"/>
		<updated>2010-10-11T04:32:18Z</updated>

		<summary type="html">&lt;p&gt;Ari64: Potential improvements aren't necessarily future improvements, just possible ones&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This page describes the dynamic recompiler in Mupen64plus, and the changes made for the ARM port.&lt;br /&gt;
&lt;br /&gt;
==The original dynamic recompiler by Hacktarux==&lt;br /&gt;
&lt;br /&gt;
The dynamic recompiler used in Mupen64plus v1.5 is based on the original written by Hacktarux in 2002.&lt;br /&gt;
&lt;br /&gt;
It recompiles contiguous blocks of MIPS instructions.  First, each instruction is decoded into a dynamically-allocated 132-byte data structure.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
typedef struct _precomp_instr&lt;br /&gt;
{&lt;br /&gt;
   void (*ops)();&lt;br /&gt;
   union&lt;br /&gt;
     {&lt;br /&gt;
    struct&lt;br /&gt;
      {&lt;br /&gt;
         long long int *rs;&lt;br /&gt;
         long long int *rt;&lt;br /&gt;
         short immediate;&lt;br /&gt;
      } i;&lt;br /&gt;
    struct&lt;br /&gt;
      {&lt;br /&gt;
         unsigned int inst_index;&lt;br /&gt;
      } j;&lt;br /&gt;
    struct&lt;br /&gt;
      {&lt;br /&gt;
         long long int *rs;&lt;br /&gt;
         long long int *rt;&lt;br /&gt;
         long long int *rd;&lt;br /&gt;
         unsigned char sa;&lt;br /&gt;
         unsigned char nrd;&lt;br /&gt;
      } r;&lt;br /&gt;
    struct&lt;br /&gt;
      {&lt;br /&gt;
         unsigned char base;&lt;br /&gt;
         unsigned char ft;&lt;br /&gt;
         short offset;&lt;br /&gt;
      } lf;&lt;br /&gt;
    struct&lt;br /&gt;
      {&lt;br /&gt;
         unsigned char ft;&lt;br /&gt;
         unsigned char fs;&lt;br /&gt;
         unsigned char fd;&lt;br /&gt;
      } cf;&lt;br /&gt;
     } f;&lt;br /&gt;
   unsigned int addr; /* word-aligned instruction address in r4300 address space */&lt;br /&gt;
   unsigned int local_addr; /* byte offset to start of corresponding x86_64 instructions, from start of code block */&lt;br /&gt;
   reg_cache_struct reg_cache_infos;&lt;br /&gt;
} precomp_instr;&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The decoded instructions are then compiled, generating x86 instructions for each MIPS instruction.  A 32K block is allocated with malloc() to hold the x86 code.  If this size proves insufficient, it is incrementally resized with realloc().&lt;br /&gt;
&lt;br /&gt;
MIPS registers are allocated to x86 registers with a least-recently-used replacement policy.  All cached registers are written back prior to a branch, or a memory read or write.&lt;br /&gt;
&lt;br /&gt;
To facilitate invalidation and replacement of modified code blocks, each 4K page is compiled separately.  If a sequence of instructions crosses a 4K page boundary, the current block is ended, and the next instructions will be compiled as a separate block.&lt;br /&gt;
&lt;br /&gt;
Branch instructions within a 4K page are compiled as branches directly to the target address.  Branches which cross a 4K page go through an indirect address lookup.&lt;br /&gt;
&lt;br /&gt;
Compiled code blocks are invalidated on write.  On an actual MIPS CPU, the instruction cache is invalidated using the CACHE instruction, however a few N64 games clear the cache using other methods.  Trapping writes appears to be the most reliable method of ensuring cache coherency in the emulated system.  The cache instruction is ignored.&lt;br /&gt;
&lt;br /&gt;
==Problems with the original design==&lt;br /&gt;
&lt;br /&gt;
The most significant performance problem with this design is its excessive memory usage.  The decoded instruction data is retained (132 bytes for each MIPS instruction) and occasionally referenced during execution.  Memory accesses frequently miss the L2 cache, resulting in poor performance.&lt;br /&gt;
&lt;br /&gt;
Additionally, the register cache is relatively inefficient, since all registers are flushed before any read or write operation.&lt;br /&gt;
&lt;br /&gt;
==A new approach==&lt;br /&gt;
&lt;br /&gt;
To reduce memory usage, the new design allocates a single large block of memory (currently 16 MiB) which is used for recompiled code.  This memory is allocated using mmap with the PROT_EXEC bit set, to ensure operation on CPUs with no-execute (NX) page permissions.&lt;br /&gt;
&lt;br /&gt;
Contiguous blocks of MIPS instructions are compiled (that is, it does not attempt to follow branches or 'hot-paths').&lt;br /&gt;
&lt;br /&gt;
The recompiler consists of eight stages, plus a linker, and a memory manager.&lt;br /&gt;
&lt;br /&gt;
Compiled blocks are invalidated on write, as before, however as the compiler will cross a 4K page, writes may invalidate adjacent pages as well.&lt;br /&gt;
&lt;br /&gt;
Currently the dynarec generates x86, x86-64, ARMv5, and ARMv7 (little-endian).  Most of the code is shared between the architectures, but a different code generator is included at compile time, using different #include statements depending on the CPU type.&lt;br /&gt;
&lt;br /&gt;
==Pass 1: Disassembly==&lt;br /&gt;
&lt;br /&gt;
When an instruction address is encountered which has not been compiled, the function new_recompile_block is called, with the (virtual) address of the target instruction as its sole parameter.  If the address is invalid, the function returns a nonzero value and the caller is responsible for handling the pagefault.&lt;br /&gt;
&lt;br /&gt;
Instructions are decoded until an unconditional jump, usually a return, is encountered.  Disassembly is ordinarilly continued past a JAL (subroutine call) instruction, however this strategy is abandoned if invalid instructions are encountered.&lt;br /&gt;
&lt;br /&gt;
Up to 4096 instructions (16K of MIPS code) may be disassembled at once.  Surprisingly, some games do actually reach this limit.&lt;br /&gt;
&lt;br /&gt;
==Pass 2: Liveness analysis==&lt;br /&gt;
&lt;br /&gt;
After disassembly, liveness analysis is performed on the registers.  This determines when a particular register will no longer be used, and thus can be removed from the register cache.&lt;br /&gt;
&lt;br /&gt;
A separate analysis is done on the upper 32 bits of 64-bit registers.  This can determine when only the lower 32 bits of a register are significant, thus allowing use of a 32-bit register.  This enables more efficient code generation on 32-bit processors.&lt;br /&gt;
&lt;br /&gt;
==Pass 3: Register allocation==&lt;br /&gt;
&lt;br /&gt;
The 31 MIPS registers must be mapped onto seven available registers on x86, or twelve available registers on ARM.&lt;br /&gt;
&lt;br /&gt;
Instructions with 64-bit results require two registers on 32-bit host processors.  32-bit instructions require only one.  A flag is set for registers containing 32-bit values, and these registers will be sign-extended before they are written to the register file.&lt;br /&gt;
&lt;br /&gt;
Registers are made available using a least-soon-needed replacement policy.  When the register cache is full, and no registers can be eliminated using the liveness analysis, a ten-instruction lookahead is used to determine which registers will not be needed soon and these registers are evicted from the cache.&lt;br /&gt;
&lt;br /&gt;
==Pass 4: Free unused registers==&lt;br /&gt;
&lt;br /&gt;
After the initial register allocation, the allocations are reviewed to determine if any registers remain allocated longer than necessary.  These are then removed from the register cache.  This avoids having to write back a large number of registers just before a branch, and makes more registers available for the next pass.&lt;br /&gt;
&lt;br /&gt;
==Pass 5: Pre-allocate registers==&lt;br /&gt;
&lt;br /&gt;
If a register will be used soon and needs to be loaded, try to load it early if the register is available.  This improves execution on CPUs with a load-use penalty.&lt;br /&gt;
&lt;br /&gt;
==Pass 6: Optimize clean/dirty state==&lt;br /&gt;
&lt;br /&gt;
If a cached register is 'dirty' and needs to be written out, try to do so as soon as the register will no longer be modified.  This avoids having to write the same register on multiple code paths due to conditional branches.  Additionally, try to avoid writing out dirty registers inside of loops.&lt;br /&gt;
&lt;br /&gt;
==Pass 7: Identify where registers are assumed to be 32-bit==&lt;br /&gt;
&lt;br /&gt;
When a 64-bit register is mapped to a 32-bit register with the assumption that the value will be sign-extended before being used, it is necessary to ensure that no register contains a 64-bit value when branching to such a location.  These instructions are flagged to identify them as requiring 32-bit inputs.  This information is used by the linker, and the exception return (ERET) handler.&lt;br /&gt;
&lt;br /&gt;
==Pass 8: Assembly==&lt;br /&gt;
&lt;br /&gt;
This generates and outputs the recompiled code.&lt;br /&gt;
&lt;br /&gt;
Following the main code block, handlers for certain exceptions as well as alternate entry points are added.&lt;br /&gt;
&lt;br /&gt;
If a recompiled instruction relies on a certain MIPS register being cached in a certain native register, then a short 'stub' of code is generated to load the necessary registers.  When an instruction outside of this block needs to jump to that location, it will instead jump to the stub.  The necessary registers will be loaded, and then it will jump into the main code sequence.&lt;br /&gt;
&lt;br /&gt;
On architectures which require literal pools (ARMv5) these are inserted as necessary.&lt;br /&gt;
&lt;br /&gt;
==Linker==&lt;br /&gt;
&lt;br /&gt;
The linker fills in all unresolved branches.  Branches within the block are linked to their target address.  Branches which jump outside of the block are linked to their target if that address has been compiled already. These inter-block branches are recorded in the jump_out array.  This information will be used to remove the links in the event that the target of the branch is invalidated.&lt;br /&gt;
&lt;br /&gt;
Unresolved branches point to a stub which loads the address of the branch instruction and the virtual address of its target into registers, and calls the dynamic linker.  When this code is executed, the dynamic linker will compile the target if necessary, and then patch the branch instruction with the new address.&lt;br /&gt;
&lt;br /&gt;
==Memory manager==&lt;br /&gt;
&lt;br /&gt;
The last step in new_recompile_block is to ensure that there will be sufficient memory available to compile the next block.  If there is not, then the oldest blocks are purged.&lt;br /&gt;
&lt;br /&gt;
The dynarec cache can be described as a 16MB circular buffer divided into eight segments.  Memory is allocated in order from beginning to end.&lt;br /&gt;
&lt;br /&gt;
When there are less than 2 empty segments, the next segment in sequence is cleared, wrapping around to the beginning of the buffer.  This continues as memory is needed, wrapping around from end to beginning.&lt;br /&gt;
&lt;br /&gt;
==Invalidation and restoration==&lt;br /&gt;
&lt;br /&gt;
Normally, code blocks are looked up via a hash table, using the virtual address of the target instruction.  However, for purposes of invalidation, blocks are grouped by physical address.  This can be described as a virtually-indexed, physically-tagged (VIPT) cache.&lt;br /&gt;
&lt;br /&gt;
References to compiled blocks are stored in one of 4096 linked lists in the jump_in array.  Each list covers a 4K block of memory, and 2048 such lists are sufficient to cover the 8MB of RAM in the Nintendo 64.  The remaining lists are for code in ROM, and the bootloader in SP memory.&lt;br /&gt;
&lt;br /&gt;
When a write hits a memory page marked as cached, all entries in the corresponding list are invalidated.  If any code is found to cross a 4K boundary, the adjacent lists are invalidated also.&lt;br /&gt;
&lt;br /&gt;
Sometimes blocks may be invalidated even when none of the code is actually modified.  This can happen if data is written to memory in the same 4K page, or if code is reloaded without actually modifying it.  If blocks which were previously invalidated are subsequently found to be unmodified, those blocks are marked in the restore_candidate array.  If the block remains unmodified, it will be restored as a valid block, to avoid recompiling blocks which do not need to be recompiled.  This is performed by the clean_blocks function which is called periodically.&lt;br /&gt;
&lt;br /&gt;
The jump_in array, which lists unmodified blocks, is physically indexed, however the jump_dirty array, which lists potentially-modified blocks, is virtually indexed.  This allows blocks which have changed physical addresses, but are still at the same virtual address, to be recognized as being the same as before, and not in need of recompilation.&lt;br /&gt;
&lt;br /&gt;
==Dynamic linker==&lt;br /&gt;
&lt;br /&gt;
Branches with unresolved addresses jump to the dynamic linker.  This will look through the jump_in list corresponding to the physical page containing the virtual target address.  If found, the branch instruction will be patched with the address, and then it will jump to this address.&lt;br /&gt;
&lt;br /&gt;
If not found, the jump_dirty list will be searched for blocks which were previously compiled but may have been modified.  If a potential match is found, the code will be compared against a cached copy to determine if any changes have been made.  If not, then it will jump to the block.  Because the memory could be modified again, branch instructions referencing these blocks are not altered, and continue to point to the dynamic linker.  These blocks will continue to be verified each time they are called, until restored to the jump_in list by the clean_blocks function described above.&lt;br /&gt;
&lt;br /&gt;
If no compiled block is found, or the existing block was modified, the target is recompiled.&lt;br /&gt;
&lt;br /&gt;
==Address lookup==&lt;br /&gt;
&lt;br /&gt;
When a JR (jump register) instruction is encountered, the address of the recompiled code must be looked up using the address of the MIPS code.  The majority of such instructions jump to the link register (r31) to return to the location following a JAL (call) instruction.&lt;br /&gt;
&lt;br /&gt;
When a JAL or JALR is executed, the address of the following instruction is inserted into a small 32-entry hash table, which is checked when a JR r31 instruction is executed.  This allows for a quick return from subroutine calls.&lt;br /&gt;
&lt;br /&gt;
If the JR instruction uses a register other than r31, or the small hash table lookup fails to find a match, a larger 131072-entry hash table is checked.  This table contains 65536 bins with up to 2 addresses per bin.  If this also fails to find a match (which occurs less than 1% of the time) an exhaustive search of all compiled addresses within that 4K memory page is performed.&lt;br /&gt;
&lt;br /&gt;
If no match is found by any of these methods, the target address is compiled, and the new address is inserted into the hash table.&lt;br /&gt;
&lt;br /&gt;
==Cycle counting==&lt;br /&gt;
&lt;br /&gt;
Cycles are counted before each branch by adding the cycles from the preceding instructions to a specific register.  The cycle count is in R10 on ARM and ESI on x86.  The value in this register is normally a negative number.  When this number exceeds zero, a conditional branch is taken which jumps to an interrupt handler.&lt;br /&gt;
&lt;br /&gt;
For example, the following x86 code adds eight cycles:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
add $8,%esi&lt;br /&gt;
jns interrupt_handler&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The conditional branch jumps to a small bit of code located after the main compiled block, which saves the cached registers to the register file, sets the instruction pointer which will be used upon return from the interrupt, and then calls cc_interrupt.&lt;br /&gt;
&lt;br /&gt;
As in the original mupen64plus, the emulated clock runs at 37.5 MHz, and each instruction takes 2 clock cycles.&lt;br /&gt;
&lt;br /&gt;
==Delay slots==&lt;br /&gt;
&lt;br /&gt;
MIPS has 'delay slots', where the instruction after the branch is executed before the branch is taken.  Instructions in delay slots are issued out-of-order in the recompiled code.&lt;br /&gt;
&lt;br /&gt;
[[Image:Recompiler delay slot reordering.png]]&lt;br /&gt;
&lt;br /&gt;
When a branch jumps into the delay slot of another branch, this case must be handled slightly differently:&lt;br /&gt;
&lt;br /&gt;
[[Image:Recompiler delay slot reordering 2.png]]&lt;br /&gt;
&lt;br /&gt;
The branch test and delay slot are executed in-order if a dependency exists, or for 'likely' branches where the delay slot is nullified when the branch condition is false.  These cases are infrequent (typically less than 10% of branches).&lt;br /&gt;
&lt;br /&gt;
==Constant propagation==&lt;br /&gt;
&lt;br /&gt;
When an instruction loads a constant into a register, the register is tagged as a constant.  The constant tag will be retained if subsequent instructions modify the constant using other constants.  During assembly, such a sequence of instructions is combined into a single load.  For example:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
LUI r8,12340000  --&amp;gt;  mov $0x12345678,%eax&lt;br /&gt;
ORI r8,r8,5678&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This optimization is not performed where a branch target intervenes, eg&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
...&lt;br /&gt;
 LUI r8,12340000  --&amp;gt;  mov $0x12340000,%eax&lt;br /&gt;
L1:&lt;br /&gt;
 ORI r8,r8,5678   --&amp;gt;  or $0x5678,%eax&lt;br /&gt;
 ...&lt;br /&gt;
 BEQ r0,r0,L1&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Registers containing constants are identified by bits in the isconst and wasconst fields of the regstat structure.  The wasconst bit is set for a register if the register contained a known constant before the instruction, and the isconst bit is set for a register if the register will contain a known constant after the instruction executes.&lt;br /&gt;
&lt;br /&gt;
==Translation lookaside buffer emulation==&lt;br /&gt;
&lt;br /&gt;
Most Nintendo 64 games do not use virtual memory, but some do.  At startup, main memory is directly mapped at addresses from 0x80000000 to 0x803FFFFF, or up to 0x807FFFFF if the memory expansion is used.&lt;br /&gt;
&lt;br /&gt;
Normally, read or write operations are checked against this range, and if outside this range, control is passed to the appropriate I/O handler.  This is done as follows:&lt;br /&gt;
&lt;br /&gt;
MIPS instruction: LW r1,8(r2)&lt;br /&gt;
&lt;br /&gt;
ARM code:&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
add r1, r2, #8&lt;br /&gt;
cmp r1, #8388608&lt;br /&gt;
bvc handler&lt;br /&gt;
ldr r1, [r1]&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If there are valid entries in the TLB, this would instead be compiled as follows:&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
add r1, r2, #8&lt;br /&gt;
mov r0, #264&lt;br /&gt;
add r0, r0, r1, lsr #12&lt;br /&gt;
ldr r0, [r11, r0, lsl #2]&lt;br /&gt;
tst r0, r0&lt;br /&gt;
bmi handler&lt;br /&gt;
ldr r1, [r1, r0, lsl #2]&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This looks up the offset in the memory_map table (which, in this example, is located at r11+264*4).  The high bit is tested to determine whether a valid mapping exists for this page.&lt;br /&gt;
&lt;br /&gt;
==Page fault emulation==&lt;br /&gt;
&lt;br /&gt;
If a memory address references an invalid page, a conditional branch is taken to a bit of code placed after the main block.  This will save any modified cached registers, and call an appropriate handler function for the address.  The handler will either perform I/O, or generate an exception (pagefault).&lt;br /&gt;
&lt;br /&gt;
==Mapping executable pages==&lt;br /&gt;
&lt;br /&gt;
If the dynamic recompiler encounters code which is not in contiguous physical memory, it will end the block at the page boundary, so that the block can be removed cleanly if the mapping is changed.&lt;br /&gt;
&lt;br /&gt;
There is a special case of this, where a branch and delay slot span two pages.  If a branch instruction is the last instruction in a virtual memory page, it is compiled in a different manner than other branches.  The branch condition is evaluated, and the target address is placed in a register (%ebp on x86, and r8 on ARM).  A special form of the dynamic linker (dyna_linker_ds) is used to link the branch to its corresponding delay slot in another block.  If no page fault occurs, the delay slot executes and then jumps to the address in the register.  For conditional branches that are not taken, the target address is the next instruction.  This code is generated by the pagespan_assemble function.&lt;br /&gt;
&lt;br /&gt;
== Compile options ==&lt;br /&gt;
&lt;br /&gt;
=== ARMv5_ONLY ===&lt;br /&gt;
&lt;br /&gt;
If this is defined, the UXTH instruction is not used, and the dynamic recompiler will generate literal pools instead of using movw/movt.  This provides compatibility with older processors, but generates somewhat less efficient code.&lt;br /&gt;
&lt;br /&gt;
=== CORTEX_A8_BRANCH_PREDICTION_HACK ===&lt;br /&gt;
&lt;br /&gt;
If this is defined, the dynamic recompiler will avoid generating consecutive branch instructions without another instruction in between.  This avoids a possible branch misprediction on the Cortex-A8 due to this processor having dual instruction decoders, but only one branch-prediction unit.  See [[Assembly Code Optimization]] for details.&lt;br /&gt;
&lt;br /&gt;
=== USE_MINI_HT ===&lt;br /&gt;
&lt;br /&gt;
If this is defined, attempt to look up return addresses in a small hash table before checking the larger hash table.  Usually improves performance.&lt;br /&gt;
&lt;br /&gt;
=== IMM_PREFETCH ===&lt;br /&gt;
&lt;br /&gt;
If this is defined, the x86 PREFETCH instruction is used to prefetch entries from the hash table.  The increase in code size often outweighs the benefit of this.&lt;br /&gt;
&lt;br /&gt;
=== REG_PREFETCH ===&lt;br /&gt;
&lt;br /&gt;
Similar to the above, but loads the address into a register first, then uses the ARM PLD instruction.  The increase in code size almost always outweighs the benefit of this.&lt;br /&gt;
&lt;br /&gt;
=== R29_HACK ===&lt;br /&gt;
&lt;br /&gt;
Assume that the stack pointer (r29) is always a valid memory address and do not check it.  It is similar to the optimization described [http://strmnnrmn.blogspot.com/2007/08/interesting-dynarec-hack.html here].  This can crash the emulator and is not enabled by default.&lt;br /&gt;
&lt;br /&gt;
==Potential improvements==&lt;br /&gt;
&lt;br /&gt;
===Copy propagation/offset propagation===&lt;br /&gt;
&lt;br /&gt;
A common instruction sequence is of the form:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
LUI r9,12340000&lt;br /&gt;
ADD r9,r9,r8&lt;br /&gt;
LW r9,5678(r9)&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
It would be helpful to recognize this as a load from r8+12345678.  The current constant propagation code does not do so.&lt;br /&gt;
&lt;br /&gt;
===Unaligned memory access===&lt;br /&gt;
&lt;br /&gt;
A small improvement could be made by combining adjacent LWL/LWR instructions.  The potential gain from doing so is very limited because these instructions typically represent less than 1% of all memory accesses.&lt;br /&gt;
&lt;br /&gt;
===SLT/branch merging===&lt;br /&gt;
&lt;br /&gt;
A frequent occurrence in MIPS code is an SLT or SLTI instruction followed by a branch.  This is generated relatively inefficiently on x86 and ARM, first doing a compare and set, then testing this value and doing a conditional branch.  Doing only one comparison would save at least one instruction, and could potentially save up to three instructions if the liveness analysis reveals that the result of the SLT instruction is used nowhere else.&lt;br /&gt;
&lt;br /&gt;
While a potentially useful optimization, there are several problems with this approach.  First, there are often additional instructions between the slt and the branch. These must be checked to make sure they do not modify the registers as that would prevent reordering the instruction stream as desired.  Secondly, if the result of the slt is found to be live, but unmodified, on both paths of the branch, clean_registers will normally write this value before the branch, to avoid duplicating the writeback code on both paths of the branch.  This optimization would have to be removed if the slt was combined with the branch.&lt;br /&gt;
&lt;br /&gt;
===x86-64===&lt;br /&gt;
&lt;br /&gt;
Currently the x86-64 backend generates only 32-bit instructions.  Proper 64-bit code generation would improve performance.&lt;br /&gt;
&lt;br /&gt;
===PowerPC===&lt;br /&gt;
&lt;br /&gt;
It would be possible to add a PowerPC code generator to the dynamic recompiler.  Currently no one is working on this.  (The mupen64gc project is using a different codebase.)&lt;br /&gt;
&lt;br /&gt;
The following is a summary of the changes which would be necessary to add a PowerPC backend.&lt;br /&gt;
&lt;br /&gt;
The slt* instructions use conditional moves, which are unavailable on PowerPC.  A suitable alternative (such as moves from the condition register) would need to be used.&lt;br /&gt;
&lt;br /&gt;
The assembler can generate as much as 256K of code in a single block, however conditional branches on PowerPC are limited to +/-64K.  It will be necessary to either restrict the block size, or insert jump tables in a manner similar to the literal pools on ARM.&lt;br /&gt;
&lt;br /&gt;
PowerPC generally relies on early branch resolution rather than statistical branch prediction.  Scheduling branch condition tests earlier may be advantageous.  (For example, the address bounds check could be done in address_generation, rather than just before the load or store.  Similarly it may be advantageous to test the branch condition and update the cycle count before executing the delay slot.)&lt;br /&gt;
&lt;br /&gt;
===MIPS===&lt;br /&gt;
&lt;br /&gt;
Recompiling MIPS into MIPS would be relatively straightforward, however the current code generator has no facility for filling delay slots.  This capability would be required for efficient code generation.&lt;br /&gt;
&lt;br /&gt;
[[Category:Development]]&lt;/div&gt;</summary>
		<author><name>Ari64</name></author>
		
	</entry>
	<entry>
		<id>https://pandorawiki.org/index.php?title=Mupen64plus_dynamic_recompiler&amp;diff=3771</id>
		<title>Mupen64plus dynamic recompiler</title>
		<link rel="alternate" type="text/html" href="https://pandorawiki.org/index.php?title=Mupen64plus_dynamic_recompiler&amp;diff=3771"/>
		<updated>2010-09-27T20:33:18Z</updated>

		<summary type="html">&lt;p&gt;Ari64: /* Compile options */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This page describes the dynamic recompiler in Mupen64plus, and the changes made for the ARM port.&lt;br /&gt;
&lt;br /&gt;
==The original dynamic recompiler by Hacktarux==&lt;br /&gt;
&lt;br /&gt;
The dynamic recompiler used in Mupen64plus v1.5 is based on the original written by Hacktarux in 2002.&lt;br /&gt;
&lt;br /&gt;
It recompiles contiguous blocks of MIPS instructions.  First, each instruction is decoded into a dynamically-allocated 132-byte data structure.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
typedef struct _precomp_instr&lt;br /&gt;
{&lt;br /&gt;
   void (*ops)();&lt;br /&gt;
   union&lt;br /&gt;
     {&lt;br /&gt;
    struct&lt;br /&gt;
      {&lt;br /&gt;
         long long int *rs;&lt;br /&gt;
         long long int *rt;&lt;br /&gt;
         short immediate;&lt;br /&gt;
      } i;&lt;br /&gt;
    struct&lt;br /&gt;
      {&lt;br /&gt;
         unsigned int inst_index;&lt;br /&gt;
      } j;&lt;br /&gt;
    struct&lt;br /&gt;
      {&lt;br /&gt;
         long long int *rs;&lt;br /&gt;
         long long int *rt;&lt;br /&gt;
         long long int *rd;&lt;br /&gt;
         unsigned char sa;&lt;br /&gt;
         unsigned char nrd;&lt;br /&gt;
      } r;&lt;br /&gt;
    struct&lt;br /&gt;
      {&lt;br /&gt;
         unsigned char base;&lt;br /&gt;
         unsigned char ft;&lt;br /&gt;
         short offset;&lt;br /&gt;
      } lf;&lt;br /&gt;
    struct&lt;br /&gt;
      {&lt;br /&gt;
         unsigned char ft;&lt;br /&gt;
         unsigned char fs;&lt;br /&gt;
         unsigned char fd;&lt;br /&gt;
      } cf;&lt;br /&gt;
     } f;&lt;br /&gt;
   unsigned int addr; /* word-aligned instruction address in r4300 address space */&lt;br /&gt;
   unsigned int local_addr; /* byte offset to start of corresponding x86_64 instructions, from start of code block */&lt;br /&gt;
   reg_cache_struct reg_cache_infos;&lt;br /&gt;
} precomp_instr;&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The decoded instructions are then compiled, generating x86 instructions for each MIPS instruction.  A 32K block is allocated with malloc() to hold the x86 code.  If this size proves insufficient, it is incrementally resized with realloc().&lt;br /&gt;
&lt;br /&gt;
MIPS registers are allocated to x86 registers with a least-recently-used replacement policy.  All cached registers are written back prior to a branch, or a memory read or write.&lt;br /&gt;
&lt;br /&gt;
To facilitate invalidation and replacement of modified code blocks, each 4K page is compiled separately.  If a sequence of instructions crosses a 4K page boundary, the current block is ended, and the next instructions will be compiled as a separate block.&lt;br /&gt;
&lt;br /&gt;
Branch instructions within a 4K page are compiled as branches directly to the target address.  Branches which cross a 4K page go through an indirect address lookup.&lt;br /&gt;
&lt;br /&gt;
Compiled code blocks are invalidated on write.  On an actual MIPS CPU, the instruction cache is invalidated using the CACHE instruction, however a few N64 games clear the cache using other methods.  Trapping writes appears to be the most reliable method of ensuring cache coherency in the emulated system.  The cache instruction is ignored.&lt;br /&gt;
&lt;br /&gt;
==Problems with the original design==&lt;br /&gt;
&lt;br /&gt;
The most significant performance problem with this design is its excessive memory usage.  The decoded instruction data is retained (132 bytes for each MIPS instruction) and occasionally referenced during execution.  Memory accesses frequently miss the L2 cache, resulting in poor performance.&lt;br /&gt;
&lt;br /&gt;
Additionally, the register cache is relatively inefficient, since all registers are flushed before any read or write operation.&lt;br /&gt;
&lt;br /&gt;
==A new approach==&lt;br /&gt;
&lt;br /&gt;
To reduce memory usage, the new design allocates a single large block of memory (currently 16 MiB) which is used for recompiled code.  This memory is allocated using mmap with the PROT_EXEC bit set, to ensure operation on CPUs with no-execute (NX) page permissions.&lt;br /&gt;
&lt;br /&gt;
Contiguous blocks of MIPS instructions are compiled (that is, it does not attempt to follow branches or 'hot-paths').&lt;br /&gt;
&lt;br /&gt;
The recompiler consists of eight stages, plus a linker, and a memory manager.&lt;br /&gt;
&lt;br /&gt;
Compiled blocks are invalidated on write, as before, however as the compiler will cross a 4K page, writes may invalidate adjacent pages as well.&lt;br /&gt;
&lt;br /&gt;
Currently the dynarec generates x86, x86-64, ARMv5, and ARMv7 (little-endian).  Most of the code is shared between the architectures, but a different code generator is included at compile time, using different #include statements depending on the CPU type.&lt;br /&gt;
&lt;br /&gt;
==Pass 1: Disassembly==&lt;br /&gt;
&lt;br /&gt;
When an instruction address is encountered which has not been compiled, the function new_recompile_block is called, with the (virtual) address of the target instruction as its sole parameter.  If the address is invalid, the function returns a nonzero value and the caller is responsible for handling the pagefault.&lt;br /&gt;
&lt;br /&gt;
Instructions are decoded until an unconditional jump, usually a return, is encountered.  Disassembly is ordinarilly continued past a JAL (subroutine call) instruction, however this strategy is abandoned if invalid instructions are encountered.&lt;br /&gt;
&lt;br /&gt;
Up to 4096 instructions (16K of MIPS code) may be disassembled at once.  Surprisingly, some games do actually reach this limit.&lt;br /&gt;
&lt;br /&gt;
==Pass 2: Liveness analysis==&lt;br /&gt;
&lt;br /&gt;
After disassembly, liveness analysis is performed on the registers.  This determines when a particular register will no longer be used, and thus can be removed from the register cache.&lt;br /&gt;
&lt;br /&gt;
A separate analysis is done on the upper 32 bits of 64-bit registers.  This can determine when only the lower 32 bits of a register are significant, thus allowing use of a 32-bit register.  This enables more efficient code generation on 32-bit processors.&lt;br /&gt;
&lt;br /&gt;
==Pass 3: Register allocation==&lt;br /&gt;
&lt;br /&gt;
The 31 MIPS registers must be mapped onto seven available registers on x86, or twelve available registers on ARM.&lt;br /&gt;
&lt;br /&gt;
Instructions with 64-bit results require two registers on 32-bit host processors.  32-bit instructions require only one.  A flag is set for registers containing 32-bit values, and these registers will be sign-extended before they are written to the register file.&lt;br /&gt;
&lt;br /&gt;
Registers are made available using a least-soon-needed replacement policy.  When the register cache is full, and no registers can be eliminated using the liveness analysis, a ten-instruction lookahead is used to determine which registers will not be needed soon and these registers are evicted from the cache.&lt;br /&gt;
&lt;br /&gt;
==Pass 4: Free unused registers==&lt;br /&gt;
&lt;br /&gt;
After the initial register allocation, the allocations are reviewed to determine if any registers remain allocated longer than necessary.  These are then removed from the register cache.  This avoids having to write back a large number of registers just before a branch, and makes more registers available for the next pass.&lt;br /&gt;
&lt;br /&gt;
==Pass 5: Pre-allocate registers==&lt;br /&gt;
&lt;br /&gt;
If a register will be used soon and needs to be loaded, try to load it early if the register is available.  This improves execution on CPUs with a load-use penalty.&lt;br /&gt;
&lt;br /&gt;
==Pass 6: Optimize clean/dirty state==&lt;br /&gt;
&lt;br /&gt;
If a cached register is 'dirty' and needs to be written out, try to do so as soon as the register will no longer be modified.  This avoids having to write the same register on multiple code paths due to conditional branches.  Additionally, try to avoid writing out dirty registers inside of loops.&lt;br /&gt;
&lt;br /&gt;
==Pass 7: Identify where registers are assumed to be 32-bit==&lt;br /&gt;
&lt;br /&gt;
When a 64-bit register is mapped to a 32-bit register with the assumption that the value will be sign-extended before being used, it is necessary to ensure that no register contains a 64-bit value when branching to such a location.  These instructions are flagged to identify them as requiring 32-bit inputs.  This information is used by the linker, and the exception return (ERET) handler.&lt;br /&gt;
&lt;br /&gt;
==Pass 8: Assembly==&lt;br /&gt;
&lt;br /&gt;
This generates and outputs the recompiled code.&lt;br /&gt;
&lt;br /&gt;
Following the main code block, handlers for certain exceptions as well as alternate entry points are added.&lt;br /&gt;
&lt;br /&gt;
If a recompiled instruction relies on a certain MIPS register being cached in a certain native register, then a short 'stub' of code is generated to load the necessary registers.  When an instruction outside of this block needs to jump to that location, it will instead jump to the stub.  The necessary registers will be loaded, and then it will jump into the main code sequence.&lt;br /&gt;
&lt;br /&gt;
On architectures which require literal pools (ARMv5) these are inserted as necessary.&lt;br /&gt;
&lt;br /&gt;
==Linker==&lt;br /&gt;
&lt;br /&gt;
The linker fills in all unresolved branches.  Branches within the block are linked to their target address.  Branches which jump outside of the block are linked to their target if that address has been compiled already. These inter-block branches are recorded in the jump_out array.  This information will be used to remove the links in the event that the target of the branch is invalidated.&lt;br /&gt;
&lt;br /&gt;
Unresolved branches point to a stub which loads the address of the branch instruction and the virtual address of its target into registers, and calls the dynamic linker.  When this code is executed, the dynamic linker will compile the target if necessary, and then patch the branch instruction with the new address.&lt;br /&gt;
&lt;br /&gt;
==Memory manager==&lt;br /&gt;
&lt;br /&gt;
The last step in new_recompile_block is to ensure that there will be sufficient memory available to compile the next block.  If there is not, then the oldest blocks are purged.&lt;br /&gt;
&lt;br /&gt;
The dynarec cache can be described as a 16MB circular buffer divided into eight segments.  Memory is allocated in order from beginning to end.&lt;br /&gt;
&lt;br /&gt;
When there are less than 2 empty segments, the next segment in sequence is cleared, wrapping around to the beginning of the buffer.  This continues as memory is needed, wrapping around from end to beginning.&lt;br /&gt;
&lt;br /&gt;
==Invalidation and restoration==&lt;br /&gt;
&lt;br /&gt;
Normally, code blocks are looked up via a hash table, using the virtual address of the target instruction.  However, for purposes of invalidation, blocks are grouped by physical address.  This can be described as a virtually-indexed, physically-tagged (VIPT) cache.&lt;br /&gt;
&lt;br /&gt;
References to compiled blocks are stored in one of 4096 linked lists in the jump_in array.  Each list covers a 4K block of memory, and 2048 such lists are sufficient to cover the 8MB of RAM in the Nintendo 64.  The remaining lists are for code in ROM, and the bootloader in SP memory.&lt;br /&gt;
&lt;br /&gt;
When a write hits a memory page marked as cached, all entries in the corresponding list are invalidated.  If any code is found to cross a 4K boundary, the adjacent lists are invalidated also.&lt;br /&gt;
&lt;br /&gt;
Sometimes blocks may be invalidated even when none of the code is actually modified.  This can happen if data is written to memory in the same 4K page, or if code is reloaded without actually modifying it.  If blocks which were previously invalidated are subsequently found to be unmodified, those blocks are marked in the restore_candidate array.  If the block remains unmodified, it will be restored as a valid block, to avoid recompiling blocks which do not need to be recompiled.  This is performed by the clean_blocks function which is called periodically.&lt;br /&gt;
&lt;br /&gt;
The jump_in array, which lists unmodified blocks, is physically indexed, however the jump_dirty array, which lists potentially-modified blocks, is virtually indexed.  This allows blocks which have changed physical addresses, but are still at the same virtual address, to be recognized as being the same as before, and not in need of recompilation.&lt;br /&gt;
&lt;br /&gt;
==Dynamic linker==&lt;br /&gt;
&lt;br /&gt;
Branches with unresolved addresses jump to the dynamic linker.  This will look through the jump_in list corresponding to the physical page containing the virtual target address.  If found, the branch instruction will be patched with the address, and then it will jump to this address.&lt;br /&gt;
&lt;br /&gt;
If not found, the jump_dirty list will be searched for blocks which were previously compiled but may have been modified.  If a potential match is found, the code will be compared against a cached copy to determine if any changes have been made.  If not, then it will jump to the block.  Because the memory could be modified again, branch instructions referencing these blocks are not altered, and continue to point to the dynamic linker.  These blocks will continue to be verified each time they are called, until restored to the jump_in list by the clean_blocks function described above.&lt;br /&gt;
&lt;br /&gt;
If no compiled block is found, or the existing block was modified, the target is recompiled.&lt;br /&gt;
&lt;br /&gt;
==Address lookup==&lt;br /&gt;
&lt;br /&gt;
When a JR (jump register) instruction is encountered, the address of the recompiled code must be looked up using the address of the MIPS code.  The majority of such instructions jump to the link register (r31) to return to the location following a JAL (call) instruction.&lt;br /&gt;
&lt;br /&gt;
When a JAL or JALR is executed, the address of the following instruction is inserted into a small 32-entry hash table, which is checked when a JR r31 instruction is executed.  This allows for a quick return from subroutine calls.&lt;br /&gt;
&lt;br /&gt;
If the JR instruction uses a register other than r31, or the small hash table lookup fails to find a match, a larger 131072-entry hash table is checked.  This table contains 65536 bins with up to 2 addresses per bin.  If this also fails to find a match (which occurs less than 1% of the time) an exhaustive search of all compiled addresses within that 4K memory page is performed.&lt;br /&gt;
&lt;br /&gt;
If no match is found by any of these methods, the target address is compiled, and the new address is inserted into the hash table.&lt;br /&gt;
&lt;br /&gt;
==Cycle counting==&lt;br /&gt;
&lt;br /&gt;
Cycles are counted before each branch by adding the cycles from the preceding instructions to a specific register.  The cycle count is in R10 on ARM and ESI on x86.  The value in this register is normally a negative number.  When this number exceeds zero, a conditional branch is taken which jumps to an interrupt handler.&lt;br /&gt;
&lt;br /&gt;
For example, the following x86 code adds eight cycles:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
add $8,%esi&lt;br /&gt;
jns interrupt_handler&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The conditional branch jumps to a small bit of code located after the main compiled block, which saves the cached registers to the register file, sets the instruction pointer which will be used upon return from the interrupt, and then calls cc_interrupt.&lt;br /&gt;
&lt;br /&gt;
As in the original mupen64plus, the emulated clock runs at 37.5 MHz, and each instruction takes 2 clock cycles.&lt;br /&gt;
&lt;br /&gt;
==Delay slots==&lt;br /&gt;
&lt;br /&gt;
MIPS has 'delay slots', where the instruction after the branch is executed before the branch is taken.  Instructions in delay slots are issued out-of-order in the recompiled code.&lt;br /&gt;
&lt;br /&gt;
[[Image:Recompiler delay slot reordering.png]]&lt;br /&gt;
&lt;br /&gt;
When a branch jumps into the delay slot of another branch, this case must be handled slightly differently:&lt;br /&gt;
&lt;br /&gt;
[[Image:Recompiler delay slot reordering 2.png]]&lt;br /&gt;
&lt;br /&gt;
The branch test and delay slot are executed in-order if a dependency exists, or for 'likely' branches where the delay slot is nullified when the branch condition is false.  These cases are infrequent (typically less than 10% of branches).&lt;br /&gt;
&lt;br /&gt;
==Constant propagation==&lt;br /&gt;
&lt;br /&gt;
When an instruction loads a constant into a register, the register is tagged as a constant.  The constant tag will be retained if subsequent instructions modify the constant using other constants.  During assembly, such a sequence of instructions is combined into a single load.  For example:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
LUI r8,12340000  --&amp;gt;  mov $0x12345678,%eax&lt;br /&gt;
ORI r8,r8,5678&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This optimization is not performed where a branch target intervenes, eg&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
...&lt;br /&gt;
 LUI r8,12340000  --&amp;gt;  mov $0x12340000,%eax&lt;br /&gt;
L1:&lt;br /&gt;
 ORI r8,r8,5678   --&amp;gt;  or $0x5678,%eax&lt;br /&gt;
 ...&lt;br /&gt;
 BEQ r0,r0,L1&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Registers containing constants are identified by bits in the isconst and wasconst fields of the regstat structure.  The wasconst bit is set for a register if the register contained a known constant before the instruction, and the isconst bit is set for a register if the register will contain a known constant after the instruction executes.&lt;br /&gt;
&lt;br /&gt;
==Translation lookaside buffer emulation==&lt;br /&gt;
&lt;br /&gt;
Most Nintendo 64 games do not use virtual memory, but some do.  At startup, main memory is directly mapped at addresses from 0x80000000 to 0x803FFFFF, or up to 0x807FFFFF if the memory expansion is used.&lt;br /&gt;
&lt;br /&gt;
Normally, read or write operations are checked against this range, and if outside this range, control is passed to the appropriate I/O handler.  This is done as follows:&lt;br /&gt;
&lt;br /&gt;
MIPS instruction: LW r1,8(r2)&lt;br /&gt;
&lt;br /&gt;
ARM code:&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
add r1, r2, #8&lt;br /&gt;
cmp r1, #8388608&lt;br /&gt;
bvc handler&lt;br /&gt;
ldr r1, [r1]&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If there are valid entries in the TLB, this would instead be compiled as follows:&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
add r1, r2, #8&lt;br /&gt;
mov r0, #264&lt;br /&gt;
add r0, r0, r1, lsr #12&lt;br /&gt;
ldr r0, [r11, r0, lsl #2]&lt;br /&gt;
tst r0, r0&lt;br /&gt;
bmi handler&lt;br /&gt;
ldr r1, [r1, r0, lsl #2]&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This looks up the offset in the memory_map table (which, in this example, is located at r11+264*4).  The high bit is tested to determine whether a valid mapping exists for this page.&lt;br /&gt;
&lt;br /&gt;
==Page fault emulation==&lt;br /&gt;
&lt;br /&gt;
If a memory address references an invalid page, a conditional branch is taken to a bit of code placed after the main block.  This will save any modified cached registers, and call an appropriate handler function for the address.  The handler will either perform I/O, or generate an exception (pagefault).&lt;br /&gt;
&lt;br /&gt;
==Mapping executable pages==&lt;br /&gt;
&lt;br /&gt;
If the dynamic recompiler encounters code which is not in contiguous physical memory, it will end the block at the page boundary, so that the block can be removed cleanly if the mapping is changed.&lt;br /&gt;
&lt;br /&gt;
There is a special case of this, where a branch and delay slot span two pages.  If a branch instruction is the last instruction in a virtual memory page, it is compiled in a different manner than other branches.  The branch condition is evaluated, and the target address is placed in a register (%ebp on x86, and r8 on ARM).  A special form of the dynamic linker (dyna_linker_ds) is used to link the branch to its corresponding delay slot in another block.  If no page fault occurs, the delay slot executes and then jumps to the address in the register.  For conditional branches that are not taken, the target address is the next instruction.  This code is generated by the pagespan_assemble function.&lt;br /&gt;
&lt;br /&gt;
== Compile options ==&lt;br /&gt;
&lt;br /&gt;
=== ARMv5_ONLY ===&lt;br /&gt;
&lt;br /&gt;
If this is defined, the UXTH instruction is not used, and the dynamic recompiler will generate literal pools instead of using movw/movt.  This provides compatibility with older processors, but generates somewhat less efficient code.&lt;br /&gt;
&lt;br /&gt;
=== CORTEX_A8_BRANCH_PREDICTION_HACK ===&lt;br /&gt;
&lt;br /&gt;
If this is defined, the dynamic recompiler will avoid generating consecutive branch instructions without another instruction in between.  This avoids a possible branch misprediction on the Cortex-A8 due to this processor having dual instruction decoders, but only one branch-prediction unit.  See [[Assembly Code Optimization]] for details.&lt;br /&gt;
&lt;br /&gt;
=== USE_MINI_HT ===&lt;br /&gt;
&lt;br /&gt;
If this is defined, attempt to look up return addresses in a small hash table before checking the larger hash table.  Usually improves performance.&lt;br /&gt;
&lt;br /&gt;
=== IMM_PREFETCH ===&lt;br /&gt;
&lt;br /&gt;
If this is defined, the x86 PREFETCH instruction is used to prefetch entries from the hash table.  The increase in code size often outweighs the benefit of this.&lt;br /&gt;
&lt;br /&gt;
=== REG_PREFETCH ===&lt;br /&gt;
&lt;br /&gt;
Similar to the above, but loads the address into a register first, then uses the ARM PLD instruction.  The increase in code size almost always outweighs the benefit of this.&lt;br /&gt;
&lt;br /&gt;
=== R29_HACK ===&lt;br /&gt;
&lt;br /&gt;
Assume that the stack pointer (r29) is always a valid memory address and do not check it.  It is similar to the optimization described [http://strmnnrmn.blogspot.com/2007/08/interesting-dynarec-hack.html here].  This can crash the emulator and is not enabled by default.&lt;br /&gt;
&lt;br /&gt;
==Future improvements==&lt;br /&gt;
&lt;br /&gt;
===Copy propagation/offset propagation===&lt;br /&gt;
&lt;br /&gt;
A common instruction sequence is of the form:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
LUI r9,12340000&lt;br /&gt;
ADD r9,r9,r8&lt;br /&gt;
LW r9,5678(r9)&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
It would be helpful to recognize this as a load from r8+12345678.  The current constant propagation code does not do so.&lt;br /&gt;
&lt;br /&gt;
===Unaligned memory access===&lt;br /&gt;
&lt;br /&gt;
A small improvement could be made by combining adjacent LWL/LWR instructions.  The potential gain from doing so is very limited because these instructions typically represent less than 1% of all memory accesses.&lt;br /&gt;
&lt;br /&gt;
===SLT/branch merging===&lt;br /&gt;
&lt;br /&gt;
A frequent occurrence in MIPS code is an SLT or SLTI instruction followed by a branch.  This is generated relatively inefficiently on x86 and ARM, first doing a compare and set, then testing this value and doing a conditional branch.  Doing only one comparison would save at least one instruction, and could potentially save up to three instructions if the liveness analysis reveals that the result of the SLT instruction is used nowhere else.&lt;br /&gt;
&lt;br /&gt;
While a potentially useful optimization, there are several problems with this approach.  First, there are often additional instructions between the slt and the branch. These must be checked to make sure they do not modify the registers as that would prevent reordering the instruction stream as desired.  Secondly, if the result of the slt is found to be live, but unmodified, on both paths of the branch, clean_registers will normally write this value before the branch, to avoid duplicating the writeback code on both paths of the branch.  This optimization would have to be removed if the slt was combined with the branch.&lt;br /&gt;
&lt;br /&gt;
===x86-64===&lt;br /&gt;
&lt;br /&gt;
Currently the x86-64 backend generates only 32-bit instructions.  Proper 64-bit code generation would improve performance.&lt;br /&gt;
&lt;br /&gt;
===PowerPC===&lt;br /&gt;
&lt;br /&gt;
It would be possible to add a PowerPC code generator to the dynamic recompiler.  Currently no one is working on this.  (The mupen64gc project is using a different codebase.)&lt;br /&gt;
&lt;br /&gt;
The following is a summary of the changes which would be necessary to add a PowerPC backend.&lt;br /&gt;
&lt;br /&gt;
The slt* instructions use conditional moves, which are unavailable on PowerPC.  A suitable alternative (such as moves from the condition register) would need to be used.&lt;br /&gt;
&lt;br /&gt;
The assembler can generate as much as 256K of code in a single block, however conditional branches on PowerPC are limited to +/-64K.  It will be necessary to either restrict the block size, or insert jump tables in a manner similar to the literal pools on ARM.&lt;br /&gt;
&lt;br /&gt;
PowerPC generally relies on early branch resolution rather than statistical branch prediction.  Scheduling branch condition tests earlier may be advantageous.  (For example, the address bounds check could be done in address_generation, rather than just before the load or store.  Similarly it may be advantageous to test the branch condition and update the cycle count before executing the delay slot.)&lt;br /&gt;
&lt;br /&gt;
===MIPS===&lt;br /&gt;
&lt;br /&gt;
Recompiling MIPS into MIPS would be relatively straightforward, however the current code generator has no facility for filling delay slots.  This capability would be required for efficient code generation.&lt;br /&gt;
&lt;br /&gt;
[[Category:Development]]&lt;/div&gt;</summary>
		<author><name>Ari64</name></author>
		
	</entry>
	<entry>
		<id>https://pandorawiki.org/index.php?title=Assembly_Code_Optimization&amp;diff=3744</id>
		<title>Assembly Code Optimization</title>
		<link rel="alternate" type="text/html" href="https://pandorawiki.org/index.php?title=Assembly_Code_Optimization&amp;diff=3744"/>
		<updated>2010-09-23T20:32:13Z</updated>

		<summary type="html">&lt;p&gt;Ari64: /* Branch Prediction */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Assembly code optimization on the Cortex-A8 ==&lt;br /&gt;
This guide presents specific optimization techniques for the ARM Cortex-A8 processor and its dual-issue, in-order pipeline.&lt;br /&gt;
&lt;br /&gt;
== Use the ARMv7 movw/movt instructions ==&lt;br /&gt;
Newer ARM processors allow loading 32-bit values as two 16-bit immediates.  The movw instruction loads the lower 16 bits, and movt loads the upper 16 bits.  The movw instruction clears the upper 16 bits, so that 16-bit values can be loaded using a single instruction.  The movt instruction does not affect the lower bits.&lt;br /&gt;
&lt;br /&gt;
On older ARM processors, it was common to load 32-bit values with a PC-relative load.  This should be avoided because it may result in a cache miss.&lt;br /&gt;
&lt;br /&gt;
== Branch Prediction ==&lt;br /&gt;
There is a one-cycle stall in instruction fetch when a branch is predicted taken.  It is therefore preferable to structure code so that the most likely code path is the one where the branch is not taken.&lt;br /&gt;
&lt;br /&gt;
Unconditional branches may be mispredicted, and load instructions which follow branches may be decoded and cause cache accesses.  Avoid placing load instructions after branches unless you intend the CPU to prefetch the addresses they reference.&lt;br /&gt;
&lt;br /&gt;
The branch predictor has a call/return stack for jumps which reference r14 or stack operations using r13 which load the program counter.  For best performance, make sure any pop, mov pc,lr or bx lr jumps to the same location as was set by the matching bl instruction.  For non-return jumps, use a register other than r14.&lt;br /&gt;
&lt;br /&gt;
Although instructions are decoded in pairs, the branch predictor can only predict one branch target per cycle.  If you have a conditional branch which is immediately followed by another branch, and the first branch is likely to be taken, place a no-operation instruction between the branches to prevent decoding and prediction of the second branch.  Inserting NOPs is detrimental if the first branch is infrequently taken.&lt;br /&gt;
&lt;br /&gt;
== Dual-Issue Restrictions ==&lt;br /&gt;
Only one branch instruction can issue per cycle.  Only one load or store instruction can issue per cycle.  Instructions which write to the same register can not issue together.&lt;br /&gt;
&lt;br /&gt;
There is a one-cycle delay before any written register can be used as the address of a subsequent load or store instruction.  There is a one-cycle delay before any written register can be used by an instruction which performs a shift or rotation on that register.&lt;br /&gt;
&lt;br /&gt;
There is one cycle delay before the result of a load can be used.  In the aforementioned cases involving subsequent shift or rotate instructions, or the address of a load or store instruction, the total delay is two cycles.  &lt;br /&gt;
&lt;br /&gt;
Stores occur at the end of the pipeline and store instructions can issue in the same cycle as another instruction which writes the same register.  Similarly, conditional branches can issue in the same cycle as flag-setting instructions because the branch is not resolved until the following cycle.  In all other cases, instructions can not issue together if the second instruction depends on the results of the first.&lt;br /&gt;
&lt;br /&gt;
== Instruction pairing restrictions following a branch ==&lt;br /&gt;
The Cortex-A8 processor normally fetches two instructions per cycle and the branch predictor tests each against the global history buffer.  However, there is only one entry in the branch target buffer for each pair of instructions.  Therefore, branch instructions, and certain instructions which resemble branches, may adversely affect branch prediction when present in pairs.  To reduce the risk of branch misprediction, avoid pairing branch instructions with the following:&lt;br /&gt;
&lt;br /&gt;
* Branch instructions&lt;br /&gt;
* Load instructions, whether or not they write r15&lt;br /&gt;
* Arithmetic/logic instructions which do not set flags and do not have an immediate value, whether or not they write r15&lt;br /&gt;
&lt;br /&gt;
Flag-setting instructions can be used to avoid this restriction.  Additionally, MOV instructions that do not have immediate values can be replaced with ADD, SUB, ORR, or EOR instructions using zero as an immediate value.&lt;br /&gt;
&lt;br /&gt;
One exception to this rule is that placing such an instruction (eg mov r0,r0) following an unconditional branch may reduce the severity of a misprediction, due to the second instruction possibly being incorrectly predicted as a branch, but resulting in the correct target being retrieved from the BTB.&lt;br /&gt;
&lt;br /&gt;
== Code alignment ==&lt;br /&gt;
Code alignment should be used with caution.  Alignment has the potential to increase performance, but may be detrimental in some circumstances.&lt;br /&gt;
&lt;br /&gt;
Because the instruction decoder always fetches 64-bit aligned words from the level-1 instruction cache, aligning code can improve instruction fetch and decode.  However, excessive code alignment can result in a suboptimal distribution of entries in the branch history tables and increase branch misprediction.  Additionally, the speculative prefetch may retrieve and decode instructions used as padding, even if those instructions are never executed.  The types of instructions used as padding will affect the instruction decoder as stated above, and this may reduce the prefetch of code into the instruction cache.&lt;br /&gt;
&lt;br /&gt;
[[Category:Development]]&lt;/div&gt;</summary>
		<author><name>Ari64</name></author>
		
	</entry>
	<entry>
		<id>https://pandorawiki.org/index.php?title=Assembly_Code_Optimization&amp;diff=3740</id>
		<title>Assembly Code Optimization</title>
		<link rel="alternate" type="text/html" href="https://pandorawiki.org/index.php?title=Assembly_Code_Optimization&amp;diff=3740"/>
		<updated>2010-09-23T02:30:09Z</updated>

		<summary type="html">&lt;p&gt;Ari64: /* Branch Prediction */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Assembly code optimization on the Cortex-A8 ==&lt;br /&gt;
This guide presents specific optimization techniques for the ARM Cortex-A8 processor and its dual-issue, in-order pipeline.&lt;br /&gt;
&lt;br /&gt;
== Use the ARMv7 movw/movt instructions ==&lt;br /&gt;
Newer ARM processors allow loading 32-bit values as two 16-bit immediates.  The movw instruction loads the lower 16 bits, and movt loads the upper 16 bits.  The movw instruction clears the upper 16 bits, so that 16-bit values can be loaded using a single instruction.  The movt instruction does not affect the lower bits.&lt;br /&gt;
&lt;br /&gt;
On older ARM processors, it was common to load 32-bit values with a PC-relative load.  This should be avoided because it may result in a cache miss.&lt;br /&gt;
&lt;br /&gt;
== Branch Prediction ==&lt;br /&gt;
There is a one-cycle stall in instruction fetch when a branch is predicted taken.  It is therefore preferable to structure code so that the most likely code path is the one where the branch is not taken.&lt;br /&gt;
&lt;br /&gt;
Unconditional branches may be mispredicted, and load instructions which follow branches may be decoded and cause cache accesses.  Avoid placing load instructions after branches unless you intend the CPU to prefetch the addresses they reference.&lt;br /&gt;
&lt;br /&gt;
Branches which have never been taken (BTB miss) are ignored, and the branch history buffers are not updated.&lt;br /&gt;
&lt;br /&gt;
The branch predictor has a call/return stack for jumps which reference r14 or stack operations using r13 which load the program counter.  For best performance, make sure any pop, mov pc,lr or bx lr jumps to the same location as was set by the matching bl instruction.  For non-return jumps, use a register other than r14.&lt;br /&gt;
&lt;br /&gt;
Although instructions are decoded in pairs, the branch predictor can only predict one branch target per cycle.  If you have a conditional branch which is immediately followed by another branch, and the first branch is likely to be taken, place a no-operation instruction between the branches to prevent decoding and prediction of the second branch.  Inserting NOPs is detrimental if the first branch is not frequently taken.&lt;br /&gt;
&lt;br /&gt;
== Dual-Issue Restrictions ==&lt;br /&gt;
Only one branch instruction can issue per cycle.  Only one load or store instruction can issue per cycle.  Instructions which write to the same register can not issue together.&lt;br /&gt;
&lt;br /&gt;
There is a one-cycle delay before any written register can be used as the address of a subsequent load or store instruction.  There is a one-cycle delay before any written register can be used by an instruction which performs a shift or rotation on that register.&lt;br /&gt;
&lt;br /&gt;
There is one cycle delay before the result of a load can be used.  In the aforementioned cases involving subsequent shift or rotate instructions, or the address of a load or store instruction, the total delay is two cycles.  &lt;br /&gt;
&lt;br /&gt;
Stores occur at the end of the pipeline and store instructions can issue in the same cycle as another instruction which writes the same register.  Similarly, conditional branches can issue in the same cycle as flag-setting instructions because the branch is not resolved until the following cycle.  In all other cases, instructions can not issue together if the second instruction depends on the results of the first.&lt;br /&gt;
&lt;br /&gt;
== Instruction pairing restrictions following a branch ==&lt;br /&gt;
The Cortex-A8 processor normally fetches two instructions per cycle and the branch predictor tests each against the global history buffer.  However, there is only one entry in the branch target buffer for each pair of instructions.  Therefore, branch instructions, and certain instructions which resemble branches, may adversely affect branch prediction when present in pairs.  To reduce the risk of branch misprediction, avoid pairing branch instructions with the following:&lt;br /&gt;
&lt;br /&gt;
* Branch instructions&lt;br /&gt;
* Load instructions, whether or not they write r15&lt;br /&gt;
* Arithmetic/logic instructions which do not set flags and do not have an immediate value, whether or not they write r15&lt;br /&gt;
&lt;br /&gt;
Flag-setting instructions can be used to avoid this restriction.  Additionally, MOV instructions that do not have immediate values can be replaced with ADD, SUB, ORR, or EOR instructions using zero as an immediate value.&lt;br /&gt;
&lt;br /&gt;
One exception to this rule is that placing such an instruction (eg mov r0,r0) following an unconditional branch may reduce the severity of a misprediction, due to the second instruction possibly being incorrectly predicted as a branch, but resulting in the correct target being retrieved from the BTB.&lt;br /&gt;
&lt;br /&gt;
== Code alignment ==&lt;br /&gt;
Code alignment should be used with caution.  Alignment has the potential to increase performance, but may be detrimental in some circumstances.&lt;br /&gt;
&lt;br /&gt;
Because the instruction decoder always fetches 64-bit aligned words from the level-1 instruction cache, aligning code can improve instruction fetch and decode.  However, excessive code alignment can result in a suboptimal distribution of entries in the branch history tables and increase branch misprediction.  Additionally, the speculative prefetch may retrieve and decode instructions used as padding, even if those instructions are never executed.  The types of instructions used as padding will affect the instruction decoder as stated above, and this may reduce the prefetch of code into the instruction cache.&lt;br /&gt;
&lt;br /&gt;
[[Category:Development]]&lt;/div&gt;</summary>
		<author><name>Ari64</name></author>
		
	</entry>
	<entry>
		<id>https://pandorawiki.org/index.php?title=Assembly_Code_Optimization&amp;diff=3739</id>
		<title>Assembly Code Optimization</title>
		<link rel="alternate" type="text/html" href="https://pandorawiki.org/index.php?title=Assembly_Code_Optimization&amp;diff=3739"/>
		<updated>2010-09-22T20:32:46Z</updated>

		<summary type="html">&lt;p&gt;Ari64: /* Instruction pairing restrictions following a branch */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Assembly code optimization on the Cortex-A8 ==&lt;br /&gt;
This guide presents specific optimization techniques for the ARM Cortex-A8 processor and its dual-issue, in-order pipeline.&lt;br /&gt;
&lt;br /&gt;
== Use the ARMv7 movw/movt instructions ==&lt;br /&gt;
Newer ARM processors allow loading 32-bit values as two 16-bit immediates.  The movw instruction loads the lower 16 bits, and movt loads the upper 16 bits.  The movw instruction clears the upper 16 bits, so that 16-bit values can be loaded using a single instruction.  The movt instruction does not affect the lower bits.&lt;br /&gt;
&lt;br /&gt;
On older ARM processors, it was common to load 32-bit values with a PC-relative load.  This should be avoided because it may result in a cache miss.&lt;br /&gt;
&lt;br /&gt;
== Branch Prediction ==&lt;br /&gt;
There is a one-cycle stall in instruction fetch when a branch is predicted taken.  It is therefore preferable to structure code so that the most likely code path is the one where the branch is not taken.&lt;br /&gt;
&lt;br /&gt;
Unconditional branches may be mispredicted, and load instructions which follow branches may be decoded and cause cache accesses.  Avoid placing load instructions after branches unless you intend the CPU to prefetch the addresses they reference.&lt;br /&gt;
&lt;br /&gt;
The branch predictor has a call/return stack for jumps which reference r14 or stack operations using r13 which load the program counter.  For best performance, make sure any pop, mov pc,lr or bx lr jumps to the same location as was set by the matching bl instruction.  For non-return jumps, use a register other than r14.&lt;br /&gt;
&lt;br /&gt;
Although instructions are decoded in pairs, the branch predictor can only predict one branch per cycle.  If you have a conditional branch which is immediately followed by another branch, and the first branch is very likely to be taken, place a no-operation instruction between the branches to prevent decoding and prediction of the second branch.  Inserting NOPs is detrimental if the first branch is not frequently taken.&lt;br /&gt;
&lt;br /&gt;
== Dual-Issue Restrictions ==&lt;br /&gt;
Only one branch instruction can issue per cycle.  Only one load or store instruction can issue per cycle.  Instructions which write to the same register can not issue together.&lt;br /&gt;
&lt;br /&gt;
There is a one-cycle delay before any written register can be used as the address of a subsequent load or store instruction.  There is a one-cycle delay before any written register can be used by an instruction which performs a shift or rotation on that register.&lt;br /&gt;
&lt;br /&gt;
There is one cycle delay before the result of a load can be used.  In the aforementioned cases involving subsequent shift or rotate instructions, or the address of a load or store instruction, the total delay is two cycles.  &lt;br /&gt;
&lt;br /&gt;
Stores occur at the end of the pipeline and store instructions can issue in the same cycle as another instruction which writes the same register.  Similarly, conditional branches can issue in the same cycle as flag-setting instructions because the branch is not resolved until the following cycle.  In all other cases, instructions can not issue together if the second instruction depends on the results of the first.&lt;br /&gt;
&lt;br /&gt;
== Instruction pairing restrictions following a branch ==&lt;br /&gt;
The Cortex-A8 processor normally fetches two instructions per cycle and the branch predictor tests each against the global history buffer.  However, there is only one entry in the branch target buffer for each pair of instructions.  Therefore, branch instructions, and certain instructions which resemble branches, may adversely affect branch prediction when present in pairs.  To reduce the risk of branch misprediction, avoid pairing branch instructions with the following:&lt;br /&gt;
&lt;br /&gt;
* Branch instructions&lt;br /&gt;
* Load instructions, whether or not they write r15&lt;br /&gt;
* Arithmetic/logic instructions which do not set flags and do not have an immediate value, whether or not they write r15&lt;br /&gt;
&lt;br /&gt;
Flag-setting instructions can be used to avoid this restriction.  Additionally, MOV instructions that do not have immediate values can be replaced with ADD, SUB, ORR, or EOR instructions using zero as an immediate value.&lt;br /&gt;
&lt;br /&gt;
One exception to this rule is that placing such an instruction (eg mov r0,r0) following an unconditional branch may reduce the severity of a misprediction, due to the second instruction possibly being incorrectly predicted as a branch, but resulting in the correct target being retrieved from the BTB.&lt;br /&gt;
&lt;br /&gt;
== Code alignment ==&lt;br /&gt;
Code alignment should be used with caution.  Alignment has the potential to increase performance, but may be detrimental in some circumstances.&lt;br /&gt;
&lt;br /&gt;
Because the instruction decoder always fetches 64-bit aligned words from the level-1 instruction cache, aligning code can improve instruction fetch and decode.  However, excessive code alignment can result in a suboptimal distribution of entries in the branch history tables and increase branch misprediction.  Additionally, the speculative prefetch may retrieve and decode instructions used as padding, even if those instructions are never executed.  The types of instructions used as padding will affect the instruction decoder as stated above, and this may reduce the prefetch of code into the instruction cache.&lt;br /&gt;
&lt;br /&gt;
[[Category:Development]]&lt;/div&gt;</summary>
		<author><name>Ari64</name></author>
		
	</entry>
	<entry>
		<id>https://pandorawiki.org/index.php?title=Assembly_Code_Optimization&amp;diff=3735</id>
		<title>Assembly Code Optimization</title>
		<link rel="alternate" type="text/html" href="https://pandorawiki.org/index.php?title=Assembly_Code_Optimization&amp;diff=3735"/>
		<updated>2010-09-19T20:52:34Z</updated>

		<summary type="html">&lt;p&gt;Ari64: /* Branch Prediction */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Assembly code optimization on the Cortex-A8 ==&lt;br /&gt;
This guide presents specific optimization techniques for the ARM Cortex-A8 processor and its dual-issue, in-order pipeline.&lt;br /&gt;
&lt;br /&gt;
== Use the ARMv7 movw/movt instructions ==&lt;br /&gt;
Newer ARM processors allow loading 32-bit values as two 16-bit immediates.  The movw instruction loads the lower 16 bits, and movt loads the upper 16 bits.  The movw instruction clears the upper 16 bits, so that 16-bit values can be loaded using a single instruction.  The movt instruction does not affect the lower bits.&lt;br /&gt;
&lt;br /&gt;
On older ARM processors, it was common to load 32-bit values with a PC-relative load.  This should be avoided because it may result in a cache miss.&lt;br /&gt;
&lt;br /&gt;
== Branch Prediction ==&lt;br /&gt;
There is a one-cycle stall in instruction fetch when a branch is predicted taken.  It is therefore preferable to structure code so that the most likely code path is the one where the branch is not taken.&lt;br /&gt;
&lt;br /&gt;
Unconditional branches may be mispredicted, and load instructions which follow branches may be decoded and cause cache accesses.  Avoid placing load instructions after branches unless you intend the CPU to prefetch the addresses they reference.&lt;br /&gt;
&lt;br /&gt;
The branch predictor has a call/return stack for jumps which reference r14 or stack operations using r13 which load the program counter.  For best performance, make sure any pop, mov pc,lr or bx lr jumps to the same location as was set by the matching bl instruction.  For non-return jumps, use a register other than r14.&lt;br /&gt;
&lt;br /&gt;
Although instructions are decoded in pairs, the branch predictor can only predict one branch per cycle.  If you have a conditional branch which is immediately followed by another branch, and the first branch is very likely to be taken, place a no-operation instruction between the branches to prevent decoding and prediction of the second branch.  Inserting NOPs is detrimental if the first branch is not frequently taken.&lt;br /&gt;
&lt;br /&gt;
== Dual-Issue Restrictions ==&lt;br /&gt;
Only one branch instruction can issue per cycle.  Only one load or store instruction can issue per cycle.  Instructions which write to the same register can not issue together.&lt;br /&gt;
&lt;br /&gt;
There is a one-cycle delay before any written register can be used as the address of a subsequent load or store instruction.  There is a one-cycle delay before any written register can be used by an instruction which performs a shift or rotation on that register.&lt;br /&gt;
&lt;br /&gt;
There is one cycle delay before the result of a load can be used.  In the aforementioned cases involving subsequent shift or rotate instructions, or the address of a load or store instruction, the total delay is two cycles.  &lt;br /&gt;
&lt;br /&gt;
Stores occur at the end of the pipeline and store instructions can issue in the same cycle as another instruction which writes the same register.  Similarly, conditional branches can issue in the same cycle as flag-setting instructions because the branch is not resolved until the following cycle.  In all other cases, instructions can not issue together if the second instruction depends on the results of the first.&lt;br /&gt;
&lt;br /&gt;
== Instruction pairing restrictions following a branch ==&lt;br /&gt;
The Cortex-A8 processor normally decodes two instructions per cycle, but branch instructions, and certain instructions which resemble branches, may adversely affect the branch predictor when decoded in pairs.  These instructions can still issue and execute in parallel when branches are correctly predicted and there are sufficient instructions in the queue.  However, when a branch is mispredicted, the following instructions may cause an additional delay when two such instructions are paired:&lt;br /&gt;
&lt;br /&gt;
* Branch instructions&lt;br /&gt;
* Load instructions, whether or not they reference r15&lt;br /&gt;
* Arithmetic/logic instructions which do not set flags and do not have an immediate value.&lt;br /&gt;
&lt;br /&gt;
Flag-setting instructions can be used to avoid this restriction.  Additionally, MOV instructions that do not have immediate values can be replaced with ADD, SUB, ORR, or EOR instructions using zero as an immediate value.&lt;br /&gt;
&lt;br /&gt;
Intentionally placing a series of such instructions, such as mov r0,r0, following an unconditional branch may reduce unwanted code prefetch.&lt;br /&gt;
&lt;br /&gt;
== Code alignment ==&lt;br /&gt;
Code alignment should be used with caution.  Alignment has the potential to increase performance, but may be detrimental in some circumstances.&lt;br /&gt;
&lt;br /&gt;
Because the instruction decoder always fetches 64-bit aligned words from the level-1 instruction cache, aligning code can improve instruction fetch and decode.  However, excessive code alignment can result in a suboptimal distribution of entries in the branch history tables and increase branch misprediction.  Additionally, the speculative prefetch may retrieve and decode instructions used as padding, even if those instructions are never executed.  The types of instructions used as padding will affect the instruction decoder as stated above, and this may reduce the prefetch of code into the instruction cache.&lt;br /&gt;
&lt;br /&gt;
[[Category:Development]]&lt;/div&gt;</summary>
		<author><name>Ari64</name></author>
		
	</entry>
	<entry>
		<id>https://pandorawiki.org/index.php?title=Assembly_Code_Optimization&amp;diff=3734</id>
		<title>Assembly Code Optimization</title>
		<link rel="alternate" type="text/html" href="https://pandorawiki.org/index.php?title=Assembly_Code_Optimization&amp;diff=3734"/>
		<updated>2010-09-19T19:49:49Z</updated>

		<summary type="html">&lt;p&gt;Ari64: /* Instruction pairing restrictions following a branch */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Assembly code optimization on the Cortex-A8 ==&lt;br /&gt;
This guide presents specific optimization techniques for the ARM Cortex-A8 processor and its dual-issue, in-order pipeline.&lt;br /&gt;
&lt;br /&gt;
== Use the ARMv7 movw/movt instructions ==&lt;br /&gt;
Newer ARM processors allow loading 32-bit values as two 16-bit immediates.  The movw instruction loads the lower 16 bits, and movt loads the upper 16 bits.  The movw instruction clears the upper 16 bits, so that 16-bit values can be loaded using a single instruction.  The movt instruction does not affect the lower bits.&lt;br /&gt;
&lt;br /&gt;
On older ARM processors, it was common to load 32-bit values with a PC-relative load.  This should be avoided because it may result in a cache miss.&lt;br /&gt;
&lt;br /&gt;
== Branch Prediction ==&lt;br /&gt;
Branches which have not been seen before are predicted not taken.  It is therefore preferable to structure code so that the most likely code path is the one where the branch is not taken.&lt;br /&gt;
&lt;br /&gt;
Unconditional branches may be mispredicted, and load instructions which follow branches may be decoded and cause cache accesses.  Avoid placing load instructions after branches unless you intend the CPU to prefetch the addresses they reference.&lt;br /&gt;
&lt;br /&gt;
The branch predictor has a call/return stack for jumps which reference r14 or stack operations using r13 which load the program counter.  For best performance, make sure any pop, mov pc,lr or bx lr jumps to the same location as was set by the matching bl instruction.  For non-return jumps, use a register other than r14.&lt;br /&gt;
&lt;br /&gt;
Although instructions are decoded in pairs, the branch predictor can only predict one branch per cycle.  If you have a conditional branch which is immediately followed by another branch, and the first branch is very likely to be taken, place a no-operation instruction between the branches to prevent decoding and prediction of the second branch.  Inserting NOPs is detrimental if the first branch is not frequently taken.&lt;br /&gt;
&lt;br /&gt;
== Dual-Issue Restrictions ==&lt;br /&gt;
Only one branch instruction can issue per cycle.  Only one load or store instruction can issue per cycle.  Instructions which write to the same register can not issue together.&lt;br /&gt;
&lt;br /&gt;
There is a one-cycle delay before any written register can be used as the address of a subsequent load or store instruction.  There is a one-cycle delay before any written register can be used by an instruction which performs a shift or rotation on that register.&lt;br /&gt;
&lt;br /&gt;
There is one cycle delay before the result of a load can be used.  In the aforementioned cases involving subsequent shift or rotate instructions, or the address of a load or store instruction, the total delay is two cycles.  &lt;br /&gt;
&lt;br /&gt;
Stores occur at the end of the pipeline and store instructions can issue in the same cycle as another instruction which writes the same register.  Similarly, conditional branches can issue in the same cycle as flag-setting instructions because the branch is not resolved until the following cycle.  In all other cases, instructions can not issue together if the second instruction depends on the results of the first.&lt;br /&gt;
&lt;br /&gt;
== Instruction pairing restrictions following a branch ==&lt;br /&gt;
The Cortex-A8 processor normally decodes two instructions per cycle, but branch instructions, and certain instructions which resemble branches, may adversely affect the branch predictor when decoded in pairs.  These instructions can still issue and execute in parallel when branches are correctly predicted and there are sufficient instructions in the queue.  However, when a branch is mispredicted, the following instructions may cause an additional delay when two such instructions are paired:&lt;br /&gt;
&lt;br /&gt;
* Branch instructions&lt;br /&gt;
* Load instructions, whether or not they reference r15&lt;br /&gt;
* Arithmetic/logic instructions which do not set flags and do not have an immediate value.&lt;br /&gt;
&lt;br /&gt;
Flag-setting instructions can be used to avoid this restriction.  Additionally, MOV instructions that do not have immediate values can be replaced with ADD, SUB, ORR, or EOR instructions using zero as an immediate value.&lt;br /&gt;
&lt;br /&gt;
Intentionally placing a series of such instructions, such as mov r0,r0, following an unconditional branch may reduce unwanted code prefetch.&lt;br /&gt;
&lt;br /&gt;
== Code alignment ==&lt;br /&gt;
Code alignment should be used with caution.  Alignment has the potential to increase performance, but may be detrimental in some circumstances.&lt;br /&gt;
&lt;br /&gt;
Because the instruction decoder always fetches 64-bit aligned words from the level-1 instruction cache, aligning code can improve instruction fetch and decode.  However, excessive code alignment can result in a suboptimal distribution of entries in the branch history tables and increase branch misprediction.  Additionally, the speculative prefetch may retrieve and decode instructions used as padding, even if those instructions are never executed.  The types of instructions used as padding will affect the instruction decoder as stated above, and this may reduce the prefetch of code into the instruction cache.&lt;br /&gt;
&lt;br /&gt;
[[Category:Development]]&lt;/div&gt;</summary>
		<author><name>Ari64</name></author>
		
	</entry>
	<entry>
		<id>https://pandorawiki.org/index.php?title=Mupen64plus_dynamic_recompiler&amp;diff=3431</id>
		<title>Mupen64plus dynamic recompiler</title>
		<link rel="alternate" type="text/html" href="https://pandorawiki.org/index.php?title=Mupen64plus_dynamic_recompiler&amp;diff=3431"/>
		<updated>2010-08-27T19:03:58Z</updated>

		<summary type="html">&lt;p&gt;Ari64: /* Translation lookaside buffer emulation */ typos&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This page describes the dynamic recompiler in Mupen64plus, and the changes made for the ARM port.&lt;br /&gt;
&lt;br /&gt;
==The original dynamic recompiler by Hacktarux==&lt;br /&gt;
&lt;br /&gt;
The dynamic recompiler used in Mupen64plus v1.5 is based on the original written by Hacktarux in 2002.&lt;br /&gt;
&lt;br /&gt;
It recompiles contiguous blocks of MIPS instructions.  First, each instruction is decoded into a dynamically-allocated 132-byte data structure.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
typedef struct _precomp_instr&lt;br /&gt;
{&lt;br /&gt;
   void (*ops)();&lt;br /&gt;
   union&lt;br /&gt;
     {&lt;br /&gt;
    struct&lt;br /&gt;
      {&lt;br /&gt;
         long long int *rs;&lt;br /&gt;
         long long int *rt;&lt;br /&gt;
         short immediate;&lt;br /&gt;
      } i;&lt;br /&gt;
    struct&lt;br /&gt;
      {&lt;br /&gt;
         unsigned int inst_index;&lt;br /&gt;
      } j;&lt;br /&gt;
    struct&lt;br /&gt;
      {&lt;br /&gt;
         long long int *rs;&lt;br /&gt;
         long long int *rt;&lt;br /&gt;
         long long int *rd;&lt;br /&gt;
         unsigned char sa;&lt;br /&gt;
         unsigned char nrd;&lt;br /&gt;
      } r;&lt;br /&gt;
    struct&lt;br /&gt;
      {&lt;br /&gt;
         unsigned char base;&lt;br /&gt;
         unsigned char ft;&lt;br /&gt;
         short offset;&lt;br /&gt;
      } lf;&lt;br /&gt;
    struct&lt;br /&gt;
      {&lt;br /&gt;
         unsigned char ft;&lt;br /&gt;
         unsigned char fs;&lt;br /&gt;
         unsigned char fd;&lt;br /&gt;
      } cf;&lt;br /&gt;
     } f;&lt;br /&gt;
   unsigned int addr; /* word-aligned instruction address in r4300 address space */&lt;br /&gt;
   unsigned int local_addr; /* byte offset to start of corresponding x86_64 instructions, from start of code block */&lt;br /&gt;
   reg_cache_struct reg_cache_infos;&lt;br /&gt;
} precomp_instr;&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The decoded instructions are then compiled, generating x86 instructions for each MIPS instruction.  A 32K block is allocated with malloc() to hold the x86 code.  If this size proves insufficient, it is incrementally resized with realloc().&lt;br /&gt;
&lt;br /&gt;
MIPS registers are allocated to x86 registers with a least-recently-used replacement policy.  All cached registers are written back prior to a branch, or a memory read or write.&lt;br /&gt;
&lt;br /&gt;
To facilitate invalidation and replacement of modified code blocks, each 4K page is compiled separately.  If a sequence of instructions crosses a 4K page boundary, the current block is ended, and the next instructions will be compiled as a separate block.&lt;br /&gt;
&lt;br /&gt;
Branch instructions within a 4K page are compiled as branches directly to the target address.  Branches which cross a 4K page go through an indirect address lookup.&lt;br /&gt;
&lt;br /&gt;
Compiled code blocks are invalidated on write.  On an actual MIPS CPU, the instruction cache is invalidated using the CACHE instruction, however a few N64 games clear the cache using other methods.  Trapping writes appears to be the most reliable method of ensuring cache coherency in the emulated system.  The cache instruction is ignored.&lt;br /&gt;
&lt;br /&gt;
==Problems with the original design==&lt;br /&gt;
&lt;br /&gt;
The most significant performance problem with this design is its excessive memory usage.  The decoded instruction data is retained (132 bytes for each MIPS instruction) and occasionally referenced during execution.  Memory accesses frequently miss the L2 cache, resulting in poor performance.&lt;br /&gt;
&lt;br /&gt;
Additionally, the register cache is relatively inefficient, since all registers are flushed before any read or write operation.&lt;br /&gt;
&lt;br /&gt;
==A new approach==&lt;br /&gt;
&lt;br /&gt;
To reduce memory usage, the new design allocates a single large block of memory (currently 16 MiB) which is used for recompiled code.  This memory is allocated using mmap with the PROT_EXEC bit set, to ensure operation on CPUs with no-execute (NX) page permissions.&lt;br /&gt;
&lt;br /&gt;
Contiguous blocks of MIPS instructions are compiled (that is, it does not attempt to follow branches or 'hot-paths').&lt;br /&gt;
&lt;br /&gt;
The recompiler consists of eight stages, plus a linker, and a memory manager.&lt;br /&gt;
&lt;br /&gt;
Compiled blocks are invalidated on write, as before, however as the compiler will cross a 4K page, writes may invalidate adjacent pages as well.&lt;br /&gt;
&lt;br /&gt;
Currently the dynarec generates x86, x86-64, ARMv5, and ARMv7 (little-endian).  Most of the code is shared between the architectures, but a different code generator is included at compile time, using different #include statements depending on the CPU type.&lt;br /&gt;
&lt;br /&gt;
==Pass 1: Disassembly==&lt;br /&gt;
&lt;br /&gt;
When an instruction address is encountered which has not been compiled, the function new_recompile_block is called, with the (virtual) address of the target instruction as its sole parameter.  If the address is invalid, the function returns a nonzero value and the caller is responsible for handling the pagefault.&lt;br /&gt;
&lt;br /&gt;
Instructions are decoded until an unconditional jump, usually a return, is encountered.  Disassembly is ordinarilly continued past a JAL (subroutine call) instruction, however this strategy is abandoned if invalid instructions are encountered.&lt;br /&gt;
&lt;br /&gt;
Up to 4096 instructions (16K of MIPS code) may be disassembled at once.  Surprisingly, some games do actually reach this limit.&lt;br /&gt;
&lt;br /&gt;
==Pass 2: Liveness analysis==&lt;br /&gt;
&lt;br /&gt;
After disassembly, liveness analysis is performed on the registers.  This determines when a particular register will no longer be used, and thus can be removed from the register cache.&lt;br /&gt;
&lt;br /&gt;
A separate analysis is done on the upper 32 bits of 64-bit registers.  This can determine when only the lower 32 bits of a register are significant, thus allowing use of a 32-bit register.  This enables more efficient code generation on 32-bit processors.&lt;br /&gt;
&lt;br /&gt;
==Pass 3: Register allocation==&lt;br /&gt;
&lt;br /&gt;
The 31 MIPS registers must be mapped onto seven available registers on x86, or twelve available registers on ARM.&lt;br /&gt;
&lt;br /&gt;
Instructions with 64-bit results require two registers on 32-bit host processors.  32-bit instructions require only one.  A flag is set for registers containing 32-bit values, and these registers will be sign-extended before they are written to the register file.&lt;br /&gt;
&lt;br /&gt;
Registers are made available using a least-soon-needed replacement policy.  When the register cache is full, and no registers can be eliminated using the liveness analysis, a ten-instruction lookahead is used to determine which registers will not be needed soon and these registers are evicted from the cache.&lt;br /&gt;
&lt;br /&gt;
==Pass 4: Free unused registers==&lt;br /&gt;
&lt;br /&gt;
After the initial register allocation, the allocations are reviewed to determine if any registers remain allocated longer than necessary.  These are then removed from the register cache.  This avoids having to write back a large number of registers just before a branch, and makes more registers available for the next pass.&lt;br /&gt;
&lt;br /&gt;
==Pass 5: Pre-allocate registers==&lt;br /&gt;
&lt;br /&gt;
If a register will be used soon and needs to be loaded, try to load it early if the register is available.  This improves execution on CPUs with a load-use penalty.&lt;br /&gt;
&lt;br /&gt;
==Pass 6: Optimize clean/dirty state==&lt;br /&gt;
&lt;br /&gt;
If a cached register is 'dirty' and needs to be written out, try to do so as soon as the register will no longer be modified.  This avoids having to write the same register on multiple code paths due to conditional branches.  Additionally, try to avoid writing out dirty registers inside of loops.&lt;br /&gt;
&lt;br /&gt;
==Pass 7: Identify where registers are assumed to be 32-bit==&lt;br /&gt;
&lt;br /&gt;
When a 64-bit register is mapped to a 32-bit register with the assumption that the value will be sign-extended before being used, it is necessary to ensure that no register contains a 64-bit value when branching to such a location.  These instructions are flagged to identify them as requiring 32-bit inputs.  This information is used by the linker, and the exception return (ERET) handler.&lt;br /&gt;
&lt;br /&gt;
==Pass 8: Assembly==&lt;br /&gt;
&lt;br /&gt;
This generates and outputs the recompiled code.&lt;br /&gt;
&lt;br /&gt;
Following the main code block, handlers for certain exceptions as well as alternate entry points are added.&lt;br /&gt;
&lt;br /&gt;
If a recompiled instruction relies on a certain MIPS register being cached in a certain native register, then a short 'stub' of code is generated to load the necessary registers.  When an instruction outside of this block needs to jump to that location, it will instead jump to the stub.  The necessary registers will be loaded, and then it will jump into the main code sequence.&lt;br /&gt;
&lt;br /&gt;
On architectures which require literal pools (ARMv5) these are inserted as necessary.&lt;br /&gt;
&lt;br /&gt;
==Linker==&lt;br /&gt;
&lt;br /&gt;
The linker fills in all unresolved branches.  Branches within the block are linked to their target address.  Branches which jump outside of the block are linked to their target if that address has been compiled already. These inter-block branches are recorded in the jump_out array.  This information will be used to remove the links in the event that the target of the branch is invalidated.&lt;br /&gt;
&lt;br /&gt;
Unresolved branches point to a stub which loads the address of the branch instruction and the virtual address of its target into registers, and calls the dynamic linker.  When this code is executed, the dynamic linker will compile the target if necessary, and then patch the branch instruction with the new address.&lt;br /&gt;
&lt;br /&gt;
==Memory manager==&lt;br /&gt;
&lt;br /&gt;
The last step in new_recompile_block is to ensure that there will be sufficient memory available to compile the next block.  If there is not, then the oldest blocks are purged.&lt;br /&gt;
&lt;br /&gt;
The dynarec cache can be described as a 16MB circular buffer divided into eight segments.  Memory is allocated in order from beginning to end.&lt;br /&gt;
&lt;br /&gt;
When there are less than 2 empty segments, the next segment in sequence is cleared, wrapping around to the beginning of the buffer.  This continues as memory is needed, wrapping around from end to beginning.&lt;br /&gt;
&lt;br /&gt;
==Invalidation and restoration==&lt;br /&gt;
&lt;br /&gt;
Normally, code blocks are looked up via a hash table, using the virtual address of the target instruction.  However, for purposes of invalidation, blocks are grouped by physical address.  This can be described as a virtually-indexed, physically-tagged (VIPT) cache.&lt;br /&gt;
&lt;br /&gt;
References to compiled blocks are stored in one of 4096 linked lists in the jump_in array.  Each list covers a 4K block of memory, and 2048 such lists are sufficient to cover the 8MB of RAM in the Nintendo 64.  The remaining lists are for code in ROM, and the bootloader in SP memory.&lt;br /&gt;
&lt;br /&gt;
When a write hits a memory page marked as cached, all entries in the corresponding list are invalidated.  If any code is found to cross a 4K boundary, the adjacent lists are invalidated also.&lt;br /&gt;
&lt;br /&gt;
Sometimes blocks may be invalidated even when none of the code is actually modified.  This can happen if data is written to memory in the same 4K page, or if code is reloaded without actually modifying it.  If blocks which were previously invalidated are subsequently found to be unmodified, those blocks are marked in the restore_candidate array.  If the block remains unmodified, it will be restored as a valid block, to avoid recompiling blocks which do not need to be recompiled.  This is performed by the clean_blocks function which is called periodically.&lt;br /&gt;
&lt;br /&gt;
The jump_in array, which lists unmodified blocks, is physically indexed, however the jump_dirty array, which lists potentially-modified blocks, is virtually indexed.  This allows blocks which have changed physical addresses, but are still at the same virtual address, to be recognized as being the same as before, and not in need of recompilation.&lt;br /&gt;
&lt;br /&gt;
==Dynamic linker==&lt;br /&gt;
&lt;br /&gt;
Branches with unresolved addresses jump to the dynamic linker.  This will look through the jump_in list corresponding to the physical page containing the virtual target address.  If found, the branch instruction will be patched with the address, and then it will jump to this address.&lt;br /&gt;
&lt;br /&gt;
If not found, the jump_dirty list will be searched for blocks which were previously compiled but may have been modified.  If a potential match is found, the code will be compared against a cached copy to determine if any changes have been made.  If not, then it will jump to the block.  Because the memory could be modified again, branch instructions referencing these blocks are not altered, and continue to point to the dynamic linker.  These blocks will continue to be verified each time they are called, until restored to the jump_in list by the clean_blocks function described above.&lt;br /&gt;
&lt;br /&gt;
If no compiled block is found, or the existing block was modified, the target is recompiled.&lt;br /&gt;
&lt;br /&gt;
==Address lookup==&lt;br /&gt;
&lt;br /&gt;
When a JR (jump register) instruction is encountered, the address of the recompiled code must be looked up using the address of the MIPS code.  The majority of such instructions jump to the link register (r31) to return to the location following a JAL (call) instruction.&lt;br /&gt;
&lt;br /&gt;
When a JAL or JALR is executed, the address of the following instruction is inserted into a small 32-entry hash table, which is checked when a JR r31 instruction is executed.  This allows for a quick return from subroutine calls.&lt;br /&gt;
&lt;br /&gt;
If the JR instruction uses a register other than r31, or the small hash table lookup fails to find a match, a larger 131072-entry hash table is checked.  This table contains 65536 bins with up to 2 addresses per bin.  If this also fails to find a match (which occurs less than 1% of the time) an exhaustive search of all compiled addresses within that 4K memory page is performed.&lt;br /&gt;
&lt;br /&gt;
If no match is found by any of these methods, the target address is compiled, and the new address is inserted into the hash table.&lt;br /&gt;
&lt;br /&gt;
==Cycle counting==&lt;br /&gt;
&lt;br /&gt;
Cycles are counted before each branch by adding the cycles from the preceding instructions to a specific register.  The cycle count is in R10 on ARM and ESI on x86.  The value in this register is normally a negative number.  When this number exceeds zero, a conditional branch is taken which jumps to an interrupt handler.&lt;br /&gt;
&lt;br /&gt;
For example, the following x86 code adds eight cycles:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
add $8,%esi&lt;br /&gt;
jns interrupt_handler&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The conditional branch jumps to a small bit of code located after the main compiled block, which saves the cached registers to the register file, sets the instruction pointer which will be used upon return from the interrupt, and then calls cc_interrupt.&lt;br /&gt;
&lt;br /&gt;
As in the original mupen64plus, the emulated clock runs at 37.5 MHz, and each instruction takes 2 clock cycles.&lt;br /&gt;
&lt;br /&gt;
==Delay slots==&lt;br /&gt;
&lt;br /&gt;
MIPS has 'delay slots', where the instruction after the branch is executed before the branch is taken.  Instructions in delay slots are issued out-of-order in the recompiled code.&lt;br /&gt;
&lt;br /&gt;
[[Image:Recompiler delay slot reordering.png]]&lt;br /&gt;
&lt;br /&gt;
When a branch jumps into the delay slot of another branch, this case must be handled slightly differently:&lt;br /&gt;
&lt;br /&gt;
[[Image:Recompiler delay slot reordering 2.png]]&lt;br /&gt;
&lt;br /&gt;
The branch test and delay slot are executed in-order if a dependency exists, or for 'likely' branches where the delay slot is nullified when the branch condition is false.  These cases are infrequent (typically less than 10% of branches).&lt;br /&gt;
&lt;br /&gt;
==Constant propagation==&lt;br /&gt;
&lt;br /&gt;
When an instruction loads a constant into a register, the register is tagged as a constant.  The constant tag will be retained if subsequent instructions modify the constant using other constants.  During assembly, such a sequence of instructions is combined into a single load.  For example:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
LUI r8,12340000  --&amp;gt;  mov $0x12345678,%eax&lt;br /&gt;
ORI r8,r8,5678&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This optimization is not performed where a branch target intervenes, eg&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
...&lt;br /&gt;
 LUI r8,12340000  --&amp;gt;  mov $0x12340000,%eax&lt;br /&gt;
L1:&lt;br /&gt;
 ORI r8,r8,5678   --&amp;gt;  or $0x5678,%eax&lt;br /&gt;
 ...&lt;br /&gt;
 BEQ r0,r0,L1&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Registers containing constants are identified by bits in the isconst and wasconst fields of the regstat structure.  The wasconst bit is set for a register if the register contained a known constant before the instruction, and the isconst bit is set for a register if the register will contain a known constant after the instruction executes.&lt;br /&gt;
&lt;br /&gt;
==Translation lookaside buffer emulation==&lt;br /&gt;
&lt;br /&gt;
Most Nintendo 64 games do not use virtual memory, but some do.  At startup, main memory is directly mapped at addresses from 0x80000000 to 0x803FFFFF, or up to 0x807FFFFF if the memory expansion is used.&lt;br /&gt;
&lt;br /&gt;
Normally, read or write operations are checked against this range, and if outside this range, control is passed to the appropriate I/O handler.  This is done as follows:&lt;br /&gt;
&lt;br /&gt;
MIPS instruction: LW r1,8(r2)&lt;br /&gt;
&lt;br /&gt;
ARM code:&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
add r1, r2, #8&lt;br /&gt;
cmp r1, #8388608&lt;br /&gt;
bvc handler&lt;br /&gt;
ldr r1, [r1]&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If there are valid entries in the TLB, this would instead be compiled as follows:&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
add r1, r2, #8&lt;br /&gt;
mov r0, #264&lt;br /&gt;
add r0, r0, r1, lsr #12&lt;br /&gt;
ldr r0, [r11, r0, lsl #2]&lt;br /&gt;
tst r0, r0&lt;br /&gt;
bmi handler&lt;br /&gt;
ldr r1, [r1, r0, lsl #2]&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This looks up the offset in the memory_map table (which, in this example, is located at r11+264*4).  The high bit is tested to determine whether a valid mapping exists for this page.&lt;br /&gt;
&lt;br /&gt;
==Page fault emulation==&lt;br /&gt;
&lt;br /&gt;
If a memory address references an invalid page, a conditional branch is taken to a bit of code placed after the main block.  This will save any modified cached registers, and call an appropriate handler function for the address.  The handler will either perform I/O, or generate an exception (pagefault).&lt;br /&gt;
&lt;br /&gt;
==Mapping executable pages==&lt;br /&gt;
&lt;br /&gt;
If the dynamic recompiler encounters code which is not in contiguous physical memory, it will end the block at the page boundary, so that the block can be removed cleanly if the mapping is changed.&lt;br /&gt;
&lt;br /&gt;
There is a special case of this, where a branch and delay slot span two pages.  If a branch instruction is the last instruction in a virtual memory page, it is compiled in a different manner than other branches.  The branch condition is evaluated, and the target address is placed in a register (%ebp on x86, and r8 on ARM).  A special form of the dynamic linker (dyna_linker_ds) is used to link the branch to its corresponding delay slot in another block.  If no page fault occurs, the delay slot executes and then jumps to the address in the register.  For conditional branches that are not taken, the target address is the next instruction.  This code is generated by the pagespan_assemble function.&lt;br /&gt;
&lt;br /&gt;
== Compile options ==&lt;br /&gt;
&lt;br /&gt;
=== ARMv5_ONLY ===&lt;br /&gt;
&lt;br /&gt;
If this is defined, the UXTH instruction is not used, and the dynamic recompiler will generate literal pools instead of using movw/movt.  This provides compatibility with older processors, but generates somewhat less efficient code.&lt;br /&gt;
&lt;br /&gt;
=== CORTEX_A8_BRANCH_PREDICTION_HACK ===&lt;br /&gt;
&lt;br /&gt;
If this is defined, the dynamic recompiler will avoid generating consecutive branch instructions without another instruction in between.  This avoids a stall in instruction decoding on the Cortex-A8 due to this processor having dual instruction decoders, but only one branch-prediction unit.  See [[Assembly Code Optimization]] for details.&lt;br /&gt;
&lt;br /&gt;
=== USE_MINI_HT ===&lt;br /&gt;
&lt;br /&gt;
If this is defined, attempt to look up return addresses in a small hash table before checking the larger hash table.  Usually improves performance.&lt;br /&gt;
&lt;br /&gt;
=== IMM_PREFETCH ===&lt;br /&gt;
&lt;br /&gt;
If this is defined, the x86 PREFETCH instruction is used to prefetch entries from the hash table.  The increase in code size often outweighs the benefit of this.&lt;br /&gt;
&lt;br /&gt;
=== REG_PREFETCH ===&lt;br /&gt;
&lt;br /&gt;
Similar to the above, but loads the address into a register first, then uses the ARM PLD instruction.  The increase in code size almost always outweighs the benefit of this.&lt;br /&gt;
&lt;br /&gt;
=== R29_HACK ===&lt;br /&gt;
&lt;br /&gt;
Assume that the stack pointer (r29) is always a valid memory address and do not check it.  It is similar to the optimization described [http://strmnnrmn.blogspot.com/2007/08/interesting-dynarec-hack.html here].  This can crash the emulator and is not enabled by default.&lt;br /&gt;
&lt;br /&gt;
==Future improvements==&lt;br /&gt;
&lt;br /&gt;
===Copy propagation/offset propagation===&lt;br /&gt;
&lt;br /&gt;
A common instruction sequence is of the form:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
LUI r9,12340000&lt;br /&gt;
ADD r9,r9,r8&lt;br /&gt;
LW r9,5678(r9)&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
It would be helpful to recognize this as a load from r8+12345678.  The current constant propagation code does not do so.&lt;br /&gt;
&lt;br /&gt;
===Unaligned memory access===&lt;br /&gt;
&lt;br /&gt;
A small improvement could be made by combining adjacent LWL/LWR instructions.  The potential gain from doing so is very limited because these instructions typically represent less than 1% of all memory accesses.&lt;br /&gt;
&lt;br /&gt;
===SLT/branch merging===&lt;br /&gt;
&lt;br /&gt;
A frequent occurrence in MIPS code is an SLT or SLTI instruction followed by a branch.  This is generated relatively inefficiently on x86 and ARM, first doing a compare and set, then testing this value and doing a conditional branch.  Doing only one comparison would save at least one instruction, and could potentially save up to three instructions if the liveness analysis reveals that the result of the SLT instruction is used nowhere else.&lt;br /&gt;
&lt;br /&gt;
While a potentially useful optimization, there are several problems with this approach.  First, there are often additional instructions between the slt and the branch. These must be checked to make sure they do not modify the registers as that would prevent reordering the instruction stream as desired.  Secondly, if the result of the slt is found to be live, but unmodified, on both paths of the branch, clean_registers will normally write this value before the branch, to avoid duplicating the writeback code on both paths of the branch.  This optimization would have to be removed if the slt was combined with the branch.&lt;br /&gt;
&lt;br /&gt;
===x86-64===&lt;br /&gt;
&lt;br /&gt;
Currently the x86-64 backend generates only 32-bit instructions.  Proper 64-bit code generation would improve performance.&lt;br /&gt;
&lt;br /&gt;
===PowerPC===&lt;br /&gt;
&lt;br /&gt;
It would be possible to add a PowerPC code generator to the dynamic recompiler.  Currently no one is working on this.  (The mupen64gc project is using a different codebase.)&lt;br /&gt;
&lt;br /&gt;
The following is a summary of the changes which would be necessary to add a PowerPC backend.&lt;br /&gt;
&lt;br /&gt;
The slt* instructions use conditional moves, which are unavailable on PowerPC.  A suitable alternative (such as moves from the condition register) would need to be used.&lt;br /&gt;
&lt;br /&gt;
The assembler can generate as much as 256K of code in a single block, however conditional branches on PowerPC are limited to +/-64K.  It will be necessary to either restrict the block size, or insert jump tables in a manner similar to the literal pools on ARM.&lt;br /&gt;
&lt;br /&gt;
PowerPC generally relies on early branch resolution rather than statistical branch prediction.  Scheduling branch condition tests earlier may be advantageous.  (For example, the address bounds check could be done in address_generation, rather than just before the load or store.  Similarly it may be advantageous to test the branch condition and update the cycle count before executing the delay slot.)&lt;br /&gt;
&lt;br /&gt;
===MIPS===&lt;br /&gt;
&lt;br /&gt;
Recompiling MIPS into MIPS would be relatively straightforward, however the current code generator has no facility for filling delay slots.  This capability would be required for efficient code generation.&lt;br /&gt;
&lt;br /&gt;
[[Category:Development]]&lt;/div&gt;</summary>
		<author><name>Ari64</name></author>
		
	</entry>
	<entry>
		<id>https://pandorawiki.org/index.php?title=Mupen64plus_dynamic_recompiler&amp;diff=3360</id>
		<title>Mupen64plus dynamic recompiler</title>
		<link rel="alternate" type="text/html" href="https://pandorawiki.org/index.php?title=Mupen64plus_dynamic_recompiler&amp;diff=3360"/>
		<updated>2010-08-19T19:12:19Z</updated>

		<summary type="html">&lt;p&gt;Ari64: Changes for 20100819&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This page describes the dynamic recompiler in Mupen64plus, and the changes made for the ARM port.&lt;br /&gt;
&lt;br /&gt;
==The original dynamic recompiler by Hacktarux==&lt;br /&gt;
&lt;br /&gt;
The dynamic recompiler used in Mupen64plus v1.5 is based on the original written by Hacktarux in 2002.&lt;br /&gt;
&lt;br /&gt;
It recompiles contiguous blocks of MIPS instructions.  First, each instruction is decoded into a dynamically-allocated 132-byte data structure.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
typedef struct _precomp_instr&lt;br /&gt;
{&lt;br /&gt;
   void (*ops)();&lt;br /&gt;
   union&lt;br /&gt;
     {&lt;br /&gt;
    struct&lt;br /&gt;
      {&lt;br /&gt;
         long long int *rs;&lt;br /&gt;
         long long int *rt;&lt;br /&gt;
         short immediate;&lt;br /&gt;
      } i;&lt;br /&gt;
    struct&lt;br /&gt;
      {&lt;br /&gt;
         unsigned int inst_index;&lt;br /&gt;
      } j;&lt;br /&gt;
    struct&lt;br /&gt;
      {&lt;br /&gt;
         long long int *rs;&lt;br /&gt;
         long long int *rt;&lt;br /&gt;
         long long int *rd;&lt;br /&gt;
         unsigned char sa;&lt;br /&gt;
         unsigned char nrd;&lt;br /&gt;
      } r;&lt;br /&gt;
    struct&lt;br /&gt;
      {&lt;br /&gt;
         unsigned char base;&lt;br /&gt;
         unsigned char ft;&lt;br /&gt;
         short offset;&lt;br /&gt;
      } lf;&lt;br /&gt;
    struct&lt;br /&gt;
      {&lt;br /&gt;
         unsigned char ft;&lt;br /&gt;
         unsigned char fs;&lt;br /&gt;
         unsigned char fd;&lt;br /&gt;
      } cf;&lt;br /&gt;
     } f;&lt;br /&gt;
   unsigned int addr; /* word-aligned instruction address in r4300 address space */&lt;br /&gt;
   unsigned int local_addr; /* byte offset to start of corresponding x86_64 instructions, from start of code block */&lt;br /&gt;
   reg_cache_struct reg_cache_infos;&lt;br /&gt;
} precomp_instr;&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The decoded instructions are then compiled, generating x86 instructions for each MIPS instruction.  A 32K block is allocated with malloc() to hold the x86 code.  If this size proves insufficient, it is incrementally resized with realloc().&lt;br /&gt;
&lt;br /&gt;
MIPS registers are allocated to x86 registers with a least-recently-used replacement policy.  All cached registers are written back prior to a branch, or a memory read or write.&lt;br /&gt;
&lt;br /&gt;
To facilitate invalidation and replacement of modified code blocks, each 4K page is compiled separately.  If a sequence of instructions crosses a 4K page boundary, the current block is ended, and the next instructions will be compiled as a separate block.&lt;br /&gt;
&lt;br /&gt;
Branch instructions within a 4K page are compiled as branches directly to the target address.  Branches which cross a 4K page go through an indirect address lookup.&lt;br /&gt;
&lt;br /&gt;
Compiled code blocks are invalidated on write.  On an actual MIPS CPU, the instruction cache is invalidated using the CACHE instruction, however a few N64 games clear the cache using other methods.  Trapping writes appears to be the most reliable method of ensuring cache coherency in the emulated system.  The cache instruction is ignored.&lt;br /&gt;
&lt;br /&gt;
==Problems with the original design==&lt;br /&gt;
&lt;br /&gt;
The most significant performance problem with this design is its excessive memory usage.  The decoded instruction data is retained (132 bytes for each MIPS instruction) and occasionally referenced during execution.  Memory accesses frequently miss the L2 cache, resulting in poor performance.&lt;br /&gt;
&lt;br /&gt;
Additionally, the register cache is relatively inefficient, since all registers are flushed before any read or write operation.&lt;br /&gt;
&lt;br /&gt;
==A new approach==&lt;br /&gt;
&lt;br /&gt;
To reduce memory usage, the new design allocates a single large block of memory (currently 16 MiB) which is used for recompiled code.  This memory is allocated using mmap with the PROT_EXEC bit set, to ensure operation on CPUs with no-execute (NX) page permissions.&lt;br /&gt;
&lt;br /&gt;
Contiguous blocks of MIPS instructions are compiled (that is, it does not attempt to follow branches or 'hot-paths').&lt;br /&gt;
&lt;br /&gt;
The recompiler consists of eight stages, plus a linker, and a memory manager.&lt;br /&gt;
&lt;br /&gt;
Compiled blocks are invalidated on write, as before, however as the compiler will cross a 4K page, writes may invalidate adjacent pages as well.&lt;br /&gt;
&lt;br /&gt;
Currently the dynarec generates x86, x86-64, ARMv5, and ARMv7 (little-endian).  Most of the code is shared between the architectures, but a different code generator is included at compile time, using different #include statements depending on the CPU type.&lt;br /&gt;
&lt;br /&gt;
==Pass 1: Disassembly==&lt;br /&gt;
&lt;br /&gt;
When an instruction address is encountered which has not been compiled, the function new_recompile_block is called, with the (virtual) address of the target instruction as its sole parameter.  If the address is invalid, the function returns a nonzero value and the caller is responsible for handling the pagefault.&lt;br /&gt;
&lt;br /&gt;
Instructions are decoded until an unconditional jump, usually a return, is encountered.  Disassembly is ordinarilly continued past a JAL (subroutine call) instruction, however this strategy is abandoned if invalid instructions are encountered.&lt;br /&gt;
&lt;br /&gt;
Up to 4096 instructions (16K of MIPS code) may be disassembled at once.  Surprisingly, some games do actually reach this limit.&lt;br /&gt;
&lt;br /&gt;
==Pass 2: Liveness analysis==&lt;br /&gt;
&lt;br /&gt;
After disassembly, liveness analysis is performed on the registers.  This determines when a particular register will no longer be used, and thus can be removed from the register cache.&lt;br /&gt;
&lt;br /&gt;
A separate analysis is done on the upper 32 bits of 64-bit registers.  This can determine when only the lower 32 bits of a register are significant, thus allowing use of a 32-bit register.  This enables more efficient code generation on 32-bit processors.&lt;br /&gt;
&lt;br /&gt;
==Pass 3: Register allocation==&lt;br /&gt;
&lt;br /&gt;
The 31 MIPS registers must be mapped onto seven available registers on x86, or twelve available registers on ARM.&lt;br /&gt;
&lt;br /&gt;
Instructions with 64-bit results require two registers on 32-bit host processors.  32-bit instructions require only one.  A flag is set for registers containing 32-bit values, and these registers will be sign-extended before they are written to the register file.&lt;br /&gt;
&lt;br /&gt;
Registers are made available using a least-soon-needed replacement policy.  When the register cache is full, and no registers can be eliminated using the liveness analysis, a ten-instruction lookahead is used to determine which registers will not be needed soon and these registers are evicted from the cache.&lt;br /&gt;
&lt;br /&gt;
==Pass 4: Free unused registers==&lt;br /&gt;
&lt;br /&gt;
After the initial register allocation, the allocations are reviewed to determine if any registers remain allocated longer than necessary.  These are then removed from the register cache.  This avoids having to write back a large number of registers just before a branch, and makes more registers available for the next pass.&lt;br /&gt;
&lt;br /&gt;
==Pass 5: Pre-allocate registers==&lt;br /&gt;
&lt;br /&gt;
If a register will be used soon and needs to be loaded, try to load it early if the register is available.  This improves execution on CPUs with a load-use penalty.&lt;br /&gt;
&lt;br /&gt;
==Pass 6: Optimize clean/dirty state==&lt;br /&gt;
&lt;br /&gt;
If a cached register is 'dirty' and needs to be written out, try to do so as soon as the register will no longer be modified.  This avoids having to write the same register on multiple code paths due to conditional branches.  Additionally, try to avoid writing out dirty registers inside of loops.&lt;br /&gt;
&lt;br /&gt;
==Pass 7: Identify where registers are assumed to be 32-bit==&lt;br /&gt;
&lt;br /&gt;
When a 64-bit register is mapped to a 32-bit register with the assumption that the value will be sign-extended before being used, it is necessary to ensure that no register contains a 64-bit value when branching to such a location.  These instructions are flagged to identify them as requiring 32-bit inputs.  This information is used by the linker, and the exception return (ERET) handler.&lt;br /&gt;
&lt;br /&gt;
==Pass 8: Assembly==&lt;br /&gt;
&lt;br /&gt;
This generates and outputs the recompiled code.&lt;br /&gt;
&lt;br /&gt;
Following the main code block, handlers for certain exceptions as well as alternate entry points are added.&lt;br /&gt;
&lt;br /&gt;
If a recompiled instruction relies on a certain MIPS register being cached in a certain native register, then a short 'stub' of code is generated to load the necessary registers.  When an instruction outside of this block needs to jump to that location, it will instead jump to the stub.  The necessary registers will be loaded, and then it will jump into the main code sequence.&lt;br /&gt;
&lt;br /&gt;
On architectures which require literal pools (ARMv5) these are inserted as necessary.&lt;br /&gt;
&lt;br /&gt;
==Linker==&lt;br /&gt;
&lt;br /&gt;
The linker fills in all unresolved branches.  Branches within the block are linked to their target address.  Branches which jump outside of the block are linked to their target if that address has been compiled already. These inter-block branches are recorded in the jump_out array.  This information will be used to remove the links in the event that the target of the branch is invalidated.&lt;br /&gt;
&lt;br /&gt;
Unresolved branches point to a stub which loads the address of the branch instruction and the virtual address of its target into registers, and calls the dynamic linker.  When this code is executed, the dynamic linker will compile the target if necessary, and then patch the branch instruction with the new address.&lt;br /&gt;
&lt;br /&gt;
==Memory manager==&lt;br /&gt;
&lt;br /&gt;
The last step in new_recompile_block is to ensure that there will be sufficient memory available to compile the next block.  If there is not, then the oldest blocks are purged.&lt;br /&gt;
&lt;br /&gt;
The dynarec cache can be described as a 16MB circular buffer divided into eight segments.  Memory is allocated in order from beginning to end.&lt;br /&gt;
&lt;br /&gt;
When there are less than 2 empty segments, the next segment in sequence is cleared, wrapping around to the beginning of the buffer.  This continues as memory is needed, wrapping around from end to beginning.&lt;br /&gt;
&lt;br /&gt;
==Invalidation and restoration==&lt;br /&gt;
&lt;br /&gt;
Normally, code blocks are looked up via a hash table, using the virtual address of the target instruction.  However, for purposes of invalidation, blocks are grouped by physical address.  This can be described as a virtually-indexed, physically-tagged (VIPT) cache.&lt;br /&gt;
&lt;br /&gt;
References to compiled blocks are stored in one of 4096 linked lists in the jump_in array.  Each list covers a 4K block of memory, and 2048 such lists are sufficient to cover the 8MB of RAM in the Nintendo 64.  The remaining lists are for code in ROM, and the bootloader in SP memory.&lt;br /&gt;
&lt;br /&gt;
When a write hits a memory page marked as cached, all entries in the corresponding list are invalidated.  If any code is found to cross a 4K boundary, the adjacent lists are invalidated also.&lt;br /&gt;
&lt;br /&gt;
Sometimes blocks may be invalidated even when none of the code is actually modified.  This can happen if data is written to memory in the same 4K page, or if code is reloaded without actually modifying it.  If blocks which were previously invalidated are subsequently found to be unmodified, those blocks are marked in the restore_candidate array.  If the block remains unmodified, it will be restored as a valid block, to avoid recompiling blocks which do not need to be recompiled.  This is performed by the clean_blocks function which is called periodically.&lt;br /&gt;
&lt;br /&gt;
The jump_in array, which lists unmodified blocks, is physically indexed, however the jump_dirty array, which lists potentially-modified blocks, is virtually indexed.  This allows blocks which have changed physical addresses, but are still at the same virtual address, to be recognized as being the same as before, and not in need of recompilation.&lt;br /&gt;
&lt;br /&gt;
==Dynamic linker==&lt;br /&gt;
&lt;br /&gt;
Branches with unresolved addresses jump to the dynamic linker.  This will look through the jump_in list corresponding to the physical page containing the virtual target address.  If found, the branch instruction will be patched with the address, and then it will jump to this address.&lt;br /&gt;
&lt;br /&gt;
If not found, the jump_dirty list will be searched for blocks which were previously compiled but may have been modified.  If a potential match is found, the code will be compared against a cached copy to determine if any changes have been made.  If not, then it will jump to the block.  Because the memory could be modified again, branch instructions referencing these blocks are not altered, and continue to point to the dynamic linker.  These blocks will continue to be verified each time they are called, until restored to the jump_in list by the clean_blocks function described above.&lt;br /&gt;
&lt;br /&gt;
If no compiled block is found, or the existing block was modified, the target is recompiled.&lt;br /&gt;
&lt;br /&gt;
==Address lookup==&lt;br /&gt;
&lt;br /&gt;
When a JR (jump register) instruction is encountered, the address of the recompiled code must be looked up using the address of the MIPS code.  The majority of such instructions jump to the link register (r31) to return to the location following a JAL (call) instruction.&lt;br /&gt;
&lt;br /&gt;
When a JAL or JALR is executed, the address of the following instruction is inserted into a small 32-entry hash table, which is checked when a JR r31 instruction is executed.  This allows for a quick return from subroutine calls.&lt;br /&gt;
&lt;br /&gt;
If the JR instruction uses a register other than r31, or the small hash table lookup fails to find a match, a larger 131072-entry hash table is checked.  This table contains 65536 bins with up to 2 addresses per bin.  If this also fails to find a match (which occurs less than 1% of the time) an exhaustive search of all compiled addresses within that 4K memory page is performed.&lt;br /&gt;
&lt;br /&gt;
If no match is found by any of these methods, the target address is compiled, and the new address is inserted into the hash table.&lt;br /&gt;
&lt;br /&gt;
==Cycle counting==&lt;br /&gt;
&lt;br /&gt;
Cycles are counted before each branch by adding the cycles from the preceding instructions to a specific register.  The cycle count is in R10 on ARM and ESI on x86.  The value in this register is normally a negative number.  When this number exceeds zero, a conditional branch is taken which jumps to an interrupt handler.&lt;br /&gt;
&lt;br /&gt;
For example, the following x86 code adds eight cycles:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
add $8,%esi&lt;br /&gt;
jns interrupt_handler&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The conditional branch jumps to a small bit of code located after the main compiled block, which saves the cached registers to the register file, sets the instruction pointer which will be used upon return from the interrupt, and then calls cc_interrupt.&lt;br /&gt;
&lt;br /&gt;
As in the original mupen64plus, the emulated clock runs at 37.5 MHz, and each instruction takes 2 clock cycles.&lt;br /&gt;
&lt;br /&gt;
==Delay slots==&lt;br /&gt;
&lt;br /&gt;
MIPS has 'delay slots', where the instruction after the branch is executed before the branch is taken.  Instructions in delay slots are issued out-of-order in the recompiled code.&lt;br /&gt;
&lt;br /&gt;
[[Image:Recompiler delay slot reordering.png]]&lt;br /&gt;
&lt;br /&gt;
When a branch jumps into the delay slot of another branch, this case must be handled slightly differently:&lt;br /&gt;
&lt;br /&gt;
[[Image:Recompiler delay slot reordering 2.png]]&lt;br /&gt;
&lt;br /&gt;
The branch test and delay slot are executed in-order if a dependency exists, or for 'likely' branches where the delay slot is nullified when the branch condition is false.  These cases are infrequent (typically less than 10% of branches).&lt;br /&gt;
&lt;br /&gt;
==Constant propagation==&lt;br /&gt;
&lt;br /&gt;
When an instruction loads a constant into a register, the register is tagged as a constant.  The constant tag will be retained if subsequent instructions modify the constant using other constants.  During assembly, such a sequence of instructions is combined into a single load.  For example:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
LUI r8,12340000  --&amp;gt;  mov $0x12345678,%eax&lt;br /&gt;
ORI r8,r8,5678&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This optimization is not performed where a branch target intervenes, eg&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
...&lt;br /&gt;
 LUI r8,12340000  --&amp;gt;  mov $0x12340000,%eax&lt;br /&gt;
L1:&lt;br /&gt;
 ORI r8,r8,5678   --&amp;gt;  or $0x5678,%eax&lt;br /&gt;
 ...&lt;br /&gt;
 BEQ r0,r0,L1&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Registers containing constants are identified by bits in the isconst and wasconst fields of the regstat structure.  The wasconst bit is set for a register if the register contained a known constant before the instruction, and the isconst bit is set for a register if the register will contain a known constant after the instruction executes.&lt;br /&gt;
&lt;br /&gt;
==Translation lookaside buffer emulation==&lt;br /&gt;
&lt;br /&gt;
Most Nintendo 64 games do not use virtual memory, but some do.  At startup, main memory is directly mapped at addresses from 0x80000000 to 0x803FFFFF, or up to 0x807FFFFF if the memory expansion is used.&lt;br /&gt;
&lt;br /&gt;
Normally, read or write operations are checked against this range, and if outside this range, control is passed to the appropriate I/O handler.  This is done as follows:&lt;br /&gt;
&lt;br /&gt;
MIPS instruction: LW r1,8(r2)&lt;br /&gt;
&lt;br /&gt;
ARM code:&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
add r1, r2, #8&lt;br /&gt;
cmp r1, $8388608&lt;br /&gt;
bvc handler&lt;br /&gt;
ldr r1, [r1]&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If there are valid entries in the TLB, this would instead be compiled as follows:&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
add r1, r2, #8&lt;br /&gt;
mov r0, #264&lt;br /&gt;
add r0, r0, r1, lsr #12&lt;br /&gt;
ldr r0, [r11, r0, lsl #2]&lt;br /&gt;
tst r0, r0&lt;br /&gt;
bmi handler&lt;br /&gt;
ldr r1, [r1, r0, lsl #2]&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This looks up the offset in the memory_map table (which, in this example, is located at r11+264*4).  The high bit is tested to determine whether a valid mapping exists for this page.&lt;br /&gt;
&lt;br /&gt;
==Page fault emulation==&lt;br /&gt;
&lt;br /&gt;
If a memory address references an invalid page, a conditional branch is taken to a bit of code placed after the main block.  This will save any modified cached registers, and call an appropriate handler function for the address.  The handler will either perform I/O, or generate an exception (pagefault).&lt;br /&gt;
&lt;br /&gt;
==Mapping executable pages==&lt;br /&gt;
&lt;br /&gt;
If the dynamic recompiler encounters code which is not in contiguous physical memory, it will end the block at the page boundary, so that the block can be removed cleanly if the mapping is changed.&lt;br /&gt;
&lt;br /&gt;
There is a special case of this, where a branch and delay slot span two pages.  If a branch instruction is the last instruction in a virtual memory page, it is compiled in a different manner than other branches.  The branch condition is evaluated, and the target address is placed in a register (%ebp on x86, and r8 on ARM).  A special form of the dynamic linker (dyna_linker_ds) is used to link the branch to its corresponding delay slot in another block.  If no page fault occurs, the delay slot executes and then jumps to the address in the register.  For conditional branches that are not taken, the target address is the next instruction.  This code is generated by the pagespan_assemble function.&lt;br /&gt;
&lt;br /&gt;
== Compile options ==&lt;br /&gt;
&lt;br /&gt;
=== ARMv5_ONLY ===&lt;br /&gt;
&lt;br /&gt;
If this is defined, the UXTH instruction is not used, and the dynamic recompiler will generate literal pools instead of using movw/movt.  This provides compatibility with older processors, but generates somewhat less efficient code.&lt;br /&gt;
&lt;br /&gt;
=== CORTEX_A8_BRANCH_PREDICTION_HACK ===&lt;br /&gt;
&lt;br /&gt;
If this is defined, the dynamic recompiler will avoid generating consecutive branch instructions without another instruction in between.  This avoids a stall in instruction decoding on the Cortex-A8 due to this processor having dual instruction decoders, but only one branch-prediction unit.  See [[Assembly Code Optimization]] for details.&lt;br /&gt;
&lt;br /&gt;
=== USE_MINI_HT ===&lt;br /&gt;
&lt;br /&gt;
If this is defined, attempt to look up return addresses in a small hash table before checking the larger hash table.  Usually improves performance.&lt;br /&gt;
&lt;br /&gt;
=== IMM_PREFETCH ===&lt;br /&gt;
&lt;br /&gt;
If this is defined, the x86 PREFETCH instruction is used to prefetch entries from the hash table.  The increase in code size often outweighs the benefit of this.&lt;br /&gt;
&lt;br /&gt;
=== REG_PREFETCH ===&lt;br /&gt;
&lt;br /&gt;
Similar to the above, but loads the address into a register first, then uses the ARM PLD instruction.  The increase in code size almost always outweighs the benefit of this.&lt;br /&gt;
&lt;br /&gt;
=== R29_HACK ===&lt;br /&gt;
&lt;br /&gt;
Assume that the stack pointer (r29) is always a valid memory address and do not check it.  It is similar to the optimization described [http://strmnnrmn.blogspot.com/2007/08/interesting-dynarec-hack.html here].  This can crash the emulator and is not enabled by default.&lt;br /&gt;
&lt;br /&gt;
==Future improvements==&lt;br /&gt;
&lt;br /&gt;
===Copy propagation/offset propagation===&lt;br /&gt;
&lt;br /&gt;
A common instruction sequence is of the form:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
LUI r9,12340000&lt;br /&gt;
ADD r9,r9,r8&lt;br /&gt;
LW r9,5678(r9)&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
It would be helpful to recognize this as a load from r8+12345678.  The current constant propagation code does not do so.&lt;br /&gt;
&lt;br /&gt;
===Unaligned memory access===&lt;br /&gt;
&lt;br /&gt;
A small improvement could be made by combining adjacent LWL/LWR instructions.  The potential gain from doing so is very limited because these instructions typically represent less than 1% of all memory accesses.&lt;br /&gt;
&lt;br /&gt;
===SLT/branch merging===&lt;br /&gt;
&lt;br /&gt;
A frequent occurrence in MIPS code is an SLT or SLTI instruction followed by a branch.  This is generated relatively inefficiently on x86 and ARM, first doing a compare and set, then testing this value and doing a conditional branch.  Doing only one comparison would save at least one instruction, and could potentially save up to three instructions if the liveness analysis reveals that the result of the SLT instruction is used nowhere else.&lt;br /&gt;
&lt;br /&gt;
While a potentially useful optimization, there are several problems with this approach.  First, there are often additional instructions between the slt and the branch. These must be checked to make sure they do not modify the registers as that would prevent reordering the instruction stream as desired.  Secondly, if the result of the slt is found to be live, but unmodified, on both paths of the branch, clean_registers will normally write this value before the branch, to avoid duplicating the writeback code on both paths of the branch.  This optimization would have to be removed if the slt was combined with the branch.&lt;br /&gt;
&lt;br /&gt;
===x86-64===&lt;br /&gt;
&lt;br /&gt;
Currently the x86-64 backend generates only 32-bit instructions.  Proper 64-bit code generation would improve performance.&lt;br /&gt;
&lt;br /&gt;
===PowerPC===&lt;br /&gt;
&lt;br /&gt;
It would be possible to add a PowerPC code generator to the dynamic recompiler.  Currently no one is working on this.  (The mupen64gc project is using a different codebase.)&lt;br /&gt;
&lt;br /&gt;
The following is a summary of the changes which would be necessary to add a PowerPC backend.&lt;br /&gt;
&lt;br /&gt;
The slt* instructions use conditional moves, which are unavailable on PowerPC.  A suitable alternative (such as moves from the condition register) would need to be used.&lt;br /&gt;
&lt;br /&gt;
The assembler can generate as much as 256K of code in a single block, however conditional branches on PowerPC are limited to +/-64K.  It will be necessary to either restrict the block size, or insert jump tables in a manner similar to the literal pools on ARM.&lt;br /&gt;
&lt;br /&gt;
PowerPC generally relies on early branch resolution rather than statistical branch prediction.  Scheduling branch condition tests earlier may be advantageous.  (For example, the address bounds check could be done in address_generation, rather than just before the load or store.  Similarly it may be advantageous to test the branch condition and update the cycle count before executing the delay slot.)&lt;br /&gt;
&lt;br /&gt;
===MIPS===&lt;br /&gt;
&lt;br /&gt;
Recompiling MIPS into MIPS would be relatively straightforward, however the current code generator has no facility for filling delay slots.  This capability would be required for efficient code generation.&lt;br /&gt;
&lt;br /&gt;
[[Category:Development]]&lt;/div&gt;</summary>
		<author><name>Ari64</name></author>
		
	</entry>
	<entry>
		<id>https://pandorawiki.org/index.php?title=Mupen64plus_dynamic_recompiler&amp;diff=3348</id>
		<title>Mupen64plus dynamic recompiler</title>
		<link rel="alternate" type="text/html" href="https://pandorawiki.org/index.php?title=Mupen64plus_dynamic_recompiler&amp;diff=3348"/>
		<updated>2010-08-14T22:30:27Z</updated>

		<summary type="html">&lt;p&gt;Ari64: typos&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This page describes the dynamic recompiler in Mupen64plus, and the changes made for the ARM port.&lt;br /&gt;
&lt;br /&gt;
==The original dynamic recompiler by Hacktarux==&lt;br /&gt;
&lt;br /&gt;
The dynamic recompiler used in Mupen64plus v1.5 is based on the original written by Hacktarux in 2002.&lt;br /&gt;
&lt;br /&gt;
It recompiles contiguous blocks of MIPS instructions.  First, each instruction is decoded into a dynamically-allocated 132-byte data structure.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
typedef struct _precomp_instr&lt;br /&gt;
{&lt;br /&gt;
   void (*ops)();&lt;br /&gt;
   union&lt;br /&gt;
     {&lt;br /&gt;
    struct&lt;br /&gt;
      {&lt;br /&gt;
         long long int *rs;&lt;br /&gt;
         long long int *rt;&lt;br /&gt;
         short immediate;&lt;br /&gt;
      } i;&lt;br /&gt;
    struct&lt;br /&gt;
      {&lt;br /&gt;
         unsigned int inst_index;&lt;br /&gt;
      } j;&lt;br /&gt;
    struct&lt;br /&gt;
      {&lt;br /&gt;
         long long int *rs;&lt;br /&gt;
         long long int *rt;&lt;br /&gt;
         long long int *rd;&lt;br /&gt;
         unsigned char sa;&lt;br /&gt;
         unsigned char nrd;&lt;br /&gt;
      } r;&lt;br /&gt;
    struct&lt;br /&gt;
      {&lt;br /&gt;
         unsigned char base;&lt;br /&gt;
         unsigned char ft;&lt;br /&gt;
         short offset;&lt;br /&gt;
      } lf;&lt;br /&gt;
    struct&lt;br /&gt;
      {&lt;br /&gt;
         unsigned char ft;&lt;br /&gt;
         unsigned char fs;&lt;br /&gt;
         unsigned char fd;&lt;br /&gt;
      } cf;&lt;br /&gt;
     } f;&lt;br /&gt;
   unsigned int addr; /* word-aligned instruction address in r4300 address space */&lt;br /&gt;
   unsigned int local_addr; /* byte offset to start of corresponding x86_64 instructions, from start of code block */&lt;br /&gt;
   reg_cache_struct reg_cache_infos;&lt;br /&gt;
} precomp_instr;&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The decoded instructions are then compiled, generating x86 instructions for each MIPS instruction.  A 32K block is allocated with malloc() to hold the x86 code.  If this size proves insufficient, it is incrementally resized with realloc().&lt;br /&gt;
&lt;br /&gt;
MIPS registers are allocated to x86 registers with a least-recently-used replacement policy.  All cached registers are written back prior to a branch, or a memory read or write.&lt;br /&gt;
&lt;br /&gt;
To facilitate invalidation and replacement of modified code blocks, each 4K page is compiled separately.  If a sequence of instructions crosses a 4K page boundary, the current block is ended, and the next instructions will be compiled as a separate block.&lt;br /&gt;
&lt;br /&gt;
Branch instructions within a 4K page are compiled as branches directly to the target address.  Branches which cross a 4K page go through an indirect address lookup.&lt;br /&gt;
&lt;br /&gt;
Compiled code blocks are invalidated on write.  On an actual MIPS CPU, the instruction cache is invalidated using the CACHE instruction, however a few N64 games clear the cache using other methods.  Trapping writes appears to be the most reliable method of ensuring cache coherency in the emulated system.  The cache instruction is ignored.&lt;br /&gt;
&lt;br /&gt;
==Problems with the original design==&lt;br /&gt;
&lt;br /&gt;
The most significant performance problem with this design is its excessive memory usage.  The decoded instruction data is retained (132 bytes for each MIPS instruction) and occasionally referenced during execution.  Memory accesses frequently miss the L2 cache, resulting in poor performance.&lt;br /&gt;
&lt;br /&gt;
Additionally, the register cache is relatively inefficient, since all registers are flushed before any read or write operation.&lt;br /&gt;
&lt;br /&gt;
==A new approach==&lt;br /&gt;
&lt;br /&gt;
To reduce memory usage, the new design allocates a single large block of memory (currently 16 MiB) which is used for recompiled code.  This memory is allocated using mmap with the PROT_EXEC bit set, to ensure operation on CPUs with no-execute (NX) page permissions.&lt;br /&gt;
&lt;br /&gt;
Contiguous blocks of MIPS instructions are compiled (that is, it does not attempt to follow branches or 'hot-paths').&lt;br /&gt;
&lt;br /&gt;
The recompiler consists of eight stages, plus a linker, and a memory manager.&lt;br /&gt;
&lt;br /&gt;
Compiled blocks are invalidated on write, as before, however as the compiler will cross a 4K page, writes may invalidate adjacent pages as well.&lt;br /&gt;
&lt;br /&gt;
Currently the dynarec generates x86, x86-64, ARMv5, and ARMv7 (little-endian).  Most of the code is shared between the architectures, but a different code generator is included at compile time, using different #include statements depending on the CPU type.&lt;br /&gt;
&lt;br /&gt;
==Pass 1: Disassembly==&lt;br /&gt;
&lt;br /&gt;
When an instruction address is encountered which has not been compiled, the function new_recompile_block is called, with the (virtual) address of the target instruction as its sole parameter.  If the address is invalid, the function returns a nonzero value and the caller is responsible for handling the pagefault.&lt;br /&gt;
&lt;br /&gt;
Instructions are decoded until an unconditional jump, usually a return, is encountered.  Disassembly is ordinarilly continued past a JAL (subroutine call) instruction, however this strategy is abandoned if invalid instructions are encountered.&lt;br /&gt;
&lt;br /&gt;
Up to 4096 instructions (16K of MIPS code) may be disassembled at once.  Surprisingly, some games do actually reach this limit.&lt;br /&gt;
&lt;br /&gt;
==Pass 2: Liveness analysis==&lt;br /&gt;
&lt;br /&gt;
After disassembly, liveness analysis is performed on the registers.  This determines when a particular register will no longer be used, and thus can be removed from the register cache.&lt;br /&gt;
&lt;br /&gt;
A separate analysis is done on the upper 32 bits of 64-bit registers.  This can determine when only the lower 32 bits of a register are significant, thus allowing use of a 32-bit register.  This enables more efficient code generation on 32-bit processors.&lt;br /&gt;
&lt;br /&gt;
==Pass 3: Register allocation==&lt;br /&gt;
&lt;br /&gt;
The 31 MIPS registers must be mapped onto seven available registers on x86, or twelve available registers on ARM.&lt;br /&gt;
&lt;br /&gt;
Instructions with 64-bit results require two registers on 32-bit host processors.  32-bit instructions require only one.  A flag is set for registers containing 32-bit values, and these registers will be sign-extended before they are written to the register file.&lt;br /&gt;
&lt;br /&gt;
Registers are made available using a least-soon-needed replacement policy.  When the register cache is full, and no registers can be eliminated using the liveness analysis, a ten-instruction lookahead is used to determine which registers will not be needed soon and these registers are evicted from the cache.&lt;br /&gt;
&lt;br /&gt;
==Pass 4: Free unused registers==&lt;br /&gt;
&lt;br /&gt;
After the initial register allocation, the allocations are reviewed to determine if any registers remain allocated longer than necessary.  These are then removed from the register cache.  This avoids having to write back a large number of registers just before a branch, and makes more registers available for the next pass.&lt;br /&gt;
&lt;br /&gt;
==Pass 5: Pre-allocate registers==&lt;br /&gt;
&lt;br /&gt;
If a register will be used soon and needs to be loaded, try to load it early if the register is available.  This improves execution on CPUs with a load-use penalty.&lt;br /&gt;
&lt;br /&gt;
==Pass 6: Optimize clean/dirty state==&lt;br /&gt;
&lt;br /&gt;
If a cached register is 'dirty' and needs to be written out, try to do so as soon as the register will no longer be modified.  This avoids having to write the same register on multiple code paths due to conditional branches.  Additionally, try to avoid writing out dirty registers inside of loops.&lt;br /&gt;
&lt;br /&gt;
==Pass 7: Identify where registers are assumed to be 32-bit==&lt;br /&gt;
&lt;br /&gt;
When a 64-bit register is mapped to a 32-bit register with the assumption that the value will be sign-extended before being used, it is necessary to ensure that no register contains a 64-bit value when branching to such a location.  These instructions are flagged to identify them as requiring 32-bit inputs.  This information is used by the linker, and the exception return (ERET) handler.&lt;br /&gt;
&lt;br /&gt;
==Pass 8: Assembly==&lt;br /&gt;
&lt;br /&gt;
This generates and outputs the recompiled code.&lt;br /&gt;
&lt;br /&gt;
Following the main code block, handlers for certain exceptions as well as alternate entry points are added.&lt;br /&gt;
&lt;br /&gt;
If a recompiled instruction relies on a certain MIPS register being cached in a certain native register, then a short 'stub' of code is generated to load the necessary registers.  When an instruction outside of this block needs to jump to that location, it will instead jump to the stub.  The necessary registers will be loaded, and then it will jump into the main code sequence.&lt;br /&gt;
&lt;br /&gt;
On architectures which require literal pools (ARMv5) these are inserted as necessary.&lt;br /&gt;
&lt;br /&gt;
==Linker==&lt;br /&gt;
&lt;br /&gt;
The linker fills in all unresolved branches.  Branches within the block are linked to their target address.  Branches which jump outside of the block are linked to their target if that address has been compiled already. These inter-block branches are recorded in the jump_out array.  This information will be used to remove the links in the event that the target of the branch is invalidated.&lt;br /&gt;
&lt;br /&gt;
Unresolved branches point to a stub which loads the address of the branch instruction and the virtual address of its target into registers, and calls the dynamic linker.  When this code is executed, the dynamic linker will compile the target if necessary, and then patch the branch instruction with the new address.&lt;br /&gt;
&lt;br /&gt;
==Memory manager==&lt;br /&gt;
&lt;br /&gt;
The last step in new_recompile_block is to ensure that there will be sufficient memory available to compile the next block.  If there is not, then the oldest blocks are purged.&lt;br /&gt;
&lt;br /&gt;
The dynarec cache can be described as a 16MB circular buffer divided into eight segments.  Memory is allocated in order from beginning to end.&lt;br /&gt;
&lt;br /&gt;
When there are less than 2 empty segments, the next segment in sequence is cleared, wrapping around to the beginning of the buffer.  This continues as memory is needed, wrapping around from end to beginning.&lt;br /&gt;
&lt;br /&gt;
==Invalidation and restoration==&lt;br /&gt;
&lt;br /&gt;
Normally, code blocks are looked up via a hash table, using the virtual address of the target instruction.  However, for purposes of invalidation, blocks are grouped by physical address.  This can be described as a virtually-indexed, physically-tagged (VIPT) cache.&lt;br /&gt;
&lt;br /&gt;
References to compiled blocks are stored in one of 4096 linked lists in the jump_in array.  Each list covers a 4K block of memory, and 2048 such lists are sufficient to cover the 8MB of RAM in the Nintendo 64.  The remaining lists are for code in ROM, and the bootloader in SP memory.&lt;br /&gt;
&lt;br /&gt;
When a write hits a memory page marked as cached, all entries in the corresponding list are invalidated.  If any code is found to cross a 4K boundary, the adjacent lists are invalidated also.&lt;br /&gt;
&lt;br /&gt;
Sometimes blocks may be invalidated even when none of the code is actually modified.  This can happen if data is written to memory in the same 4K page, or if code is reloaded without actually modifying it.  If blocks which were previously invalidated are subsequently found to be unmodified, those blocks are marked in the restore_candidate array.  If the block remains unmodified, it will be restored as a valid block, to avoid recompiling blocks which do not need to be recompiled.  This is performed by the clean_blocks function which is called periodically.&lt;br /&gt;
&lt;br /&gt;
==Dynamic linker==&lt;br /&gt;
&lt;br /&gt;
Branches with unresolved addresses jump to the dynamic linker.  This will look through the jump_in list corresponding to the physical page containing the virtual target address.  If found, the branch instruction will be patched with the address, and then it will jump to this address.&lt;br /&gt;
&lt;br /&gt;
If not found, the jump_dirty list will be searched for blocks which were previously compiled but may have been modified.  If a potential match is found, the code will be compared against a cached copy to determine if any changes have been made.  If not, then it will jump to the block.  Because the memory could be modified again, branch instructions referencing these blocks are not altered, and continue to point to the dynamic linker.  These blocks will continue to be verified each time they are called, until restored to the jump_in list by the clean_blocks function described above.&lt;br /&gt;
&lt;br /&gt;
If no compiled block is found, or the existing block was modified, the target is recompiled.&lt;br /&gt;
&lt;br /&gt;
==Address lookup==&lt;br /&gt;
&lt;br /&gt;
When a JR (jump register) instruction is encountered, the address of the recompiled code must be looked up using the address of the MIPS code.  The majority of such instructions jump to the link register (r31) to return to the location following a JAL (call) instruction.&lt;br /&gt;
&lt;br /&gt;
When a JAL or JALR is executed, the address of the following instruction is inserted into a small 32-entry hash table, which is checked when a JR r31 instruction is executed.  This allows for a quick return from subroutine calls.&lt;br /&gt;
&lt;br /&gt;
If the JR instruction uses a register other than r31, or the small hash table lookup fails to find a match, a larger 131072-entry hash table is checked.  This table contains 65536 bins with up to 2 addresses per bin.  If this also fails to find a match (which occurs less than 1% of the time) an exhaustive search of all compiled addresses within that 4K memory page is performed.&lt;br /&gt;
&lt;br /&gt;
If no match is found by any of these methods, the target address is compiled, and the new address is inserted into the hash table.&lt;br /&gt;
&lt;br /&gt;
==Cycle counting==&lt;br /&gt;
&lt;br /&gt;
Cycles are counted before each branch by adding the cycles from the preceding instructions to a specific register.  The cycle count is in R10 on ARM and ESI on x86.  The value in this register is normally a negative number.  When this number exceeds zero, a conditional branch is taken which jumps to an interrupt handler.&lt;br /&gt;
&lt;br /&gt;
For example, the following x86 code adds eight cycles:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
add $8,%esi&lt;br /&gt;
jns interrupt_handler&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The conditional branch jumps to a small bit of code located after the main compiled block, which saves the cached registers to the register file, sets the instruction pointer which will be used upon return from the interrupt, and then calls cc_interrupt.&lt;br /&gt;
&lt;br /&gt;
As in the original mupen64plus, the emulated clock runs at 37.5 MHz, and each instruction takes 2 clock cycles.&lt;br /&gt;
&lt;br /&gt;
==Delay slots==&lt;br /&gt;
&lt;br /&gt;
MIPS has 'delay slots', where the instruction after the branch is executed before the branch is taken.  Instructions in delay slots are issued out-of-order in the recompiled code.&lt;br /&gt;
&lt;br /&gt;
[[Image:Recompiler delay slot reordering.png]]&lt;br /&gt;
&lt;br /&gt;
When a branch jumps into the delay slot of another branch, this case must be handled slightly differently:&lt;br /&gt;
&lt;br /&gt;
[[Image:Recompiler delay slot reordering 2.png]]&lt;br /&gt;
&lt;br /&gt;
The branch test and delay slot are executed in-order if a dependency exists, or for 'likely' branches where the delay slot is nullified when the branch condition is false.  These cases are infrequent (typically less than 10% of branches).&lt;br /&gt;
&lt;br /&gt;
==Constant propagation==&lt;br /&gt;
&lt;br /&gt;
When an instruction loads a constant into a register, the register is tagged as a constant.  The constant tag will be retained if subsequent instructions modify the constant using other constants.  During assembly, such a sequence of instructions is combined into a single load.  For example:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
LUI r8,12340000  --&amp;gt;  mov $0x12345678,%eax&lt;br /&gt;
ORI r8,r8,5678&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This optimization is not performed where a branch target intervenes, eg&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
...&lt;br /&gt;
 LUI r8,12340000  --&amp;gt;  mov $0x12340000,%eax&lt;br /&gt;
L1:&lt;br /&gt;
 ORI r8,r8,5678   --&amp;gt;  or $0x5678,%eax&lt;br /&gt;
 ...&lt;br /&gt;
 BEQ r0,r0,L1&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Registers containing constants are identified by bits in the isconst and wasconst fields of the regstat structure.  The wasconst bit is set for a register if the register contained a known constant before the instruction, and the isconst bit is set for a register if the register will contain a known constant after the instruction executes.&lt;br /&gt;
&lt;br /&gt;
==Translation lookaside buffer emulation==&lt;br /&gt;
&lt;br /&gt;
Most Nintendo 64 games do not use virtual memory, but some do.  At startup, main memory is directly mapped at addresses from 0x80000000 to 0x803FFFFF, or up to 0x807FFFFF if the memory expansion is used.&lt;br /&gt;
&lt;br /&gt;
Normally, read or write operations are checked against this range, and if outside this range, control is passed to the appropriate I/O handler.  This is done as follows:&lt;br /&gt;
&lt;br /&gt;
MIPS instruction: LW r1,8(r2)&lt;br /&gt;
&lt;br /&gt;
ARM code:&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
add r1, r2, #8&lt;br /&gt;
cmp r1, $8388608&lt;br /&gt;
bvc handler&lt;br /&gt;
ldr r1, [r1]&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If there are valid entries in the TLB, this would instead be compiled as follows:&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
add r1, r2, #8&lt;br /&gt;
mov r0, #264&lt;br /&gt;
add r0, r0, r1, lsr #12&lt;br /&gt;
ldr r0, [r11, r0, lsl #2]&lt;br /&gt;
tst r0, r0&lt;br /&gt;
bmi handler&lt;br /&gt;
ldr r1, [r1, r0, lsl #2]&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This looks up the offset in the memory_map table (which, in this example, is located at r11+264*4).  The high bit is tested to determine whether a valid mapping exists for this page.&lt;br /&gt;
&lt;br /&gt;
==Page fault emulation==&lt;br /&gt;
&lt;br /&gt;
If a memory address references an invalid page, a conditional branch is taken to a bit of code placed after the main block.  This will save any modified cached registers, and call an appropriate handler function for the address.  The handler will either perform I/O, or generate an exception (pagefault).&lt;br /&gt;
&lt;br /&gt;
==Mapping executable pages==&lt;br /&gt;
&lt;br /&gt;
If the dynamic recompiler encounters code which is not in contiguous physical memory, it will end the block at the page boundary, so that the block can be removed cleanly if the mapping is changed.&lt;br /&gt;
&lt;br /&gt;
There is one exception to this, where a branch and delay slot span two pages.  verify_dirty and verify_mapping check for this special case, and ensure that the TLB mapping remains the same before executing any such page-spanning block.&lt;br /&gt;
&lt;br /&gt;
== Compile options ==&lt;br /&gt;
&lt;br /&gt;
=== ARMv5_ONLY ===&lt;br /&gt;
&lt;br /&gt;
If this is defined, the UXTH instruction is not used, and the dynamic recompiler will generate literal pools instead of using movw/movt.  This provides compatibility with older processors, but generates somewhat less efficient code.&lt;br /&gt;
&lt;br /&gt;
=== CORTEX_A8_BRANCH_PREDICTION_HACK ===&lt;br /&gt;
&lt;br /&gt;
If this is defined, the dynamic recompiler will avoid generating consecutive branch instructions without another instruction in between.  This avoids a stall in instruction decoding on the Cortex-A8 due to this processor having dual instruction decoders, but only one branch-prediction unit.  See [[Assembly Code Optimization]] for details.&lt;br /&gt;
&lt;br /&gt;
=== USE_MINI_HT ===&lt;br /&gt;
&lt;br /&gt;
If this is defined, attempt to look up return addresses in a small hash table before checking the larger hash table.  Usually improves performance.&lt;br /&gt;
&lt;br /&gt;
=== IMM_PREFETCH ===&lt;br /&gt;
&lt;br /&gt;
If this is defined, the x86 PREFETCH instruction is used to prefetch entries from the hash table.  The increase in code size often outweighs the benefit of this.&lt;br /&gt;
&lt;br /&gt;
=== REG_PREFETCH ===&lt;br /&gt;
&lt;br /&gt;
Similar to the above, but loads the address into a register first, then uses the ARM PLD instruction.  The increase in code size almost always outweighs the benefit of this.&lt;br /&gt;
&lt;br /&gt;
=== R29_HACK ===&lt;br /&gt;
&lt;br /&gt;
Assume that the stack pointer (r29) is always a valid memory address and do not check it.  It is similar to the optimization described [http://strmnnrmn.blogspot.com/2007/08/interesting-dynarec-hack.html here].  This can crash the emulator and is not enabled by default.&lt;br /&gt;
&lt;br /&gt;
==Future improvements==&lt;br /&gt;
&lt;br /&gt;
===Copy propagation/offset propagation===&lt;br /&gt;
&lt;br /&gt;
A common instruction sequence is of the form:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
LUI r9,12340000&lt;br /&gt;
ADD r9,r9,r8&lt;br /&gt;
LW r9,5678(r9)&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
It would be helpful to recognize this as a load from r8+12345678.  The current constant propagation code does not do so.&lt;br /&gt;
&lt;br /&gt;
===Unaligned memory access===&lt;br /&gt;
&lt;br /&gt;
A small improvement could be made by combining adjacent LWL/LWR instructions.  The potential gain from doing so is very limited because these instructions typically represent less than 1% of all memory accesses.&lt;br /&gt;
&lt;br /&gt;
===SLT/branch merging===&lt;br /&gt;
&lt;br /&gt;
A frequent occurrence in MIPS code is an SLT or SLTI instruction followed by a branch.  This is generated relatively inefficiently on x86 and ARM, first doing a compare and set, then testing this value and doing a conditional branch.  Doing only one comparison would save at least one instruction, and could potentially save up to three instructions if the liveness analysis reveals that the result of the SLT instruction is used nowhere else.&lt;br /&gt;
&lt;br /&gt;
While a potentially useful optimization, there are several problems with this approach.  First, there are often additional instructions between the slt and the branch. These must be checked to make sure they do not modify the registers as that would prevent reordering the instruction stream as desired.  Secondly, if the result of the slt is found to be live, but unmodified, on both paths of the branch, clean_registers will normally write this value before the branch, to avoid duplicating the writeback code on both paths of the branch.  This optimization would have to be removed if the slt was combined with the branch.&lt;br /&gt;
&lt;br /&gt;
===Matching by virtual address===&lt;br /&gt;
&lt;br /&gt;
The clean_blocks function will only recognize blocks as unmodified if both the physical and virtual addresses are the same.  It would be more efficient if it could recognize unmodified blocks where the virtual address is the same even though the physical address may have changed.&lt;br /&gt;
&lt;br /&gt;
===x86-64===&lt;br /&gt;
&lt;br /&gt;
Currently the x86-64 backend generates only 32-bit instructions.  Proper 64-bit code generation would improve performance.&lt;br /&gt;
&lt;br /&gt;
===PowerPC===&lt;br /&gt;
&lt;br /&gt;
It would be possible to add a PowerPC code generator to the dynamic recompiler.  Currently no one is working on this.  (The mupen64gc project is using a different codebase.)&lt;br /&gt;
&lt;br /&gt;
The following is a summary of the changes which would be necessary to add a PowerPC backend.&lt;br /&gt;
&lt;br /&gt;
The slt* instructions use conditional moves, which are unavailable on PowerPC.  A suitable alternative (such as moves from the condition register) would need to be used.&lt;br /&gt;
&lt;br /&gt;
The assembler can generate as much as 256K of code in a single block, however conditional branches on PowerPC are limited to +/-64K.  It will be necessary to either restrict the block size, or insert jump tables in a manner similar to the literal pools on ARM.&lt;br /&gt;
&lt;br /&gt;
PowerPC generally relies on early branch resolution rather than statistical branch prediction.  Scheduling branch condition tests earlier may be advantageous.  (For example, the address bounds check could be done in address_generation, rather than just before the load or store.  Similarly it may be advantageous to test the branch condition and update the cycle count before executing the delay slot.)&lt;br /&gt;
&lt;br /&gt;
===MIPS===&lt;br /&gt;
&lt;br /&gt;
Recompiling MIPS into MIPS would be relatively straightforward, however the current code generator has no facility for filling delay slots.  This capability would be required for efficient code generation.&lt;br /&gt;
&lt;br /&gt;
[[Category:Development]]&lt;/div&gt;</summary>
		<author><name>Ari64</name></author>
		
	</entry>
	<entry>
		<id>https://pandorawiki.org/index.php?title=Mupen64plus_dynamic_recompiler&amp;diff=3062</id>
		<title>Mupen64plus dynamic recompiler</title>
		<link rel="alternate" type="text/html" href="https://pandorawiki.org/index.php?title=Mupen64plus_dynamic_recompiler&amp;diff=3062"/>
		<updated>2010-07-18T18:22:18Z</updated>

		<summary type="html">&lt;p&gt;Ari64: Compile options&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This page describes the dynamic recompiler in Mupen64plus, and the changes made for the ARM port.&lt;br /&gt;
&lt;br /&gt;
==The original dynamic recompiler by Hacktarux==&lt;br /&gt;
&lt;br /&gt;
The dynamic recompiler used in Mupen64plus v1.5 is based on the original written by Hacktarux in 2002.&lt;br /&gt;
&lt;br /&gt;
It recompiles contiguous blocks of MIPS instructions.  First, each instruction is decoded into a dynamically-allocated 132-byte data structure.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
typedef struct _precomp_instr&lt;br /&gt;
{&lt;br /&gt;
   void (*ops)();&lt;br /&gt;
   union&lt;br /&gt;
     {&lt;br /&gt;
    struct&lt;br /&gt;
      {&lt;br /&gt;
         long long int *rs;&lt;br /&gt;
         long long int *rt;&lt;br /&gt;
         short immediate;&lt;br /&gt;
      } i;&lt;br /&gt;
    struct&lt;br /&gt;
      {&lt;br /&gt;
         unsigned int inst_index;&lt;br /&gt;
      } j;&lt;br /&gt;
    struct&lt;br /&gt;
      {&lt;br /&gt;
         long long int *rs;&lt;br /&gt;
         long long int *rt;&lt;br /&gt;
         long long int *rd;&lt;br /&gt;
         unsigned char sa;&lt;br /&gt;
         unsigned char nrd;&lt;br /&gt;
      } r;&lt;br /&gt;
    struct&lt;br /&gt;
      {&lt;br /&gt;
         unsigned char base;&lt;br /&gt;
         unsigned char ft;&lt;br /&gt;
         short offset;&lt;br /&gt;
      } lf;&lt;br /&gt;
    struct&lt;br /&gt;
      {&lt;br /&gt;
         unsigned char ft;&lt;br /&gt;
         unsigned char fs;&lt;br /&gt;
         unsigned char fd;&lt;br /&gt;
      } cf;&lt;br /&gt;
     } f;&lt;br /&gt;
   unsigned int addr; /* word-aligned instruction address in r4300 address space */&lt;br /&gt;
   unsigned int local_addr; /* byte offset to start of corresponding x86_64 instructions, from start of code block */&lt;br /&gt;
   reg_cache_struct reg_cache_infos;&lt;br /&gt;
} precomp_instr;&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The decoded instructions are then compiled, generating x86 instructions for each MIPS instruction.  A 32K block is allocated with malloc() to hold the x86 code.  If this size proves insufficient, it is incrementally resized with realloc().&lt;br /&gt;
&lt;br /&gt;
MIPS registers are allocated to x86 registers with a least-recently-used replacement policy.  All cached registers are written back prior to a branch, or a memory read or write.&lt;br /&gt;
&lt;br /&gt;
To facilitate invalidation and replacement of modified code blocks, each 4K page is compiled separately.  If a sequence of instructions crosses a 4K page boundary, the current block is ended, and the next instructions will be compiled as a separate block.&lt;br /&gt;
&lt;br /&gt;
Branch instructions within a 4K page are compiled as branches directly to the target address.  Branches which cross a 4K page go through an indirect address lookup.&lt;br /&gt;
&lt;br /&gt;
Compiled code blocks are invalidated on write.  On an actual MIPS CPU, the instruction cache is invalidated using the CACHE instruction, however a few N64 games clear the cache using other methods.  Trapping writes appears to be the most reliable method of ensuring cache coherency in the emulated system.  The cache instruction is ignored.&lt;br /&gt;
&lt;br /&gt;
==Problems with the original design==&lt;br /&gt;
&lt;br /&gt;
The most significant performance problem with this design is its excessive memory usage.  The decoded instruction data is retained (132 bytes for each MIPS instruction) and occasionally referenced during execution.  Memory accesses frequently miss the L2 cache, resulting in poor performance.&lt;br /&gt;
&lt;br /&gt;
Additionally, the register cache is relatively inefficient, since all registers are flushed before any read or write operation.&lt;br /&gt;
&lt;br /&gt;
==A new approach==&lt;br /&gt;
&lt;br /&gt;
To reduce memory usage, the new design allocates a single large block of memory (currently 16 MiB) which is used for recompiled code.  This memory is allocated using mmap with the PROT_EXEC bit set, to ensure operation on CPUs with no-execute (NX) page permissions.&lt;br /&gt;
&lt;br /&gt;
Contiguous blocks of MIPS instructions are compiled (that is, it does not attempt to follow branches or 'hot-paths').&lt;br /&gt;
&lt;br /&gt;
The recompiler consists of eight stages, plus a linker, and a memory manager.&lt;br /&gt;
&lt;br /&gt;
Compiled blocks are invalidated on write, as before, however as the compiler will cross a 4K page, writes may invalidate adjacent pages as well.&lt;br /&gt;
&lt;br /&gt;
Currently the dynarec generates x86, x86-64, ARMv5, and ARMv7 (little-endian).  Most of the code is shared between the architectures, but a different code generator is included at compile time, using different #include statements depending on the CPU type.&lt;br /&gt;
&lt;br /&gt;
==Pass 1: Disassembly==&lt;br /&gt;
&lt;br /&gt;
When an instruction address is encountered which has not been compiled, the function new_recompile_block is called, with the (virtual) address of the target instruction as its sole parameter.  If the address is invalid, the function returns a nonzero value and the caller is responsible for handling the pagefault.&lt;br /&gt;
&lt;br /&gt;
Instructions are decoded until an unconditional jump, usually a return, is encountered.  Disassembly is ordinarilly continued past a JAL (subroutine call) instruction, however this strategy is abandoned if invalid instructions are encountered.&lt;br /&gt;
&lt;br /&gt;
Up to 4096 instructions (16K of MIPS code) may be disassembled at once.  Surprisingly, some games do actually reach this limit.&lt;br /&gt;
&lt;br /&gt;
==Pass 2: Liveness analysis==&lt;br /&gt;
&lt;br /&gt;
After disassembly, liveness analysis is performed on the registers.  This determines when a particular register will no longer be used, and thus can be removed from the register cache.&lt;br /&gt;
&lt;br /&gt;
A separate analysis is done on the upper 32 bits of 64-bit registers.  This can determine when only the lower 32 bits of a register are significant, thus allowing use of a 32-bit register.  This enables more efficient code generation on 32-bit processors.&lt;br /&gt;
&lt;br /&gt;
==Pass 3: Register allocation==&lt;br /&gt;
&lt;br /&gt;
The 31 MIPS registers must be mapped onto seven available registers on x86, or twelve available registers on ARM.&lt;br /&gt;
&lt;br /&gt;
Instructions with 64-bit results require two registers on 32-bit host processors.  32-bit instructions require only one.  A flag is set for registers containing 32-bit values, and these registers will be sign-extended before they are written to the register file.&lt;br /&gt;
&lt;br /&gt;
Registers are made available using a least-soon-needed replacement policy.  When the register cache is full, and no registers can be eliminated using the liveness analysis, a ten-instruction lookahead is used to determine which registers will not be needed soon and these registers are evicted from the cache.&lt;br /&gt;
&lt;br /&gt;
==Pass 4: Free unused registers==&lt;br /&gt;
&lt;br /&gt;
After the initial register allocation, the allocations are reviewed to determine if any registers remain allocated longer than necessary.  These are then removed from the register cache.  This avoids having to write back a large number of registers just before a branch, and makes more registers available for the next pass.&lt;br /&gt;
&lt;br /&gt;
==Pass 5: Pre-allocate registers==&lt;br /&gt;
&lt;br /&gt;
If a register will be used soon and needs to be loaded, try to load it early if the register is available.  This improves execution on CPUs with a load-use penalty.&lt;br /&gt;
&lt;br /&gt;
==Pass 6: Optimize clean/dirty state==&lt;br /&gt;
&lt;br /&gt;
If a cached register is 'dirty' and needs to be written out, try to do so as soon as the register will no longer be modified.  This avoids having to write the same register on multiple code paths due to conditional branches.  Additionally, try to avoid writing out dirty registers inside of loops.&lt;br /&gt;
&lt;br /&gt;
==Pass 7: Identify where registers are assumed to be 32-bit==&lt;br /&gt;
&lt;br /&gt;
When a 64-bit register is mapped to a 32-bit register with the assumption that the value will be sign-extended before being used, it is necessary to ensure that no register contains a 64-bit value when branching to such a location.  These instructions are flagged to identify them as requiring 32-bit inputs.  This information is used by the linker, and the exception return (ERET) handler.&lt;br /&gt;
&lt;br /&gt;
==Pass 8: Assembly==&lt;br /&gt;
&lt;br /&gt;
This generates and outputs the recompiled code.&lt;br /&gt;
&lt;br /&gt;
Following the main code block, handlers for certain exceptions as well as alternate entry points are added.&lt;br /&gt;
&lt;br /&gt;
If a recompiled instruction relies on a certain MIPS register being cached in a certain native register, then a short 'stub' of code is generated to load the necessary registers.  When an instruction outside of this block needs to jump to that location, it will instead jump to the stub.  The necessary registers will be loaded, and then it will jump into the main code sequence.&lt;br /&gt;
&lt;br /&gt;
On architectures which require literal pools (ARMv5) these are inserted as necessary.&lt;br /&gt;
&lt;br /&gt;
==Linker==&lt;br /&gt;
&lt;br /&gt;
The linker fills in all unresolved branches.  Branches within the block are linked to their target address.  Branches which jump outside of the block are linked to their target if that address has been compiled already. These inter-block branches are recorded in the jump_out array.  This information will be used to remove the links in the event that the target of the branch is invalidated.&lt;br /&gt;
&lt;br /&gt;
Unresolved branches point to a stub which loads the address of the branch instruction and the virtual address of its target into registers, and calls the dynamic linker.  When this code is executed, the dynamic linker will compile the target if necessary, and then patch the branch instruction with the new address.&lt;br /&gt;
&lt;br /&gt;
==Memory manager==&lt;br /&gt;
&lt;br /&gt;
The last step in new_recompile_block is to ensure that there will be sufficient memory available to compile the next block.  If there is not, then the oldest blocks are purged.&lt;br /&gt;
&lt;br /&gt;
The dynarec cache can be described as a 16MB circular buffer divided into eight segments.  Memory is allocated in order from beginning to end.&lt;br /&gt;
&lt;br /&gt;
When there are less than 2 empty segments, the next segment in sequence is cleared, wrapping around to the beginning of the buffer.  This continues as memory is needed, wrapping around from end to beginning.&lt;br /&gt;
&lt;br /&gt;
==Invalidation and restoration==&lt;br /&gt;
&lt;br /&gt;
Normally, code blocks are looked up via a hash table, using the virtual address of the target instruction.  However, for purposes of invalidation, blocks are grouped by physical address.  This can be described as a virtually-indexed, physically-tagged (VIPT) cache.&lt;br /&gt;
&lt;br /&gt;
References to compiled blocks are stored in one of 4096 linked lists in the jump_in array.  Each list covers a 4K block of memory, and 2048 such lists are sufficient to cover the 8MB of RAM in the Nintendo 64.  The remaining lists are for code in ROM, and the bootloader in SP memory.&lt;br /&gt;
&lt;br /&gt;
When a write hits a memory page marked as cached, all entries in the corresponding list are invalidated.  If any code is found to cross a 4K boundary, the adjacent lists are invalidated also.&lt;br /&gt;
&lt;br /&gt;
Sometimes blocks may be invalidated even when none of the code is actually modified.  This can happen if data is written to memory in the same 4K page, or if code is reloaded without actually modifying it.  If blocks which were previously invalidated are subsequently found to be unmodified, those blocks are marked in the restore_candidate array.  If the block remains unmodified, it will be restored as a valid block, to avoid recompiling blocks which do not need to be recompiled.  This is performed by the clean_blocks function which is called periodically.&lt;br /&gt;
&lt;br /&gt;
==Dynamic linker==&lt;br /&gt;
&lt;br /&gt;
Branches with unresolved addresses jump to the dynamic linker.  This will look through the jump_in list corresponding to the physical page containing the virtual target address.  If found, the branch instruction will be patched with the address, and then it will jump to this address.&lt;br /&gt;
&lt;br /&gt;
If not found, the jump_dirty list will be searched for blocks which were previously compiled but may have been modified.  If a potential match is found, the code will be compared against a cached copy to determine if any changes have been made.  If not, then it will jump to the block.  Because the memory could be modified again, branch instructions referencing these blocks are not altered, and continue to point to the dynamic linker.  These blocks will continue to be verified each time they are called, until restored to the jump_in list by the clean_blocks function described above.&lt;br /&gt;
&lt;br /&gt;
If no compiled block is found, or the existing block was modified, the target is recompiled.&lt;br /&gt;
&lt;br /&gt;
==Address lookup==&lt;br /&gt;
&lt;br /&gt;
When a JR (jump register) instruction is encountered, the address of the recompiled code must be looked up using the address of the MIPS code.  The majority of such instructions jump to the link register (r31) to return to the location following a JAL (call) instruction.&lt;br /&gt;
&lt;br /&gt;
When a JAL or JALR is executed, the address of the following instruction is inserted into a small 32-entry hash table, which is checked when a JR r31 instruction is executed.  This allows for a quick return from subroutine calls.&lt;br /&gt;
&lt;br /&gt;
If the JR instruction uses a register other than r31, or the small hash table lookup fails to find a match, a larger 131072-entry hash table is checked.  This table contains 65536 bins with up to 2 addresses per bin.  If this also fails to find a match (which occurs less than 1% of the time) an exhaustive search of all compiled addresses within that 4K memory page is performed.&lt;br /&gt;
&lt;br /&gt;
If no match is found by any of these methods, the target address is compiled, and the new address is inserted into the hash table.&lt;br /&gt;
&lt;br /&gt;
==Cycle counting==&lt;br /&gt;
&lt;br /&gt;
Cycles are counted before each branch by adding the cycles from the preceding instructions to a specific register.  The cycle count is in R10 on ARM and ESI on x86.  The value in this register is normally a negative number.  When this number exceeds zero, a conditional branch is taken which jumps to an interrupt handler.&lt;br /&gt;
&lt;br /&gt;
For example, the following x86 code adds eight cycles:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
add $8,%esi&lt;br /&gt;
jns interrupt_handler&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The conditional branch jumps to a small bit of code located after the main compiled block, which saves the cached registers to the register file, sets the instruction pointer which will be used upon return from the interrupt, and then calls cc_interrupt.&lt;br /&gt;
&lt;br /&gt;
As in the original mupen64plus, the emulated clock runs at 37.5 MHz, and each instruction takes 2 clock cycles.&lt;br /&gt;
&lt;br /&gt;
==Delay slots==&lt;br /&gt;
&lt;br /&gt;
MIPS has 'delay slots', where the instruction after the branch is executed before the branch is taken.  Instructions in delay slots are issued out-of-order in the recompiled code.&lt;br /&gt;
&lt;br /&gt;
[[Image:Recompiler delay slot reordering.png]]&lt;br /&gt;
&lt;br /&gt;
When a branch jumps into the delay slot of another branch, this case must be handled slightly differently:&lt;br /&gt;
&lt;br /&gt;
[[Image:Recompiler delay slot reordering 2.png]]&lt;br /&gt;
&lt;br /&gt;
The branch test and delay slot are executed in-order if a dependency exists, or for 'likely' branches where the delay slot is nullified when the branch condition is false.  These cases are infrequent (typically less than 10% of branches).&lt;br /&gt;
&lt;br /&gt;
==Constant propagation==&lt;br /&gt;
&lt;br /&gt;
When an instruction loads a constant into a register, the register is tagged as a constant.  The constant tag will be retained if subsequent instructions modify the constant using other constants.  During assembly, such a sequence of instructions is combined into a single load.  For example:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
LUI r8,12340000  --&amp;gt;  mov $0x12345678,%eax&lt;br /&gt;
ORI r8,r8,5678&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This optimization is not performed where a branch target intervenes, eg&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
...&lt;br /&gt;
 LUI r8,12340000  --&amp;gt;  mov $0x12340000,%eax&lt;br /&gt;
L1:&lt;br /&gt;
 ORI r8,r8,5678   --&amp;gt;  or $0x5678,%eax&lt;br /&gt;
 ...&lt;br /&gt;
 BEQ r0,r0,L1&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Registers containing constants are identified by bits in the isconst and wasconst fields of the regstat structure.  The wasconst bit is set for a register if the register contained a known constant before the instruction, and the isconst bit is set for a register if the register will contain a known constant after the instruction executes.&lt;br /&gt;
&lt;br /&gt;
==Translation lookaside buffer emulation==&lt;br /&gt;
&lt;br /&gt;
Most Nintendo 64 games do not use virtual memory, but some do.  At startup, main memory is directly mapped at addresses from 0x80000000 to 0x803FFFFF, or up to 0x807FFFFF if the memory expansion is used.&lt;br /&gt;
&lt;br /&gt;
Normally, read or write operations are checked against this range, and if outside this range, control is passed to the appropriate I/O handler.  This is done as follows:&lt;br /&gt;
&lt;br /&gt;
MIPS instruction: LW r1,8(r2)&lt;br /&gt;
&lt;br /&gt;
ARM code:&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
add r1, r2, #8&lt;br /&gt;
cmp r1, $8388608&lt;br /&gt;
bvc handler&lt;br /&gt;
ldr r1, [r1]&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If there are valid entries in the TLB, this would instead be compiled as follows:&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
add r1, r2, #8&lt;br /&gt;
mov r0, #264&lt;br /&gt;
add r0, r0, r1, lsr #12&lt;br /&gt;
ldr r0, [r11, r0, lsl #2]&lt;br /&gt;
tst r0, r0&lt;br /&gt;
bmi handler&lt;br /&gt;
ldr r1, [r1, r0, lsl #2]&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This looks up the offset in the memory_map table (which, in this example, is located at r11+264*4).  The high bit is tested to determine whether a valid mapping exists for this page.&lt;br /&gt;
&lt;br /&gt;
==Page fault emulation==&lt;br /&gt;
&lt;br /&gt;
If a memory address references an invalid page, a conditional branch is taken to a bit of code placed after the main block.  This will save any modified cached registers, and call an appropriate handler function for the address.  The handler will either perform I/O, or generate an exception (pagefault).&lt;br /&gt;
&lt;br /&gt;
==Mapping executable pages==&lt;br /&gt;
&lt;br /&gt;
If the dynamic recompiler encounters code which is not in contiguous physical memory, it will end the block at the page boundary, so that the block can be removed cleanly if the mapping is changed.&lt;br /&gt;
&lt;br /&gt;
There is one exception to this, where a branch and delay slot span two pages.  verify_dirty and verify_mapping check for this special case, and ensure that the TLB mapping remains the same before executing any such page-spanning block.&lt;br /&gt;
&lt;br /&gt;
== Compile options ==&lt;br /&gt;
&lt;br /&gt;
=== ARMv5_ONLY ===&lt;br /&gt;
&lt;br /&gt;
If this is defined, the UXTH instruction is not used, and dynamic recompiler will generate literal pools instead of using movw/movt.  This provides compatibility with older processors, but generates somewhat less efficient code.&lt;br /&gt;
&lt;br /&gt;
=== CORTEX_A8_BRANCH_PREDICTION_HACK ===&lt;br /&gt;
&lt;br /&gt;
If this is defined, the dynamic recompiler will avoid generating consecutive branch instructions without another instruction in between.  This avoids a stall in instruction decoding on the Cortex-A8 due to this processor having dual instruction decoders, but only one branch-prediction unit.  See [[Assembly Code Optimization]] for details.&lt;br /&gt;
&lt;br /&gt;
=== USE_MINI_HT ===&lt;br /&gt;
&lt;br /&gt;
If this is defined, attempt to look up return addresses in a small hash table before checking the larger hash table.  Usually improves performance.&lt;br /&gt;
&lt;br /&gt;
=== IMM_PREFETCH ===&lt;br /&gt;
&lt;br /&gt;
If this is defined, the x86 PREFETCH instruction is used to prefetch entries from the hash table.  The increase in code size often outweighs the benefit of this.&lt;br /&gt;
&lt;br /&gt;
=== REG_PREFETCH ===&lt;br /&gt;
&lt;br /&gt;
Similar to the above, but loads the address into a register first, then uses the ARM PLD instruction.  The increase in code size almost always outweighs the benefit of this.&lt;br /&gt;
&lt;br /&gt;
=== R29_HACK ===&lt;br /&gt;
&lt;br /&gt;
Assume that the stack pointer (r29) is always a valid memory address and do not check it.  It is similar to the optimization described [http://strmnnrmn.blogspot.com/2007/08/interesting-dynarec-hack.html here].  This can crash the emulator and is not enabled by default.&lt;br /&gt;
&lt;br /&gt;
==Future improvements==&lt;br /&gt;
&lt;br /&gt;
===Copy propagation/offset propagation===&lt;br /&gt;
&lt;br /&gt;
A common instruction sequence is of the form:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
LUI r9,12340000&lt;br /&gt;
ADD r9,r9,r8&lt;br /&gt;
LW r9,5678(r9)&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
It would be helpful to recognize this as a load from r8+12345678.  The current constant propagation code does not do so.&lt;br /&gt;
&lt;br /&gt;
===Unaligned memory access===&lt;br /&gt;
&lt;br /&gt;
A small improvement could be made by combining adjacent LWL/LWR instructions.  The potential gain from doing so is very limited because these instructions typically represent less than 1% of all memory accesses.&lt;br /&gt;
&lt;br /&gt;
===SLT/branch merging===&lt;br /&gt;
&lt;br /&gt;
A frequent occurrence in MIPS code is an SLT or SLTI instruction followed by a branch.  This is generated relatively inefficiently on x86 and ARM, first doing a compare and set, then testing this value and doing a conditional branch.  Doing only one comparison would save at least one instruction, and could potentially save up to three instructions if the liveness analysis reveals that the result of the SLT instruction is used nowhere else.&lt;br /&gt;
&lt;br /&gt;
While a potentially useful optimization, there are several problems with this approach.  First, there are often additional instructions between the slt and the branch. These must be checked to make sure they do not modify the registers as that would prevent reordering the instruction stream as desired.  Secondly, if the result of the slt is found to be live, but unmodified, on both paths of the branch, clean_registers will normally write this value before the branch, to avoid duplicating the writeback code on both paths of the branch.  This optimization would have to be removed if the slt was combined with the branch.&lt;br /&gt;
&lt;br /&gt;
===Matching by virtual address===&lt;br /&gt;
&lt;br /&gt;
The clean_blocks function will only recognize blocks as unmodified if both the physical and virtual addresses are the same.  It would be more efficient if it could recognize unmodified blocks where the virtual address is the same even though the physical address may have changed.&lt;br /&gt;
&lt;br /&gt;
===x86-64===&lt;br /&gt;
&lt;br /&gt;
Currently the x86-64 backend generates only 32-bit instructions.  Proper 64-bit code generation would improve performance.&lt;br /&gt;
&lt;br /&gt;
===PowerPC===&lt;br /&gt;
&lt;br /&gt;
It would be possible to add a PowerPC code generator to the dynamic recompiler.  Currently no one is working on this.  (The mupen64gc project is using a different codebase.)&lt;br /&gt;
&lt;br /&gt;
The following is a summary of the changes which would be necessary to add a PowerPC backend.&lt;br /&gt;
&lt;br /&gt;
The slt* instructions use conditional moves, which are unavailable on PowerPC.  A suitable alternative (such as moves from the condition register) would need to be used.&lt;br /&gt;
&lt;br /&gt;
The assembler can generate as much as 256K of code in a single block, however conditional branches on PowerPC are limited to +/-64K.  It will be necessary to either restrict the block size, or insert jump tables in a manner similar to the literal pools on ARM.&lt;br /&gt;
&lt;br /&gt;
PowerPC generally relies on early branch resolution rather than statistical branch prediction.  Scheduling branch condition tests earlier may be advantageous.  (For example, the address bounds check could be done in address_generation, rather than just before the load or store.  Similarly it may be advantageous to test the branch condition and update the cycle count before executing the delay slot.)&lt;br /&gt;
&lt;br /&gt;
===MIPS===&lt;br /&gt;
&lt;br /&gt;
Recompiling MIPS into MIPS would be relatively straightforward, however the current code generator has no facility for filling delay slots.  This capability would be required for efficient code generation.&lt;br /&gt;
&lt;br /&gt;
[[Category:Development]]&lt;/div&gt;</summary>
		<author><name>Ari64</name></author>
		
	</entry>
	<entry>
		<id>https://pandorawiki.org/index.php?title=Mupen64plus_dynamic_recompiler&amp;diff=2625</id>
		<title>Mupen64plus dynamic recompiler</title>
		<link rel="alternate" type="text/html" href="https://pandorawiki.org/index.php?title=Mupen64plus_dynamic_recompiler&amp;diff=2625"/>
		<updated>2010-06-30T15:44:11Z</updated>

		<summary type="html">&lt;p&gt;Ari64: typos&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This page describes the dynamic recompiler in Mupen64plus, and the changes made for the ARM port.&lt;br /&gt;
&lt;br /&gt;
==The original dynamic recompiler by Hacktarux==&lt;br /&gt;
&lt;br /&gt;
The dynamic recompiler used in Mupen64plus v1.5 is based on the original written by Hacktarux in 2002.&lt;br /&gt;
&lt;br /&gt;
It recompiles contiguous blocks of MIPS instructions.  First, each instruction is decoded into a dynamically-allocated 132-byte data structure.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
typedef struct _precomp_instr&lt;br /&gt;
{&lt;br /&gt;
   void (*ops)();&lt;br /&gt;
   union&lt;br /&gt;
     {&lt;br /&gt;
    struct&lt;br /&gt;
      {&lt;br /&gt;
         long long int *rs;&lt;br /&gt;
         long long int *rt;&lt;br /&gt;
         short immediate;&lt;br /&gt;
      } i;&lt;br /&gt;
    struct&lt;br /&gt;
      {&lt;br /&gt;
         unsigned int inst_index;&lt;br /&gt;
      } j;&lt;br /&gt;
    struct&lt;br /&gt;
      {&lt;br /&gt;
         long long int *rs;&lt;br /&gt;
         long long int *rt;&lt;br /&gt;
         long long int *rd;&lt;br /&gt;
         unsigned char sa;&lt;br /&gt;
         unsigned char nrd;&lt;br /&gt;
      } r;&lt;br /&gt;
    struct&lt;br /&gt;
      {&lt;br /&gt;
         unsigned char base;&lt;br /&gt;
         unsigned char ft;&lt;br /&gt;
         short offset;&lt;br /&gt;
      } lf;&lt;br /&gt;
    struct&lt;br /&gt;
      {&lt;br /&gt;
         unsigned char ft;&lt;br /&gt;
         unsigned char fs;&lt;br /&gt;
         unsigned char fd;&lt;br /&gt;
      } cf;&lt;br /&gt;
     } f;&lt;br /&gt;
   unsigned int addr; /* word-aligned instruction address in r4300 address space */&lt;br /&gt;
   unsigned int local_addr; /* byte offset to start of corresponding x86_64 instructions, from start of code block */&lt;br /&gt;
   reg_cache_struct reg_cache_infos;&lt;br /&gt;
} precomp_instr;&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The decoded instructions are then compiled, generating x86 instructions for each MIPS instruction.  A 32K block is allocated with malloc() to hold the x86 code.  If this size proves insufficient, it is incrementally resized with realloc().&lt;br /&gt;
&lt;br /&gt;
MIPS registers are allocated to x86 registers with a least-recently-used replacement policy.  All cached registers are written back prior to a branch, or a memory read or write.&lt;br /&gt;
&lt;br /&gt;
To facilitate invalidation and replacement of modified code blocks, each 4K page is compiled separately.  If a sequence of instructions crosses a 4K page boundary, the current block is ended, and the next instructions will be compiled as a separate block.&lt;br /&gt;
&lt;br /&gt;
Branch instructions within a 4K page are compiled as branches directly to the target address.  Branches which cross a 4K page go through an indirect address lookup.&lt;br /&gt;
&lt;br /&gt;
Compiled code blocks are invalidated on write.  On an actual MIPS CPU, the instruction cache is invalidated using the CACHE instruction, however a few N64 games clear the cache using other methods.  Trapping writes appears to be the most reliable method of ensuring cache coherency in the emulated system.  The cache instruction is ignored.&lt;br /&gt;
&lt;br /&gt;
==Problems with the original design==&lt;br /&gt;
&lt;br /&gt;
The most significant performance problem with this design is its excessive memory usage.  The decoded instruction data is retained (132 bytes for each MIPS instruction) and occasionally referenced during execution.  Memory accesses frequently miss the L2 cache, resulting in poor performance.&lt;br /&gt;
&lt;br /&gt;
Additionally, the register cache is relatively inefficient, since all registers are flushed before any read or write operation.&lt;br /&gt;
&lt;br /&gt;
==A new approach==&lt;br /&gt;
&lt;br /&gt;
To reduce memory usage, the new design allocates a single large block of memory (currently 16 MiB) which is used for recompiled code.  This memory is allocated using mmap with the PROT_EXEC bit set, to ensure operation on CPUs with no-execute (NX) page permissions.&lt;br /&gt;
&lt;br /&gt;
Contiguous blocks of MIPS instructions are compiled (that is, it does not attempt to follow branches or 'hot-paths').&lt;br /&gt;
&lt;br /&gt;
The recompiler consists of eight stages, plus a linker, and a memory manager.&lt;br /&gt;
&lt;br /&gt;
Compiled blocks are invalidated on write, as before, however as the compiler will cross a 4K page, writes may invalidate adjacent pages as well.&lt;br /&gt;
&lt;br /&gt;
Currently the dynarec generates x86, x86-64, ARMv5, and ARMv7 (little-endian).  Most of the code is shared between the architectures, but a different code generator is included at compile time, using different #include statements depending on the CPU type.&lt;br /&gt;
&lt;br /&gt;
==Pass 1: Disassembly==&lt;br /&gt;
&lt;br /&gt;
When an instruction address is encountered which has not been compiled, the function new_recompile_block is called, with the (virtual) address of the target instruction as its sole parameter.  If the address is invalid, the function returns a nonzero value and the caller is responsible for handling the pagefault.&lt;br /&gt;
&lt;br /&gt;
Instructions are decoded until an unconditional jump, usually a return, is encountered.  Disassembly is ordinarilly continued past a JAL (subroutine call) instruction, however this strategy is abandoned if invalid instructions are encountered.&lt;br /&gt;
&lt;br /&gt;
Up to 4096 instructions (16K of MIPS code) may be disassembled at once.  Surprisingly, some games do actually reach this limit.&lt;br /&gt;
&lt;br /&gt;
==Pass 2: Liveness analysis==&lt;br /&gt;
&lt;br /&gt;
After disassembly, liveness analysis is performed on the registers.  This determines when a particular register will no longer be used, and thus can be removed from the register cache.&lt;br /&gt;
&lt;br /&gt;
A separate analysis is done on the upper 32 bits of 64-bit registers.  This can determine when only the lower 32 bits of a register are significant, thus allowing use of a 32-bit register.  This enables more efficient code generation on 32-bit processors.&lt;br /&gt;
&lt;br /&gt;
==Pass 3: Register allocation==&lt;br /&gt;
&lt;br /&gt;
The 31 MIPS registers must be mapped onto seven available registers on x86, or twelve available registers on ARM.&lt;br /&gt;
&lt;br /&gt;
Instructions with 64-bit results require two registers on 32-bit host processors.  32-bit instructions require only one.  A flag is set for registers containing 32-bit values, and these registers will be sign-extended before they are written to the register file.&lt;br /&gt;
&lt;br /&gt;
Registers are made available using a least-soon-needed replacement policy.  When the register cache is full, and no registers can be eliminated using the liveness analysis, a ten-instruction lookahead is used to determine which registers will not be needed soon and these registers are evicted from the cache.&lt;br /&gt;
&lt;br /&gt;
==Pass 4: Free unused registers==&lt;br /&gt;
&lt;br /&gt;
After the initial register allocation, the allocations are reviewed to determine if any registers remain allocated longer than necessary.  These are then removed from the register cache.  This avoids having to write back a large number of registers just before a branch, and makes more registers available for the next pass.&lt;br /&gt;
&lt;br /&gt;
==Pass 5: Pre-allocate registers==&lt;br /&gt;
&lt;br /&gt;
If a register will be used soon and needs to be loaded, try to load it early if the register is available.  This improves execution on CPUs with a load-use penalty.&lt;br /&gt;
&lt;br /&gt;
==Pass 6: Optimize clean/dirty state==&lt;br /&gt;
&lt;br /&gt;
If a cached register is 'dirty' and needs to be written out, try to do so as soon as the register will no longer be modified.  This avoids having to write the same register on multiple code paths due to conditional branches.  Additionally, try to avoid writing out dirty registers inside of loops.&lt;br /&gt;
&lt;br /&gt;
==Pass 7: Identify where registers are assumed to be 32-bit==&lt;br /&gt;
&lt;br /&gt;
When a 64-bit register is mapped to a 32-bit register with the assumption that the value will be sign-extended before being used, it is necessary to ensure that no register contains a 64-bit value when branching to such a location.  These instructions are flagged to identify them as requiring 32-bit inputs.  This information is used by the linker, and the exception return (ERET) handler.&lt;br /&gt;
&lt;br /&gt;
==Pass 8: Assembly==&lt;br /&gt;
&lt;br /&gt;
This generates and outputs the recompiled code.&lt;br /&gt;
&lt;br /&gt;
Following the main code block, handlers for certain exceptions as well as alternate entry points are added.&lt;br /&gt;
&lt;br /&gt;
If a recompiled instruction relies on a certain MIPS register being cached in a certain native register, then a short 'stub' of code is generated to load the necessary registers.  When an instruction outside of this block needs to jump to that location, it will instead jump to the stub.  The necessary registers will be loaded, and then it will jump into the main code sequence.&lt;br /&gt;
&lt;br /&gt;
On architectures which require literal pools (ARMv5) these are inserted as necessary.&lt;br /&gt;
&lt;br /&gt;
==Linker==&lt;br /&gt;
&lt;br /&gt;
The linker fills in all unresolved branches.  Branches within the block are linked to their target address.  Branches which jump outside of the block are linked to their target if that address has been compiled already. These inter-block branches are recorded in the jump_out array.  This information will be used to remove the links in the event that the target of the branch is invalidated.&lt;br /&gt;
&lt;br /&gt;
Unresolved branches point to a stub which loads the address of the branch instruction and the virtual address of its target into registers, and calls the dynamic linker.  When this code is executed, the dynamic linker will compile the target if necessary, and then patch the branch instruction with the new address.&lt;br /&gt;
&lt;br /&gt;
==Memory manager==&lt;br /&gt;
&lt;br /&gt;
The last step in new_recompile_block is to ensure that there will be sufficient memory available to compile the next block.  If there is not, then the oldest blocks are purged.&lt;br /&gt;
&lt;br /&gt;
The dynarec cache can be described as a 16MB circular buffer divided into eight segments.  Memory is allocated in order from beginning to end.&lt;br /&gt;
&lt;br /&gt;
When there are less than 2 empty segments, the next segment in sequence is cleared, wrapping around to the beginning of the buffer.  This continues as memory is needed, wrapping around from end to beginning.&lt;br /&gt;
&lt;br /&gt;
==Invalidation and restoration==&lt;br /&gt;
&lt;br /&gt;
Normally, code blocks are looked up via a hash table, using the virtual address of the target instruction.  However, for purposes of invalidation, blocks are grouped by physical address.  This can be described as a virtually-indexed, physically-tagged (VIPT) cache.&lt;br /&gt;
&lt;br /&gt;
References to compiled blocks are stored in one of 4096 linked lists in the jump_in array.  Each list covers a 4K block of memory, and 2048 such lists are sufficient to cover the 8MB of RAM in the Nintendo 64.  The remaining lists are for code in ROM, and the bootloader in SP memory.&lt;br /&gt;
&lt;br /&gt;
When a write hits a memory page marked as cached, all entries in the corresponding list are invalidated.  If any code is found to cross a 4K boundary, the adjacent lists are invalidated also.&lt;br /&gt;
&lt;br /&gt;
Sometimes blocks may be invalidated even when none of the code is actually modified.  This can happen if data is written to memory in the same 4K page, or if code is reloaded without actually modifying it.  If blocks which were previously invalidated are subsequently found to be unmodified, those blocks are marked in the restore_candidate array.  If the block remains unmodified, it will be restored as a valid block, to avoid recompiling blocks which do not need to be recompiled.  This is performed by the clean_blocks function which is called periodically.&lt;br /&gt;
&lt;br /&gt;
==Dynamic linker==&lt;br /&gt;
&lt;br /&gt;
Branches with unresolved addresses jump to the dynamic linker.  This will look through the jump_in list corresponding to the physical page containing the virtual target address.  If found, the branch instruction will be patched with the address, and then it will jump to this address.&lt;br /&gt;
&lt;br /&gt;
If not found, the jump_dirty list will be searched for blocks which were previously compiled but may have been modified.  If a potential match is found, the code will be compared against a cached copy to determine if any changes have been made.  If not, then it will jump to the block.  Because the memory could be modified again, branch instructions referencing these blocks are not altered, and continue to point to the dynamic linker.  These blocks will continue to be verified each time they are called, until restored to the jump_in list by the clean_blocks function described above.&lt;br /&gt;
&lt;br /&gt;
If no compiled block is found, or the existing block was modified, the target is recompiled.&lt;br /&gt;
&lt;br /&gt;
==Address lookup==&lt;br /&gt;
&lt;br /&gt;
When a JR (jump register) instruction is encountered, the address of the recompiled code must be looked up using the address of the MIPS code.  The majority of such instructions jump to the link register (r31) to return to the location following a JAL (call) instruction.&lt;br /&gt;
&lt;br /&gt;
When a JAL or JALR is executed, the address of the following instruction is inserted into a small 32-entry hash table, which is checked when a JR r31 instruction is executed.  This allows for a quick return from subroutine calls.&lt;br /&gt;
&lt;br /&gt;
If the JR instruction uses a register other than r31, or the small hash table lookup fails to find a match, a larger 131072-entry hash table is checked.  This table contains 65536 bins with up to 2 addresses per bin.  If this also fails to find a match (which occurs less than 1% of the time) an exhaustive search of all compiled addresses within that 4K memory page is performed.&lt;br /&gt;
&lt;br /&gt;
If no match is found by any of these methods, the target address is compiled, and the new address is inserted into the hash table.&lt;br /&gt;
&lt;br /&gt;
==Cycle counting==&lt;br /&gt;
&lt;br /&gt;
Cycles are counted before each branch by adding the cycles from the preceding instructions to a specific register.  The cycle count is in R10 on ARM and ESI on x86.  The value in this register is normally a negative number.  When this number exceeds zero, a conditional branch is taken which jumps to an interrupt handler.&lt;br /&gt;
&lt;br /&gt;
For example, the following x86 code adds eight cycles:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
add $8,%esi&lt;br /&gt;
jns interrupt_handler&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The conditional branch jumps to a small bit of code located after the main compiled block, which saves the cached registers to the register file, sets the instruction pointer which will be used upon return from the interrupt, and then calls cc_interrupt.&lt;br /&gt;
&lt;br /&gt;
As in the original mupen64plus, the emulated clock runs at 37.5 MHz, and each instruction takes 2 clock cycles.&lt;br /&gt;
&lt;br /&gt;
==Delay slots==&lt;br /&gt;
&lt;br /&gt;
MIPS has 'delay slots', where the instruction after the branch is executed before the branch is taken.  Instructions in delay slots are issued out-of-order in the recompiled code.&lt;br /&gt;
&lt;br /&gt;
[[Image:Recompiler delay slot reordering.png]]&lt;br /&gt;
&lt;br /&gt;
When a branch jumps into the delay slot of another branch, this case must be handled slightly differently:&lt;br /&gt;
&lt;br /&gt;
[[Image:Recompiler delay slot reordering 2.png]]&lt;br /&gt;
&lt;br /&gt;
The branch test and delay slot are executed in-order if a dependency exists, or for 'likely' branches where the delay slot is nullified when the branch condition is false.  These cases are infrequent (typically less than 10% of branches).&lt;br /&gt;
&lt;br /&gt;
==Constant propagation==&lt;br /&gt;
&lt;br /&gt;
When an instruction loads a constant into a register, the register is tagged as a constant.  The constant tag will be retained if subsequent instructions modify the constant using other constants.  During assembly, such a sequence of instructions is combined into a single load.  For example:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
LUI r8,12340000  --&amp;gt;  mov $0x12345678,%eax&lt;br /&gt;
ORI r8,r8,5678&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This optimization is not performed where a branch target intervenes, eg&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
...&lt;br /&gt;
 LUI r8,12340000  --&amp;gt;  mov $0x12340000,%eax&lt;br /&gt;
L1:&lt;br /&gt;
 ORI r8,r8,5678   --&amp;gt;  or $0x5678,%eax&lt;br /&gt;
 ...&lt;br /&gt;
 BEQ r0,r0,L1&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Registers containing constants are identified by bits in the isconst and wasconst fields of the regstat structure.  The wasconst bit is set for a register if the register contained a known constant before the instruction, and the isconst bit is set for a register if the register will contain a known constant after the instruction executes.&lt;br /&gt;
&lt;br /&gt;
==Translation lookaside buffer emulation==&lt;br /&gt;
&lt;br /&gt;
Most Nintendo 64 games do not use virtual memory, but some do.  At startup, main memory is directly mapped at addresses from 0x80000000 to 0x803FFFFF, or up to 0x807FFFFF if the memory expansion is used.&lt;br /&gt;
&lt;br /&gt;
Normally, read or write operations are checked against this range, and if outside this range, control is passed to the appropriate I/O handler.  This is done as follows:&lt;br /&gt;
&lt;br /&gt;
MIPS instruction: LW r1,8(r2)&lt;br /&gt;
&lt;br /&gt;
ARM code:&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
add r1, r2, #8&lt;br /&gt;
cmp r1, $8388608&lt;br /&gt;
bvc handler&lt;br /&gt;
ldr r1, [r1]&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If there are valid entries in the TLB, this would instead be compiled as follows:&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
add r1, r2, #8&lt;br /&gt;
mov r0, #264&lt;br /&gt;
add r0, r0, r1, lsr #12&lt;br /&gt;
ldr r0, [r11, r0, lsl #2]&lt;br /&gt;
tst r0, r0&lt;br /&gt;
bmi handler&lt;br /&gt;
ldr r1, [r1, r0, lsl #2]&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This looks up the offset in the memory_map table (which, in this example, is located at r11+264*4).  The high bit is tested to determine whether a valid mapping exists for this page.&lt;br /&gt;
&lt;br /&gt;
==Page fault emulation==&lt;br /&gt;
&lt;br /&gt;
If a memory address references an invalid page, a conditional branch is taken to a bit of code placed after the main block.  This will save any modified cached registers, and call an appropriate handler function for the address.  The handler will either perform I/O, or generate an exception (pagefault).&lt;br /&gt;
&lt;br /&gt;
==Mapping executable pages==&lt;br /&gt;
&lt;br /&gt;
If the dynamic recompiler encounters code which is not in contiguous physical memory, it will end the block at the page boundary, so that the block can be removed cleanly if the mapping is changed.&lt;br /&gt;
&lt;br /&gt;
There is one exception to this, where a branch and delay slot span two pages.  verify_dirty and verify_mapping check for this special case, and ensure that the TLB mapping remains the same before executing any such page-spanning block.&lt;br /&gt;
&lt;br /&gt;
==Future improvements==&lt;br /&gt;
&lt;br /&gt;
===Copy propagation/offset propagation===&lt;br /&gt;
&lt;br /&gt;
A common instruction sequence is of the form:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
LUI r9,12340000&lt;br /&gt;
ADD r9,r9,r8&lt;br /&gt;
LW r9,5678(r9)&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
It would be helpful to recognize this as a load from r8+12345678.  The current constant propagation code does not do so.&lt;br /&gt;
&lt;br /&gt;
===Unaligned memory access===&lt;br /&gt;
&lt;br /&gt;
A small improvement could be made by combining adjacent LWL/LWR instructions.  The potential gain from doing so is very limited because these instructions typically represent less than 1% of all memory accesses.&lt;br /&gt;
&lt;br /&gt;
===SLT/branch merging===&lt;br /&gt;
&lt;br /&gt;
A frequent occurrence in MIPS code is an SLT or SLTI instruction followed by a branch.  This is generated relatively inefficiently on x86 and ARM, first doing a compare and set, then testing this value and doing a conditional branch.  Doing only one comparison would save at least one instruction, and could potentially save up to three instructions if the liveness analysis reveals that the result of the SLT instruction is used nowhere else.&lt;br /&gt;
&lt;br /&gt;
While a potentially useful optimization, there are several problems with this approach.  First, there are often additional instructions between the slt and the branch. These must be checked to make sure they do not modify the registers as that would prevent reordering the instruction stream as desired.  Secondly, if the result of the slt is found to be live, but unmodified, on both paths of the branch, clean_registers will normally write this value before the branch, to avoid duplicating the writeback code on both paths of the branch.  This optimization would have to be removed if the slt was combined with the branch.&lt;br /&gt;
&lt;br /&gt;
===Matching by virtual address===&lt;br /&gt;
&lt;br /&gt;
The clean_blocks function will only recognize blocks as unmodified if both the physical and virtual addresses are the same.  It would be more efficient if it could recognize unmodified blocks where the virtual address is the same even though the physical address may have changed.&lt;br /&gt;
&lt;br /&gt;
===x86-64===&lt;br /&gt;
&lt;br /&gt;
Currently the x86-64 backend generates only 32-bit instructions.  Proper 64-bit code generation would improve performance.&lt;br /&gt;
&lt;br /&gt;
===PowerPC===&lt;br /&gt;
&lt;br /&gt;
It would be possible to add a PowerPC code generator to the dynamic recompiler.  Currently no one is working on this.  (The mupen64gc project is using a different codebase.)&lt;br /&gt;
&lt;br /&gt;
The following is a summary of the changes which would be necessary to add a PowerPC backend.&lt;br /&gt;
&lt;br /&gt;
The slt* instructions use conditional moves, which are unavailable on PowerPC.  A suitable alternative (such as moves from the condition register) would need to be used.&lt;br /&gt;
&lt;br /&gt;
The assembler can generate as much as 256K of code in a single block, however conditional branches on PowerPC are limited to +/-64K.  It will be necessary to either restrict the block size, or insert jump tables in a manner similar to the literal pools on ARM.&lt;br /&gt;
&lt;br /&gt;
PowerPC generally relies on early branch resolution rather than statistical branch prediction.  Scheduling branch condition tests earlier may be advantageous.  (For example, the address bounds check could be done in address_generation, rather than just before the load or store.  Similarly it may be advantageous to test the branch condition and update the cycle count before executing the delay slot.)&lt;br /&gt;
&lt;br /&gt;
===MIPS===&lt;br /&gt;
&lt;br /&gt;
Recompiling MIPS into MIPS would be relatively straightforward, however the current code generator has no facility for filling delay slots.  This capability would be required for efficient code generation.&lt;br /&gt;
&lt;br /&gt;
[[Category:Development]]&lt;/div&gt;</summary>
		<author><name>Ari64</name></author>
		
	</entry>
	<entry>
		<id>https://pandorawiki.org/index.php?title=Mupen64plus_dynamic_recompiler&amp;diff=2580</id>
		<title>Mupen64plus dynamic recompiler</title>
		<link rel="alternate" type="text/html" href="https://pandorawiki.org/index.php?title=Mupen64plus_dynamic_recompiler&amp;diff=2580"/>
		<updated>2010-06-28T00:52:28Z</updated>

		<summary type="html">&lt;p&gt;Ari64: Description of the dynamic recompiler&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This page describes the dynamic recompiler in Mupen64plus, and the changes made for the ARM port.&lt;br /&gt;
&lt;br /&gt;
==The original dynamic recompiler by Hacktarux==&lt;br /&gt;
&lt;br /&gt;
The dynamic recompiler used in Mupen64plus v1.5 is based on the original written by Hacktarux in 2002.&lt;br /&gt;
&lt;br /&gt;
It recompiles contiguous blocks of MIPS instructions.  First, each instruction is decoded into a dynamically-allocated 132-byte data structure.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
typedef struct _precomp_instr&lt;br /&gt;
{&lt;br /&gt;
   void (*ops)();&lt;br /&gt;
   union&lt;br /&gt;
     {&lt;br /&gt;
    struct&lt;br /&gt;
      {&lt;br /&gt;
         long long int *rs;&lt;br /&gt;
         long long int *rt;&lt;br /&gt;
         short immediate;&lt;br /&gt;
      } i;&lt;br /&gt;
    struct&lt;br /&gt;
      {&lt;br /&gt;
         unsigned int inst_index;&lt;br /&gt;
      } j;&lt;br /&gt;
    struct&lt;br /&gt;
      {&lt;br /&gt;
         long long int *rs;&lt;br /&gt;
         long long int *rt;&lt;br /&gt;
         long long int *rd;&lt;br /&gt;
         unsigned char sa;&lt;br /&gt;
         unsigned char nrd;&lt;br /&gt;
      } r;&lt;br /&gt;
    struct&lt;br /&gt;
      {&lt;br /&gt;
         unsigned char base;&lt;br /&gt;
         unsigned char ft;&lt;br /&gt;
         short offset;&lt;br /&gt;
      } lf;&lt;br /&gt;
    struct&lt;br /&gt;
      {&lt;br /&gt;
         unsigned char ft;&lt;br /&gt;
         unsigned char fs;&lt;br /&gt;
         unsigned char fd;&lt;br /&gt;
      } cf;&lt;br /&gt;
     } f;&lt;br /&gt;
   unsigned int addr; /* word-aligned instruction address in r4300 address space */&lt;br /&gt;
   unsigned int local_addr; /* byte offset to start of corresponding x86_64 instructions, from start of code block */&lt;br /&gt;
   reg_cache_struct reg_cache_infos;&lt;br /&gt;
} precomp_instr;&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The decoded instructions are then compiled, generating x86 instructions for each MIPS instruction.  A 32K block is allocated with malloc() to hold the x86 code.  If this size proves insufficient, it is incrementally resized with realloc().&lt;br /&gt;
&lt;br /&gt;
MIPS registers are allocated to x86 registers with a least-recently-used replacement policy.  All cached registers are written back prior to a branch, or a memory read or write.&lt;br /&gt;
&lt;br /&gt;
To facilitate invalidation and replacement of modified code blocks, each 4K page is compiled separately.  If a sequence of instructions crosses a 4K page boundary, the current block is ended, and the next instructions will be compiled as a separate block.&lt;br /&gt;
&lt;br /&gt;
Branch instructions within a 4K page are compiled as branches directly to the target address.  Branches which cross a 4K page go through an indirect address lookup.&lt;br /&gt;
&lt;br /&gt;
Compiled code blocks are invalidated on write.  On an actual MIPS CPU, the instruction cache is invalidated using the CACHE instruction, however a few N64 games clear the cache using other methods.  Trapping writes appears to be the most reliable method of ensuring cache coherency in the emulated system.  The cache instruction is ignored.&lt;br /&gt;
&lt;br /&gt;
==Problems with the original design==&lt;br /&gt;
&lt;br /&gt;
The most significant performance problem with this design is its excessive memory usage.  The decoded instruction data is retained (132 bytes for each MIPS instruction) and occasionally referenced during execution.  Memory accesses frequently miss the L2 cache, resulting in poor performance.&lt;br /&gt;
&lt;br /&gt;
Additionally, the register cache is relatively inefficient, since all registers are flushed before any read or write operation.&lt;br /&gt;
&lt;br /&gt;
==A new approach==&lt;br /&gt;
&lt;br /&gt;
To reduce memory usage, the new design allocates a single large block of memory (currently 16 MiB) which is used for recompiled code.  This memory is allocated using mmap with the PROT_EXEC bit set, to ensure operation on CPUs with no-execute (NX) page permissions.&lt;br /&gt;
&lt;br /&gt;
Contiguous blocks of MIPS instructions are compiled (that is, it does not attempt to follow branches or 'hot-paths').&lt;br /&gt;
&lt;br /&gt;
The recompiler consists of eight stages, plus a linker, and a memory manager.&lt;br /&gt;
&lt;br /&gt;
Compiled blocks are invalidated on write, as before, however as the compiler will cross a 4K page, writes may invalidate adjacent pages as well.&lt;br /&gt;
&lt;br /&gt;
Currently the dynarec generates x86. x86-64, ARMv5, and ARMv7 (little-endian).  Most of the code is shared between the architectures, but a different code generator is included at compile time, using different #include statements depending on the CPU type.&lt;br /&gt;
&lt;br /&gt;
==Pass 1: Disassembly==&lt;br /&gt;
&lt;br /&gt;
When an instruction address is encountered which has not been compiled, the function new_recompile_block is called, with the (virtual) address of the target instruction as its sole parameter.  If the address is invalid, the function returns a nonzero value and the caller is responsible for handling the pagefault.&lt;br /&gt;
&lt;br /&gt;
Instructions are decoded until an unconditional jump, usually a return, is encountered.  Disassembly is ordinarilly continued past a JAL (subroutine call) instruction, however this strategy is abandoned if invalid instructions are encountered.&lt;br /&gt;
&lt;br /&gt;
Up to 4096 instructions (16K of MIPS code) may be disassembled at once.  Surprisingly, some games do actually reach this limit.&lt;br /&gt;
&lt;br /&gt;
==Pass 2: Liveness analysis==&lt;br /&gt;
&lt;br /&gt;
After disassembly, liveness analysis is performed on the registers.  This determines when a particular register will no longer be used, and thus can be removed from the register cache.&lt;br /&gt;
&lt;br /&gt;
A separate analysis is done on the upper 32 bits of 64-bit registers.  This can determine when only the lower 32 bits of a register are significant, thus allowing use of a 32-bit register.  This enables more efficient code generation on 32-bit processors.&lt;br /&gt;
&lt;br /&gt;
==Pass 3: Register allocation==&lt;br /&gt;
&lt;br /&gt;
The 31 MIPS registers must be mapped onto seven available registers on x86, or twelve available registers on ARM.&lt;br /&gt;
&lt;br /&gt;
Instructions with 64-bit results require two registers on 32-bit host processors.  32-bit instructions require only one.  A flag is set for registers containing 32-bit values, and these registers will be sign-extended before they are written to the register file.&lt;br /&gt;
&lt;br /&gt;
Registers are made available using a least-soon-needed replacement policy.  When the register cache is full, and no registers can be eliminated using the liveness analysis, a ten-instruction lookahead is used to determine which registers will not be needed soon and these registers are evicted from the cache.&lt;br /&gt;
&lt;br /&gt;
==Pass 4: Free unused registers==&lt;br /&gt;
&lt;br /&gt;
After the initial register allocation, the allocations are reviewed to determine if any registers remain allocated longer than necessary.  These are then removed from the register cache.  This avoids having to write back a large number of registers just before a branch, and makes more registers available for the next pass.&lt;br /&gt;
&lt;br /&gt;
==Pass 5: Pre-allocate registers==&lt;br /&gt;
&lt;br /&gt;
If a register will be used soon and needs to be loaded, try to load it early if the register is available.  This improves execution on CPUs with a load-use penalty.&lt;br /&gt;
&lt;br /&gt;
==Pass 6: Optimize clean/dirty state==&lt;br /&gt;
&lt;br /&gt;
If a cached register is 'dirty' and needs to be written out, try to do so as soon as the register will no longer be modified.  This avoids having to write the same register on multiple code paths due to conditional branches.  Additionally, try to avoid writing out dirty registers inside of loops.&lt;br /&gt;
&lt;br /&gt;
==Pass 7: Identify where registers are assumed to be 32-bit==&lt;br /&gt;
&lt;br /&gt;
When a 64-bit register is mapped to a 32-bit register with the assumption that the value will be sign-extended before being used, it is necessary to ensure that no register contains a 64-bit value when branching to such a location.  These instructions are flagged to identify them as requiring 32-bit inputs.  This information is used by the linker, and the exception return (ERET) handler.&lt;br /&gt;
&lt;br /&gt;
==Pass 8: Assembly==&lt;br /&gt;
&lt;br /&gt;
This generates and outputs the recompiled code.&lt;br /&gt;
&lt;br /&gt;
Following the main code block, handlers for certain exceptions as well as alternate entry points are added.&lt;br /&gt;
&lt;br /&gt;
If a recompiled instruction relies on a certain MIPS register being cached in a certain native register, then a short 'stub' of code is generated to load the necessary registers.  When an instruction outside of this block needs to jump to that location, it will instead jump to the stub.  The necessary registers will be loaded, and then it will jump into the main code sequence.&lt;br /&gt;
&lt;br /&gt;
On architectures which require literal pools (ARMv5) these are inserted as necessary.&lt;br /&gt;
&lt;br /&gt;
==Linker==&lt;br /&gt;
&lt;br /&gt;
The linker fills in all unresolved branches.  Branches within the block are linked to their target address.  Branches which jump outside of the block are linked to their target if that address has been compiled already. These inter-block branches are recorded in the jump_out array.  This information will be used to remove the links in the event that the target of the branch is invalidated.&lt;br /&gt;
&lt;br /&gt;
Unresolved branches point to a stub which loads the address of the branch instruction and the virtual address of its target into registers, and calls the dynamic linker.  When this code is executed, the dynamic linker will compile the target if necessary, and then patch the branch instruction with the new address.&lt;br /&gt;
&lt;br /&gt;
==Memory manager==&lt;br /&gt;
&lt;br /&gt;
The last step in new_recompile_block is to ensure that there will be sufficient memory available to compile the next block.  If there is not, then the oldest blocks are purged.&lt;br /&gt;
&lt;br /&gt;
The dynarec cache can be described as a 16MB circular buffer divided into eight segments.  Memory is allocated in order from beginning to end.&lt;br /&gt;
&lt;br /&gt;
When there are less than 2 empty segments, the next block in sequence is cleared, wrapping around to the beginning of the buffer.  This continues as memory is needed, wrapping around from end to beginning.&lt;br /&gt;
&lt;br /&gt;
==Invalidation and restoration==&lt;br /&gt;
&lt;br /&gt;
Normally, code blocks are looked up via a hash table, using the virtual address of the target instruction.  However, for purposes of invalidation, blocks are grouped by physical address.  This can be described as a virtually-indexed, physically-tagged (VIPT) cache.&lt;br /&gt;
&lt;br /&gt;
References to compiled blocks are stored in one of 4096 linked lists in the jump_in array.  Each list covers a 4K block of memory, and 2048 such lists are sufficient to cover the 8MB of RAM in the Nintendo 64.  The remaining lists are for code in ROM, and the bootloader in SP memory.&lt;br /&gt;
&lt;br /&gt;
When a write hits a memory page marked as cached, all entries in the corresponding list are invalidated.  If any code is found to cross a 4K boundary, the adjacent lists are invalidated also.&lt;br /&gt;
&lt;br /&gt;
Sometimes blocks may be invalidated even when none of the code is actually modified.  This can happen if data is written to memory in the same 4K page, or if code is reloaded without actually modifying it.  If blocks which were previously invalidated are subsequently found to be unmodified, those blocks are marked in the restore_candidate array.  If the block remains unmodified, it will be restored as a valid block, to avoid recompiling blocks which do not need to be recompiled.  This is performed by the clean_blocks function which is called periodically.&lt;br /&gt;
&lt;br /&gt;
==Dynamic linker==&lt;br /&gt;
&lt;br /&gt;
Branches with unresolved addresses jump to the dynamic linker.  This will look through the jump_in list corresponding to the physical page containing the virtual target address.  If found, the branch instruction will be patched with the address, and then it will jump to this address.&lt;br /&gt;
&lt;br /&gt;
If not found, the jump_dirty list will be searched for blocks which were previously compiled but may have been modified.  If a potential match is found, the code will be compared against a cached copy to determine if any changes have been made.  If not, then it will jump to the block.  Because the memory could be modified again, branch instructions referencing these blocks are not altered, and continue to point to the dynamic linker.  These blocks will continue to be verified each time they are called, until restored to the jump_in list by the clean_blocks function described above.&lt;br /&gt;
&lt;br /&gt;
If no compiled block is found, or the existing block was modified, the target is recompiled.&lt;br /&gt;
&lt;br /&gt;
==Address lookup==&lt;br /&gt;
&lt;br /&gt;
When a JR (jump register) instruction is encountered, the address of the recompiled code must be looked up using the address of the MIPS code.  The majority of such instructions jump to the link register (r31) to return to the location following a JAL (call) instruction.&lt;br /&gt;
&lt;br /&gt;
When a JAL or JALR is executed, the address of the following instruction is inserted into a small 32-entry hash table, which is checked when a JR r31 instruction is executed.  This allows for a quick return from subroutine calls.&lt;br /&gt;
&lt;br /&gt;
If the JR instruction uses a register other than r31, or the small hash table lookup fails to find a match, a larger 131072-entry hash table is checked.  This table contains 65536 bins with up to 2 addresses per bin.  If this also fails to find a match (which occurs less than 1% of the time) an exhaustive search of all compiled addresses within that 4K memory page is performed.&lt;br /&gt;
&lt;br /&gt;
If no match is found by any of these methods, the target address is compiled, and the new address is inserted into the hash table.&lt;br /&gt;
&lt;br /&gt;
==Cycle counting==&lt;br /&gt;
&lt;br /&gt;
Cycles are counted before each branch by adding the cycles from the preceding instructions to a specific register.  The cycle count is in R10 on ARM and ESI on x86.  The value in this register is normally a negative number.  When this number exceeds zero, a conditional branch is taken which jumps to an interrupt handler.&lt;br /&gt;
&lt;br /&gt;
For example, the following x86 code adds eight cycles:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
add $8,%esi&lt;br /&gt;
jns interrupt_handler&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The conditional branch jumps to a small bit of code located after the main compiled block, which saves the cached registers to the register file, sets the instruction pointer which will be used upon return from the interrupt, and then calls cc_interrupt.&lt;br /&gt;
&lt;br /&gt;
As in the original mupen64plus, the emulated clock runs at 37.5 MHz, and each instruction takes 2 clock cycles.&lt;br /&gt;
&lt;br /&gt;
==Delay slots==&lt;br /&gt;
&lt;br /&gt;
MIPS has 'delay slots', where the instruction after the branch is executed before the branch is taken.  Instructions in delay slots are issued out-of-order in the recompiled code.&lt;br /&gt;
&lt;br /&gt;
[[Image:Recompiler delay slot reordering.png]]&lt;br /&gt;
&lt;br /&gt;
When a branch jumps into the delay slot of another branch, this case must be handled slightly differently:&lt;br /&gt;
&lt;br /&gt;
[[Image:Recompiler delay slot reordering 2.png]]&lt;br /&gt;
&lt;br /&gt;
The branch test and delay slot are executed in-order if a dependency exists, or for 'likely' branches where the delay slot is nullified when the branch condition is false.  These cases are infrequent (typically less than 10% of branches).&lt;br /&gt;
&lt;br /&gt;
==Constant propagation==&lt;br /&gt;
&lt;br /&gt;
When an instruction loads a constant into a register, the register is tagged as a constant.  The constant tag will be retained if subsequent instructions modify the constant using other constants.  During assembly, such a sequence of instructions is combined into a single load.  For example:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
LUI r8,12340000  --&amp;gt;  mov $0x12345678,%eax&lt;br /&gt;
ORI r8,r8,5678&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This optimization is not performed where a branch target intervenes, eg&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
...&lt;br /&gt;
 LUI r8,12340000  --&amp;gt;  mov $0x12340000,%eax&lt;br /&gt;
L1:&lt;br /&gt;
 ORI r8,r8,5678   --&amp;gt;  or $0x5678,%eax&lt;br /&gt;
 ...&lt;br /&gt;
 BEQ r0,r0,L1&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Registers containing constants are identified by bits in the isconst and wasconst fields of the regstat structure.  The wasconst bit is set for a register if the register contained a known constant before the instruction, and the isconst bit is set for a register if the register will contain a known constant after the instruction executes.&lt;br /&gt;
&lt;br /&gt;
==Translation lookaside buffer emulation==&lt;br /&gt;
&lt;br /&gt;
Most Nintendo 64 games do not use virtual memory, but some do.  At startup, main memory is directly mapped at addresses from 0x80000000 to 0x803FFFFF, or up to 0x807FFFFF if the memory expansion is used.&lt;br /&gt;
&lt;br /&gt;
Normally, read or write operations are checked against this range, and if outside this range, control is passed to the appropriate I/O handler.  This is done as follows:&lt;br /&gt;
&lt;br /&gt;
MIPS instruction: LW r1,8(r2)&lt;br /&gt;
&lt;br /&gt;
ARM code:&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
add r1, r2, #8&lt;br /&gt;
cmp r1, $8388608&lt;br /&gt;
bvc handler&lt;br /&gt;
ldr r1, [r1]&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If there are valid entries in the TLB, this would instead be compiled as follows:&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
add r1, r2, #8&lt;br /&gt;
mov r0, #264&lt;br /&gt;
add r0, r0, r1, lsr #12&lt;br /&gt;
ldr r0, [r11, r0, lsl #2]&lt;br /&gt;
tst r0, r0&lt;br /&gt;
bmi handler&lt;br /&gt;
ldr r1, [r1, r0, lsl #2]&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This looks up the offset in the memory_map table (which, in this example, is located at r11+264*4).  The high bit is tested to determine whether a valid mapping exists for this page.&lt;br /&gt;
&lt;br /&gt;
==Page fault emulation==&lt;br /&gt;
&lt;br /&gt;
If a memory address references an invalid page, a conditional branch is taken to a bit of code placed after the main block.  This will save any modified cached registers, and call an appropriate handler function for the address.  The handler will either perform I/O, or generate an exception (pagefault).&lt;br /&gt;
&lt;br /&gt;
==Mapping executable pages==&lt;br /&gt;
&lt;br /&gt;
If the dynamic recompiler encounters code which is not in contiguous physical memory, it will end the block at the page boundary, so that the block can be removed cleanly if the mapping is changed.&lt;br /&gt;
&lt;br /&gt;
There is one exception to this, where a branch and delay slot span two pages.  verify_dirty and verify_mapping check for this special case, and ensure that the TLB mapping remains the same before executing any such page-spanning block.&lt;br /&gt;
&lt;br /&gt;
==Future improvements==&lt;br /&gt;
&lt;br /&gt;
===Copy propagation/offset propagation===&lt;br /&gt;
&lt;br /&gt;
A common instruction sequence is of the form:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
LUI r9,12340000&lt;br /&gt;
ADD r9,r9,r8&lt;br /&gt;
LW r9,5678(r9)&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
It would be helpful to recognize this as a load from r8+12345678.  The current constant propagation code does not do so.&lt;br /&gt;
&lt;br /&gt;
===Unaligned memory access===&lt;br /&gt;
&lt;br /&gt;
A small improvement could be made by combining adjacent LWL/LWR instructions.  The potential gain from doing so is very limited because these instructions typically represent less than 1% of all memory accesses.&lt;br /&gt;
&lt;br /&gt;
===SLT/branch merging===&lt;br /&gt;
&lt;br /&gt;
A frequent occurrence in MIPS code is an SLT or SLTI instruction followed by a branch.  This is generated relatively inefficiently on x86 and ARM, first doing a compare and set, then testing this value and doing a conditional branch.  Doing only one comparison would save at least one instruction, and could potentially save up to three instructions if the liveness analysis reveals that the result of the SLT instruction is used nowhere else.&lt;br /&gt;
&lt;br /&gt;
While a potentially useful optimization, there are several problems with this approach.  First, there are often additional instructions between the slt and the branch. These must be checked to make sure they do not modify the registers as that would prevent reordering the instruction stream as desired.  Secondly, if the result of the slt is found to be live, but unmodified, on both paths of the branch, clean_registers will normally write this value before the branch, to avoid duplicating the writeback code on both paths of the branch.  This optimization would have to be removed if the slt was combined with the branch.&lt;br /&gt;
&lt;br /&gt;
===Matching by virtual address===&lt;br /&gt;
&lt;br /&gt;
The clean_blocks function will only recognize blocks as unmodified if both the physical and virtual addresses are the same.  It would be more efficient if it could recognize unmodified blocks where the virtual address is the same even though the physical address may have changed.&lt;br /&gt;
&lt;br /&gt;
===x86-64===&lt;br /&gt;
&lt;br /&gt;
Currently the x86-64 backend generates only 32-bit instructions.  Proper 64-bit code generation would improve performance.&lt;br /&gt;
&lt;br /&gt;
===PowerPC===&lt;br /&gt;
&lt;br /&gt;
It would be possible to add a PowerPC code generator to the dynamic recompiler.  Currently no one is working on this.  (The mupen64gc project is using a different codebase.)&lt;br /&gt;
&lt;br /&gt;
The following is a summary of the changes which would be necessary to add a PowerPC backend.&lt;br /&gt;
&lt;br /&gt;
The slt* instructions use conditional moves, which are unavailable on PowerPC.  A suitable alternative (such as moves from the condition register) would need to be used.&lt;br /&gt;
&lt;br /&gt;
The assembler can generate as much as 256K of code in a single block, however conditional branches on PowerPC are limited to +/-64K.  It will be necessary to either restrict the block size, or insert jump tables in a manner similar to the literal pools on ARM.&lt;br /&gt;
&lt;br /&gt;
PowerPC generally relies on early branch resolution rather than statistical branch prediction.  Scheduling branch condition tests earlier may be advantageous.  (For example, the address bounds check could be done in address_generation, rather than just before the load or store.  Similarly it may be advantageous to test the branch condition and update the cycle count before executing the delay slot.)&lt;br /&gt;
&lt;br /&gt;
===MIPS===&lt;br /&gt;
&lt;br /&gt;
Recompiling MIPS into MIPS would be relatively straightforward, however the current code generator has no facility for filling delay slots.  This capability would be required for efficient code generation.&lt;/div&gt;</summary>
		<author><name>Ari64</name></author>
		
	</entry>
	<entry>
		<id>https://pandorawiki.org/index.php?title=File:Recompiler_delay_slot_reordering_2.png&amp;diff=2579</id>
		<title>File:Recompiler delay slot reordering 2.png</title>
		<link rel="alternate" type="text/html" href="https://pandorawiki.org/index.php?title=File:Recompiler_delay_slot_reordering_2.png&amp;diff=2579"/>
		<updated>2010-06-28T00:43:23Z</updated>

		<summary type="html">&lt;p&gt;Ari64: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&lt;/div&gt;</summary>
		<author><name>Ari64</name></author>
		
	</entry>
	<entry>
		<id>https://pandorawiki.org/index.php?title=File:Recompiler_delay_slot_reordering.png&amp;diff=2578</id>
		<title>File:Recompiler delay slot reordering.png</title>
		<link rel="alternate" type="text/html" href="https://pandorawiki.org/index.php?title=File:Recompiler_delay_slot_reordering.png&amp;diff=2578"/>
		<updated>2010-06-28T00:42:17Z</updated>

		<summary type="html">&lt;p&gt;Ari64: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&lt;/div&gt;</summary>
		<author><name>Ari64</name></author>
		
	</entry>
	<entry>
		<id>https://pandorawiki.org/index.php?title=Assembly_Code_Optimization&amp;diff=2254</id>
		<title>Assembly Code Optimization</title>
		<link rel="alternate" type="text/html" href="https://pandorawiki.org/index.php?title=Assembly_Code_Optimization&amp;diff=2254"/>
		<updated>2010-03-22T20:44:37Z</updated>

		<summary type="html">&lt;p&gt;Ari64: /* Branch Prediction */ fix typo&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Assembly code optimization on the Cortex-A8 ==&lt;br /&gt;
This guide presents specific optimization techniques for the ARM Cortex-A8 processor and its dual-issue, in-order pipeline.&lt;br /&gt;
&lt;br /&gt;
== Use the ARMv7 movw/movt instructions ==&lt;br /&gt;
Newer ARM processors allow loading 32-bit values as two 16-bit immediates.  The movw instruction loads the lower 16 bits, and movt loads the upper 16 bits.  The movw instruction clears the upper 16 bits, so that 16-bit values can be loaded using a single instruction.  The movt instruction does not affect the lower bits.&lt;br /&gt;
&lt;br /&gt;
On older ARM processors, it was common to load 32-bit values with a PC-relative load.  This should be avoided because it may result in a cache miss.&lt;br /&gt;
&lt;br /&gt;
== Branch Prediction ==&lt;br /&gt;
Branches which have not been seen before are predicted not taken.  It is therefore preferable to structure code so that the most likely code path is the one where the branch is not taken.&lt;br /&gt;
&lt;br /&gt;
Unconditional branches may be mispredicted, and load instructions which follow branches may be decoded and cause cache accesses.  Avoid placing load instructions after branches unless you intend the CPU to prefetch the addresses they reference.&lt;br /&gt;
&lt;br /&gt;
The branch predictor has a call/return stack for jumps which reference r14 or stack operations using r13 which load the program counter.  For best performance, make sure any pop, mov pc,lr or bx lr jumps to the same location as was set by the matching bl instruction.  For non-return jumps, use a register other than r14.&lt;br /&gt;
&lt;br /&gt;
Although instructions are decoded in pairs, the branch predictor can only predict one branch per cycle.  If you have a conditional branch which is immediately followed by another branch, and the first branch is very likely to be taken, place a no-operation instruction between the branches to prevent decoding and prediction of the second branch.  Inserting NOPs is detrimental if the first branch is not frequently taken.&lt;br /&gt;
&lt;br /&gt;
== Dual-Issue Restrictions ==&lt;br /&gt;
Only one branch instruction can issue per cycle.  Only one load or store instruction can issue per cycle.  Instructions which write to the same register can not issue together.&lt;br /&gt;
&lt;br /&gt;
There is a one-cycle delay before any written register can be used as the address of a subsequent load or store instruction.  There is a one-cycle delay before any written register can be used by an instruction which performs a shift or rotation on that register.&lt;br /&gt;
&lt;br /&gt;
There is one cycle delay before the result of a load can be used.  In the aforementioned cases involving subsequent shift or rotate instructions, or the address of a load or store instruction, the total delay is two cycles.  &lt;br /&gt;
&lt;br /&gt;
Stores occur at the end of the pipeline and store instructions can issue in the same cycle as another instruction which writes the same register.  Similarly, conditional branches can issue in the same cycle as flag-setting instructions because the branch is not resolved until the following cycle.  In all other cases, instructions can not issue together if the second instruction depends on the results of the first.&lt;br /&gt;
&lt;br /&gt;
== Instruction pairing restrictions following a branch ==&lt;br /&gt;
The Cortex-A8 processor normally decodes two instructions per cycle, but branch instructions, and certain instructions which resemble branches, are decoded at a rate of only one per cycle if present in pairs.  These instructions can still issue and execute in parallel if there are sufficient instructions in the queue.  However, following a mispredicted branch, the queue will be empty, and the following instructions issue and execute at a rate of only one per cycle when two such instructions are paired:&lt;br /&gt;
&lt;br /&gt;
* Branch instructions&lt;br /&gt;
* Load instructions, whether or not they reference r15&lt;br /&gt;
* Arithmetic/logic instructions which do not set flags and do not have an immediate value.&lt;br /&gt;
&lt;br /&gt;
Flag-setting instructions can be used to avoid this restriction.  Additionally, MOV instructions that do not have immediate values can be replaced with ADD, SUB, ORR, or EOR instructions using zero as an immediate value.&lt;br /&gt;
&lt;br /&gt;
Intentionally placing a series of such instructions, such as mov r0,r0, following an unconditional branch may reduce unwanted code prefetch.&lt;br /&gt;
&lt;br /&gt;
== Code alignment ==&lt;br /&gt;
Code alignment should be used with caution.  Alignment has the potential to increase performance, but may be detrimental in some circumstances.&lt;br /&gt;
&lt;br /&gt;
Because the instruction decoder always fetches 64-bit aligned words from the level-1 instruction cache, aligning code can improve instruction fetch and decode.  However, excessive code alignment can result in a suboptimal distribution of entries in the branch history tables and increase branch misprediction.  Additionally, the speculative prefetch may retrieve and decode instructions used as padding, even if those instructions are never executed.  The types of instructions used as padding will affect the instruction decoder as stated above, and this may reduce the prefetch of code into the instruction cache.&lt;br /&gt;
&lt;br /&gt;
[[Category:Development]]&lt;/div&gt;</summary>
		<author><name>Ari64</name></author>
		
	</entry>
	<entry>
		<id>https://pandorawiki.org/index.php?title=Assembly_Code_Optimization&amp;diff=2238</id>
		<title>Assembly Code Optimization</title>
		<link rel="alternate" type="text/html" href="https://pandorawiki.org/index.php?title=Assembly_Code_Optimization&amp;diff=2238"/>
		<updated>2010-03-14T09:24:27Z</updated>

		<summary type="html">&lt;p&gt;Ari64: /* Instruction pairing restrictions following a branch */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Assembly code optimization on the Cortex-A8 ==&lt;br /&gt;
This guide presents specific optimization techniques for the ARM Cortex-A8 processor and its dual-issue, in-order pipeline.&lt;br /&gt;
&lt;br /&gt;
== Use the ARMv7 movw/movt instructions ==&lt;br /&gt;
Newer ARM processors allow loading 32-bit values as two 16-bit immediates.  The movw instruction loads the lower 16 bits, and movt loads the upper 16 bits.  The movw instruction clears the upper 16 bits, so that 16-bit values can be loaded using a single instruction.  The movt instruction does not affect the lower bits.&lt;br /&gt;
&lt;br /&gt;
On older ARM processors, it was common to load 32-bit values with a PC-relative load.  This should be avoided because it may result in a cache miss.&lt;br /&gt;
&lt;br /&gt;
== Branch Prediction ==&lt;br /&gt;
Branches which have not been seen before are predicted not taken.  It is therefore preferable to structure code so the the most likely code path is the one where the branch is not taken.&lt;br /&gt;
&lt;br /&gt;
Unconditional branches may be mispredicted, and load instructions which follow branches may be decoded and cause cache accesses.  Avoid placing load instructions after branches unless you intend the CPU to prefetch the addresses they reference.&lt;br /&gt;
&lt;br /&gt;
The branch predictor has a call/return stack for jumps which reference r14 or stack operations using r13 which load the program counter.  For best performance, make sure any pop, mov pc,lr or bx lr jumps to the same location as was set by the matching bl instruction.  For non-return jumps, use a register other than r14.&lt;br /&gt;
&lt;br /&gt;
Although instructions are decoded in pairs, the branch predictor can only predict one branch per cycle.  If you have a conditional branch which is immediately followed by another branch, and the first branch is very likely to be taken, place a no-operation instruction between the branches to prevent decoding and prediction of the second branch.  Inserting NOPs is detrimental if the first branch is not frequently taken.&lt;br /&gt;
&lt;br /&gt;
== Dual-Issue Restrictions ==&lt;br /&gt;
Only one branch instruction can issue per cycle.  Only one load or store instruction can issue per cycle.  Instructions which write to the same register can not issue together.&lt;br /&gt;
&lt;br /&gt;
There is a one-cycle delay before any written register can be used as the address of a subsequent load or store instruction.  There is a one-cycle delay before any written register can be used by an instruction which performs a shift or rotation on that register.&lt;br /&gt;
&lt;br /&gt;
There is one cycle delay before the result of a load can be used.  In the aforementioned cases involving subsequent shift or rotate instructions, or the address of a load or store instruction, the total delay is two cycles.  &lt;br /&gt;
&lt;br /&gt;
Stores occur at the end of the pipeline and store instructions can issue in the same cycle as another instruction which writes the same register.  Similarly, conditional branches can issue in the same cycle as flag-setting instructions because the branch is not resolved until the following cycle.  In all other cases, instructions can not issue together if the second instruction depends on the results of the first.&lt;br /&gt;
&lt;br /&gt;
== Instruction pairing restrictions following a branch ==&lt;br /&gt;
The Cortex-A8 processor normally decodes two instructions per cycle, but branch instructions, and certain instructions which resemble branches, are decoded at a rate of only one per cycle if present in pairs.  These instructions can still issue and execute in parallel if there are sufficient instructions in the queue.  However, following a mispredicted branch, the queue will be empty, and the following instructions issue and execute at a rate of only one per cycle when two such instructions are paired:&lt;br /&gt;
&lt;br /&gt;
* Branch instructions&lt;br /&gt;
* Load instructions, whether or not they reference r15&lt;br /&gt;
* Arithmetic/logic instructions which do not set flags and do not have an immediate value.&lt;br /&gt;
&lt;br /&gt;
Flag-setting instructions can be used to avoid this restriction.  Additionally, MOV instructions that do not have immediate values can be replaced with ADD, SUB, ORR, or EOR instructions using zero as an immediate value.&lt;br /&gt;
&lt;br /&gt;
Intentionally placing a series of such instructions, such as mov r0,r0, following an unconditional branch may reduce unwanted code prefetch.&lt;br /&gt;
&lt;br /&gt;
== Code alignment ==&lt;br /&gt;
Code alignment should be used with caution.  Alignment has the potential to increase performance, but may be detrimental in some circumstances.&lt;br /&gt;
&lt;br /&gt;
Because the instruction decoder always fetches 64-bit aligned words from the level-1 instruction cache, aligning code can improve instruction fetch and decode.  However, excessive code alignment can result in a suboptimal distribution of entries in the branch history tables and increase branch misprediction.  Additionally, the speculative prefetch may retrieve and decode instructions used as padding, even if those instructions are never executed.  The types of instructions used as padding will affect the instruction decoder as stated above, and this may reduce the prefetch of code into the instruction cache.&lt;br /&gt;
&lt;br /&gt;
[[Category:Development]]&lt;/div&gt;</summary>
		<author><name>Ari64</name></author>
		
	</entry>
	<entry>
		<id>https://pandorawiki.org/index.php?title=Assembly_Code_Optimization&amp;diff=2235</id>
		<title>Assembly Code Optimization</title>
		<link rel="alternate" type="text/html" href="https://pandorawiki.org/index.php?title=Assembly_Code_Optimization&amp;diff=2235"/>
		<updated>2010-03-13T12:49:50Z</updated>

		<summary type="html">&lt;p&gt;Ari64: /* Dual-Issue Restrictions */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Assembly code optimization on the Cortex-A8 ==&lt;br /&gt;
This guide presents specific optimization techniques for the ARM Cortex-A8 processor and its dual-issue, in-order pipeline.&lt;br /&gt;
&lt;br /&gt;
== Use the ARMv7 movw/movt instructions ==&lt;br /&gt;
Newer ARM processors allow loading 32-bit values as two 16-bit immediates.  The movw instruction loads the lower 16 bits, and movt loads the upper 16 bits.  The movw instruction clears the upper 16 bits, so that 16-bit values can be loaded using a single instruction.  The movt instruction does not affect the lower bits.&lt;br /&gt;
&lt;br /&gt;
On older ARM processors, it was common to load 32-bit values with a PC-relative load.  This should be avoided because it may result in a cache miss.&lt;br /&gt;
&lt;br /&gt;
== Branch Prediction ==&lt;br /&gt;
Branches which have not been seen before are predicted not taken.  It is therefore preferable to structure code so the the most likely code path is the one where the branch is not taken.&lt;br /&gt;
&lt;br /&gt;
Unconditional branches may be mispredicted, and load instructions which follow branches may be decoded and cause cache accesses.  Avoid placing load instructions after branches unless you intend the CPU to prefetch the addresses they reference.&lt;br /&gt;
&lt;br /&gt;
The branch predictor has a call/return stack for jumps which reference r14 or stack operations using r13 which load the program counter.  For best performance, make sure any pop, mov pc,lr or bx lr jumps to the same location as was set by the matching bl instruction.  For non-return jumps, use a register other than r14.&lt;br /&gt;
&lt;br /&gt;
Although instructions are decoded in pairs, the branch predictor can only predict one branch per cycle.  If you have a conditional branch which is immediately followed by another branch, and the first branch is very likely to be taken, place a no-operation instruction between the branches to prevent decoding and prediction of the second branch.  Inserting NOPs is detrimental if the first branch is not frequently taken.&lt;br /&gt;
&lt;br /&gt;
== Dual-Issue Restrictions ==&lt;br /&gt;
Only one branch instruction can issue per cycle.  Only one load or store instruction can issue per cycle.  Instructions which write to the same register can not issue together.&lt;br /&gt;
&lt;br /&gt;
There is a one-cycle delay before any written register can be used as the address of a subsequent load or store instruction.  There is a one-cycle delay before any written register can be used by an instruction which performs a shift or rotation on that register.&lt;br /&gt;
&lt;br /&gt;
There is one cycle delay before the result of a load can be used.  In the aforementioned cases involving subsequent shift or rotate instructions, or the address of a load or store instruction, the total delay is two cycles.  &lt;br /&gt;
&lt;br /&gt;
Stores occur at the end of the pipeline and store instructions can issue in the same cycle as another instruction which writes the same register.  Similarly, conditional branches can issue in the same cycle as flag-setting instructions because the branch is not resolved until the following cycle.  In all other cases, instructions can not issue together if the second instruction depends on the results of the first.&lt;br /&gt;
&lt;br /&gt;
== Instruction pairing restrictions following a branch ==&lt;br /&gt;
The Cortex-A8 processor normally decodes two instructions per cycle, but branch instructions, and certain instructions which resemble branches, are decoded at a rate of only one per cycle if present in pairs.  These instructions can still issue and execute in parallel with other instructions that were already in the queue.  However, following a mispredicted branch, the queue will be empty, and the following instructions issue and execute at a rate of only one per cycle:&lt;br /&gt;
&lt;br /&gt;
* Branch instructions&lt;br /&gt;
* Load instructions, whether or not they reference r15&lt;br /&gt;
* Arithmetic/logic instructions which do not set flags and do not have an immediate value.&lt;br /&gt;
&lt;br /&gt;
Flag-setting instructions may be used to avoid this restriction.  Additionally, MOV or MVN instructions which do not have immediate values can often be replaced with ADD, ORR, or XOR instructions using immediate values.&lt;br /&gt;
&lt;br /&gt;
== Code alignment ==&lt;br /&gt;
Code alignment should be used with caution.  Alignment has the potential to increase performance, but may be detrimental in some circumstances.&lt;br /&gt;
&lt;br /&gt;
Because the instruction decoder always fetches 64-bit aligned words from the level-1 instruction cache, aligning code can improve instruction fetch and decode.  However, excessive code alignment can result in a suboptimal distribution of entries in the branch history tables and increase branch misprediction.  Additionally, the speculative prefetch may retrieve and decode instructions used as padding, even if those instructions are never executed.  The types of instructions used as padding will affect the instruction decoder as stated above, and this may reduce the prefetch of code into the instruction cache.&lt;br /&gt;
&lt;br /&gt;
[[Category:Development]]&lt;/div&gt;</summary>
		<author><name>Ari64</name></author>
		
	</entry>
	<entry>
		<id>https://pandorawiki.org/index.php?title=Assembly_Code_Optimization&amp;diff=2234</id>
		<title>Assembly Code Optimization</title>
		<link rel="alternate" type="text/html" href="https://pandorawiki.org/index.php?title=Assembly_Code_Optimization&amp;diff=2234"/>
		<updated>2010-03-13T12:30:37Z</updated>

		<summary type="html">&lt;p&gt;Ari64: /* Code alignment */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Assembly code optimization on the Cortex-A8 ==&lt;br /&gt;
This guide presents specific optimization techniques for the ARM Cortex-A8 processor and its dual-issue, in-order pipeline.&lt;br /&gt;
&lt;br /&gt;
== Use the ARMv7 movw/movt instructions ==&lt;br /&gt;
Newer ARM processors allow loading 32-bit values as two 16-bit immediates.  The movw instruction loads the lower 16 bits, and movt loads the upper 16 bits.  The movw instruction clears the upper 16 bits, so that 16-bit values can be loaded using a single instruction.  The movt instruction does not affect the lower bits.&lt;br /&gt;
&lt;br /&gt;
On older ARM processors, it was common to load 32-bit values with a PC-relative load.  This should be avoided because it may result in a cache miss.&lt;br /&gt;
&lt;br /&gt;
== Branch Prediction ==&lt;br /&gt;
Branches which have not been seen before are predicted not taken.  It is therefore preferable to structure code so the the most likely code path is the one where the branch is not taken.&lt;br /&gt;
&lt;br /&gt;
Unconditional branches may be mispredicted, and load instructions which follow branches may be decoded and cause cache accesses.  Avoid placing load instructions after branches unless you intend the CPU to prefetch the addresses they reference.&lt;br /&gt;
&lt;br /&gt;
The branch predictor has a call/return stack for jumps which reference r14 or stack operations using r13 which load the program counter.  For best performance, make sure any pop, mov pc,lr or bx lr jumps to the same location as was set by the matching bl instruction.  For non-return jumps, use a register other than r14.&lt;br /&gt;
&lt;br /&gt;
Although instructions are decoded in pairs, the branch predictor can only predict one branch per cycle.  If you have a conditional branch which is immediately followed by another branch, and the first branch is very likely to be taken, place a no-operation instruction between the branches to prevent decoding and prediction of the second branch.  Inserting NOPs is detrimental if the first branch is not frequently taken.&lt;br /&gt;
&lt;br /&gt;
== Dual-Issue Restrictions ==&lt;br /&gt;
Only one branch instruction can issue per cycle.  Only one load or store instruction can issue per cycle.  Instructions which write to the same register can not issue together.&lt;br /&gt;
&lt;br /&gt;
There is a one-cycle delay before any written register can be used as the address of a load or store instruction.&lt;br /&gt;
&lt;br /&gt;
There is one cycle delay before the result of a load can be used.  Stores occur at the end of the pipeline and store instructions can issue in the same cycle as another instruction which writes the same register.  Similarly, conditional branches can issue in the same cycle as flag-setting instructions because the branch is not resolved until the following cycle.&lt;br /&gt;
&lt;br /&gt;
Shift or rotate instructions take two cycles, and may stall if any of three preceding instructions write to the shifted register.&lt;br /&gt;
&lt;br /&gt;
== Instruction pairing restrictions following a branch ==&lt;br /&gt;
The Cortex-A8 processor normally decodes two instructions per cycle, but branch instructions, and certain instructions which resemble branches, are decoded at a rate of only one per cycle if present in pairs.  These instructions can still issue and execute in parallel with other instructions that were already in the queue.  However, following a mispredicted branch, the queue will be empty, and the following instructions issue and execute at a rate of only one per cycle:&lt;br /&gt;
&lt;br /&gt;
* Branch instructions&lt;br /&gt;
* Load instructions, whether or not they reference r15&lt;br /&gt;
* Arithmetic/logic instructions which do not set flags and do not have an immediate value.&lt;br /&gt;
&lt;br /&gt;
Flag-setting instructions may be used to avoid this restriction.  Additionally, MOV or MVN instructions which do not have immediate values can often be replaced with ADD, ORR, or XOR instructions using immediate values.&lt;br /&gt;
&lt;br /&gt;
== Code alignment ==&lt;br /&gt;
Code alignment should be used with caution.  Alignment has the potential to increase performance, but may be detrimental in some circumstances.&lt;br /&gt;
&lt;br /&gt;
Because the instruction decoder always fetches 64-bit aligned words from the level-1 instruction cache, aligning code can improve instruction fetch and decode.  However, excessive code alignment can result in a suboptimal distribution of entries in the branch history tables and increase branch misprediction.  Additionally, the speculative prefetch may retrieve and decode instructions used as padding, even if those instructions are never executed.  The types of instructions used as padding will affect the instruction decoder as stated above, and this may reduce the prefetch of code into the instruction cache.&lt;br /&gt;
&lt;br /&gt;
[[Category:Development]]&lt;/div&gt;</summary>
		<author><name>Ari64</name></author>
		
	</entry>
	<entry>
		<id>https://pandorawiki.org/index.php?title=Assembly_Code_Optimization&amp;diff=2230</id>
		<title>Assembly Code Optimization</title>
		<link rel="alternate" type="text/html" href="https://pandorawiki.org/index.php?title=Assembly_Code_Optimization&amp;diff=2230"/>
		<updated>2010-03-12T04:06:50Z</updated>

		<summary type="html">&lt;p&gt;Ari64: clarify some wording&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Assembly code optimization on the Cortex-A8 ==&lt;br /&gt;
This guide presents specific optimization techniques for the ARM Cortex-A8 processor and its dual-issue, in-order pipeline.&lt;br /&gt;
&lt;br /&gt;
== Use the ARMv7 movw/movt instructions ==&lt;br /&gt;
Newer ARM processors allow loading 32-bit values as two 16-bit immediates.  The movw instruction loads the lower 16 bits, and movt loads the upper 16 bits.  The movw instruction clears the upper 16 bits, so that 16-bit values can be loaded using a single instruction.  The movt instruction does not affect the lower bits.&lt;br /&gt;
&lt;br /&gt;
On older ARM processors, it was common to load 32-bit values with a PC-relative load.  This should be avoided because it may result in a cache miss.&lt;br /&gt;
&lt;br /&gt;
== Branch Prediction ==&lt;br /&gt;
Branches which have not been seen before are predicted not taken.  It is therefore preferable to structure code so the the most likely code path is the one where the branch is not taken.&lt;br /&gt;
&lt;br /&gt;
Unconditional branches may be mispredicted, and load instructions which follow branches may be decoded and cause cache accesses.  Avoid placing load instructions after branches unless you intend the CPU to prefetch the addresses they reference.&lt;br /&gt;
&lt;br /&gt;
The branch predictor has a call/return stack for jumps which reference r14 or stack operations using r13 which load the program counter.  For best performance, make sure any pop, mov pc,lr or bx lr jumps to the same location as was set by the matching bl instruction.  For non-return jumps, use a register other than r14.&lt;br /&gt;
&lt;br /&gt;
Although instructions are decoded in pairs, the branch predictor can only predict one branch per cycle.  If you have a conditional branch which is immediately followed by another branch, and the first branch is very likely to be taken, place a no-operation instruction between the branches to prevent decoding and prediction of the second branch.  Inserting NOPs is detrimental if the first branch is not frequently taken.&lt;br /&gt;
&lt;br /&gt;
== Dual-Issue Restrictions ==&lt;br /&gt;
Only one branch instruction can issue per cycle.  Only one load or store instruction can issue per cycle.  Instructions which write to the same register can not issue together.&lt;br /&gt;
&lt;br /&gt;
There is a one-cycle delay before any written register can be used as the address of a load or store instruction.&lt;br /&gt;
&lt;br /&gt;
There is one cycle delay before the result of a load can be used.  Stores occur at the end of the pipeline and store instructions can issue in the same cycle as another instruction which writes the same register.  Similarly, conditional branches can issue in the same cycle as flag-setting instructions because the branch is not resolved until the following cycle.&lt;br /&gt;
&lt;br /&gt;
Shift or rotate instructions take two cycles, and may stall if any of three preceding instructions write to the shifted register.&lt;br /&gt;
&lt;br /&gt;
== Instruction pairing restrictions following a branch ==&lt;br /&gt;
The Cortex-A8 processor normally decodes two instructions per cycle, but branch instructions, and certain instructions which resemble branches, are decoded at a rate of only one per cycle if present in pairs.  These instructions can still issue and execute in parallel with other instructions that were already in the queue.  However, following a mispredicted branch, the queue will be empty, and the following instructions issue and execute at a rate of only one per cycle:&lt;br /&gt;
&lt;br /&gt;
* Branch instructions&lt;br /&gt;
* Load instructions, whether or not they reference r15&lt;br /&gt;
* Arithmetic/logic instructions which do not set flags and do not have an immediate value.&lt;br /&gt;
&lt;br /&gt;
Flag-setting instructions may be used to avoid this restriction.  Additionally, MOV or MVN instructions which do not have immediate values can often be replaced with ADD, ORR, or XOR instructions using immediate values.&lt;br /&gt;
&lt;br /&gt;
== Code alignment ==&lt;br /&gt;
Code alignment may reduce performance.&lt;br /&gt;
&lt;br /&gt;
In some cases, aligning code may improve instruction fetch and decode.  However, excessive code alignment can result in a suboptimal distribution of entries in the branch history tables and increase branch misprediction.&lt;br /&gt;
&lt;br /&gt;
[[Category:Development]]&lt;/div&gt;</summary>
		<author><name>Ari64</name></author>
		
	</entry>
	<entry>
		<id>https://pandorawiki.org/index.php?title=Development_tutorials&amp;diff=2229</id>
		<title>Development tutorials</title>
		<link rel="alternate" type="text/html" href="https://pandorawiki.org/index.php?title=Development_tutorials&amp;diff=2229"/>
		<updated>2010-03-12T02:32:19Z</updated>

		<summary type="html">&lt;p&gt;Ari64: /* ARM Cortex A8 Tutorials */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== ARM Cortex A8 Tutorials ==&lt;br /&gt;
* [[Floating Point Optimization]]&lt;br /&gt;
* [[Assembly Code Optimization]]&lt;br /&gt;
&lt;br /&gt;
== SDL Tutorials ==&lt;br /&gt;
&lt;br /&gt;
These tutorials assume you know the basics of C++ programming, and know your way around a C++ compiler.&lt;br /&gt;
&lt;br /&gt;
* [http://www.lazyfoo.net/SDL_tutorials/index.php Lazy Foo's Tutorials].  Not Pandora specific, but a good guide to getting your programming environment set up, along with many SDL tutorials.&lt;br /&gt;
* [http://iki.fi/sol/gp/ Sol's Graphics for beginners].  Not Pandora specific, but a good place to get started with SDL graphics coding.&lt;br /&gt;
&lt;br /&gt;
==OpenGL on the Pandora==&lt;br /&gt;
&lt;br /&gt;
*[[OpenGL ES 1.1 Tutorial]]&lt;br /&gt;
&lt;br /&gt;
*[[OpenGL ES 2.0 Tutorial]]&lt;br /&gt;
&lt;br /&gt;
*[[Combining OpenGL ES 1.1 and SDL to create a window on the Pandora]]&lt;br /&gt;
&lt;br /&gt;
== The Kernel ==&lt;br /&gt;
* [[Kernel build instructions|Compiling the Kernel from Git]]&lt;br /&gt;
* [[Kernel interface|Kernel Interface]]&lt;br /&gt;
&lt;br /&gt;
== Matchbox Window Manager ==&lt;br /&gt;
&lt;br /&gt;
* [[Matchbox|Matchbox version]]&lt;br /&gt;
* [[xoo on ubuntu|Setting up xoo on Ubuntu 8.04/8.10]] (Theme Testing and Development)&lt;br /&gt;
&lt;br /&gt;
== See Also ==&lt;br /&gt;
&lt;br /&gt;
* [[Development Tools]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Category:Development]]&lt;/div&gt;</summary>
		<author><name>Ari64</name></author>
		
	</entry>
	<entry>
		<id>https://pandorawiki.org/index.php?title=Assembly_Code_Optimization&amp;diff=2228</id>
		<title>Assembly Code Optimization</title>
		<link rel="alternate" type="text/html" href="https://pandorawiki.org/index.php?title=Assembly_Code_Optimization&amp;diff=2228"/>
		<updated>2010-03-12T02:31:04Z</updated>

		<summary type="html">&lt;p&gt;Ari64: Optimization guidelines for Cortex-A8&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Assembly code optimization on the Cortex-A8 ==&lt;br /&gt;
This guide presents specific optimization techniques for the ARM Cortex-A8 processor and its dual-issue, in-order pipeline.&lt;br /&gt;
&lt;br /&gt;
== Use the ARMv7 movw/movt instructions ==&lt;br /&gt;
Newer ARM processors allow loading 32-bit values as two 16-bit immediates.  The movw instruction loads the lower 16 bits, and movt loads the upper 16 bits.  The movw instruction clears the upper 16 bits, so that 16-bit values can be loaded using a single instruction.  The movt instruction does not affect the lower bits.&lt;br /&gt;
&lt;br /&gt;
On older ARM processors, it was common to load 32-bit values with a PC-relative load.  This should be avoided because it may result in a cache miss.&lt;br /&gt;
&lt;br /&gt;
== Branch Prediction ==&lt;br /&gt;
Branches which have not been seen before are predicted not taken.  It is therefore preferable to structure code so the the most likely code path is the one where the branch is not taken.&lt;br /&gt;
&lt;br /&gt;
Unconditional branches may be mispredicted, and load instructions which follow branches may be decoded and cause cache hits.  Avoid placing load instructions after branches unless you intend the CPU to prefetch the addresses they reference.&lt;br /&gt;
&lt;br /&gt;
The branch predictor has a call/ret stack for jumps which reference r14 or push/pop operations using r13 which load the program counter.  For best performance, make sure any pop, mov pc,lr or bx lr jumps to the same location as was set by the previous bl instruction.  For non-return jumps, use a register other than r14.&lt;br /&gt;
&lt;br /&gt;
Although instructions are decoded in pairs, the branch predictor can only predict one branch per cycle.  If you have a conditional branch which is immediately followed by another branch, and the first branch is very likely to be taken, place a no-operation instruction between the branches to prevent decoding and prediction of the second branch.  Inserting NOPs is detrimental if the first branch is not frequently taken.&lt;br /&gt;
&lt;br /&gt;
== Dual-Issue Restrictions ==&lt;br /&gt;
Only one branch instruction can issue per cycle.  Only one load or store instruction can issue per cycle.  Instructions which write to the same register can not issue together.&lt;br /&gt;
&lt;br /&gt;
There is a one-cycle delay before any written register can be used as the address of a load or store instruction.&lt;br /&gt;
&lt;br /&gt;
There is one cycle delay before the result of a load can be used.  Stores occur at the end of the pipeline and store instructions can issue in the same cycle as another instruction which writes the same register.  Similarly, conditional branches can issue in the same cycle as flag-setting instructions because the branch is not resolved until the following cycle.&lt;br /&gt;
&lt;br /&gt;
Shift or rotate instructions take two cycles, and may stall if any of three preceding instructions write to the shifted register.&lt;br /&gt;
&lt;br /&gt;
== Instruction pairing restrictions following a branch ==&lt;br /&gt;
The Cortex-A8 processor normally decodes two instructions per cycle, but branch instructions, and certain instructions which resemble branches, are decoded at a rate of only one per cycle if present in pairs.  These instructions can still issue and execute in parallel with other instructions that were already in the queue.  However, following a mispredicted branch, the queue will be empty, and the following instructions issue and execute at a rate of only one per cycle:&lt;br /&gt;
&lt;br /&gt;
* Branch instructions&lt;br /&gt;
* Load instructions, whether or not they reference r15&lt;br /&gt;
* Arithmetic/logic instructions which do not set flags and do not have an immediate value.&lt;br /&gt;
&lt;br /&gt;
Flag-setting instructions may be used to avoid this restriction.  Additionally, MOV or MVN instructions which do not have immediate values can often be replaced with ADD, ORR, or XOR instructions using immediate values.&lt;br /&gt;
&lt;br /&gt;
== Code alignment ==&lt;br /&gt;
Code alignment may reduce performance.&lt;br /&gt;
&lt;br /&gt;
In some cases, aligning code may improve instruction fetch and decode.  However, excessive code alignment can result in a suboptimal distribution of entries in the branch history tables and increase branch misprediction.&lt;br /&gt;
&lt;br /&gt;
[[Category:Development]]&lt;/div&gt;</summary>
		<author><name>Ari64</name></author>
		
	</entry>
</feed>