diff --git a/docs/cpu_todo.md b/docs/cpu_todo.md new file mode 100644 index 000000000..7b1050452 --- /dev/null +++ b/docs/cpu_todo.md @@ -0,0 +1,317 @@ +# CPU TODO + +There are many improvements that can be done under `xe::cpu` to improve +debugging, performance (both to JIT and of generated code), and portability. +Some are in various states of completion, and others are just thoughts that need +more exploring. + +## Debugging Improvements + +### Reproducable X64 Emission + +It'd be useful to be able to run a PPC function through the entire pipeline and +spit out x64 that is byte-for-byte identical across runs. This would allow +automated verification, bulk analysis, etc. Currently `X64Emitter::Emplace` +will relocate the x64 when placing it in memory, which will be at a different +location each time. Instead it would be nice to have the xbyak `calcJmpAddress` +that performs the relocations use the address of our choosing. + +### Stack Walking + +Currently the Windows/VC++ dbghelp stack walking is relied on, however this is +not portable, is slow, and cannot resolve JIT'ed symbols properly. Having our +own stack walking code that could fall back to dbghelp (via some pluggable +system) for host symbols would let us quickly get stacks through host and guest +code and make things like sampling profilers, kernel callstack tracing, and +other features possible. + +### Sampling Profiler + +Once we have stack walking it'd be nice to take something like +[micro-profiler](https://code.google.com/p/micro-profiler/) and augment it to +support our system. This would let us run continuous performance analysis and +track hotspots in JITed code without a large performance impact. Automatically +showing the top hot functions in the debugger could help track down poor +translation much faster. + +### Intel Architecture Code Analyzer Support + +The [Intel ACA](https://software.intel.com/en-us/articles/intel-architecture-code-analyzer) +is a nifty tool that, given a kernel of x64, can detail theoretical performance +characteristics on different processors down to cycle timings and potential +bottlenecks on memory/execution units. It's designed to run on elf/obj/etc files +however it simply looks for special markers in the code. Having something that +walks the code cache and dumps a specially formatted file with the markers +around basic blocks could allow running the tool in bulk, or alternatively being +able to invoke it one-off by dumping a specific x64 block to disk and processing +it for display when looking at the code in the debugger would be useful. + +I've done some early experiments with this and its possible to pass just a +bin file with the markers and the x64. + +### Function Tracing/Coverage Information + +`function_trace_data.h` contains the `FunctionTraceData` struct, which is +currently partially populated by the x64 backend. This enables tracking of which +threads a function is called on, function call count, recent callers of the +function, and even instruction-level counts. + +This is all only partially implemented, though, and there's no tool to read it +out. This would be nice to get integrated into the debugger so that it can +overlay the information when viewing a function, but also useful in aggregate to +find hot functions/code paths or enhance callstacks by automatically annotating +thread information. + +#### Block-level Counting + +Currently the code assumes each instruction has a count, however this is +expensive and often unneeded as it can be done on a block level and then the +instruction counts can be derived from that. This can reduce the overhead (both +in memory and accounting time) by an order of magnitude. + +### On-Stack Context Inspection + +Currently the debugger only works with `--store_all_context_values`, as it can +only get the values of PPC registers when they are stored to the PPC context +after each instruction. As this can slow things down by ~10-20% it could be +useful to be able to preserve the optimized and register-allocated HIR so that +host registers holding context values can be derived on demand. Or, we could +just make `--store_all_context_values` faster. + +## JIT Performance Improvements + +### Reduce HIR Size + +Currently there are a lot of pointers stored within `Instr`, `Value`, and +related types. These are big 8B values that eat a lot of memory and really +hurt the cache (especially with all the block/instruction walking done). +Aligning everything to 16B values in the arena and using 16bit indices +(or something) could shrink things a lot. + +### Serialize Code Cache + +The x64 code cache is currently set up to use fixed memory addresses and is even +represented as mapped memory. It should be fairly easy to back this with a file +and have all code written to disk. Adding more metadata, or perhaps a side-car +file, would allow for the code to be written to disk. On future runs the code +cache could load this data (by mapping the file containing the code right into +memory) and short cut JIT'ing entirely. + +It would be possible to use a common container format (ELF/etc), however there's +elegance in not requiring any additional steps beyond the memory mapping. Such +containers could be useful for running static tools against, though. + +## Portability Improvements + +### Emulated Opcode Layer + +Having a way to use emulated variants for any HIR opcode in a backend would +help when writing a new backend as well as when verifying the existing backends. +This may look like a C library with functions for each opcode/type pairing and +utilities to call out to them. Something like the x64 backend could then call +out to these with CallNativeSafe (or some faster equivalent) and something like +an interpreter backend would be fairly trivial to write. + +## X64 Backend Improvements + +### Implement Emulated Instructions + +There are a ton of half-implemented HIR opcodes that call out to C++ to do their +work. These are extremely expensive as they incur a full guest-to-host thunk +(~hundreds of instructions!). Basically, any of the `Emulate*`/`CallNativeSafe` +functions in `x64_sequences.cc` need to be replaced with proper AVX/AVX2 +variants. + +### Increase Register Availability + +Currently only a few x64 registers are usable (due to reservations by the +backend or ABI conflicts). Though register pressure is surprisingly light in +most cases there are pathological cases that result in a lot of spills. By +freeing up some of the registers these spills could be reduced. + +### Constant Pooling + +This may make sense as a compiler pass instead. + +Right now, particular sequences of instructions are nasty - such as anything +using `LoadConstantXmm` to load non-zero or non-1 vec128's. Instead of doing the +super fat (20-30byte!) constant loads as they are done now it may be better to +keep a per-function constant table and instead use RIP-relative addressing (or +something) to use the memory-form AVX instructions. + +For example, right now this: +``` + v82.v128 = [0,1,2,3] + v83.v128 = or v81.v128, v82.128 +``` + +Translates to (something like): +``` + mov([rsp+0x...], 0x00000000) + mov([rsp+0x...+4], 0x00000001) + mov([rsp+0x...+8], 0x00000002) + mov([rsp+0x...+12], 0x00000003) + vmovdqa(xmm2, [rsp+0x...]) + vor(xmm2, xmm2, xmm2) +``` + +Where as it could be: +``` + vor(xmm2, xmm2, [rip+0x...]) +``` + +Whether the cost of doing the constant de-dupe is worth it remains to be seen. +Right now it's wasting a lot of instruction cache space, increasing decode time, +and potentially using a lot more memory bandwidth. + +## Optimization Improvements + +### Speed Up RegisterAllocationPass + +Currently the slowest pass, this could be improved by requiring less use +tracking or perhaps maintaining the use tracking in other passes. A faster +SortUsageList (radix or something fancy?) may be helpful as well. + +### More Opcodes in ConstantPropagationPass + +There's a few HIR opcodes with no handling, and others with minimal handling. +It'd be nice to know what paths need improvement and add them, as any work here +makes things free later on. + +### Cross-Block ConstantPropagationPass + +Constant propagation currently only occurs within a single block. This makes it +difficult to optimize common PPC patterns like loading the constants 0 or 1 into +a register before a loop and other loads of expensive altivec values. + +Either ControlFlowAnalysisPass or DataFlowAnalysisPass could be piggy-backed to +track constant load_context/store_context's across block bounds and propagate +the values. This is simpler than dynamic values as no phi functions or anything +fancy needs to happen. + +### Add TypePropagationPass + +There are many extensions/truncations in generated code right now due to +various load/stores of varying widths. Being able to find and short- +circuit the conversions early on would make following passes cleaner +and faster as they'd have to trace through fewer value definitions and there'd +be less extraneous movs in the final code. + +Example (after ContextPromotion): +``` + v82.i32 = truncate v81.i64 + v83.i32 = and v82.i32, 3F + v85.i64 = zero_extend v84.i32 +``` + +Becomes (after DCE/etc): +``` + v85.i64 = and v81.i64, 3F +``` + +### Enhance MemorySequenceCombinationPass with Extend/Truncate + +Currently this pass will look for byte_swap and merge that into loads/stores. +This allows for better final codegen at the cost of making optimization more +difficult, so it only happens at the end of the process. + +There's currently TODOs in there for adding extend/truncate support, which +will extend what it does with swaps to also merge the +sign_extend/zero_extend/truncate into the matching load/store. This allows for +the x64 backend to generate the proper mov's that do these operations without +requiring additional steps. Note that if we had a LIR and a peephole optimizer +this would be better done there. + +Load with swap and extend: +``` + v1.i32 = load v0 + v2.i32 = byte_swap v1.i32 + v3.i64 = zero_extend v2.i32 +``` + +Becomes: +``` + v1.i64 = load_convert v0, [swap|i32->i64,zero] +``` + +Store with truncate and swap: +``` + v1.i64 = ... + v2.i32 = truncate v1.i64 + v3.i32 = byte_swap v2.i32 + store v0, v3.i32 +``` + +Becomes: +``` + store_convert v0, v1.i64, [swap|i64->i32,trunc] +``` + +### Add DeadStoreEliminationPass + +Generic DSE pass, removing all redundant stores. ContextPromotion may be +able to take care of most of these, as the input assembly is generally +pretty optimized already. This pass would mainly be looking for introduced +stores, such as those from comparisons. + +Currently ControlFlowAnalysisPass will annotate blocks with incoming/outgoing +edges as well as dominators, and that could be used to check whether stores into +the context are used in their destination block or instead overwritten +(currently they almost never are). + +If this pass was able to remove a good number of the stores then the comparisons +would also be removed with dead code elimination and dramatically reduce branch +overhead. + +Example: +``` +: + v0 = compare_ult ... (later removed by DCE) + v1 = compare_ugt ... (later removed by DCE) + v2 = compare_eq ... + store_context +300, v0 <-- removed + store_context +301, v1 <-- removed + store_context +302, v2 <-- removed + branch_true v1, ... +: + v3 = compare_ult ... + v4 = compare_ugt ... + v5 = compare_eq ... + store_context +300, v3 <-- these may be required if at end of function + store_context +301, v4 or before a call + store_context +302, v5 + branch_true v5, ... +``` + +### Add X64CanonicalizationPass + +For various opcodes add copies/commute the arguments to match x64 +operand semantics. This makes code generation easier and if done +before register allocation can prevent a lot of extra shuffling in +the emitted code. + +Example: +``` +: + v0 = ... + v1 = ... + v2 = add v0, v1 <-- v1 now unused +``` + +Becomes: +``` + v0 = ... + v1 = ... + v1 = add v1, v0 <-- src1 = dest/src, so reuse for both + by commuting and setting dest = src1 +``` + +### Add MergeLocalSlotsPass + +As the RegisterAllocationPass runs it generates load_local/store_local as it +spills. Currently each set of locals is unique to each block, which in very +large functions can result in a lot of locals that are only used briefly. It +may be useful to use the results of the ControlFlowAnalysisPass to track local +liveness and merge the slots so they are reused when they cannot possibly be +live at the same time. This saves stack space and potentially improves cache +behavior. diff --git a/src/xenia/cpu/backend/x64/x64_emitter.cc b/src/xenia/cpu/backend/x64/x64_emitter.cc index 7a1cbd3de..820ae3538 100644 --- a/src/xenia/cpu/backend/x64/x64_emitter.cc +++ b/src/xenia/cpu/backend/x64/x64_emitter.cc @@ -53,11 +53,6 @@ static const size_t MAX_CODE_SIZE = 1 * 1024 * 1024; static const size_t STASH_OFFSET = 32; static const size_t STASH_OFFSET_HIGH = 32 + 32; -// If we are running with tracing on we have to store the EFLAGS in the stack, -// otherwise our calls out to C to print will clear it before DID_CARRY/etc -// can get the value. -#define STORE_EFLAGS 1 - const uint32_t X64Emitter::gpr_reg_map_[X64Emitter::GPR_COUNT] = { Operand::RBX, Operand::R12, Operand::R13, Operand::R14, Operand::R15, }; @@ -539,25 +534,6 @@ void X64Emitter::nop(size_t length) { } } -void X64Emitter::LoadEflags() { -#if STORE_EFLAGS - mov(eax, dword[rsp + STASH_OFFSET]); - btr(eax, 0); -#else -// EFLAGS already present. -#endif // STORE_EFLAGS -} - -void X64Emitter::StoreEflags() { -#if STORE_EFLAGS - pushf(); - pop(dword[rsp + STASH_OFFSET]); -#else -// EFLAGS should have CA set? -// (so long as we don't fuck with it) -#endif // STORE_EFLAGS -} - bool X64Emitter::ConstantFitsIn32Reg(uint64_t v) { if ((v & ~0x7FFFFFFF) == 0) { // Fits under 31 bits, so just load using normal mov. diff --git a/src/xenia/cpu/backend/x64/x64_emitter.h b/src/xenia/cpu/backend/x64/x64_emitter.h index d5b56ee17..abda0ee4d 100644 --- a/src/xenia/cpu/backend/x64/x64_emitter.h +++ b/src/xenia/cpu/backend/x64/x64_emitter.h @@ -173,9 +173,6 @@ class X64Emitter : public Xbyak::CodeGenerator { // TODO(benvanik): Label for epilog (don't use strings). - void LoadEflags(); - void StoreEflags(); - // Moves a 64bit immediate into memory. bool ConstantFitsIn32Reg(uint64_t v); void MovMem64(const Xbyak::RegExp& addr, uint64_t v); diff --git a/src/xenia/cpu/compiler/compiler_passes.h b/src/xenia/cpu/compiler/compiler_passes.h index 90196c7a4..7601df8dd 100644 --- a/src/xenia/cpu/compiler/compiler_passes.h +++ b/src/xenia/cpu/compiler/compiler_passes.h @@ -24,157 +24,4 @@ #include "xenia/cpu/compiler/passes/validation_pass.h" #include "xenia/cpu/compiler/passes/value_reduction_pass.h" -// TODO: -// - mark_use/mark_set -// For now: mark_all_changed on all calls -// For external functions: -// - load_context/mark_use on all arguments -// - mark_set on return argument? -// For internal functions: -// - if liveness analysis already done, use that -// - otherwise, assume everything dirty (ACK!) -// - could use scanner to insert mark_use -// -// Maybe: -// - v0.xx = load_constant -// - v0.xx = load_zero -// Would prevent NULL defs on values, and make constant de-duping possible. -// Not sure if it's worth it, though, as the extra register allocation -// pressure due to de-duped constants seems like it would slow things down -// a lot. -// -// - CFG: -// Blocks need predecessors()/successor() -// phi Instr reference -// -// - block liveness tracking (in/out) -// Block gets: -// AddIncomingValue(Value* value, Block* src_block) ?? - -// Potentially interesting passes: -// -// Run order: -// ContextPromotion -// Simplification -// ConstantPropagation -// TypePropagation -// ByteSwapElimination -// Simplification -// DeadStoreElimination -// DeadCodeElimination -// -// - TypePropagation -// There are many extensions/truncations in generated code right now due to -// various load/stores of varying widths. Being able to find and short- -// circuit the conversions early on would make following passes cleaner -// and faster as they'd have to trace through fewer value definitions. -// Example (after ContextPromotion): -// v81.i64 = load_context +88 -// v82.i32 = truncate v81.i64 -// v84.i32 = and v82.i32, 3F -// v85.i64 = zero_extend v84.i32 -// v87.i64 = load_context +248 -// v88.i64 = v85.i64 -// v89.i32 = truncate v88.i64 <-- zero_extend/truncate => v84.i32 -// v90.i32 = byte_swap v89.i32 -// store v87.i64, v90.i32 -// after type propagation / simplification / DCE: -// v81.i64 = load_context +88 -// v82.i32 = truncate v81.i64 -// v84.i32 = and v82.i32, 3F -// v87.i64 = load_context +248 -// v90.i32 = byte_swap v84.i32 -// store v87.i64, v90.i32 -// -// - ByteSwapElimination -// Find chained byte swaps and replace with assignments. This is often found -// in memcpy paths. -// Example: -// v0 = load ... -// v1 = byte_swap v0 -// v2 = byte_swap v1 -// store ..., v2 <-- this could be v0 -// -// It may be tricky to detect, though, as often times there are intervening -// instructions: -// v21.i32 = load v20.i64 -// v22.i32 = byte_swap v21.i32 -// v23.i64 = zero_extend v22.i32 -// v88.i64 = v23.i64 (from ContextPromotion) -// v89.i32 = truncate v88.i64 -// v90.i32 = byte_swap v89.i32 -// store v87.i64, v90.i32 -// After type propagation: -// v21.i32 = load v20.i64 -// v22.i32 = byte_swap v21.i32 -// v89.i32 = v22.i32 -// v90.i32 = byte_swap v89.i32 -// store v87.i64, v90.i32 -// This could ideally become: -// v21.i32 = load v20.i64 -// ... (DCE takes care of this) ... -// store v87.i64, v21.i32 -// -// - DeadStoreElimination -// Generic DSE pass, removing all redundant stores. ContextPromotion may be -// able to take care of most of these, as the input assembly is generally -// pretty optimized already. This pass would mainly be looking for introduced -// stores, such as those from comparisons. -// -// Example: -// : -// v0 = compare_ult ... (later removed by DCE) -// v1 = compare_ugt ... (later removed by DCE) -// v2 = compare_eq ... -// store_context +300, v0 <-- removed -// store_context +301, v1 <-- removed -// store_context +302, v2 <-- removed -// branch_true v1, ... -// : -// v3 = compare_ult ... -// v4 = compare_ugt ... -// v5 = compare_eq ... -// store_context +300, v3 <-- these may be required if at end of function -// store_context +301, v4 or before a call -// store_context +302, v5 -// branch_true v5, ... -// -// - X86Canonicalization -// For various opcodes add copies/commute the arguments to match x86 -// operand semantics. This makes code generation easier and if done -// before register allocation can prevent a lot of extra shuffling in -// the emitted code. -// -// Example: -// : -// v0 = ... -// v1 = ... -// v2 = add v0, v1 <-- v1 now unused -// Becomes: -// v0 = ... -// v1 = ... -// v1 = add v1, v0 <-- src1 = dest/src, so reuse for both -// by commuting and setting dest = src1 -// -// - RegisterAllocation -// Given a machine description (register classes, counts) run over values -// and assign them to registers, adding spills as needed. It should be -// possible to directly emit code from this form. -// -// Example: -// : -// v0 = load_context +0 -// v1 = load_context +1 -// v0 = add v0, v1 -// ... -// v2 = mul v0, v1 -// Becomes: -// reg0 = load_context +0 -// reg1 = load_context +1 -// reg2 = add reg0, reg1 -// store_local +123, reg2 <-- spill inserted -// ... -// reg0 = load_local +123 <-- load inserted -// reg0 = mul reg0, reg1 - #endif // XENIA_COMPILER_COMPILER_PASSES_H_