In some very unscientific benchmark:
spu_thread::do_dma_transfer() was taking 2.27% of my CPU before, now
0.07%, while __memmove_avx_unaligned_erms() was taking 1.47% and now
2.88%, which added makes about 0.8% saved.
This race could lead to spu status bits indicate RUNNING status, but cpu state being stopped.
Fix it by making sure cpu state is set before spu status.
* Removed wrong code in sys_spu_thread_group_terminate.
* SPU Thread ID is accurate, including 5th thread id "rule".
* Fixed possible use-after-free access of spu_thread::group member.
* RawSPU ID management simplified.
This replaces the totally messed up PR #6728
Some games make heavy use of getllar() & putllc() without even changing data.
In this case avoid unneccesary heavy locking of the PPU threads on non-TSX
hosts.
* Ignore more warnings
These are intentional
* Signed/unsigned mismatch when comparing
* Explictly cast values
* Intentionally discard a nodiscard value
* Change ppu_tid to u32
* Do not use POSIX function name on Windows
* Qt: Use horizontalAdvance instead of width
* Change progress variables to u32
With the option enabled GET commands are blocked until the current PUTLLC/PUTLLUC executer on that address finishes
Additional improvements:
- Minor race fix of sys_ppu_thread_exit (wait until the writer finishes)
- Max number of ppu threads bumped to 8
vm::spu max address was overflowing resulting in issues, so cast to u64 where needed. Fixes#6145.
Use vm::get_addr instead of manually substructing vm::base(0) from pointer in texture cache code.
Prefer std::atomic_thread_fence over _mm_?fence(), adjust usage to be more correct.
Used sequantially consistent ordering in semaphore_release for TSX path as well.
Improved memory ordering for sys_rsx_context_iounmap/map.
Fixed sync bugs in HLE gcm because of not using atomic instructions.
Use release memory barrier in lwsync for PPU LLVM, according to this xbox360 programming guide lwsync is a hw release memory barrier.
Also use release barrier where lwsync was originally used in liblv2 sys_lwmutex and cellSync.
Use acquire barrier for isync instruction, see https://devblogs.microsoft.com/oldnewthing/20180814-00/?p=99485