This implementation optimises correctly on all relevant compilers,
unlike GSL’s which gave extremely slow code on any compiler other than
MSVC.
Supersedes #6948.
- When the point sprite flag is set, overrides the input similar to the
2D mask. The returned X and Y values are always the gl_PointCoord values
for the fragment.
- Stacks with the 2D mask to override the z and w coordinates.
- Allow delaying report flushes triggered by image_in or buffer_notify
- When the report is ready, all the delayed transfers will automatically
be done.
- TODO: Make this configurable?
- Renormalizes arbitrary N-bit values as 8-bit normalized.
- NV hardware performs integer normalization at 8 bits if the size is less than 8.
- This can cause significant arithmetic drift because the error is multiplied by a huge number when sampling.
* allow sys_rsx_device_map to be called twice: in this case the DEVICE address retrived from the previous call returned
* Add ENOMEM checks for sys_rsx_memory_allocate and sys_rsx_context_allocate
* add EINVAL check for sys_rsx_context_allocate if memory handle is not found
* Separate sys_rsx_device_map allocation from sys_rsx_context_allocate's
* Implement sys_rsx_memory_free; used by cellGcmInit upon failure
* Added context_id checks
* Throw if sys_rsx_context_allocate was called twice.
- Allows frameskipping to occur naturally if RSX thread is bombarded with flip requests but just jumping to the last one if possible
- See request_emu_flip() for async frame submission and implicit skipping
- Also allows display queue to fill faster than the flip thread can drain the queue
Move se_t and se_storage to util/endian.hpp
Use single template instead of two specializations.
Add minor optimization for MSVC.
Remove v128 dependency.
Try to enable intrinsics for unaligned data.
Fix minor bug in u16/u32/u64 specializations.
- Separate displayed statistics from actual backend statistics.
Allows asynchronous flipping to work correctly as it just uses display stats.
The real stats are used by the frame scope marker to determine behavior like engaging the FIFO optimizer or skipping draw calls correctly.
- Add an explicit frame scope marker tied in with the queue_prepare command
Since queue_prepare is emitted at the end of a frame, it can be used as end-of-frame in games that emit this
- If this command is not emitted, fifo flatenner and frameskip will not work
- While the mask for surface_a is at index 0, the surface cache expects the order to be maintained correctly!
Set the correct mask since surface store now checks each RTT individually
- Fix reading of varying registers in FP
Different registers have different behavior
- Always write to varying registers. If a register is not written to, it is initialized to (0, 0, 0, 1)
- Reimplements two-sided lighting correctly without hacks
- Also bumps shader cache version
- Do not allow offloader to handle its own faults. Serialize them on RSX instead.
This approach introduces a GPU race condition that should be avoided with improved synchronization.
- TODO: Use proper GPU-side synchronization to avoid this situation
- ZCULL queue was updated to one-per-cb but the conditional render sync hint was not updated.
- Do not unconditionally flush the queue unless the upcoming ref is contained in the active CB.
- This avoids spamming queue flush, which frees up resources and improves performance
- Merge viewport raster window and scissor into one clipping region
- Viewport raster clip is different from viewport geometry clipping in
hardware as the latter is configurable separately
vm::spu max address was overflowing resulting in issues, so cast to u64 where needed. Fixes#6145.
Use vm::get_addr instead of manually substructing vm::base(0) from pointer in texture cache code.
Prefer std::atomic_thread_fence over _mm_?fence(), adjust usage to be more correct.
Used sequantially consistent ordering in semaphore_release for TSX path as well.
Improved memory ordering for sys_rsx_context_iounmap/map.
Fixed sync bugs in HLE gcm because of not using atomic instructions.
Use release memory barrier in lwsync for PPU LLVM, according to this xbox360 programming guide lwsync is a hw release memory barrier.
Also use release barrier where lwsync was originally used in liblv2 sys_lwmutex and cellSync.
Use acquire barrier for isync instruction, see https://devblogs.microsoft.com/oldnewthing/20180814-00/?p=99485
Prefer vm::ptr<>::ptr over vm::get_addr.
Prefer vm::_ptr/base over vm::g_base_addr with offset.
Added methods atomic_t<>::bts and atomic_t<>::btr .
Removed obsolute rsx:🧵:Read/WriteIO32 methods.
Removed wrong check in semaphore_release.
Added handling for PUTRx commands for RawSPU MFC proxy.
Prefer overloaded methods of v128 instead of _mm_... in VPKSHUS ppu interpreter precise.
Fixed more potential overflows that may result in wrong behaviour.
Added io/size alignment check for sys_rsx_context_iounmap.
Added rsx::constants::local_mem_base which represents RSX local memory base address.
Removed obsolute rsx:🧵:main_mem_addr/ioSize/ioAddress members.
- The fixed-point D24S8 format does special Z clamping during compare which matches PS3 behaviour
- D32S8 is a floating point format and comparison with Dref > 1 always fails causing black edges/borders
formats
- Allows D24S8 and D32S8 transport via typeless channels
- Allows uploading and downloading D24S8 data easily
- TODO: Implement optional byteswapping to fix flushed readbacks with
the same method
- Apply dither to edges that almost fail the straight-up alpha test
- Significantly improves alpha tested geometry far from the camera
- Also removes blend factor overrides/hacks as they give incorrect results due to background bleeding
- Remove the required_xxx_pitch constraint as it makes no sense. The pitch controls what can be written per line.
- It is possible to have a huge surface width but only render to a small region at the beginning and have a smaller pitch than can fit the surface (NFS carbon)
- Do not bind companion framebuffer when clearing single aspect; let the
contest mechanism sort it out instead
- Do not prematurely tag framebuffers, instead only do so at
write-confirmation time. Should avoid false tagging if setup does not
allow a render to occur.
- Immediate mode is isolated from the rest of the vertex configuration
- TODO: Verify register behaviour when immediate mode is used
Check if per-primitive const register values are supported (likely are)
* Restore stack in fifo error handling
* Update get register after the cmd execution
* Fix put pause in the middle of command
* Add restore points when branching to self
* Precise nopcmd detection
* Test all invalid cmds for early treatment of queue corruption
- Do not assume flip marks end-of-frame if executed via syscall
- Also disables skip_frame for these applications as there is no frame boundary
- NOTE: QUEUE_HEAD cannot be relied on as it is seemingly possible to flip the same head and not need to queue it
- Tries to detect when FIFO preprocessing is beneficial and only enables optimizations if the benefit outweighs the cost
- Current threshold is at least 500 draw calls saved at over 2000 draw calls to justify the overhead
- TODO: More tuning for other CPUs
- Improve vertex attribute layout format. Allows for full 16-bit attribute divisor
- Use actual pitch when declaring framebuffer rsx pitch instead of register value in case of swizzle? rendering
- Also fix visual corruption when using disjoint indexed draws
- Refactor draw call emit again (vk)
- Improve execution barrier resolve
- Allow vertex/index rebase inside begin/end pair
- Add ALPHA_TEST to list of excluded methods [TODO: defer raster state]
- gl bringup
- Simplify
- using the simple_array gets back a few more fps :)