- Allows sections reclaimed by the surface store due to overlap/inheritance to be identified and removed.
- Additionally, potentially lowers the number of flushes required per block with multiple overlaps improving efficiency and theoretically performance.
- Reject writes to RTT if the source data is of unknown origin.
non-RTT data and only 1 line in length is suspicious and often GPU data like programs or other rendering inputs.
- Attempt to identify blit operations that will be flushed immediately
after and just do them on CPU instead if the transformation is trivial.
- If only a single blit section is contributing to an atlas merge op, the
threshold should be 100%. The only acceptable result here is a
truncation.
- Raise passing 'score' from 50% to 90% to filter out very incomplete
merge operations.
- Catch unfit sections passing the match test; possible for blit_dst
data but will likely be always harmless. Disabled in release builds by default.
* Prefer default initializer over std::memset 0 when possible and more readable.
* Use std::format in trophy files name obtaining.
* Use vm::ptr<>::operator bool() instead of comparing vm::ptr to vm::null or using addr().
* Add a few std::memset calls in hle where it matters (or in some places just to document an actual firmware memcpy call).
Round-to-nearest integral based division, optimized for unsigned integral.
Used in sceNpTrophyGetGameProgress.
Do not allow signed values for aligned_div(), align().
This implementation optimises correctly on all relevant compilers,
unlike GSL’s which gave extremely slow code on any compiler other than
MSVC.
Supersedes #6948.
- When the point sprite flag is set, overrides the input similar to the
2D mask. The returned X and Y values are always the gl_PointCoord values
for the fragment.
- Stacks with the 2D mask to override the z and w coordinates.
* rsx: Optimise primitive_restart::upload_untouched() with SSE4.1
This optimisation is only applied when skip_restart is false.
I’ve only tested the u16 codepath, as it is the one used in NieR.
In some very unscientific profiling, this function used to take 2.76% of
the total frame time at the save point of the port town, it now takes
about 0.40%.
* rsx: Mark all SSE4.1 functions with attributes on gcc and clang
This assures the compiler we will take care of only calling these
functions after having checked that the CPU does support these
instructions.
* rsx: Add an AVX2 implementation of primitive restart ibo upload
* rsx: Remove redefinition of SSE4.1 instructions
Now that clang is aware that our functions are compiled with SSE4.1, it
lets us generate this code using its intrinsics.
* rsx: Optimise vector to scalar conversion
This is done using minpos and srli intrinsics and generate less code
than before.
Thanks Nekotekina for the suggestion!
subresource_layout::dim_in_texel
- These two are not always linked when working with compressed textures.
The actual texels extend past the actual size of the image if the size
is not aligned. e.g if height is 1, the real height is 4, but its not
possible to determine this from the aligned size. It could be 1, 2, 3 or
4 for example.
- Fixes image out-of-bounds writes when uploading from CPU
- Noticed a glitch on AMD hw and windows drivers where discard seems to affect entire 4x4 cells.
- Dead fragments (outside the primitive boundary) could have their discards trigger as they do not have proper access to variables.
- This introduces dead fragments along triangle edges, causing a diagonal line pattern across the screen that is very annoying.
- Renormalizes arbitrary N-bit values as 8-bit normalized.
- NV hardware performs integer normalization at 8 bits if the size is less than 8.
- This can cause significant arithmetic drift because the error is multiplied by a huge number when sampling.
- If either source data or dest is a render target, do image operations on the GPU same as before
- If swizzle is desired, use CPU fallback
- If no scaling and no format conversion is required, use CPU fallback
- If scaling is desired and the transfer target is in local memory, use the GPU
- When doing trivial copies, use the routine in rsx_methods instead of
duplicating code. Also has the benefit of better range checking.
- When commiting a block as fbo, keep blit_dst data as well.
- Avoids removing (and losing data from) blit targets that just happen to share a page with a framebuffer.
- Uncacheable resources can be reused as soon as they're made visible to the draw call.
- Since they're likely to be reused every draw call until the shader changes, it is important to reuse as much as possible
- Calculate exact sizes when doing hit tests to avoid false negatives
- Defer page checking until actually require to do memory setup
- Introduce align2 helper to do non-pow2 alignments
- Allow use of intrinsics when SSSE3 and SSSE4.1 are not available in the build target environment
- Properly separate SSE4.1 code from SSSE3 code for some older proceessors without SSE4.1
- With harmonization between all texture types implemented, there is no difference between blit_engine_src and shader_read for supported formats
- Adds extra format filtering to ensure no conflicts when copying data
- While the mask for surface_a is at index 0, the surface cache expects the order to be maintained correctly!
Set the correct mask since surface store now checks each RTT individually
- Avoid silly broken tests due to queue_tag being called before pitch is initialized.
- Return actual memory range covered and exclude trailing padding.
- Coordinates in src are to be calculated with src_pitch, not required_pitch.
- This allows creating buffers with no MAP bits set which should ensure they are created for VRAM usage only
- TODO: Implement compute kernels to avoid software fallback mode for pack/unpack operations
- Fix 2D coordinate sampling of W coordinate.
W is actually HPOS.w and not 1. Z is however always 0.
- Optimize register usage a bit
Disassembling compiled SPV shows that global declaration results in less ops than using inout modifiers. Modifiers generate extra mov instructions.
- Fix reading of varying registers in FP
Different registers have different behavior
- Always write to varying registers. If a register is not written to, it is initialized to (0, 0, 0, 1)
- Reimplements two-sided lighting correctly without hacks
- Also bumps shader cache version
- Properly commit orphaned blocks not invalidating existing cache structures
- Do not ignore overwritten objects when commiting as unprotected fbo. Avoids stale references to invalidated surface objects.
- Load into memory as straightforward BGRA
- Fixes a bug in vulkan caused by byte shuffling in blit engine vs shader access
- Removes the need for memory shuffling when transferring into a rendertarget
- Implements render target data load (aka Read Color Buffer/Read Depth Buffer)
- Refactors vulkan surface barrier to be much cleaner.
- Removes redundant surface barrier invocations after doing a merged load
from surface cache.
- Adds explicit access modes when gathering surfaces from cache.
- Further improve aliased data preservation by unconditionally scanning.
Its is possible for cache aliasing to occur when doing memory split.
- Also sets up for RCB/RDB implementation
- After splitting, the sections may not be referenced at all for anything other than just pixel storage
- In such cases, either merge down or sample from the upstream source instead
- Texel borders are no longer actually supported in modern APIs
- Removes the border texels and uses border color instead which is incorrect but should work fine
vm::spu max address was overflowing resulting in issues, so cast to u64 where needed. Fixes#6145.
Use vm::get_addr instead of manually substructing vm::base(0) from pointer in texture cache code.
Prefer std::atomic_thread_fence over _mm_?fence(), adjust usage to be more correct.
Used sequantially consistent ordering in semaphore_release for TSX path as well.
Improved memory ordering for sys_rsx_context_iounmap/map.
Fixed sync bugs in HLE gcm because of not using atomic instructions.
Use release memory barrier in lwsync for PPU LLVM, according to this xbox360 programming guide lwsync is a hw release memory barrier.
Also use release barrier where lwsync was originally used in liblv2 sys_lwmutex and cellSync.
Use acquire barrier for isync instruction, see https://devblogs.microsoft.com/oldnewthing/20180814-00/?p=99485
Prefer vm::ptr<>::ptr over vm::get_addr.
Prefer vm::_ptr/base over vm::g_base_addr with offset.
Added methods atomic_t<>::bts and atomic_t<>::btr .
Removed obsolute rsx:🧵:Read/WriteIO32 methods.
Removed wrong check in semaphore_release.
Added handling for PUTRx commands for RawSPU MFC proxy.
Prefer overloaded methods of v128 instead of _mm_... in VPKSHUS ppu interpreter precise.
Fixed more potential overflows that may result in wrong behaviour.
Added io/size alignment check for sys_rsx_context_iounmap.
Added rsx::constants::local_mem_base which represents RSX local memory base address.
Removed obsolute rsx:🧵:main_mem_addr/ioSize/ioAddress members.
- Remove string comparisons from the hot-path!
- Use attribute streaming and push constants to avoid forcing a descriptor block copy every other draw call/pass.
While this isn't so bad on nvidia cards, it makes AMD cards a slideshow.
- Multiple header files where missing #includes to other headers that
where used in the header. Correct header was included in correct
order in source files which caused everything to compile.
- Added missing #includes so header files correctly include all their
dependencies and fixes problems with IDEs being unable to parse
headers correctly due to missing symbols
- TODO: Option to completely skip clamping in some architectures as it is not needed in most games
- Mostly affects older GPUs that do not have access to native fp16
- Ensures the current renderpass matches the image properties even when a cyclic reference is detected
- Solves SDK debug output error spam due to mismatching layouts and renderpasses
- Transition attachments to LAYOUT_GENERAL in case of a feedback loop
- Fixes appearance of garbage along polygon edges in some
post-processing passes.
- Also reverse this transition when rendering goes back to normal
- Allows render targets to behave like stacked 3D views same as shader inputs are resolved
- Basically implements most of 'Read Color/Depth Buffers" option for 'free'.
- Allows splitting RTV/DSV resources if they are superceded by a partial surface
- Also allows intersecting new resources through the surface cache for proper inheritance from other scattered data
- TODO: Refactor bind_surface_as_rtt and bind_surface_as_ds to reduce asinine code duplication
TODO: Investigate the _s input modifier behaviour further, in case it can avoid generating zeroes from a MAD instruction.
x = MAD(+ve, -ve, -ve) with _s input modifier in BFBC expects result to be Non-zero
- Properly test for NaN and Inf when clamping down to fp16
- Optimize divsq a bit; mix(vec, vec, bvec) emits OpSelect which is what
we want here, instead of component-wise selection which is much slower.
- While mul(0, nan) = nan and 0 / 0 = nan, 0 / sqrt(0) = 0 because of hw
gremlins. normalize(0) is also nan so this behaviour does not work
around that particular case either which makes it even more baffling.
- The hw generates inaccurate values when doing perspective-correct
interpolation of vertex output attributes and makes the comparison (a ==
b) fail even when they are a fixed constant value.
- Increase equality tolerance when doing comparisons in fragment
shaders for NV cards only to work around this issue.
- Teepo fix
- The fixed-point D24S8 format does special Z clamping during compare which matches PS3 behaviour
- D32S8 is a floating point format and comparison with Dref > 1 always fails causing black edges/borders
- Improve support for float16_t by minimizing mixed inputs to functions
(ambiguous overloads)
- Minimize amount of downcasts in code by using opcode flags
- Re-enable float16_t support for vulkan
- Emulating f16 with f32 is not ideal and requires a lot of value clamping
- Using native data type can significantly improve performance and accuracy
- With openGL, check for the compatible extensions NV_gpu_shader5 and
AMD_gpu_shader_half_float
- With Vulkan, enable this functionality in the deviceFeatures if
applicable. (VK_KHR_shader_float16_int8 extension)
- Temporarily disable hw fp16 for vulkan
- When reverse scanning, offsets are inverted and offset value of 0 is logically equivalent to an offset of -1
- Add an explicit message if clipping happens to avoid silent errors/bugs
- Revert to using block metrics, but with optional per-channel decode
stage for the final transfer. Much cleaner than hacking in the width to
be in channels instead of blocks.
- Removes CPU-only transforms that broke GPU-side code.
-- Channels in GPU compute are laid out in cell-order, but CPU was uploading in favorable order and compensating with swizzles.
-- This leads to 2 different layouts depending on the location of the data (CPU vs GPU)
- Implement R8G8_R8B8 interleaved format decode
- General improvements
- Do not round up sub-pixel offsets, round down instead
- Do not allow incomplete sources for hw blit transfer
- Reimplement src clipping (slice_h)
- Check 'area' of incoming texels and correct for them before RTT lookup/transfer
- Filter out incomplete targets when performing RTT lookup (1 texel or less contribution)
- If a transfer writes to a RTT and depth mismatch happens, create a local target and the upload function will likely resolve between the two
- If a surface is rejected, reset the target region!
- Also refactors some bpp handling code
- Simplify texture intersection test to use a normalized/uniform coordinate space
- Fix broken bounds checking as well
- Batch dma transfers whenever possible and do them in one go
- vk: Always ensure that queued dma transfers are visible to the GPU before they are needed by the host
Requires a little refactoring to allow proper communication of the commandbuffer state
- vk: Code cleanup, the simplified mechanism makes it so that its not necessary to pass tons of args to methods
- vk: Fixup - do not forcefully do dma transfers on sections in an invalidation zone! They may have been speculated correctly already
- Properly wait for the buffer transfer operation to finish before map/readback!
- Change vkFence to vkEvent which works more like a GL fence which is what is needed.
- Implement supporting methods and functions
- Do not destroy fence by immediately waiting after copying to dma buffer
- Avoids blindly reusing blit dst sections as they may contain garbage.
If a section was unlocked for a flush, just discard it as its reuse introduces potential data corruption.
Since the data needs to be reuploaded anyway (for now), its better to start afresh
- In case of format mismatch, reset the calculated dst block
- Add a bounds check to determine if data contained in an atlas is good enough for sampling the cache.
If not enough data is provided, fall back to full upload
- Use a 5-point tap with an X pattern across the target's memory space to reduce chances of false positives
- TODO: Potential false positives identified, requires some minor
restructuring of surface_store
- Apply dither to edges that almost fail the straight-up alpha test
- Significantly improves alpha tested geometry far from the camera
- Also removes blend factor overrides/hacks as they give incorrect results due to background bleeding
- Index offset is ignored anyway and only used to calculate vertex attribute divisor index
- Specialized optimization for untouched xfer without primitive restart
- Avoid tagging and rely on read/write barriers and the dirty flag mechanism. Testing is done with a weak 8-byte memory test
- Introducing new data when tagging breaks applications with race conditions where tags can overwrite flushed data
- gl: Include an execution state wrapper to ensure state changes are consistent. Also removes a lot of required 'cleanup' for helper methods
- texture_cache: Make execition context a mandatory field as it is required for all operations. Also removes a lot of situations where duplicate argument is added in for both fixed and vararg fields
- Explicit read/write barrier for framebuffer resources depending on
usage. Allows for operations like optional memory initialization before
reading
- If draw call resources consume memory that intersects with NA parts of the texture cache, we get a framebuffer test mismatch.
This mismatch is false and happens because the thread has not yet reached the point of relocking the pages
- Implicitly invoke a memory barrier if actively reading from an unsynchronized texture
- Simplify memory transfer operations
- Should allow more games to work without strict mode
- Do not bind companion framebuffer when clearing single aspect; let the
contest mechanism sort it out instead
- Do not prematurely tag framebuffers, instead only do so at
write-confirmation time. Should avoid false tagging if setup does not
allow a render to occur.
- Implements a mirror view of D24S8 data that accesses the stencil components.
Finishes the implementation of TEX2D_DEPTH_RGBA as the stencil component was previously missing from the reconstructed data
- Add a few missing destructors
Image classes are inherited a lot and I forgot to make the dtors virtual
- Per-channel conditional execution introduces RAW hazards all over the place
- Its cheaper to process both branches and select between the two
- Also improves ShaderVariable functionality to allow functionality such as match_size and taking complex variables as inputs
To avoid the need (and performance hit) of Read Color/Depth Buffers, we
may not invalidate overlapping fbos inside lock_memory_region unless
they are guaranteed to be superseded by the new one.
This avoids e.g. issues with overblooming, among others.
Fixes VRAM leaks and incorrect destruction of resources, which could
lead to drivers crashes.
Additionally, lock_memory_region is now able to flush superseded
sections. However, due to the potential performance impact of this
for little gain, a new debug setting ("Strict Flushing") has been
added to config.yaml
- Improve vertex attribute layout format. Allows for full 16-bit attribute divisor
- Use actual pitch when declaring framebuffer rsx pitch instead of register value in case of swizzle? rendering
- Also fix visual corruption when using disjoint indexed draws
- Refactor draw call emit again (vk)
- Improve execution barrier resolve
- Allow vertex/index rebase inside begin/end pair
- Add ALPHA_TEST to list of excluded methods [TODO: defer raster state]
- gl bringup
- Simplify
- using the simple_array gets back a few more fps :)
- Orders flushing to preserve memory at all cost
- Avoids false positive where flushing overlapping sections can falsely invalidate another with head/tail test
- Forcefully downloads and reuploads data from the CPU in case of unexpected overlaps
- Properly detect correct size of newly created blit targets
- Remember to clear any existing views when changing the default component map!
- NOTE: The address swizzle index is only for use as src. The address registers are only used one channel at a time.
- When the destination of ARL, the encoding is the same as the other temp registers
- Retag resources reprotected under flush_always rules
- Properly check for blit resource fitting taking into account format
mismatch, pitch mismatch and typeless transfers
- The x value contains the VP output value interpolated across primitive surface
- The y coordinate contains the fog fraction according to the selected fog formula
- Tags framebuffer resources on first use (when on_write is called to verify memory)
- Texture cache now selects the best match and even sorts atlas writes with memory write order to avoid older data showing over newer one
- Some formats are proven to ignore swizzle flag
- DXT compressed textures
- COMPRESSED_BG_GB class textures
- Some applications are using swizzled wide integer formats so those are confirmed to swizzle
invalidate_range_impl_base does not mark all textures that will only be
unprotected as dirty when doing a deferred flush, since that is done by
flush_all.
However, if there are no sections to flush, the deferred flush will
use the same code path as non-deferred flushes for unprotecting textures
and forget to mark them as dirty.
This commit fixes this bug.
The existing implementation restarts the loop immediately after
finding a range_data instance that updates the trampled_range.
This commit refactors this method to continue the loop with the updated
trampled_range, and then repeat only those range_data instances that
were iterated through before the trampled_range was last updated.
As a result, the number of total iterations required is reduced.