- The situation was clarified in the official vulkan spec to allow this
behavior.
Barriers are now only inserted by the driver when layout
transitions are requested.
- Fixes a data leak that can happen when a surface is rejected due to aspect mismatch.
- Mismatch can lead to rejection due to area covered excluding the RTT and inevitable upload a texture from CPU at the same location.
- Overlapping fbo/shader_read resources are not allowed.
Use -fno-exceptions in cmake.
On MSVC, enable _HAS_EXCEPTION=0.
Cleanup throw/catch from the source.
Create yaml.cpp enclave because it needs exception to work.
Disable thread_local optimizations in logs.cpp (TODO).
Implement cpu_counter for cpu_threads (moved globals).
- Detect writes to the display output memory and handle it specially.
It already defines a known 2D region.
- Try and detect situations where raw transfers would be of benefit.
- It is possible to have a RTV<->DSV transfer with compatible-sized formats.
Mark the depth size as typeless in such a situation to avoid crossing the aspect barrier with the API.
- Avoids a situation where a game configures output correctly but gets back bogus information later when querying.
- Should fix games being broken at some resolutions but not others.
- Both ZCULL stats and ZPASS stats require hardware queries, but
ZCULL stats should not contribute to ZPASS stats and vice versa!
- Disables hardware queries for ZCULL stats by themselves, we cannot
generate them correctly anyway and no game so far has been found to
actually use them. Should lessen the load on the backend for games
that do not actually require it.
- System widgets are callable from outside RSX code.
- Responding to draw requests while setup is in progress can cause malformed cached output
- Fixes glitched layouts for system message dialogs
- Add support for Hangul blocks (korean)
- Restructure font fallback system to allow the user to 'install' fonts if missing.
Should allow fonts to work with no firmware on open systems like linux
- Gets JP glyphs to render correctly, but the generalization may negatively affect other CJK glyph sets.
PS3 doesn't seem to use other glyph sets much however.
- Take viewport offset into account when applying window transforms.
This is necessary because gl_FragCoord is based on the framebuffer and not the viewport.
- Allows sections reclaimed by the surface store due to overlap/inheritance to be identified and removed.
- Additionally, potentially lowers the number of flushes required per block with multiple overlaps improving efficiency and theoretically performance.
Allow x30 times the speed of vblank rate + clocks scale of original PS3.
In theory a 60 fps limit game which scales frame limit perfectly with vblank rate can be played at up to 1800 fps with this change.
And:
* Fixed lv2 sleep with Clocks Scaling
* Make these settings dynamicaly adjustable.
* Avoid code duplication
- Reject writes to RTT if the source data is of unknown origin.
non-RTT data and only 1 line in length is suspicious and often GPU data like programs or other rendering inputs.
- Attempt to identify blit operations that will be flushed immediately
after and just do them on CPU instead if the transformation is trivial.
- If only a single blit section is contributing to an atlas merge op, the
threshold should be 100%. The only acceptable result here is a
truncation.
- Raise passing 'score' from 50% to 90% to filter out very incomplete
merge operations.
- Catch unfit sections passing the match test; possible for blit_dst
data but will likely be always harmless. Disabled in release builds by default.
- Some games use texture semaphore for zcull sync which is rather bizzare.
However, it works on realhw as the depth test happens before fragment shader completion
- Due to the high performance penalty incurred by this act, this
behavior is only enabled by the "strict rendering mode" option.
- Adds support for partial (letterboxed) source images by taking insets
into account.
- Bugfix for potential access violation when capturing screenshot on
vulkan
- Adds the same optimization/simplification steps to complex image
transfer routines. Whenever possible, multi-step transfers are collapsed
into a single operation.
- Avoid double transfers where a transfer to a temp image is done
without scaling and then a secondary transfer follows. Combines the two
steps into one whenever possible which can significantly alleviate
bandwidth problems at higher resolutions. Significant speedup, upto 90%
in some cases (PDF, PDF2)
- Interpolating floats is not the same as interpolating their bits!
Use integer format to interpolate linearly for D32F formats instead of using R32F as intermediary
- Queueing commands on the offloader is a good idea but unfortunately
page faults can still happen causing a cyclic dependency and eventual
deadlock. Characterized by a vk::wait_for_event timed out error
accompanied by severe hitching.
- Drain the fault-able commands before pushing a submit operation to the
queue. If a fault is in progress, bypass the queue system and submit
raw. Technically this is incorrect but there isn't much that can be
done about it right now.
- When using partial results on NVIDIA, a non-zero result is returned even when the draw is fully occluded.
This, I believe, violates spec which says the partial result shall be between 0 and the final result.
- Adds color interpolation and modulation pass and refactors the code a
bit. Elements with this pass applied have their color modulated by the
animated color from the pass. Modulation transform is multiplicative.
Instead of speed, direction and distance, the user now specifies
start/end offsets and how much time the transition should take.
Fixes:
- Stuttering caused from framerate estimation.
- An edge case where animations would go over their supposed limit.
Adds:
- The ability to specify arbitrary easing functions for the animations
- Implemented quadratic ease in and ease out and cubic ease in/out.
- Usage of cubic ease in/out in the trophy notification
- Slightly increases the size of the trophy dialog and the font size.
The old dimensions did not work with some libre fonts causing
alignment errors and other problems.
- Adds animation support. This commit adds the base framework and
implements a translate animation used to slide elements around the
screen. This is then used to implement the sliding animation for the
trophy notification.
- Properly synchronize DMA transfers when handling RSX pipeline
barriers. Texture read barrier is used to signify completion of DMA
routines and is often used to signal that Cell can overwrite vertex
data!
* Prefer default initializer over std::memset 0 when possible and more readable.
* Use std::format in trophy files name obtaining.
* Use vm::ptr<>::operator bool() instead of comparing vm::ptr to vm::null or using addr().
* Add a few std::memset calls in hle where it matters (or in some places just to document an actual firmware memcpy call).
- A few nagging issues remain, specifically that partial command stream
largely caused by poor synchronization structures for partial CS flush
and also the fact that occlusion map entries wait on a command buffer
and not an EID!
- Prefer lazy retire model. Sync commands are sent out and the reports will be
retired when they are available without forcing.
- To make this work with conditional rendering, hardware support is
required where the backend will automatically determine visibility by
itself during rendering.
Round-to-nearest integral based division, optimized for unsigned integral.
Used in sceNpTrophyGetGameProgress.
Do not allow signed values for aligned_div(), align().
This implementation optimises correctly on all relevant compilers,
unlike GSL’s which gave extremely slow code on any compiler other than
MSVC.
Supersedes #6948.
- When the point sprite flag is set, overrides the input similar to the
2D mask. The returned X and Y values are always the gl_PointCoord values
for the fragment.
- Stacks with the 2D mask to override the z and w coordinates.
- Adhere to workgroup count limits as exposed by the GPU vendor.
They already execute properly even when going beyond the limits but this removes validation noise.
- Fix invocation counts for deswizzle kernel. The count was incorrect if blocksize was not 4, causing a bunch of useless work to be done.
- Handles all LODs per layer meaning cubemaps are now fully handled in 6 passes instead of 6 * (log2(width)) passes.
- Handles all LODs of a 3D texture in one pass as well.
- The improvements do warrant dropping down the number of allowed compute invocations a bit
- Remove use of uniform buffers for compute static data. Use push
constants instead.
- Minor touchups to the deswizzle code to avoid redundant data copies.
- Allow delaying report flushes triggered by image_in or buffer_notify
- When the report is ready, all the delayed transfers will automatically
be done.
- TODO: Make this configurable?
* rsx: Optimise primitive_restart::upload_untouched() with SSE4.1
This optimisation is only applied when skip_restart is false.
I’ve only tested the u16 codepath, as it is the one used in NieR.
In some very unscientific profiling, this function used to take 2.76% of
the total frame time at the save point of the port town, it now takes
about 0.40%.
* rsx: Mark all SSE4.1 functions with attributes on gcc and clang
This assures the compiler we will take care of only calling these
functions after having checked that the CPU does support these
instructions.
* rsx: Add an AVX2 implementation of primitive restart ibo upload
* rsx: Remove redefinition of SSE4.1 instructions
Now that clang is aware that our functions are compiled with SSE4.1, it
lets us generate this code using its intrinsics.
* rsx: Optimise vector to scalar conversion
This is done using minpos and srli intrinsics and generate less code
than before.
Thanks Nekotekina for the suggestion!
- Simplify active instance management. While multicontext support will
be required in future, this is better done with multiple logical devices
rather than multiple instances.
- Destroy the WSI surface on exit
- Enable depthBoundsTest explicitly. TODO: Properly check for supported
features.
subresource_layout::dim_in_texel
- These two are not always linked when working with compressed textures.
The actual texels extend past the actual size of the image if the size
is not aligned. e.g if height is 1, the real height is 4, but its not
possible to determine this from the aligned size. It could be 1, 2, 3 or
4 for example.
- Fixes image out-of-bounds writes when uploading from CPU
This lowers the relative cost of this function from ~2.25% to ~1.80% on
gcc 9 which I found quite surprising, some of it probably gets inlined
better in the callers, but I haven’t been able to isolate which parts.
The current behaviour when going fullscreen from windowed was to keep
the previous size of the swapchain, with black borders on all sides,
which looks quite ugly.
The root of this issue is that rpcs3 only checks for frame resize if
vkQueuePresent() returns VK_SUBOPTIMAL_KHR, which drivers can’t do on
Wayland, see https://gitlab.freedesktop.org/mesa/mesa/issues/1979
- Noticed a glitch on AMD hw and windows drivers where discard seems to affect entire 4x4 cells.
- Dead fragments (outside the primitive boundary) could have their discards trigger as they do not have proper access to variables.
- This introduces dead fragments along triangle edges, causing a diagonal line pattern across the screen that is very annoying.
- Renormalizes arbitrary N-bit values as 8-bit normalized.
- NV hardware performs integer normalization at 8 bits if the size is less than 8.
- This can cause significant arithmetic drift because the error is multiplied by a huge number when sampling.
* allow sys_rsx_device_map to be called twice: in this case the DEVICE address retrived from the previous call returned
* Add ENOMEM checks for sys_rsx_memory_allocate and sys_rsx_context_allocate
* add EINVAL check for sys_rsx_context_allocate if memory handle is not found
* Separate sys_rsx_device_map allocation from sys_rsx_context_allocate's
* Implement sys_rsx_memory_free; used by cellGcmInit upon failure
* Added context_id checks
* Throw if sys_rsx_context_allocate was called twice.
- If either source data or dest is a render target, do image operations on the GPU same as before
- If swizzle is desired, use CPU fallback
- If no scaling and no format conversion is required, use CPU fallback
- If scaling is desired and the transfer target is in local memory, use the GPU
- When doing trivial copies, use the routine in rsx_methods instead of
duplicating code. Also has the benefit of better range checking.
- When commiting a block as fbo, keep blit_dst data as well.
- Avoids removing (and losing data from) blit targets that just happen to share a page with a framebuffer.
- Uncacheable resources can be reused as soon as they're made visible to the draw call.
- Since they're likely to be reused every draw call until the shader changes, it is important to reuse as much as possible
- If a stale reference is left lying around (e.g the texture bound to
depth has been deleted and we attach a color image) no operations
actually take place. glCheckFramebufferStatus also does not catch this
problem.
- Sometimes program-point-size is enabled, but the vs does not actually
write to the point size register. In this case, pass the incoming point
size along instead of the default register init.
- When compiling LLVM objects, it is possible to starve the driver thread and cause the timeouts to trigger
- Observed in RE6 when using SPU LLVM since the game generates a very large number of objects "infinitely"
- Allows frameskipping to occur naturally if RSX thread is bombarded with flip requests but just jumping to the last one if possible
- See request_emu_flip() for async frame submission and implicit skipping
- Also allows display queue to fill faster than the flip thread can drain the queue
Move se_t and se_storage to util/endian.hpp
Use single template instead of two specializations.
Add minor optimization for MSVC.
Remove v128 dependency.
Try to enable intrinsics for unaligned data.
Fix minor bug in u16/u32/u64 specializations.
- Separate displayed statistics from actual backend statistics.
Allows asynchronous flipping to work correctly as it just uses display stats.
The real stats are used by the frame scope marker to determine behavior like engaging the FIFO optimizer or skipping draw calls correctly.
- Add an explicit frame scope marker tied in with the queue_prepare command
Since queue_prepare is emitted at the end of a frame, it can be used as end-of-frame in games that emit this
- If this command is not emitted, fifo flatenner and frameskip will not work
- Calculate exact sizes when doing hit tests to avoid false negatives
- Defer page checking until actually require to do memory setup
- Introduce align2 helper to do non-pow2 alignments
- Allow use of intrinsics when SSSE3 and SSSE4.1 are not available in the build target environment
- Properly separate SSE4.1 code from SSSE3 code for some older proceessors without SSE4.1
- With harmonization between all texture types implemented, there is no difference between blit_engine_src and shader_read for supported formats
- Adds extra format filtering to ensure no conflicts when copying data
- While the mask for surface_a is at index 0, the surface cache expects the order to be maintained correctly!
Set the correct mask since surface store now checks each RTT individually
- Avoid silly broken tests due to queue_tag being called before pitch is initialized.
- Return actual memory range covered and exclude trailing padding.
- Coordinates in src are to be calculated with src_pitch, not required_pitch.
- This allows creating buffers with no MAP bits set which should ensure they are created for VRAM usage only
- TODO: Implement compute kernels to avoid software fallback mode for pack/unpack operations
- Fix 2D coordinate sampling of W coordinate.
W is actually HPOS.w and not 1. Z is however always 0.
- Optimize register usage a bit
Disassembling compiled SPV shows that global declaration results in less ops than using inout modifiers. Modifiers generate extra mov instructions.
- Fix reading of varying registers in FP
Different registers have different behavior
- Always write to varying registers. If a register is not written to, it is initialized to (0, 0, 0, 1)
- Reimplements two-sided lighting correctly without hacks
- Also bumps shader cache version
- Do not allow offloader to handle its own faults. Serialize them on RSX instead.
This approach introduces a GPU race condition that should be avoided with improved synchronization.
- TODO: Use proper GPU-side synchronization to avoid this situation
- Avoids memory appearing older when used for depth test without depth write
The write_barrier before the call will inherit new data but the tag will not update as no new information is added.
- Properly commit orphaned blocks not invalidating existing cache structures
- Do not ignore overwritten objects when commiting as unprotected fbo. Avoids stale references to invalidated surface objects.
- Load into memory as straightforward BGRA
- Fixes a bug in vulkan caused by byte shuffling in blit engine vs shader access
- Removes the need for memory shuffling when transferring into a rendertarget
- Implements render target data load (aka Read Color Buffer/Read Depth Buffer)
- Refactors vulkan surface barrier to be much cleaner.
- Removes redundant surface barrier invocations after doing a merged load
from surface cache.
- Adds explicit access modes when gathering surfaces from cache.
- Further improve aliased data preservation by unconditionally scanning.
Its is possible for cache aliasing to occur when doing memory split.
- Also sets up for RCB/RDB implementation
vkAcquireNextImageKHR can also return VK_SUBOPTIMAL_KHR and is non-fatal.
However, it's a good idea to still recreate the swap chain later to maintain
optimal presentation paths after temporary occlusion.
- ZCULL queue was updated to one-per-cb but the conditional render sync hint was not updated.
- Do not unconditionally flush the queue unless the upcoming ref is contained in the active CB.
- This avoids spamming queue flush, which frees up resources and improves performance
- Merge viewport raster window and scissor into one clipping region
- Viewport raster clip is different from viewport geometry clipping in
hardware as the latter is configurable separately
- After splitting, the sections may not be referenced at all for anything other than just pixel storage
- In such cases, either merge down or sample from the upstream source instead
- Texel borders are no longer actually supported in modern APIs
- Removes the border texels and uses border color instead which is incorrect but should work fine
- Tagged eventIDs can be used to safely delete resources that are no
longer used
- TODO: Expand gc to collect images as well
- TODO: Fix the texture cache to avoid over-allocating image resources
- Fix a typo in OpenAL
- Fix typo in cellHttp.h
- Unused variables in catch
- Use 64-bit shifts
- Use use_count with shared pointers, unique is depracated and getting removed
- Explicitly cast boolean to int
- Signed/unsigned issues with loop variables
- Fix missing return statement (the code path is unreachable, but compiler wants a return)
- */ ouside of comment
- Fix duplicate layout name
vm::spu max address was overflowing resulting in issues, so cast to u64 where needed. Fixes#6145.
Use vm::get_addr instead of manually substructing vm::base(0) from pointer in texture cache code.
Prefer std::atomic_thread_fence over _mm_?fence(), adjust usage to be more correct.
Used sequantially consistent ordering in semaphore_release for TSX path as well.
Improved memory ordering for sys_rsx_context_iounmap/map.
Fixed sync bugs in HLE gcm because of not using atomic instructions.
Use release memory barrier in lwsync for PPU LLVM, according to this xbox360 programming guide lwsync is a hw release memory barrier.
Also use release barrier where lwsync was originally used in liblv2 sys_lwmutex and cellSync.
Use acquire barrier for isync instruction, see https://devblogs.microsoft.com/oldnewthing/20180814-00/?p=99485
Prefer vm::ptr<>::ptr over vm::get_addr.
Prefer vm::_ptr/base over vm::g_base_addr with offset.
Added methods atomic_t<>::bts and atomic_t<>::btr .
Removed obsolute rsx:🧵:Read/WriteIO32 methods.
Removed wrong check in semaphore_release.
Added handling for PUTRx commands for RawSPU MFC proxy.
Prefer overloaded methods of v128 instead of _mm_... in VPKSHUS ppu interpreter precise.
Fixed more potential overflows that may result in wrong behaviour.
Added io/size alignment check for sys_rsx_context_iounmap.
Added rsx::constants::local_mem_base which represents RSX local memory base address.
Removed obsolute rsx:🧵:main_mem_addr/ioSize/ioAddress members.
-Indentation warnings
-prevent shift overflow
-This was declared extern in all contexts. Remove this for initialization
-Fix main return types. OH CANADA!
-Silence extraneos 'unused expression' warning
-Force use return value (warning)
-Remove tautological compare copy-pasta (char always < 256)
- Do not consume a slot every draw call, instead batch as many draws as possible
- Since renderpasses are dispatched per-draw-clause, keeping occlusion queries outside the renderpasses works fine
- If renderpasses are reorganized, occlusion tasks will have to be reorganized again
- Remove string comparisons from the hot-path!
- Use attribute streaming and push constants to avoid forcing a descriptor block copy every other draw call/pass.
While this isn't so bad on nvidia cards, it makes AMD cards a slideshow.
- When multithreaded RSX is enabled, the vertex cache just lowers performance
- The small cost of upload is paid by the asynchronous thread, allowing RSX to work optimally
- Multiple header files where missing #includes to other headers that
where used in the header. Correct header was included in correct
order in source files which caused everything to compile.
- Added missing #includes so header files correctly include all their
dependencies and fixes problems with IDEs being unable to parse
headers correctly due to missing symbols
- Add vm_locking.h and vm_reservation.h and move relevant functions
and types to these headers.
- Change include order and make vm_ptr.h, vm_var.h and vm_ref.h headers
usable invidually and them including vm.h instead of other way around
- Because usage of vm::ptr now requires including vm_ptr.h instead of
vm.h updated multiple #includes
- Added additional #includes to vm_reservation.h and vm_locking to
where vm::reservation_* and locking related functions are used
- Refactor overlays and resolve passes to support use of push constants instead of relying buffer map/unmap
- Add support for nvidia resolve (NV is the only vendor not supporting shader_stencil_export)
- TODO: Option to completely skip clamping in some architectures as it is not needed in most games
- Mostly affects older GPUs that do not have access to native fp16
- Removes a lot of wm_event code that was used to perform window management and is no longer needed.
- Significantly simplifies the vulkan code.
- Implements resource management when vulkan window is minimized to allow resources to be freed.
- Just use a semaphore and let the driver handle it instead of manual framepacing.
We lose framepace control but drivers have matured in the past few years so it should work fine.
- Ensures the current renderpass matches the image properties even when a cyclic reference is detected
- Solves SDK debug output error spam due to mismatching layouts and renderpasses
Intel ANV has been tested and verified to work without workaround
AMDVLK and the proprietary AMD driver have been confirmed to require workaround for window resizing
- Add double-buffered descriptor pools to avoid use-after-free situations
- Make descriptor pools more configurable
- Also adds in a hack to allow renderdoc to capture properly
- Use a simple queue to avoid redundant checks over all the contexts
- Poll queue if RSX pipe is idle
- Only check the queue when the frame context is dirty (after a queue operation)
- Reset descriptors at the start of the frame context to avoid having to synchronize mid-frame
- Fully synchronize if a descriptor reset is required mid-frame (spec compliance; also fixes flickering verts on some hardware)
- Transition attachments to LAYOUT_GENERAL in case of a feedback loop
- Fixes appearance of garbage along polygon edges in some
post-processing passes.
- Also reverse this transition when rendering goes back to normal
- Allows render targets to behave like stacked 3D views same as shader inputs are resolved
- Basically implements most of 'Read Color/Depth Buffers" option for 'free'.
- Allows splitting RTV/DSV resources if they are superceded by a partial surface
- Also allows intersecting new resources through the surface cache for proper inheritance from other scattered data
- TODO: Refactor bind_surface_as_rtt and bind_surface_as_ds to reduce asinine code duplication
TODO: Investigate the _s input modifier behaviour further, in case it can avoid generating zeroes from a MAD instruction.
x = MAD(+ve, -ve, -ve) with _s input modifier in BFBC expects result to be Non-zero
- Properly test for NaN and Inf when clamping down to fp16
- Optimize divsq a bit; mix(vec, vec, bvec) emits OpSelect which is what
we want here, instead of component-wise selection which is much slower.
- While mul(0, nan) = nan and 0 / 0 = nan, 0 / sqrt(0) = 0 because of hw
gremlins. normalize(0) is also nan so this behaviour does not work
around that particular case either which makes it even more baffling.
- The hw generates inaccurate values when doing perspective-correct
interpolation of vertex output attributes and makes the comparison (a ==
b) fail even when they are a fixed constant value.
- Increase equality tolerance when doing comparisons in fragment
shaders for NV cards only to work around this issue.
- Teepo fix
- The fixed-point D24S8 format does special Z clamping during compare which matches PS3 behaviour
- D32S8 is a floating point format and comparison with Dref > 1 always fails causing black edges/borders
- Improve support for float16_t by minimizing mixed inputs to functions
(ambiguous overloads)
- Minimize amount of downcasts in code by using opcode flags
- Re-enable float16_t support for vulkan
- Emulating f16 with f32 is not ideal and requires a lot of value clamping
- Using native data type can significantly improve performance and accuracy
- With openGL, check for the compatible extensions NV_gpu_shader5 and
AMD_gpu_shader_half_float
- With Vulkan, enable this functionality in the deviceFeatures if
applicable. (VK_KHR_shader_float16_int8 extension)
- Temporarily disable hw fp16 for vulkan
- When reverse scanning, offsets are inverted and offset value of 0 is logically equivalent to an offset of -1
- Add an explicit message if clipping happens to avoid silent errors/bugs
- Revert to using block metrics, but with optional per-channel decode
stage for the final transfer. Much cleaner than hacking in the width to
be in channels instead of blocks.
- Removes CPU-only transforms that broke GPU-side code.
-- Channels in GPU compute are laid out in cell-order, but CPU was uploading in favorable order and compensating with swizzles.
-- This leads to 2 different layouts depending on the location of the data (CPU vs GPU)
- Implement R8G8_R8B8 interleaved format decode
- General improvements
formats
- Allows D24S8 and D32S8 transport via typeless channels
- Allows uploading and downloading D24S8 data easily
- TODO: Implement optional byteswapping to fix flushed readbacks with
the same method
- Do not round up sub-pixel offsets, round down instead
- Do not allow incomplete sources for hw blit transfer
- Reimplement src clipping (slice_h)
- Check 'area' of incoming texels and correct for them before RTT lookup/transfer
- Filter out incomplete targets when performing RTT lookup (1 texel or less contribution)
- If a transfer writes to a RTT and depth mismatch happens, create a local target and the upload function will likely resolve between the two
- If a surface is rejected, reset the target region!
- Also refactors some bpp handling code
- Simplify texture intersection test to use a normalized/uniform coordinate space
- Fix broken bounds checking as well
- Batch dma transfers whenever possible and do them in one go
- vk: Always ensure that queued dma transfers are visible to the GPU before they are needed by the host
Requires a little refactoring to allow proper communication of the commandbuffer state
- vk: Code cleanup, the simplified mechanism makes it so that its not necessary to pass tons of args to methods
- vk: Fixup - do not forcefully do dma transfers on sections in an invalidation zone! They may have been speculated correctly already
- Properly wait for the buffer transfer operation to finish before map/readback!
- Change vkFence to vkEvent which works more like a GL fence which is what is needed.
- Implement supporting methods and functions
- Do not destroy fence by immediately waiting after copying to dma buffer
- Avoids blindly reusing blit dst sections as they may contain garbage.
If a section was unlocked for a flush, just discard it as its reuse introduces potential data corruption.
Since the data needs to be reuploaded anyway (for now), its better to start afresh
- In case of format mismatch, reset the calculated dst block
- Add a bounds check to determine if data contained in an atlas is good enough for sampling the cache.
If not enough data is provided, fall back to full upload
- Blit operations do format conversion automatically which is NOT what we want!
- Scale onto temp buffer with similar format before performing data cast.
- Use a 5-point tap with an X pattern across the target's memory space to reduce chances of false positives
- TODO: Potential false positives identified, requires some minor
restructuring of surface_store
- Properly synchronize when transitioning to/from GENERAL layout.
- General layout requires full pipeline dependency since its used in a 'general' sense. As such, its use is to be largely avoided.
- gl: Properly initialize and manage sampler states
- gl/vk: Snap overlay elements to pixel grid by aligning to pixel centers
- overlays: Disable grid snapping in stb since its now handled in the backend
- Make detail a separate text entity as it often contains a lot of noise
- Properly pad the entry if needed to avoid text sitting too close to the edge
- Use custom string conversion to ensure overlay deals with extended ascii whenever possible
- Improves language compatibility greatly and avoids empty spaces for unknown glyphs
- Adds all the major buttons to native dialog input options
- Adds more button options for the native osk
- Brighten osk cell backgrounds a bit to improve visibility
- NVIDIA drivers hook into the msq before our nativeEvent handler. This means NV is aware of events before rpcs3 is aware of them and sometimes stops until a new event is triggered.
If rpcs3 is inside a driver call at this time, the system will deadlock since the driver waits for msq which waits for the renderer which waits for the driver.
- Use explicit hook management to control window events
- Add fence timeout to attempt detection of surface loss events
- Disable DEPTH<->RGBA typeless transfers for now as they require a lot more work to work for all vendors
- Do not allow switching layouts to UNDEFINED/PREINITIALIZED formats
- Apply dither to edges that almost fail the straight-up alpha test
- Significantly improves alpha tested geometry far from the camera
- Also removes blend factor overrides/hacks as they give incorrect results due to background bleeding
- Index offset is ignored anyway and only used to calculate vertex attribute divisor index
- Specialized optimization for untouched xfer without primitive restart
- Avoid tagging and rely on read/write barriers and the dirty flag mechanism. Testing is done with a weak 8-byte memory test
- Introducing new data when tagging breaks applications with race conditions where tags can overwrite flushed data
- D24S8 targets have 2 aspects that are dealt with separately; Forcefully initialize the remaining data if a partial init is done. Its 'free' anyway
- It seems that the stencil mask matters when clearing unlike the depth mask and color mask
- gl: Include an execution state wrapper to ensure state changes are consistent. Also removes a lot of required 'cleanup' for helper methods
- texture_cache: Make execition context a mandatory field as it is required for all operations. Also removes a lot of situations where duplicate argument is added in for both fixed and vararg fields
- Explicit read/write barrier for framebuffer resources depending on
usage. Allows for operations like optional memory initialization before
reading
- Remove the required_xxx_pitch constraint as it makes no sense. The pitch controls what can be written per line.
- It is possible to have a huge surface width but only render to a small region at the beginning and have a smaller pitch than can fit the surface (NFS carbon)
- If draw call resources consume memory that intersects with NA parts of the texture cache, we get a framebuffer test mismatch.
This mismatch is false and happens because the thread has not yet reached the point of relocking the pages
- Implicitly invoke a memory barrier if actively reading from an unsynchronized texture
- Simplify memory transfer operations
- Should allow more games to work without strict mode
- Do not bind companion framebuffer when clearing single aspect; let the
contest mechanism sort it out instead
- Do not prematurely tag framebuffers, instead only do so at
write-confirmation time. Should avoid false tagging if setup does not
allow a render to occur.
- Immediate mode is isolated from the rest of the vertex configuration
- TODO: Verify register behaviour when immediate mode is used
Check if per-primitive const register values are supported (likely are)
- Implements a mirror view of D24S8 data that accesses the stencil components.
Finishes the implementation of TEX2D_DEPTH_RGBA as the stencil component was previously missing from the reconstructed data
- Add a few missing destructors
Image classes are inherited a lot and I forgot to make the dtors virtual
- Per-channel conditional execution introduces RAW hazards all over the place
- Its cheaper to process both branches and select between the two
- Also improves ShaderVariable functionality to allow functionality such as match_size and taking complex variables as inputs
* Restore stack in fifo error handling
* Update get register after the cmd execution
* Fix put pause in the middle of command
* Add restore points when branching to self
* Precise nopcmd detection
* Test all invalid cmds for early treatment of queue corruption
To avoid the need (and performance hit) of Read Color/Depth Buffers, we
may not invalidate overlapping fbos inside lock_memory_region unless
they are guaranteed to be superseded by the new one.
This avoids e.g. issues with overblooming, among others.
Fixes VRAM leaks and incorrect destruction of resources, which could
lead to drivers crashes.
Additionally, lock_memory_region is now able to flush superseded
sections. However, due to the potential performance impact of this
for little gain, a new debug setting ("Strict Flushing") has been
added to config.yaml
The hw doesnt fix pitch, when specifying src pitch 0 it copies the same pixels line to dst. keep in mind out_pitch = 0 is not allowed in image_in.
Same goes for buffer_notify, though it allows out_pitch to be 0.
* Dont bother capturing 'destination' blocks with no data. instead premap all main memory to ensure allocated
* Capture zcull and tile state as their compressed gcm forms
* Fix index array capturing, ignore empty sets
* hle gcm: Fix byteswaping in cellGcmSetZcull
* Added a helper function for fetching game's PARAM.SFO path
This should properly get SFO path for unlocked C00 games
* Normalized line endings
* Refresh game list after installing a RAP file
- Do not assume flip marks end-of-frame if executed via syscall
- Also disables skip_frame for these applications as there is no frame boundary
- NOTE: QUEUE_HEAD cannot be relied on as it is seemingly possible to flip the same head and not need to queue it
- Ignore barriers inserted after BEGIN but before any draw commands are emitted
- Properly process tail barriers inserted before END but after draw commands are submitted
- Ignore execution barriers with no effect (same register value written)
- Tries to detect when FIFO preprocessing is beneficial and only enables optimizations if the benefit outweighs the cost
- Current threshold is at least 500 draw calls saved at over 2000 draw calls to justify the overhead
- TODO: More tuning for other CPUs
- Improve vertex attribute layout format. Allows for full 16-bit attribute divisor
- Use actual pitch when declaring framebuffer rsx pitch instead of register value in case of swizzle? rendering
- Replace a few more vectors with simple_array<T>
- Avoid unnecessary string comparisons in backends. We already know referenced textures from the program analysers!
- Also fix visual corruption when using disjoint indexed draws
- Refactor draw call emit again (vk)
- Improve execution barrier resolve
- Allow vertex/index rebase inside begin/end pair
- Add ALPHA_TEST to list of excluded methods [TODO: defer raster state]
- gl bringup
- Simplify
- using the simple_array gets back a few more fps :)
* gcm: Fix tile offset setting
highest bit signifyies location, so ignore that while reading the offset.
* rsx-capture: Fix tile binding
fixes division by zero when dividing by pitch when the tile is not bound.
* rsx-capture: Fix zcull binding