Move se_t and se_storage to util/endian.hpp
Use single template instead of two specializations.
Add minor optimization for MSVC.
Remove v128 dependency.
Try to enable intrinsics for unaligned data.
Fix minor bug in u16/u32/u64 specializations.
- ZCULL queue was updated to one-per-cb but the conditional render sync hint was not updated.
- Do not unconditionally flush the queue unless the upcoming ref is contained in the active CB.
- This avoids spamming queue flush, which frees up resources and improves performance
- Merge viewport raster window and scissor into one clipping region
- Viewport raster clip is different from viewport geometry clipping in
hardware as the latter is configurable separately
vm::spu max address was overflowing resulting in issues, so cast to u64 where needed. Fixes#6145.
Use vm::get_addr instead of manually substructing vm::base(0) from pointer in texture cache code.
Prefer std::atomic_thread_fence over _mm_?fence(), adjust usage to be more correct.
Used sequantially consistent ordering in semaphore_release for TSX path as well.
Improved memory ordering for sys_rsx_context_iounmap/map.
Fixed sync bugs in HLE gcm because of not using atomic instructions.
Use release memory barrier in lwsync for PPU LLVM, according to this xbox360 programming guide lwsync is a hw release memory barrier.
Also use release barrier where lwsync was originally used in liblv2 sys_lwmutex and cellSync.
Use acquire barrier for isync instruction, see https://devblogs.microsoft.com/oldnewthing/20180814-00/?p=99485
Prefer vm::ptr<>::ptr over vm::get_addr.
Prefer vm::_ptr/base over vm::g_base_addr with offset.
Added methods atomic_t<>::bts and atomic_t<>::btr .
Removed obsolute rsx:🧵:Read/WriteIO32 methods.
Removed wrong check in semaphore_release.
Added handling for PUTRx commands for RawSPU MFC proxy.
Prefer overloaded methods of v128 instead of _mm_... in VPKSHUS ppu interpreter precise.
Fixed more potential overflows that may result in wrong behaviour.
Added io/size alignment check for sys_rsx_context_iounmap.
Added rsx::constants::local_mem_base which represents RSX local memory base address.
Removed obsolute rsx:🧵:main_mem_addr/ioSize/ioAddress members.
- Add vm_locking.h and vm_reservation.h and move relevant functions
and types to these headers.
- Change include order and make vm_ptr.h, vm_var.h and vm_ref.h headers
usable invidually and them including vm.h instead of other way around
- Because usage of vm::ptr now requires including vm_ptr.h instead of
vm.h updated multiple #includes
- Added additional #includes to vm_reservation.h and vm_locking to
where vm::reservation_* and locking related functions are used
- When reverse scanning, offsets are inverted and offset value of 0 is logically equivalent to an offset of -1
- Add an explicit message if clipping happens to avoid silent errors/bugs
- Do not round up sub-pixel offsets, round down instead
- Do not allow incomplete sources for hw blit transfer
- Reimplement src clipping (slice_h)
- Check 'area' of incoming texels and correct for them before RTT lookup/transfer
- Filter out incomplete targets when performing RTT lookup (1 texel or less contribution)
- Immediate mode is isolated from the rest of the vertex configuration
- TODO: Verify register behaviour when immediate mode is used
Check if per-primitive const register values are supported (likely are)
The hw doesnt fix pitch, when specifying src pitch 0 it copies the same pixels line to dst. keep in mind out_pitch = 0 is not allowed in image_in.
Same goes for buffer_notify, though it allows out_pitch to be 0.
- Ignore barriers inserted after BEGIN but before any draw commands are emitted
- Properly process tail barriers inserted before END but after draw commands are submitted
- Ignore execution barriers with no effect (same register value written)
- Also fix visual corruption when using disjoint indexed draws
- Refactor draw call emit again (vk)
- Improve execution barrier resolve
- Allow vertex/index rebase inside begin/end pair
- Add ALPHA_TEST to list of excluded methods [TODO: defer raster state]
- gl bringup
- Simplify
- using the simple_array gets back a few more fps :)
- Some games overflow the program buffer e.g Resistance games
The observed overflow is one instruction longer, likely an engine bug
with counting instructions
- Implement forced reading when calling update method to sync partial lists
- Defer conditional render evaluation and use a read barrier to avoid extra work
- Fix HLE gcm library when binding tiles & zcull RAM
- Do not do a full sync on a texture read barrier
- Avoid calling zcull sync in FIFO spin wait
- Do not flush memory to cache from the renderer side; this method is now obsolete
- Adds proper support for vertex textures, including dimensions other than 2D textures
- Minor analyser fixup, removes spurious 'analyser failed' errors
- Minor optimizations for program state tracking
- Fix program abort logic to never abort before resolving later label addresses
Fixes jumping over broken code and jumping over END markers
- TEXTURE_CONTROL2 has indexing range of [0..15] without stride skipping!
This register does not have interleaving with other texture registers
- Track shader address poke as it seems to invalidate programs as well
Drop _xbegin family intrinsics due to bad codegen
Implemented `notifier` class, replacing vm::notify
Minor optimization: detach transactions from global mutex on TSX path
Minor optimization: don't acquire vm::passive_lock on PPU on TSX path
- Improve dirty state tracking affecting program state
- vk: Refactor out transform constants upload into a separate channel to avoid if possible
transform data uploads are quite expensive
- Properly detect inline array registers vs constant value registers
- Silence needless spam, 306E is 2D surface engiine, the assumption that y is multiplied by 306E pitch is not crazy
- Track asynchronous operations in RSX core
- Add read barriers to force pending writes to finish.
Fixes zcull delay flicker in all UE3 titles without forcing hard stall
- Increase zcull latency as all writes should be synchronized now
- ZCULL unit emulation rewritten
- ZCULL reports are now deferred avoiding pipeline stalls
- Minor optimizations; replaced std::mutex with shared_mutex where contention is rare
- Silence unnecessary error message
- Small improvement to out of memory handling for vulkan and slightly bump vertex buffer heap
- Removes the duplicate local_transform_constants
- Resets the transform constants on every context reset
- Simplifies the code abit which should make it faster
- NOTE: Transform constants are persistent across context re-init events (VF5)
- Implement flush-always behaviour to partially fix readback from a currently bound fbo
- Without this, only the first read is correct, as more draws are added the results become 'wrong'
- Fixes WCB and cpublit behviour
- Synchronize blit_dst surfaces to avoid data loss when gpu texture scaling is used
- Its still faster in such cases to disable gpu texture scaling but some types cannot be disabled without force cpu blit (e.g framebuffer transfers)
- Memory management tuning
- rsx: on-demand texture cache rescanning for unprotected sections
- rsx: Only framebuffer resources are upscaled
- Do not resize regular blit engine resources
- Lazy initialize readback buffer when using opengl
-- These measures should help minimize vram usage
- Track last working state and reset to it if RSX starts to desync
-- This is especially useful when running vulkan since the renderer will easily outpace the rest of the system when merely recording draw commands
- Ignore empty sets
-- Mark empty/invalid IB sets as having 0 element counts.
- Abort nv406e semaphore acquire if the rsx thread stalls/crashes
- Fix texture size approximation to take mipmaps into account. Fixes some games hanging with WCB
- Fix VERTEX_DATA_3F_M element alignment (its 16 bytes per attribute)
- Fix DATA_2S_X interpretation type. Its signed 16-bit unnormalized (s32k) and not signed normalized (s1)
- Zeta pitch is ignored by real HW for some reason
- Monitor ptch value changes as well since they may affect disabled surfaces
- TODO: Verify if MRT pitch is really taken into consideration
- Reimplement fragment program fetch and rewrite texture upload mechanism
-- All of these steps should only be done at most once per draw call
-- Eliminates continously checking the surface store for overlapping addresses as well
addenda - critical fixes
- gl: Bind TIU before starting texture operations as they will affect the currently bound texture
- vk: Reuse sampler objects if possible
- rsx: Support for depth resampling for depth textures obtained via blit engine
vk/rsx: Minor fixes
- Fix accidental imageview dereference when using WCB if texture memory occupies FB memory
- Invalidate dirty framebuffers (strict mode only)
- Normalize line endings because VS is dumb
- Limits buffer size to min 720 in the Y axis (1024 section causes conflicts in some cases - TODO)
rsx: Fixups to allow large textures for blit operation
- Also includes checks for both leaking sections and blit regions for vulkan
hotfix for hanging when using WCB
addendum - unlock both ro and no blocks before attempting to copy memory blocks
gl: Fixups for ARB_explicit_uniform_location
- Forces glsl v 430 to make use of the extension
rsx/vk: Rework texture cache to minimize recursive access violations
- Also modifies the vulkan commandbuffer begin/end/submit mechanism
gl: Fix cached_texture_section::is_flushable to take memory protection into account
rsx: Fix blit dst offset calculation
gl/vk/rsx: Refactoring; unify texture cache code
gl: Fixups
- Removes rsx::gl::texture class and leave gl::texture intact
- Simplify texture create and upload mechanisms
- Re-enable texture uploads with the new texture cache mechanism
rsx: texture cache - check if bit region fits into dst texture before attempting to copy
gl/vk: Cleanup
- Set initial texture layout to DST_OPTIMAL since it has no data in it anyway at the start
- Move structs outside of classes to avoid clutter
gl: Fix multidraw [WIP]
rsx: Ignore vertex base when data source is generated using arithmetic
vk: Check pending flag before doing fence poke
vk/gl: Fix for inlined array and immediate draws
rsx: Collapse joined draws when batching
rsx: Blit engine improvements
- Always handle blits to and from framebuffers through the GPU
- Handle depth surfaces properly when using GL
- Check for format mismatches when blitting to the surface store [WIP]
Fix rsx offscreen-render-to-display-buffer-blit surface reads
- Also, properly scale display output height if reading from compressed tile
gl: Fix broken dst height computation
- The extra padding is only there to force power-of-2 sizes and isnt used
gl: Ignore compression scaling if output is rendered to in a renderpass
rsx/gl/vk: Cleanup for GPU texture scaling. Initial impl [WIP]
- TODO: Refactor more shared code into RSX/common
- Significant gains from greatly reduced CPU work
- Also reorders command submission in end() to improve throughput
- Refactors most of the vertex buffer handling
- All vertex processing is moved GPU side
Improve scaling and separate sampler state from texture state
gl: Unify all texture cache objects under one structure separate by use case
gl: Texture cache fixes
- Acquire lock when finding matching textures
- Account for swizzled surfaces when deciding whether to cpu memcpy
- Handle swizzled images on the GPU
`unveil<>` renamed to `fmt_unveil<>`, now packs args to u64 imitating va_args
`bijective...` removed, `cfg::enum_entry` now uses formatting system
`fmt_class_string<>` added, providing type-specific "%s" handler function
Added `fmt::append`, removed `fmt::narrow` (too obscure)
Utilities/cfmt.h: C-style format template function (WIP)
Minor formatting fixes and cleanup
rsx_decode.h implements a "rsx_decoders" template class that is specialized for most GCM command
found in rsx command buffer. 3 static members are defined : a "decode" function that turns command
value into a more meaninfull type if applicable (for instance bool for _enabled* command, surface
formats for set_surface_format command...), a "commit_rsx_state" that modifies a given rsx_state
structure when the command is parsed, and a "dump" function used in rsx_debugger for pretty printing.
Hopefully having the 3 functions in a single place for every command will act as a self documenting
list of rsx command buffer opcode.
rsx_state is also expanded into several explicit variables instead of being stored into a u32 array.
This should makes debugging easier (Visual Studio will display the exact value of these member for instance)
as well as preparing rsx_state for serialisation/deserialisation.
The vertex array and textures opcode are not concerned atm for bisecting purpose.
* Optimizations
1) Some headers simplified for better compilation time
2) Some templates simplified for smaller executable size
3) Eliminate std::future to fix compilation for mingw64
4) PKG installation can be cancelled now
5) cellGame fixes
6) XAudio2 fix for mingw64
7) PPUInterpreter bug fixed (Clang)
* any_pod<> implemented
Aliases: any16, any32, any64
rsx::make_command fixed