Use the _xe suffix instead of the xesl_ prefix for quicker visual
recognition of identifiers, also switch to snake_case for consistency.
Also add the f suffix to float32 literals because the Metal Shading
Language is based on C++.
Enable portability subset physical device enumeration.
Don't use Vulkan 1.1+ logical devices on Vulkan 1.0 instances due to the
VkApplicationInfo::apiVersion specification.
Make sure all extension dependencies are enabled when creating a device.
Prefer exposing feature support over extension support via the device
interface to avoid causing confusion with regard to promoted extensions
(especially those that required some features as extensions, but had those
features made optional when they were promoted).
Allow creating presentation-only devices, not demanding any optional
features beyond the basic Vulkan 1.0, for use cases such as internal tools
or CPU rendering.
Require the independentBlend feature for GPU emulation as working around is
complicated, while support is almost ubiquitous.
Move the graphics system initialization fatal error message to xenia_main
after attempting to initialize all implementations, for automatic fallback
to other implementations in the future.
Log Vulkan driver info.
Improve Vulkan debug message logging, enabled by default.
Refactor code, with simplified logic for enabling extensions and layers.
VS/PS_NUM_REG is 6-bit on Adreno 200, and games aren't seen using the
bit 7 to indicate that no GPRs are used. It's not clear why Freedreno
configures it this way.
Some texture fetch fields were deprecated or moved during the development
of the Xenos, reflect that in the comments.
Add definitions of the registers configuring the conversion of vertex
positions to fixed-point. Although there isn't much that can be done with
it when emulating using PC GPU APIs, there are some places in Xenia that
wrongly (though sometimes deliberately, for results closer to the behavior
of the host GPU) assume that the conversion works like in Direct3D 10+,
however the Xenos supports only up to 4 subpixel bits rather than 8. The
effects of this difference are largely negligible, though.
Also add more detailed info about register references and differences from
other ATI/AMD GPUs for potential future contributors.
The `SUB` instruction can only encode immediates in the form of `0xFFF`
or `0xFFF000`. In the case that the stack size is greater than `0xFFF`,
then just align the stack-size by `0x1000` to keep the bottom 12 bits
clear.
Moves the `FMOV` constant functions into `a64_util` so it is available to other translation units. Optimize constant-splats with conditional use of `MOVI` and `FMOV`.
Byte-sized constants can utilize the `MOVI` instructions. This makes
many cases such as zero-splats much faster since this encodes as just a
register-rename(similar to `xor` on x64).
The emitter doesn't actually hold onto executable code, but just
generates the assembly-data into a buffer for the currently-resolving
function before placing it into a code-cache. When code gets pushed into
the code-cache, it can just be copied from an `std::vector` and reset.
The code-cache itself maintains the actual executable memory and
stack-unwinding code and such.
This also fixes a bunch of errornous relative-addressing glitches where
relative addresses were calculated based on the address of the unused
CodeBlock rather than being position-independent. `MOVP2R` in particular
was generating different instructions depending on its distance from the
code block when it should always just use `MOV` and not do any
relative-address calculations since we can't predict where the actual
instruction's offset will be(we cannot predict what the program counter
will be). Oaknut probably needs a "position independent" policy or mode
or something so that it avoids PC-relative instructions.
`dc civac` causes an illegal-instruciton on Windows-ARM. This is likely
as a security measure against cache-attacks. On Linux this instruction
is trapped into an EL1 kernel function. Windows does not seem to have
any user-mode cache-maintenance instructions available for
data-cache(only instruction-cache via `FlushInstructionCache`).
The closest thing we can do for now is a full data memory-barrier with
`dsb ish`.
Prefetches are implemented using `prfm pldl1keep, ...`.
`FMOV` encodes an 8-bit floating point immediate that can be used to
accelerate the loading of certain constant floating point values between
-31.0 and 32.0. A lot of immediates such as -1.0, 1.0, 0.5, etc fall
within this range and this code gets lots of hits in my testing. This is
much more optimal than trying to load a 32/64-bit value in W0/X0 and
moving it into an FP register.
The 64-bit cases uses a particular Replicated 8-bit immediate so
something else will have to handle that This cases a lot of cases
without having to touch memory. Does not catch cases of
`1.0`(0x3f800000).
This directly maps to the QC bit in the FPSR. Just have to make sure
that the saturated instruction is the very last instruction(which is
currently the case for stuff like VECTOR_ADD and such).
Uses `CNTFRQ` and `CNTVCT` system-registers as a raw clock source.
On my ThinkPad x13s, the raw clock source returns a tick-frequency of
19,200,000 while the platform clock source(QueryPerformanceFrequency)
returns 10,000,000. Almost double the accuracy over the platform-clock!
Load the pointer to the VConst table once, and use offsets from this base address from the underlying enum value.
Reduces the amount of instructions for each VConst memory load.
Use `FMADD` and `FMLA`
Tests are the same, though now it should run a bit faster.
The tests that fail are primarily denormals and other subtle precision
issues it seems.
Ex:
```
i> 00002358 - vmaddfp_7298_GEN
!> 00002358 Register v4 assert failed:
!> 00002358 Expected: v4 == [00000000, 00000000, 00000000, 00000000]
!> 00002358 Actual: v4 == [000D000E, 00138014, 000E4CDC, 0018B34D]
!> 00002358 TEST FAILED
```
Host-To-Guest and Guest-To-Host thunks should probably restore/preserve
the FPCR to maintain these roundings.
```
4.2.2.4 Floating-Point Rounding and Conversion Instructions
...
Floating-point conversions to integers (vctuxs, vctsxs) use round-toward-zero (truncate).
...
```
This passes all of the `vctuxs` and `vctsxs` unit tests