Add atomic_t<>::observe() (relaxed load)
Add atomic_fence_XXX() (barrier functions)
Get rid of MFENCE instruction, replace with no-op LOCK OR on stack.
Remove <atomic> dependence from stdafx.h and relevant headers.
These conversions don't exist in std::shared_ptr-alike types.
But I don't want to bother with == operators until we have proper C++20.
Removed trivial conversion for atomic_ptr because it's heavyweight.
Fix atomic_ptr load() edge case.
Implement atomic_ptr::peek_op() to make possible to reduce load() overhead.
Implement atomic_ptr::compare_exchange() and friends.
Implement null_ptr constant, remove nullptr_t assignment/construction.
Notification with "phantom value" could be useful in theory.
As an alternative way of sending signal from the notifier.
But current implementation is just useless.
Also fixed slow_mutex (not used anywhere).
It enables Transparend Huge Pages for some regions on Linux.
Although it can't be actively useful, it seems to do something.
Maybe it's even harmful for recompilers.
Static initialization is all-zeros anyway.
But constexpr was killing my intellisense.
And probably also affected compile time.
Also make some internal structures hidden ("static").
Billions of events can reduce to thousands, saving CPU time a few s.
Only 4 slots are available (arbitrarily), and only 1 is usually used.
Other slots are used only for waiting on multiple atomics.
Takes ~80 ns instead of ~40 ns, same about deallocation.
Loops don't exist here, only 4-level semaphore tree.
Worst case only happens with concurrence, not from looping.
This optimization is not really necessary at current state of RPCS3.
This is more like to test C++ compilers and MSVC u128 implementation.
Remove unnecessary optimization from cond_alloc().
Optimistic case was absolutely dominating anyway.
Although the whole function is a dirty hack.
Now scanning through all threads is faster.
Compress 16-bit ref counter and two 48+64 bit slot allocators.
This allowed to remove some weird unnecessary logic paths.
Adjust hashtable size to keep it the same.
Instead of plain waiting while equal to some value,
it can be something like less, or greater, or even bitcount.
But it's a draft and untested. Hopefully doesn't break anything.
Hashtable increased and flatten, tree-alike extensions removed.
Some things simplified, so it can actually decrease perf a bit.
But most platforms shouldn't be affected.
Removed limit of 56 waiters per pointer.
Real limit now is about 65535.