oo             oo     oo                                                                                            20230824  
  OO     ,oOOo. oOOooo oOOooo ,oOOo. ,oOOOo                                                                           
  OO     OO  OO  OO     OO    OO  OO OO..            THIS IS FICTITIOUS, FROM FICTIONAL PERSONA, NO IDENTIFICATION WITH ACTUAL
  OO.    OO  OO  OO.    OO.   OO"""' `"""OO          PEOPLE|PRODUCTS|EMPLOYERS|BUSINESSES  IS INTENDED  OR SHOULD BE  INFERRED
  `Ooooo `OooO'  `Oooo  `Oooo `Ooooo ooooO'          SPAM HOLE: FIRST AND LAST NAME  JOINED WITHOUT A SPACE AT  PROTONMAIL.COM
________________________________________________________________________________________________________________________________
[SPC]

IN PROGRESS ...


________________________________________________________________________________________________________________________________
[SPC] Page Faults & BO Alloc
Post on the mechanics of CPU/GPU communication. Using AMDgpu based timing results on the SteamDeck an example, but relating to the larger picture of PC GFX APIs like Vulkan/etc.


Both RADV and AMDVLK: Flush/invalidate mapped memory ranges is a NOP. So bus-crossing dGPU traffic to HOST_VISIBLE is automatically snooping CPU caches. The one without HOST_CACHED, is Write+Combined [WC] on store, and Uncached [UC] on read. The one with HOST_CACHED is non-WC/UC.


In AMDgpu (the kernel driver), likely DEVICE_LOCAL maps to AMDGPU_GEM_DOMAIN_VRAM (also the carve out on APUs) and the non-DEVICE_LOCAL maps to AMDGPU_GEM_DOMAIN_GTT.


AMD+RADV added {DEVICE_COHERENT_BIT_AMD, DEVICE_UNCACHED_BIT_AMD} variations to the core 4 memory types. Likely to support GPU crash debug. But also provides a way to avoid needing to write-back (flush) GPU caches before CPU read. Likely AMDgpu kernel flag mapping below.


This AMDGPU_GEM_CREATE_CPU_GTT_USWC appear to toggle on WriteCombine [WC] for CPU store, and Uncached [US] for CPU reads (cases of HOST_VISIBLE without HOST_CACHED).


For review from ChipsAndCheese Deck bandwidths: ~71 GB/s GPU, ~43 GB/s DMA, ~34 GB/s shader copy CPU<->GPU, ~25 GB/s CPU/CPU, and damn, brutal 0.27 GB/s CPU mapped GPU buffer reads, 0.71 GB/s CPU mapped GPU buffer writes.


And going direct to AMDgpu instead of VK on the Deck shows these kinds of bandwidths (non-DEVICE_LOCAL, HOST_VISIBLE with HOST_CACHED and without). So using Write-Combined is amazingly painful for stores.


Tangental Notes


Back to SteamDeck Numbers
GTT (HOST_VISIBLE) + USWC (non-HOST_CACHED) takes 1 sec for 1st 4 GiB BO alloc, but *30 sec* for 2nd 4 GiB alloc. Kernel driver time for memory allocation (maybe page table related) can be brutal (general comment for PCs too).

Related, I think Chips&Cheese CPU/GPU link DMA and Compute Bandwidths are only measured to a GTT+USWC buffer (only supporting useless uncached CPU reads). Bandwidth exceeds the CPU's bus capacity, implies it is 'garlic' or GPU bus only accesses, direct to DRAM.


Since RADV+AMDVLK don't support user-space CPU mapped flush | invalidate, this implies the only supported mapping to read from DRAM direct is USWC (uncached R and write+combine W). Doing GPU stores to a CPU mapped GTT without USWC would be crippingly slow (limited by the snooping bus rate). But this is unfortunately the only option available for CPU read back. So if doing a shader store, it better be only a few waves and running in parallel.

"Use GTT because it's as fast as VRAM on the Deck", could only work if GTT+USWC, as that would be only way to get Garlic (direct to DRAM high bandwidth bus). GTT without USWC would need Onion (slow snooping bus) because AMDgpu's only no-CPU map option is for VRAM!


Summary of the theory on best Steam Deck practices. This is the plan for my deck-compute-only driver too. Theory -> as in I haven't yet verified the GPU-side parts (my driver isn't that far along yet).


Possible to do better? MAYBE! CPU readback actually has 2 problems, 1st the slow GPU-side copy (4 GiB copy via snooping bus could be almost 4 sec, but direct to DRAM via USWC might be just 160 ms). Also CPU only has 4 MiB of L3, so the majority of 4 GiB will be uncached later. Believe UC MTYPE (uncached) forces the CPU into serialized behavior. My test was single thread 8-byte/access reads. VMOVDQA_M256_YMM! Looks like Zen2 might be able to get 32-byte/access via VMOVDQA, and going multi-threaded (8 thread), that might be a 32x speed up.

If so might be able to approach under a GB/s for the CPU-side part (UC multithread via VMOVDQA), which would be close enough to the non-USWC running single threaded using the cache. What you'd really want here as a band-aid workaround is ability for the GPU to act as if the CPU map was USWC (so go direct to DRAM), but have the CPU map act as non-USWC, so it goes through the cache. Then some kind of hack to flush the tiny 4 MiB L3 and lower caches on the CPU.

CPU readback (reading and summing 8-byte) GTT without USWC.



Going multi-core on cached readback doesn't really help much. Proper test, parked threads waiting on futex, signal, last-1st active core timing.

Now CPU readback GTT+USWC.



So measurements match theory, going multi-core with MOVNTDQA uncached read on GTT+USWC can be made to match 1 thread GTT (without USWC). Both those results above had been using a pair of 4 GiB allocations. The GTT+USWC one used one 4 GiB GTT+USWC for timing, and one 4 GiB GTT (unused). And that test suffered from a 30 sec GTT BO allocation. So something was going very wrong in the page mapping. When I rerun same test with just one 4 GiB GTT+USWC allocation, the 30 sec stall is gone, and the performance also changes.



Oh!
Perhaps there is some resource limit that kills perf if too much memory gets mapped, page faults? Top with thread cumulative results doesn't show anything significantly different between the 4 GiB and 8 GiB runs in terms of page faults ... suggests it must be something else.


And yet there is obviously a bug in my multi-core tests, you can tell directly from the page fault numbers, only one thread is taking all the faults. So it is back to finding my coding error (fail). Lunch break and fixed the bug. Two runs now and leaving threads open to get TOP results. First run definitely soaks up the page faults, second run is page fault free (expected). Both around only 7 GB/s.


And the 8 GiB of BO mapped, but only 4 GiB used run. The first pass gets only 1 GiB/s and the second gets 7 GiB/s. Page fault number is similar to last run, can only conclude page fault costs exploded?


30 sec BO alloc time + super low bandwidth on 1st pass only (where page faults happen) suggest that Linux Kernel logic explodes in cost if too many pages are used in this way. ~8K faults for 512 MiB accessed / thread = 64 KiB/fault ... X86-64 has either 4 KiB or 2 MiB for page size. So not using large pages (fail). Probably mapping 16 pages per fault. Not sure if this implies anything about GPU page size (but certainly hoping it isn't 4 KiB, ouch).

Some other very rough measured numbers of GTT+USWC with 8 cores splitting 4 GiB of BO.



I think these seem plausable now (so maybe no more code bugs). One takeaways of all this, is that you need to pre-warm the page tables for large mapped buffers (by touching all pages) when the user isn't waiting on results. And if you are doing batch jobs that {open device, send data to GPU, get data back, close device} you are screwed!

And lastly (maybe) comparison of GTT and GTT+USWC both 8 threads splitting streaming through 4 GiB of mapping (second pass, no page fault issues):



So an alternative option summary for those who don't want the extra GPU-side GPU to CPU mapped buffer copy step.


If anyone is looking to repro the 30 sec stall: amdgpu_bo_alloc() one 4 GiB GTT+USWC, then one 4 GiB GTT buffer, the second alloc causes the Deck to become unresponsive for 30 seconds.

Note, madvise() with MADV_HUGEPAGE on mapped 4 GiB region doesn't do anything (still faults at 64 KiB granularity), and none of these MADV_{WILLNEED|POPULATE_READ|POPULATE_WRITE} have any effect either (still waits until use before faulting, causing low initial effective bandwidth).

64KiB strided write through 4 GiB GTT+USWC (to pre-fault) costs the same as writing full 4 GiB, roughly 3 seconds. So it is quite literally massive page fault overhead for 1st access. No possible workaround found at this time for initial load time problems.

If doing two 4 GiB GTT+USWC allocations, there is also a 30 second BO allocate cost on the 2nd one. And this makes the initial page fault cost for access to the first 4 GiB take another 30 seconds. Effectively hangs the machine for a full minute.

Doing two 4 GiB GTT allocations (without USWC), doesn't incure any of the 30 second stalls. So that problem is specific to big USWC allocations. However the initial access page fault problem (extra 3 seconds) is there, so a 2ndary problem with just mapping lots of APU memory.

Allocation of one 4 GiB GTT first then a 4 GiB GTT+USWC doesn't see the 30 second allocation stall. Almost like anything post a big USWC alloc is poisioned. And after mapping both, then accessing the GTT only, the 30 second time for page faulting comes back. Even if you don't map the 2nd GTT+USWC, the 30 second initial page faulting time is still there, so the act of mapping doesn't matter, simply doing the BO allocation had already doomed the Linux page management.

Despite header docs which imply flag only works on DOMAIN_VRAM, using DOMAIN_GTT+AMDGPU_GEM_CREATE_NO_CPU_ACCESS is apparently what you want for non-mapped GTT allocations. Mapped 4 GiB GTT with 4 GiB GTT+NO_CPU, drops the initial page fault time from 30 seconds to 3 sec.


________________________________________________________________________________________________________________________________
[SPC] Blame Onat

Workspace

One day a friend Onat said to me that on Linux Steam Deck the Vulkan driver is in user space, and it is possible to even have both RADV and the AMD Vulkan drivers running on the system at the same time ...

And That Was Enough to Seed the Idea
An idea that could not be ignored, a platform still exists, where in theory one could ship a game with their own generated shader binary and no driver. A way out of GPU API Hell! This was the day I stopped my other Windows PC projects, and became an exclusive Steam Deck developer.

________________________________


[EAT] Life of a Steak
Sometimes old technology is the best. Like when making Steaks. Salt, short dry age, season, WOOD FIRE, eat.




________________________________


[X68]
This was an idea for a simplified machine-level x86-64 interface ...


Resources


Register Naming
Characters {0-9} and {A-F} reserved for hex numbers. So the 16 registers are mapped {G-V}.
_G_ _H_ _I_ _J_ _K_ _L_ _M_ _N_ _O_ _P_ _Q_ _R_ _S_ _T_ _U_ _V_
rax rcx rdx rbx rsp rbp rsi rdi r8_ r9_ r10 r11 r12 r13 r14 r15

Addressing Mode Syntax


Sizing


Maths


Examples


________________________________________________________________________________________________________________________________

The 'Right' Font





X68 at 1:2x2


X68 at 1:1

An alternative for even smaller text? (Outtake)


At 1:2x2


At 1:1

________________________________________________________________________________________________________________________________

Windows/Linux ABIs


For reference...
Reviewing stack operations on x86-64. Stack grows down, RSP points to last written entry.


Reviewing Windows and Linux ABI register conventions.
 N X86 X68 WIN LXN
 = === === === ===
 0 rax g   r   r   (return value)
 1 rcx h   a0  a3
 2 rdx i   a1  a2
 3 rbx j   sav sav
 4 rsp k   sav sav (stack pointer)
 5 rbp l   sav sav
 6 rsi m   sav a1
 7 rdi n   sav a0
 8 r8  o   a2  a4
 9 r9  p   a3  a5
 a r10 q   vol vol
 b r11 r   vol vol
 c r12 s   sav sav
 d r13 t   sav sav
 e r14 u   sav sav
 f r15 v   sav sav

Then the Windows stack conventions. Anything less than RSP can be overwritten any time, thus must move RSP before writing below a set RSP point. Before a CALL, RSP must be 16-byte aligned. There is a 32-byte 'shadow' region reserved for called function usage.
...
[RSP+0x28] A5
[RSP+0x20] A4
[RSP+0x18] not A3 (R9  shadow)
[RSP+0x10] not A2 (R8  shadow)
[RSP+0x08] not A1 (RDX shadow)
[RSP+0x00] not A0 (RCX shadow) ... 16-byte aligned
(return_goes_here)

Linux conventions.
...
[RSP+0x08] A7
[RSP+0x00] A6
(return_goes_here)

________________________________________________________________________________________________________________________________

Windows Terminal Remap


Docs for taking Windows terminal codes and mapping them into simple 8-bit single byte codes (for a portable editor).

//_______________________________________________/WINDOWS:KEYDOC
// _/KEYS\_____
// EN end
// ES escape
// BS backspace
// DN down
// DL delete
// HM home
// IN insert
// LF left
// PD page down
// PU page up
// RN return
// RT right
// SP space
// TB tab
// UP up
// _/EXCEPTIONS\___________________________________
// CTRL+h aliases CTRL+BACKSPACE
// CTRL+i aliases TAB
// CTRL+j aliases CTRL+RETURN
// CTRL+m aliases RETURN
// CTRL+[ aliases ESCAPE aliases control code start
// NO SHIFT+{BACKSPACE,RETURN,SPACE}
// NO  CTRL+{`-=;',.z,TAB,SPACE}
// NO   ALT+{TAB,RETURN}
// _/INPUT\_________________________
// =-=============-==-==-==-==-==-==
// A space........ 1b 20
// A ' ........... 1b 27
// A , ........... 1b 2c
// A - ........... 1b 2d
// A . ........... 1b 2e
// A / ........... 1b 2f
// A = ........... 1b 3d
// A 0 ........... 1b 30
// . . ........... .. ..
// A 9 ........... 1b 39
// A ; ........... 1b 3b
// A [ ........... 1b 5b
// A \ ........... 1b 5c
// A ] ........... 1b 5d
// A ` ........... 1b 60
// A a ........... 1b 61
// . . ........... .. ..
// A z ........... 1b 7a
// A backspace ... 1b 7f
// =-=============-==-==-==-==-==-==
// S tab ......... 1b 5b 5a
// =-=============-==-==-==-==-==-==
//   insert ...... 1b 5b 32 7e
//   delete ...... 1b 5b 33 7e
//   page up ..... 1b 5b 35 7e
//   page down ... 1b 5b 36 7e
//   up .......... 1b 5b 41
//   down ........ 1b 5b 42
//   right ....... 1b 5b 43
//   left ........ 1b 5b 44
//   end ......... 1b 5b 46
//   home ........ 1b 5b 48
// =-=============-==-==-==-==-==-==
// S insert ...... 1b 5b 32 36 32 7e
// S delete ...... 1b 5b 33 36 32 7e
// S page up ..... 1b 5b 35 36 32 7e
// S page down ... 1b 5b 36 36 32 7e
// S up .......... 1b 5b 31 3b 32 41
// S down ........ 1b 5b 31 3b 32 42
// S right ....... 1b 5b 31 3b 32 43
// S left ........ 1b 5b 31 3b 32 44
// S end ......... 1b 5b 31 3b 32 46
// S home ........ 1b 5b 31 3b 32 48
// =-=============-==-==-==-==-==-==
// C insert ...... 1b 5b 32 36 35 7e
// C delete ...... 1b 5b 33 36 35 7e
// C page up ..... 1b 5b 35 36 35 7e
// C page down ... 1b 5b 36 36 35 7e
// C up .......... 1b 5b 31 3b 35 41
// C down ........ 1b 5b 31 3b 35 42
// C right ....... 1b 5b 31 3b 35 43
// C left ........ 1b 5b 31 3b 35 44
// C end ......... 1b 5b 31 3b 35 46
// C home ........ 1b 5b 31 3b 35 48
// =-=============-==-==-==-==-==-==
// _/OUTPUT_MATCHING\_________ _/OUTPUT_CUSTOM\___________
// ==-==- ==-==- ==-==- ==-==- ==-==- ==-==- ==-==- ==-==-
// 00     20 SP  40 @   60 `   80     a0 SPa c0     e0 ` a
// 01 a c 21 !   41 A   61 a   81     a1 INa c1 INc e1 a a
// 02 b c 22 "   42 B   62 b   82     a2 DLa c2 DLc e2 b a
// 03 c c 23 #   43 C   63 c   83     a3 PUa c3 PUc e3 c a 
// 04 d c 24 $   44 D   64 d   84     a4 PDa c4 PDc e4 d a  
// 05 e c 25 %   45 E   65 e   85     a5 UPa c5 UPc e5 e a 
// 06 f c 26 &   46 F   66 f   86     a6 DNa c6 DNc e6 f a  
// 07 g c 27 '   47 G   67 g   87     a7 RTa c7 RTc e7 g a 
// 08 h c 28 (   48 H   68 h   88     a8 LFa c8 LFc e8 h a  
// 09 TB  29 )   49 I   69 i   89 TBs a9 ENa c9 ENc e9 i a  
// 0a j c 2a *   4a J   6a j   8a     aa HMa ca HMc ea j a  
// 0b k c 2b +   4b K   6b k   8b     ab     cb     eb k a  
// 0c l c 2c ,   4c L   6c l   8c     ac , a cc     ec l a  
// 0d RN  2d -   4d M   6d m   8d     ad - a cd     ed m a  
// 0e n c 2e .   4e N   6e n   8e     ae . a ce     ee n a  
// 0f o c 2f /   4f O   6f o   8f     af / a cf     ef o a  
// 10 p c 30 0   50 P   70 p   90     b0 0 a d0     f0 p a
// 11 q c 31 1   51 Q   71 q   91 INs b1 1 a d1 IN  f1 q a
// 12 r c 32 2   52 R   72 r   92 DLs b2 2 a d2 DL  f2 r a
// 13 s c 33 3   53 S   73 s   93 PUs b3 3 a d3 PU  f3 s a  
// 14 t c 34 4   54 T   74 t   94 PDs b4 4 a d4 PD  f4 t a  
// 15 u c 35 5   55 U   75 u   95 UPs b5 5 a d5 UP  f5 u a  
// 16 v c 36 6   56 V   76 v   96 DNs b6 6 a d6 DN  f6 v a  
// 17 w c 37 7   57 W   77 w   97 RTs b7 7 a d7 RT  f7 w a  
// 18 x c 38 8   58 X   78 x   98 LFs b8 8 a d8 LF  f8 x a  
// 19 y c 39 9   59 Y   79 y   99 ENs b9 9 a d9 EN  f9 y a  
// 1a z c 3a :   5a Z   7a z   9a HMs ba     da HM  fa z a  
// 1b ES  3b ;   5b [   7b {   9b     bb ; a db [ a fb    
// 1c \ c 3c <   5c \   7c |   9c     bc     dc \ a fc    
// 1d ] c 3d =   5d ]   7d }   9d     bd = a dd ] a fd    
// 1e     3e >   5e ^   7e ~   9e     be     de     fe    
// 1f / c 3f ?   5f _   7f BS  9f     bf     df     ff BSa
// ==-==- ==-==- ==-==- ==-==- ==-==- ==-==- ==-==- ==-==-
// _/ENCODING_6_CHARS_INTO_32BIT\_________________________________
// char 0 - 7-bit 
// char 1 - 7-bit
// char 2 - 0,31-5a -> 30-5a -> 0-2a -> 0-39 -> 6-bit
// char 3 - 0,36,3b,7e
//  0000000
//  0110110 ... extract 2 bits
//  0111011
//  1111110
//     ab 
// char 4 - 0,32,35
//  000000
//  110010 ... extract 2 bits
//  110101
//     ab
// char 5 - 0,41,,7e - can just use 7-bit
// ---------------------------------------------------------------
// 11111111111111110000000000000000
// fedcba9876543210fedcba9876543210
// ================================
// .0000000........................  char 0 [lower 7-bits]
// ........1111111.................  char 1 [lower 7-bits]
// ...............222222...........  char 2 [clamp(c-0x30,0,0x39)]
// .....................33.........  char 3 [(c>>2)&3]
// .......................44.......  char 4 [(c>>1)&3]
// .........................5555555  char 5 [lower 7-bits]

________________________________________________________________________________________________________________________________

SPIR-V Notes


Aim
The point of this was to look at the possibility to replace GLSL with some simplified virtual assembly language (something that is closer to 1:1 mapping to GCN ISA), and see if that can be expressed in SPIR-V. I believe the answer to that is YES. Notice that multi-component values like uvec4 get reduced to OpLoad and OpStore without using OpPhi given a branch, so a compiler would need to be able to handle optimizing with loads and stores. Which implies it would be easy to just pre-allocate N*4 registers as N multi-component values, then use load/store to access.

Tangentally, it is possible just in the examples below to see how SPIR-V is a great example of bad engineering as SPIR-V obfuscates the meaning of code with very poor information density.

SPIR-V Reference
SPIR-V 1.0 Spec

Minimal SPIR-V File?
__GLSL__
#version 450
layout(local_size_x=64)in;
void main(){}

__SPIR-V__
; SPIR-V
; Version: 1.0
; Generator: Khronos Glslang Reference Front End; 7
; Bound: 11
; Schema: 0
               OpCapability Shader
          %1 = OpExtInstImport "GLSL.std.450"
               OpMemoryModel Logical GLSL450
               OpEntryPoint GLCompute %main "main"
               OpExecutionMode %main LocalSize 64 1 1
               OpSource GLSL 450
               OpName %main "main"
               OpDecorate %gl_WorkGroupSize BuiltIn WorkgroupSize
       %void = OpTypeVoid
          %3 = OpTypeFunction %void
       %uint = OpTypeInt 32 0
     %v3uint = OpTypeVector %uint 3
    %uint_64 = OpConstant %uint 64
     %uint_1 = OpConstant %uint 1
%gl_WorkGroupSize = OpConstantComposite %v3uint %uint_64 %uint_1 %uint_1
       %main = OpFunction %void None %3
          %5 = OpLabel
               OpReturn
               OpFunctionEnd

Adding Buffer Binding
__GLSL__
#version 450
layout(set=0,binding=1,std430)buffer b0_{uvec4 b0[4096];};
layout(local_size_x=64)in;
void main(){}

__SPIR-V__
; SPIR-V
; Version: 1.0
; Generator: Khronos Glslang Reference Front End; 7
; Bound: 17
; Schema: 0
               OpCapability Shader
          %1 = OpExtInstImport "GLSL.std.450"
               OpMemoryModel Logical GLSL450
               OpEntryPoint GLCompute %main "main"
               OpExecutionMode %main LocalSize 64 1 1
               OpSource GLSL 450
               OpName %main "main"
               OpName %b0_ "b0_"
               OpMemberName %b0_ 0 "b0"
               OpName %_ ""
               OpDecorate %_arr_v4uint_uint_4096 ArrayStride 16
               OpMemberDecorate %b0_ 0 Offset 0
               OpDecorate %b0_ BufferBlock
               OpDecorate %_ DescriptorSet 0
               OpDecorate %_ Binding 1
               OpDecorate %gl_WorkGroupSize BuiltIn WorkgroupSize
       %void = OpTypeVoid
          %3 = OpTypeFunction %void
       %uint = OpTypeInt 32 0
     %v4uint = OpTypeVector %uint 4
  %uint_4096 = OpConstant %uint 4096
%_arr_v4uint_uint_4096 = OpTypeArray %v4uint %uint_4096
        %b0_ = OpTypeStruct %_arr_v4uint_uint_4096
%_ptr_Uniform_b0_ = OpTypePointer Uniform %b0_
          %_ = OpVariable %_ptr_Uniform_b0_ Uniform
     %v3uint = OpTypeVector %uint 3
    %uint_64 = OpConstant %uint 64
     %uint_1 = OpConstant %uint 1
%gl_WorkGroupSize = OpConstantComposite %v3uint %uint_64 %uint_1 %uint_1
       %main = OpFunction %void None %3
          %5 = OpLabel
               OpReturn
               OpFunctionEnd

Buffer Load, Component Modify, Buffer Store
__GLSL__
#version 450
layout(set=0,binding=1,std430)buffer b0_{uvec4 b0[4096];};
layout(local_size_x=64)in;
void main(){uvec4 u=b0[0];u.x+=1u;b0[0]=u;}

__SPIR-V__
; SPIR-V
; Version: 1.0
; Generator: Khronos Glslang Reference Front End; 7
; Bound: 32
; Schema: 0
               OpCapability Shader
          %1 = OpExtInstImport "GLSL.std.450"
               OpMemoryModel Logical GLSL450
               OpEntryPoint GLCompute %main "main"
               OpExecutionMode %main LocalSize 64 1 1
               OpSource GLSL 450
               OpName %main "main"
               OpName %u "u"
               OpName %b0_ "b0_"
               OpMemberName %b0_ 0 "b0"
               OpName %_ ""
               OpDecorate %_arr_v4uint_uint_4096 ArrayStride 16
               OpMemberDecorate %b0_ 0 Offset 0
               OpDecorate %b0_ BufferBlock
               OpDecorate %_ DescriptorSet 0
               OpDecorate %_ Binding 1
               OpDecorate %gl_WorkGroupSize BuiltIn WorkgroupSize
       %void = OpTypeVoid
          %3 = OpTypeFunction %void
       %uint = OpTypeInt 32 0
     %v4uint = OpTypeVector %uint 4
%_ptr_Function_v4uint = OpTypePointer Function %v4uint
  %uint_4096 = OpConstant %uint 4096
%_arr_v4uint_uint_4096 = OpTypeArray %v4uint %uint_4096
        %b0_ = OpTypeStruct %_arr_v4uint_uint_4096
%_ptr_Uniform_b0_ = OpTypePointer Uniform %b0_
          %_ = OpVariable %_ptr_Uniform_b0_ Uniform
        %int = OpTypeInt 32 1
      %int_0 = OpConstant %int 0
%_ptr_Uniform_v4uint = OpTypePointer Uniform %v4uint
     %uint_1 = OpConstant %uint 1
     %uint_0 = OpConstant %uint 0
%_ptr_Function_uint = OpTypePointer Function %uint
     %v3uint = OpTypeVector %uint 3
    %uint_64 = OpConstant %uint 64
%gl_WorkGroupSize = OpConstantComposite %v3uint %uint_64 %uint_1 %uint_1
       %main = OpFunction %void None %3
          %5 = OpLabel
          %u = OpVariable %_ptr_Function_v4uint Function
         %18 = OpAccessChain %_ptr_Uniform_v4uint %_ %int_0 %int_0
         %19 = OpLoad %v4uint %18
               OpStore %u %19
         %23 = OpAccessChain %_ptr_Function_uint %u %uint_0
         %24 = OpLoad %uint %23
         %25 = OpIAdd %uint %24 %uint_1
         %26 = OpAccessChain %_ptr_Function_uint %u %uint_0
               OpStore %26 %25
         %27 = OpLoad %v4uint %u
         %28 = OpAccessChain %_ptr_Uniform_v4uint %_ %int_0 %int_0
               OpStore %28 %27
               OpReturn
               OpFunctionEnd

Now With Simple Conditional
__GLSL__
#version 450
layout(set=0,binding=1,std430)buffer b0_{uvec4 b0[4096];};
layout(local_size_x=64)in;
void main(){uvec4 u=b0[0];if(u.x!=0u)u.x+=1u;else u.x+=2u;b0[0]=u;}

__SPIR-V__
; SPIR-V
; Version: 1.0
; Generator: Khronos Glslang Reference Front End; 7
; Bound: 44
; Schema: 0
               OpCapability Shader
          %1 = OpExtInstImport "GLSL.std.450"
               OpMemoryModel Logical GLSL450
               OpEntryPoint GLCompute %main "main"
               OpExecutionMode %main LocalSize 64 1 1
               OpSource GLSL 450
               OpName %main "main"
               OpName %u "u"
               OpName %b0_ "b0_"
               OpMemberName %b0_ 0 "b0"
               OpName %_ ""
               OpDecorate %_arr_v4uint_uint_4096 ArrayStride 16
               OpMemberDecorate %b0_ 0 Offset 0
               OpDecorate %b0_ BufferBlock
               OpDecorate %_ DescriptorSet 0
               OpDecorate %_ Binding 1
               OpDecorate %gl_WorkGroupSize BuiltIn WorkgroupSize
       %void = OpTypeVoid
          %3 = OpTypeFunction %void
       %uint = OpTypeInt 32 0
     %v4uint = OpTypeVector %uint 4
%_ptr_Function_v4uint = OpTypePointer Function %v4uint
  %uint_4096 = OpConstant %uint 4096
%_arr_v4uint_uint_4096 = OpTypeArray %v4uint %uint_4096
        %b0_ = OpTypeStruct %_arr_v4uint_uint_4096
%_ptr_Uniform_b0_ = OpTypePointer Uniform %b0_
          %_ = OpVariable %_ptr_Uniform_b0_ Uniform
        %int = OpTypeInt 32 1
      %int_0 = OpConstant %int 0
%_ptr_Uniform_v4uint = OpTypePointer Uniform %v4uint
     %uint_0 = OpConstant %uint 0
%_ptr_Function_uint = OpTypePointer Function %uint
       %bool = OpTypeBool
     %uint_1 = OpConstant %uint 1
     %uint_2 = OpConstant %uint 2
     %v3uint = OpTypeVector %uint 3
    %uint_64 = OpConstant %uint 64
%gl_WorkGroupSize = OpConstantComposite %v3uint %uint_64 %uint_1 %uint_1
       %main = OpFunction %void None %3
          %5 = OpLabel
          %u = OpVariable %_ptr_Function_v4uint Function
         %18 = OpAccessChain %_ptr_Uniform_v4uint %_ %int_0 %int_0
         %19 = OpLoad %v4uint %18
               OpStore %u %19
         %22 = OpAccessChain %_ptr_Function_uint %u %uint_0
         %23 = OpLoad %uint %22
         %25 = OpINotEqual %bool %23 %uint_0
               OpSelectionMerge %27 None
               OpBranchConditional %25 %26 %33
         %26 = OpLabel
         %29 = OpAccessChain %_ptr_Function_uint %u %uint_0
         %30 = OpLoad %uint %29
         %31 = OpIAdd %uint %30 %uint_1
         %32 = OpAccessChain %_ptr_Function_uint %u %uint_0
               OpStore %32 %31
               OpBranch %27
         %33 = OpLabel
         %35 = OpAccessChain %_ptr_Function_uint %u %uint_0
         %36 = OpLoad %uint %35
         %37 = OpIAdd %uint %36 %uint_2
         %38 = OpAccessChain %_ptr_Function_uint %u %uint_0
               OpStore %38 %37
               OpBranch %27
         %27 = OpLabel
         %39 = OpLoad %v4uint %u
         %40 = OpAccessChain %_ptr_Uniform_v4uint %_ %int_0 %int_0
               OpStore %40 %39
               OpReturn
               OpFunctionEnd

________________________________________________________________________________________________________________________________

Shipping Only 16-Bit


TLDR, going to require 16-bit shader support! One man show, avoiding having both a 32-bit and 16-bit shader variation is highly desired. Don't want to limit the tech based on decisions required to support 32-bit. Would it be practical to develop and ship just a 16-bit version? This is Vulkan only, no XBox (for certain), and no Playstation (lack of time). Don't need 16-bit buffer access, and wave size is managed via spec constants, so relatively easy.

vulkan.gpuinfo.org - VkPhysicalDeviceFeatures::shaderInt16 - VK_KHR_shader_float16_int8::shaderFloat16



Adding up the top cards (all NVIDIA) on Steam Hardware Survey shows at least 30% market won't support 16-bit. But also that at least 30% of the market will support 16-bit, and that is good enough for me.

________________________________________________________________________________________________________________________________

X68 Epilogue


Keeping notes here incase I ever choose to revisit ...
The time for this project was replaced by [SBC]. The ability to write GPU binaries and command buffers was too great, and after that there is effectively no need for any CPU logic except the ugly that is interfacing inside a modern OS.

________________________________


[PR0] Random Prototype 0
TODO, out of time perhaps more on these later...

Raw View Without Hole Filling Or Temporal Reconstruction


Majority of white dots are actually holes in the scene. This is using stratified visibility, it doesn't necessarily find an intersection for each pixel.





With Temporal Reconstruction and Grain


This uses a spatial temporal reconstruction that also fills holes and removes noise.









________________________________________________________________________________________________________________________________
[PR0] Octahedral Framebuffer
Implemnted a 1024x512 rectangular layout octahedral framebuffer, with a 360 degree cylindrical projection in a three stage pipeline. The intermediate stage samples the octahedron into a VGA-like 720x256 resolution target with a warped vertical. This is to avoid perspective induced undersampling. Final stage applies the CRT shader.

CRT shader has progressively thicker scanlines at top and bottom due to the vertical warpage. Running full 16:9 but with a strong vignette. The shadow mask is blended out towards the center of the screen to increase peak brightness Horizontal blur is increased towards the right and left for the chromatic aberration.

Both the sampling of the octahedron and the VGA intermediate images are done with linear filtering and a wide gaussian kernel. Monochromatic tonemapping is applied afterwards. Followed by linear-space colorizing of the greyscale. The octahedron output is 32-bit packed {8-bit 1/(1+luma) in gamma 2.0, 13-bit x, 11-bit y}. Sub-pixel position will later be used in filtering.

Shots below have linear temporal average of simple ray traced dummy scene, enough to show first pass of post pipeline. Have some changes to try before moving on to the next step in development.

Not planning on doing bloom, due to the wide gaussian kernel bright areas naturally have a slight bleed, that is likely enough hint of brightness. Not going to do DOF, following the strategy, if it cannot be done really well, don't bother. Not doing motion blur, as there are no hard edges for the eye to get stuck on, focusing on peak frame rate instead. Not doing local contrast adaption or sharpening of any kind, don't like the look of negative lobes inverting the edge.






________________________________________________________________________________________________________________________________
[PR0] GPU Program Bring Up
Not yet to the fun part, still laying down foundation. Added a frame counter as a push constant for the single dispatch. This will be enough to branch on to get to even and odd frame permutations. So that double buffering works properly. Dispatch sizing is a fixed 2K workgroups. Which in theory at 64 VGPRs/wave target, is good for a 128 CU machine (in classic GCN arch). Classic GCN has maxed out at a 64 CU machine. Will have to modify this stuff later.

Bringing up helper GLSL structure, binding points, etc. Initial testing involved writing to the front buffer a color based on the frame count push constant. Front buffer rendering seems to "work" both when full-screen (low latency) and windowed (high latency). Example screen shot shows {red,black} color due to window compositor reading during program writes.


Software Spin Wait - With pipelined execution, a wait on a "barrier" would be expected to not wait. The barrier only functions as a safe-guard in case something goes wrong. Initial check to see if a barrier is signaled should thus involve a cached read (because there would be an expectation of other waves later reading the same value). Only if the initial read fails should an uncached spin wait be invoked. Can see from the RDNA ISA Guide, hardware support both in scalar and vector loads for GLC=0 reads which hit on the cache, and GLC=1 reads which force a fetch from L2, and evict the line afterwards ("miss-evict"). Ideal spin wait would be the following,

// Want this logic in SALU only (no burning vector ALU cycles).
if(ramR.barrier<signal){        // (A.) SMEM GLC=0 read, only enter if not signaled.
 while(ramRV.barrier<signal){}} // (B.) SMEM GLC=1 read, spin while not signaled.

The first wave would miss on the first (A.) read. If the barrier passes, all future waves will hit and quickly pass. If the spin (B.) is invoked, the second wave will miss on the first (A.) read, because GLC=1 evicts the line post-read. But the third will hit.

API ASK #0 - Ability to provide branch hints (coherent vs divergent, and expected branch outcome). The compiler output (see below) is always the slower option. It keeps the most uncommon path inline resulting in the most inefficient execution.

HARDWARE ASK #0 - Would be nice to have a way to force a miss on read but leave the line in the cache.

AMD DRIVER BUG #0 - AMD driver sees "readonly" then ignores the other memory qualifiers. This is both a correctness and optimization bug. So there is no way to get SMEM loads with GLC=1 set. This pushes the overhead into the VMEM and VALU paths for (B.). If the first wave sees a non-signaled state in (A.) then likely all waves on that cache will always invoke the slower (B.) spin loop, because nothing will be refreshing the K$ line.

// layout(set=0,binding=2,std430)readonly buffer ssbo3_ {RamT ramRV;};
s_buffer_load_dword  s0, s[12:15], null               // 000000000020: F4200006 FA000000

// layout(set=0,binding=2,std430)volatile readonly buffer ssbo3_ {RamT ramRV;};
s_buffer_load_dword  s0, s[12:15], null               // 000000000020: F4200006 FA000000

// layout(set=0,binding=2,std430)coherent readonly buffer ssbo3_ {RamT ramV;};
s_buffer_load_dword  s0, s[12:15], null               // 000000000020: F4200006 FA000000

// layout(set=0,binding=2,std430)volatile buffer ssbo3_ {RamT ramRV;};
buffer_load_dword  v1, v0, s[12:15], 0 dlc glc        // 000000000020: E030C000 80030100

// layout(set=0,binding=2,std430)coherent buffer ssbo3_ {RamT ramRV;};
buffer_load_dword  v1, v0, s[12:15], 0 dlc glc        // 000000000020: E030C000 80030100

AMD DRIVER BUG #1 - The above code with the VMEM spin workaround won't work due to this second bug. AMD driver incorrectly hoists the VMEM GLC=1 load outside the loop, leading to incorrect behavior. In the example below the signal state is zero (instead of less than some number).

if(ramR.barrier!=0u){ // (A.).
 while(ramV.barrier!=0u){}} // (B.).

// This is (A.).
  s_buffer_load_dword  s0, s[12:15], null               // 000000000028: F4200006 FA000000
  s_waitcnt     lgkmcnt(0)                              // 000000000030: BF8CC07F
  s_cmp_eq_i32  s0, 0                                   // 000000000034: BF008000
  s_cbranch_scc1  label_006C                            // 000000000038: BF85000C
// This should be in the (B.) loop.
  buffer_load_dword  v1, v0, s[12:15], 0 dlc glc        // 00000000003C: E030C000 80030100
  s_nop         0x0000                                  // 000000000044: BF800000
  s_nop         0x0000                                  // 000000000048: BF800000
  s_nop         0x0000                                  // 00000000004C: BF800000
  s_nop         0x0000                                  // 000000000050: BF800000
  s_nop         0x0000                                  // 000000000054: BF800000
  s_nop         0x0000                                  // 000000000058: BF800000
  s_nop         0x0000                                  // 00000000005C: BF800000
// This is (B.).
label_0060:
  s_waitcnt     vmcnt(0)                                // 000000000060: BF8C3F70
  v_cmp_eq_i32  vcc_lo, 0, v1                           // 000000000064: 7D040280
  s_cbranch_vccz  label_0060                            // 000000000068: BF86FFFD
label_006C:

AMD DRIVER BUG #2 - Now trying to trick the compiler into doing the right thing. First with 'subgroupElect()', which generates the same bug as prior, but adds another performance bug. This should just be a simple operation to save and set EXEC to 1, then restore afterwards. But instead the compiler acts as if it is already in divergent control flow ('ff1' is find first 1).

if(subgroupElect()){if(ramR.barrier!=0u){while(ramV.barrier!=0u){}}}

  // This is the subgroupElect() code for a wave sized workgroup in known coherent execution.
  v_mbcnt_lo_u32_b32  v1, -1, 0                         // 000000000010: D7650001 000100C1
  s_ff1_i32_b64  s0, exec                               // 000000000018: BE80147E
  v_mbcnt_hi_u32_b32  v1, -1, v1                        // 00000000001C: D7660001 000202C1
  . . .
  s_and_saveexec_b64  s[10:11], vcc                     // 000000000028: BE8A246A

AMD DRIVER BUG #3 - Trying to work around the above performance bug (since running work for just lane 0 will be needed elsewhere). Using 'gl_LocalInvocationID.x' paired with 'layout(local_size_x=32)' won't work either. This example gets 'wave_size(32)' in the disassembly. The compiler still uses VALU work instead of just masking EXEC.

if(gl_LocalInvocationID.x==0u){if(ramR.barrier!=0u){while(ramV.barrier!=0u){}}}

  v_cmp_eq_i32  vcc_lo, 0, v0                           // 000000000010: 7D040080
  s_and_saveexec_b32  s0, vcc_lo                        // 000000000014: BE803C6A
  s_cbranch_execz  label_006C                           // 000000000018: BF880014

AMD DRIVER BUG #4 - Last attempt to workaround the performance bug also fails. The driver will always use the slow path burning VALU instruction(s) for what should map to one 'S_AND_SAVEEXEC_B32 s[...],1' scalar instruction on Navi. Using subgroup ops results in 'wave_size(64)' in the disassembly.

if(gl_SubgroupInvocationID==0u){if(ramR.barrier!=0u){while(ramV.barrier!=0u){}}}

  v_mbcnt_lo_u32_b32  v1, -1, 0                         // 000000000010: D7650001 000100C1
  v_mbcnt_hi_u32_b32  v1, -1, v1                        // 000000000018: D7660001 000202C1
  v_cmp_gt_i32  vcc, 1, v1                              // 000000000020: 7D080281
  s_and_saveexec_b64  s[10:11], vcc                     // 000000000024: BE8A246A

AMD DRIVER BUG #5 - There is another obvious bug in the above disassembly. The program is using 'layout(local_size_x=32)' and using 'gl_SubgroupInvocationID' or 'subgroupElect()' causes the compiler to switch to 'wave_size(64)' mode with the high 32 lanes doing nothing. This also means it is using 'V_MBCNT_HI_U32_B32' which wouldn't be needed in wave32 mode.

AMD DRIVER BUG #6 - Using 'VK_EXT_subgroup_size_control' doesn't support forcing wave32 mode on Navi.

Takeaways?
If software prevents you from accessing it, the hardware doesn't actually exist.
All reasonable efforts to optimize on AMD are thwarted by it's software stack. No choice but to ship with the slowest path on the hardware.

One optimization is possible however, early in execution, going to process 'gl_LocalInvocationID' to write a waveID into an SGPR. This way the VGPR for 'gl_LocalInvocationID' can be freed. Later will use 'gl_SubgroupInvocationID' when required to build a lane index (which materializes lane index from ALU instead of keeping it in a VGPR).

________________________________________________________________________________________________________________________________
[PR0] ABI Crossing
Thoughts on interfacing a custom language to library calls, didn't end up using the custom language for this project...

Stack Crossing


System ABIs use 16-byte aligned stacks.

ABI REQUIREMENTS AFTER A CALL INSTRUCTION
=========================================
[rsp+8] SYSV 7th argument, WIN first entry of 32-byte shadow space
[rsp+0] Return address, this is 16-byte aligned
[rsp-8] Free space

RSP BEFORE A CALL IS THUS NOT 16-BYTE ALIGNED
=============================================
[rsp+0] SYSV 7th argument, WIN first entry of 32-byte shadow space

This will use 8-byte aligned stacks, because they are not used for 16-byte data. The ABI crossing call will need to start by aligning the stack, and restoring it before the return.

// Aligned case,
//  [64] [rsp   ] return address
//  [56] [rsp-8 ] save rsp ....... skipped
//  [48] [rsp-16] save rsp ....... final rsp points here 
// -----------------------------------------------------
// Unaligned case,
//  [56] [rsp   ] return address
//  [48] [rsp-8 ] save rsp ....... final rsp points here
//  [40] [rsp-16] save rsp ....... unused
// -----------------------------------------------------
enter:
 mov [rsp-8],rsp
 mov [rsp-16],rsp
 add rsp,-8
 and rsp,~15
 ...
leave:
 mov rsp,[rsp+0]
 ret

ABIs


Only supporting x86-64 in 64-bit mode for this project. Have a few points to cross between my custom non-language and the rest of the system. C ABI different for Windows vs everyone else, and system calls on non-windows platforms. The 'a' is an argument (numbered), the 'non' is non-volatile, 'vol' is volatile, and everything else is volatile. The 'WIN' is the Windows ABI, the 'SYSV' is shared across unix/BSDs, the 'KRN' is the Linux kernel syscall convention.

REG       X86-FAIL  WIN     SYSV    KRN
========  ========  ======  ======  =======
r0 (rax)  ........  return  return  num/ret  --- reuse for call address or syscall number
r1 (rcx)  ........  a0 ...  a3 ...  vol ...  --- save
r2 (rdx)  ........  a1 ...  a2 ...  a2 ....  --- save
r3 (rbx)  ........  non ..  non ..  non ...
r4 (rsp)  SIB ....  stack   stack   stack .  --- save
r5 (rbp)  RIP ....  non ..  non ..  non ...
r6 (rsi)  ........  non ..  a1 ...  a1 ....  --- save on non-win
r7 (rdi)  ........  non ..  a0 ...  a0 ....  --- save on non-win
r8 .....  ........  a2 ...  a4 ...  a4 ....  --- save
r9 .....  ........  a3 ...  a5 ...  a5 ....  --- save
r10 ....  ........  vol ..  vol ..  a3 ....  --- save
r11 ....  ........  vol ..  vol ..  vol ...  --- save
r12 ....  SIB ....  non ..  non ..  non ...
r13 ....  RIP ....  non ..  non ..  non ...
r14 ....  ........  non ..  non ..  non ...
r15 ....  ........  non ..  non ..  non ...

Stacks must be 16-byte aligned, showing state of the stack after a call.

WIN STACK CONVENTION
====================
| [rsp+0x80] a16
  [rsp+0x78] a15
| [rsp+0x70] a14
  [rsp+0x68] a13
| [rsp+0x60] a12
  [rsp+0x58] a11
| [rsp+0x50] a10
  [rsp+0x48] a9
| [rsp+0x40] a8
  [rsp+0x38] a7
| [rsp+0x30] a6
  [rsp+0x28] a5
| [rsp+0x20] Shadow space
  [rsp+0x18] Shadow space
| [rsp+0x10] Shadow space
  [rsp+0x08] Shadow space
| [rsp+0x00] Return address, must be 16-byte aligned (rsp after call)

SYSV STACK CONVENTION
=====================
| [rsp+0x50] a16
  [rsp+0x48] a15
| [rsp+0x40] a14
  [rsp+0x38] a13
| [rsp+0x30] a12
  [rsp+0x28] a11
| [rsp+0x20] a10
  [rsp+0x18] a9
| [rsp+0x10] a8
  [rsp+0x08] a7
| [rsp+0x00] Return address, must be 16-byte aligned (rsp after call)

Expected language crossing granularity is low, so I'm not inclined to do anything other than make it easy to manage. A language crossing will include a stack crossing as well, as I'm not going to keep rsp 16-byte aligned. This is bloody ugly, but it will work. Will have to check if I have any language crossings using floating point.

ENGINE CONVENTION
=================
| [rsp+0xb8] r11
  [rsp+0xb0] r10
| [rsp+0xa8] r9
  [rsp+0xa0] r8
| [rsp+0x98] rdx
  [rsp+0x90] rsi
| [rsp+0x88] rsp
  [rsp+0x80] rdx (save volatile)
----------------
| [rsp+0x78] a16 (args)
  [rsp+0x70] a15
| [rsp+0x68] a14
  [rsp+0x60] a13
| [rsp+0x58] a12
  [rsp+0x50] a11
| [rsp+0x48] a10
  [rsp+0x40] a9
| [rsp+0x38] a8
  [rsp+0x30] a7 (adjusted pointer sysv only)     
| [rsp+0x28] a6 (register copy sysv only)     
  [rsp+0x20] a5 (register copy sysv only)     
| [rsp+0x18] a4 (copied back to registers before the call)  
  [rsp+0x10] a3  
| [rsp+0x08] a2  
  [rsp+0x00] a1  

// On entry, 
//  - rax is the address to call
//  - rcx is future stack pointer for call
entry:
 mov [rcx+0x80],rdx
 mov [rcx+0x88],rsp
#if SYSV
 mov [rcx+0x90],rsi
 mov [rcx+0x98],rdi 
#endif
 mov [rcx+0xa0],r8
 mov [rcx+0xa8],r9
 mov [rcx+0xb0],r10
 mov [rcx+0xb8],r11
#if WIN
 mov rsp,rcx
 mov r9,[rcx+0x18]
 mov r8,[rcx+0x10]
 mov rdx,[rcx+0x08]
 mov rcx,[rcx+0x00]
#endif
#if SYSV
 lea rsp,[rcx+0x30]
 mov r8,[rcx+0x20]
 mov r9,[rcx+0x28]
 mov rdi,[rcx+0x00]
 mov rsi,[rcx+0x08]
 mov rdx,[rcx+0x10]
 mov rcx,[rcx+0x18]
#endif
 call rax
#if WIN
 mov rdx,[rsp+0x80]
 mov r8,[rsp+0xa0]
 mov r9,[rsp+0xa8]
 mov r10,[rsp+0xb0]
 mov r11,[rsp+0xb8]
 mov rsp,[rsp+0x88]
#endif
#if SYSV
 mov rdx,[rsp+0x80-0x30]
 mov rsi,[rcx+0x90-0x30]
 mov rdi,[rcx+0x98-0x30] 
 mov r8,[rcx+0xa0-0x30]
 mov r9,[rcx+0xa8-0x30]
 mov r10,[rcx+0xb0-0x30]
 mov r11,[rcx+0xb8-0x30]
 mov rsp,[rcx+0x88-0x30]
#endif
 ret

________________________________


[GPU] Links


________________________________________________________________________________________________________________________________
[GPU] 3D Barycentric
Useful for skinning volumetric data
d=1-(a+b+c) ... coordinates must sum to one
r = point to convert into barycentric 
r{a,b,c,d} = points of tetrahedron
{a,b,c} = inv(T)*(r-rd)

T = 
 x1-x4 x2-x4 x3-x4
 y1-y4 y2-y4 y3-y4
 z1-z4 z2-z4 z3-z4

  INVERT A 3x3 MATRIX
=======================
 a b c
 d e f = A
 g h i

 (ei-fh) -(bi-ch)  (bf-ce)
-(di-fg)  (ai-cg) -(af-cd) * 1/det(A)
 (dh-eg) -(ah-bg)  (ae-bd)
 
det(A)
  (ei-fh) * a -(di-fg) * b + (dh-eg) * c

________________________________________________________________________________________________________________________________
[GPU] Shader Device Clock
VK_KHR_shader_clock - device clock support

TODO: Is NVIDIA's device shader clock a consistent frequency?

________________________________________________________________________________________________________________________________
[GPU] Wave OPs
Suggestion of API and implementation for wave operations.
This is a copy of what I like for personal development.

=========
  TERMS
=========
P1 ... 'predicate' (bool single component)
I1 ... 'integer' (32-bit signed integer single component)
UI1 .. 'unsigned integer'
W1 ... 'word' (16-bit signed integer)
C .... 'coherent' (function is static or dynamically uniform control flow)
V .... 'volatile' (function can be called in unknown control flow)

=====================================
  AVOIDING PROBLEMS WITH DIVERGENCE
=====================================
Pass around 'P1 laneMask' and only go into divergent control flow locally.
This requires a different style of programming.

P1 laneMask = ...; // Existing lane mask value.
P1 laneMask2 = laneMask & newMask; // Make a local lane mask for a new subset of active threads.
if(laneMask2){ ... } // Do logic which is limited to a subset of lanes.
f(laneMask,...); // Do logic which is limited to older subset of lanes.

Note above, 'f()' gets the lane mask passed in.
So 'f()' is always called from dynamically uniform control flow.
And can thus do any operations that require all lanes active.
The standard method of possibly having dynamically divergent control flow cannot do that.

=======
  API
=======
NO 'gl_SubgroupInvocationID' or 'WaveGetLaneIndex()'
 - INSTEAD only launch 1D workgroups and compute from 'gl_LocalInvocationID.x' and 'SV_GroupThreadID.x'
 - 2D coordiates are always generated from a 1D workgroup due to needing lane swizzling to get perf
 - Shader can avoid the AND operation if workgroup is know to be wave sized
 - May want to maintain a 16-bit lane index to save space in some cases
 - AMD
    - The 'gl_LocalInvocationID.x' is placed in 'v0' before program launch (fast path)
    - Driver will NOT optimize 'gl_SubgroupInvocationID' to 'gl_LocalInvocationID.x'
    - The 'gl_SubgroupInvocationID' gets built (slow path)
       - Using 2 VALU instructions via V_MBCNT_{HI,LO}_U32_B32 which is possibly slower (wave64, or 1 op wave32)

NO GENERIC SHUFFLE VIA 'subgroupShuffle()' OR 'WaveReadLaneAt(,nonUniformValue)'
 - This is because of min spec hardware portability
 - See 'Quad' and 'WaveXor' cases for constrained shuffle usage
 - AMD
    - DS_SWIZZLE_B32 only works in groups of 32 lanes (on GCN and RDNA)
       - Same with the introduction of DS_*PERMUTE_B32 (note AMD ISA guide has some incorrect descriptions on this)
    - GCN hardware is only wave64
    - No way to easily portably force wave32 on RDNA
       - So no way to guarantee usage of DS_BPERMUTE_B32 (wave32 only path)
       - The wave64 path ends up using a V_READLANE_B32 waterfall loop (could be 64 interations, so unusably slow)
    - Forced to use LDS for this kind of functionality
    - No need for new API to use LDS

NO USING 'subgroupElect()' OR 'WaveIsFirstLane()'
 - Both have the overhead of needing to find the first lane in possibly divergent control flow
 - They are thus slow
 - Instead manually mask to lane 0 via 'if((gl_LocalInvocationID.x & waveSizeMinusOne)==0)'
 - Where the AND part is skipped for wave sized workgroups
 - AMD
    - Driver output for 'subgroupElect()' is expensive
    - It is not optimized for even compile-time known uniform control flow

NO READ-FIRST-LANE
 - Because on some platforms this implies using 'find-first-one' to figure out the first active lane
 - So instead only call from non-divergent control flow and use explicit lane=0 in function calls
 - This will be slightly less optimal (probably not measurable) on AMD, but better overall
 - AMD
    - 'V_READFIRSTLANE_B32 d,s' is a 32-bit instruction
    - 'V_READLANE_B32 d,s,0' is a 64-bit instruction (slightly less optimal)

WITH SOME EXCEPTIONS, NO LANE SHARING BEYOND GROUPS OF 16 LANES
 - Supported on all AMD and NVIDIA hardware
 - Should in theory work on Intel hardware (they can do wave16)
    - Wave16 is their fast path
 - Exceptions
    - Single lane read/write 
    - Ballot

I1 Quad{0,1,2,3,X,Y,D}CI1(I1 v)
 - {0,1,2,3} selects quad element
    - Separate functions force compile-time uniformity (portable fast path)
    - DX: QuadReadLaneAt(,{0,1,2,3})
    - VK: subgroupQuadBroadcast(,{0,1,2,3})
 - {X,Y,D} swap with directional neighbor
    - DX: QuadReadAcross{X,Y,Diagonal}()
    - VK: subgroupQuadSwap{Horizontal,Vertical,Diagonal}()
 - AMD: all DPP ops (fast path)
 - GL: emulation?
    - Can use dFd{x,y}Fine() functions for emulation
    - See 'GPU Pro 2' "Shader Amortization using Pixel Quad Message Passing"
    - However {ES,WebGL} lacks the 'Fine()' functions 
    - ES?: might be able to use 'GL_FRAGMENT_SHADER_DERIVATIVE_HINT' 'GL_NICEST'

void WavePutCI1(inout I1 dst, I1 src, I1 dynamicallyUniformLaneDst, I1 laneSrc)
 - Emulation is possible because 'C' (only callable from dynamically uniform flow control)
 - This could be useful for storing stacks in a VGPR
 - This is a function since it can be mapped without emulation to a hardware instruction on AMD
 - The 'laneSrc' is passed in to enable fast path if wave-sized workgroups vs multi-wave workgroups
 - Usage,
    WavePutCI1(dst, src, 2, (gl_LocalInvocationID.x & waveSizeMinusOne))
    WavePutCI1(dst, src, 2,  gl_LocalInvocationID.x                    )
 - Emulation,
    if(dynamicallyUniformLaneDst == laneSrc) d = x;
 - AMD: V_WRITELANE_B32
    - Ignores EXEC mask

I1 WaveGetCI1(I1 src, I1 dynamicallyUniformLaneSrc)
 - Return 'src' value from lane 'dynamicallyUniformLaneSrc'
 - For review the 'C' is 'coherent' meaning can only be called from dynamically uniform control flow
 - For review the 'I1' is 32-bit integer
 - AMD: V_READLANE_B32
    - Works for wave32|wave64
    - Ignores EXEC mask

I1 WaveXor1CI1(I1 src)
I1 WaveXor2CI1(I1 src)
I1 WaveXor4CI1(I1 src)
I1 WaveXor8CI1(I1 src)
 - Designed to support 2D reductions from 4x4:1
 - Requires minimum wave16 support (Intel's fast path)
 - VK: subgroupShuffleXor(src, {1,2,4,8})
 - DX
 - AMD
    - subgroupShuffleXor(,{1,2}) uses DPP quad permute
    - subgroupShuffleXor(,4) uses DPP row XOR mask
    - subgroupShuffleXor(,8) uses DPP row rotate by 8
    - subgroupShuffleXor(,16) uses DS_SWIZZLE_B32 (which requires S_WAITCNT, slower)
    - subgroupShuffleXor(,32) uses a horrible amount of code for wave64 (unusably slow)

FOR REFERENCE, THE CORRECT WAY TO DO ATOMIC APPEND
 - This gets around stupid driver behavior of AMD
 - AMD pattern matches atomicAdd(,staticUniform) and turns it into garbage
 - Can fix it by atomicAdd(,dynamicallyUniform) where the compiler doesn't see staticUniform
 - Instead do this,
P1 p=...; // Set to true to append, false to not append
UI4 b=subgroupBallot(p);
UI1 stopStupid=gl_LocalInvocationID.x>>31; // Generate a VGPR zero that the compiler doesn't pattern match
UI1 c=subgroupBallotBitCount(b)+stopStupid; // Force the compiler to promote from SGPR to VGPR
UI1 d=0;
if(gl_LocalInvocationID.x==0)d=atomicAdd(...,c); // Do the atomic on lane zero only
UI1 e=subgroupBallotExclusiveBitCount(b); // Factor this work in the latency window of the atomic add
subgroupBarrier(); // Required to be API safe
d=subgroupBroadcast(d,0); // Fast lane zero broadcast avoids 'find-first-one' overhead
d+=e; // Output position for append

________________________________________________________________________________________________________________________________
[GPU] Float Bool Fixes
There are times when there is a need to do bool logic inside floating point numbers. Here is a good starting point for implementing such logic, with comments about implementation on AMD GPUs.


The following get more expensive (extra op).


They could be faster if there was a way to run the hardware in a mode without NaNs, with modified floating point rules.


Detailed logic using new rules.
EQ
==
saturate((-INF*{ 0.0})+INF) ... saturate(( 0.0)+INF) ... saturate(+INF) ... 1.0
saturate((-INF*{+INF})+INF) ... saturate((-INF)+INF) ... saturate( 0.0) ... 0.0
saturate((-INF*{   +})+INF) ... saturate((-INF)+INF) ... saturate( 0.0) ... 0.0

GEZ
===
saturate((+INF*{ 0.0})+INF) ... saturate(( 0.0)+INF) ... saturate(+INF) ... 1.0
saturate((+INF*{-INF})+INF) ... saturate((-INF)+INF) ... saturate( 0.0) ... 0.0
saturate((+INF*{   -})+INF) ... saturate((-INF)+INF) ... saturate( 0.0) ... 0.0
saturate((+INF*{+INF})+INF) ... saturate((+INF)+INF) ... saturate(+INF) ... 1.0
saturate((+INF*{   +})+INF) ... saturate((+INF)+INF) ... saturate(+INF) ... 1.0

LEZ
===
saturate((-INF*{ 0.0})+INF) ... saturate(( 0.0)+INF) ... saturate(+INF) ... 1.0
saturate((-INF*{-INF})+INF) ... saturate((+INF)+INF) ... saturate(+INF) ... 1.0
saturate((-INF*{   -})+INF) ... saturate((+INF)+INF) ... saturate(+INF) ... 1.0
saturate((-INF*{+INF})+INF) ... saturate((-INF)+INF) ... saturate( 0.0) ... 0.0
saturate((-INF*{   +})+INF) ... saturate((-INF)+INF) ... saturate( 0.0) ... 0.0

Case For Hardware FAMA




________________________________________________________________________________________________________________________________
[GPU] Log Depth Encoding
For reference...
// LOG DEPTH ENCODING
// ==================
// - Don't need too much precision around the minimum traversable coordinate
// - Or alternatively can clip on near plane
//    - This logic improves precision by a good amount (1/118 to 1/174)
//    - When s=2047, n=256, a=1/256, m=2^25
// - Encoding: x=log2(z*a+(1-a*n))*b -> {0 to s}
//    - m ... maximum depth value that can be encoded
//    - n ... minimum depth value that can be encoded
//    - s ... maximum step value
//    - z ... {0 to m}
//    - a ... controls distribution close to zero
//    - b ... s/log2(m*a+1-n*a)
// - Decoding: z=exp2(x*(1/b))*(1/a)+(n-(1/a))

Why not just mask part of a FP16 value and use that instead of log depth encoding?
Float Toy
//    - Breakdown
//       fedcba9876543210
//       ================
//       s............... sign (ignore)
//       .eeeee.......... exponent (don't want top bit, due to wasted enocding)
//       ......mmmmmmmmmm mantissa
//       ----------------
//       ..eeeemmmmmmm... possible encoding for simple masking
//       ..11111111111... 1.993 (around 2)
//       ..00000000001... 4.8e-7 (around 1/2M)
//    - Using simple masking burns roughly 1/16 of encoding in a linear region
//    - Complex masking can only approach 1/32 of encoding
//    - This neglects lower 3-bits of precision (gets worse if including more bits)
//    - So NO!  

________________________________________________________________________________________________________________________________
[GPU] Ultimate Video Quality


  TERMS
=========
latency ... as in input read on GPU to start of first frame's line on CRT (ignoring H blanking)
gi ........ GPU view independent work
gd ........ GPU view translation dependent work
gc ........ GPU view camera angle dependent work

  MAXED OUT GPU
==================
/_gi5_/_gd5_/_gc5_//_gi6_/_gd6_/_gc6_//_gi7_/_gd7_/_gc7_/
[____scanout_4____][____scanout_5____][____scanout_6____]
      ^     ^      ^
      |     |<---->| ... Camera rotation latency (slightly lower) 
      |            |
      |<---------->| ... Camera translation and button to flash latency (slightly higher)

  MAXED OUT GPU - LATENCY INDEPENDENT OF CPU WORK
===================================================
(_cpu6_)           (_cpu7_)           (_cpu8_)
/_gi5_/_gd5_/_gc5_//_gi6_/_gd6_/_gc6_//_gi7_/_gd7_/_gc7_/   -+
[____scanout_4____][____scanout_5____][____scanout_6____]    |
                                                             |
  or                                                         |- same latency
                                                             |
           (_cpu6_)           (_cpu7_)           (_cpu8_)    |
/_gi5_/_gd5_/_gc5_//_gi6_/_gd6_/_gc6_//_gi7_/_gd7_/_gc7_/   -+
[____scanout_4____][____scanout_5____][____scanout_6____]

  NON-MAXED OUT GPU - LATENCY DEPENDENT ON WHEN GPU WORK STARTS
=================================================================
        /_gi5_/_gd5_/_gc5_/        /_gi6_/_gd6_/_gc6_/        /_gi7_/_gd7_/_gc7_/
[________scanout_4________][________scanout_5________][________scanout_6________]
              ^     ^      ^
              |     |<---->|
              |            | ... Lower latency
              |<---------->|

  vs
  
/_gi5_/_gd5_/_gc5_/        /_gi6_/_gd6_/_gc6_/        /_gi7_/_gd7_/_gc7_/
[________scanout_4________][________scanout_5________][________scanout_6________]
      ^     ^              ^
      |     |<------------>|
      |                    | ... Higher latency
      |<------------------>| 
      

________________________________________________________________________________________________________________________________
[GPU] Unlimited Boy
Fantasy console inspired by the 160x144 pixel and 4 shades of grey Gameboy...

Unlimited Boy Concept


________________________________


[PRG] Links


________________________________________________________________________________________________________________________________
[PRG] Lottes6x16 Font

A Bitmap Terminal Font


Designed for monospace text editing.
Lottes6x16.fon - Easy to find for general app usage like in Notepad2.
LottesTerminal6x16.fon - Special version to make work in windows terminal.

As a programmer who sticks to 1080p displays, I use this bitmap font for source editing and windows terminal. The font was made using Fony. Right click on the file and "Install" to install. Use "6x16" in the terminal.


________________________________________________________________________________________________________________________________
[PRG] Page Warming
Something From an Existing 'C' Engine ...
The desire for the user is to have a hitch-free experience. OS design today seems more around bloatware, not designed for tiny tight binaries. The problem being that pages are not necessarily there until needed, and that process can be a latency chain nightmare (hitch fest). To workaround this problem, on program launch, and repeated each time the app gets focus, a background thread walks all pages.



All code is done with ROM_ defined to 1, the source file simply includes itself, wrapped with beginning WrmBas() and ending WrmEnd() functions so it becomes possible to easily know the range of addresses for code.

  #define ROM_ 1
  S_ void WrmBas(void){Crash();}
  #include "nvg0.c"
  S_ void WrmEnd(void){Crash();}
  #undef ROM_

All data is placed into one structure (with RAM_ defined to 1, source including itself), so finding start and end is easy.

  #define RAM_ 1
  typedef struct{
   #include "nvg0.c"
   A_(64) I1 end[1024*1024/4];}RamT;S_ A_(64) L1 ramM[sizeof(RamT)/8];
  #define ramR TR_(RamT,L1_(ramM))
  #define ramV TV_(RamT,L1_(ramM))
  #undef RAM_

________________________________________________________________________________________________________________________________
[PRG] Self Modifying Binary

Single File App


Turns out this still works in Win10. But is likely to not work in the future (for another post).
The concept is simple, instead of having a binary and data file(s), just have a binary, where the application saves it's configuration state directly into the binary. Or the step beyond, saving a RAM snapshot into the binary, so the application can easily startup where it last left off, and the user can have any number of save points by having different binaries. Distribution and install of the application is just place the file wherever you want to run it from. Uninstall is just delete the binary. Very easy setup, no registry or config file garbage.

The technique is quite simple.


Not shown below, but on exit, the temp file launch could launch the original binary with a command to delete the temp file. This would work to automatically not leave a garbage file around.

Proof of Concept


//
// SIMPLE SELF-CONTAINED GCC 'C' BASED WIN32 SELF-MOD EXE TEST APP
//
// Compile with: gcc sme.c -march=amdfam10 -std=gnu11 -Ofast -o sme.exe -s -lkernel32 -luser32 -lgdi32 -lwinmm
//
// Language tools.
#define E_(x,y) __builtin_expect(x,y)
#define O_ __attribute__((noreturn))
#define R_ __restrict
#define S_ static
#define W_ __attribute__((__stdcall__)) __attribute__((__force_align_arg_pointer__))
//
// Type system.
typedef unsigned char U1;
typedef unsigned short U2;
typedef unsigned int U4;
typedef unsigned long long U8;
typedef U1 *R_ U1R;
typedef U4 *R_ U4R;
#define U1R_(x) ((U1R)(x))
#define U4R_(x) ((U4R)(x))
#define U8_(x) ((U8)(x))
//
// Win32 API for x86-64.
typedef struct{U8 hProcess;U8 hThread;U4 dwProcessId;U4 dwThreadId;}PROCESS_INFORMATION;
typedef struct{U4 cb;U1R lpReserved;U1R lpDesktop;U1R lpTitle;U4 dwX;U4 dwY;U4 dwXSize;U4 dwYSize;U4 dwXCountChars;
 U4 dwYCountChars;U4 dwFillAttribute;U4 dwFlags;U2 wShowWindow;U2 cbReserved2;U8 lpReserved2;U8 hStdInput;U8 hStdOutput;
 U8 hStdError;}STARTUPINFOA; 
//  
W_ U4 CloseHandle(U8);
W_ U4 CopyFileExA(U1R,U1R,U8,U8,U4R,U4);
W_ U8 CreateFileA(U1R,U4,U4,U8,U4,U4,U8);
W_ U4 CreateProcessA(U1R,U1R,U8,U8,U4,U4,U8,U1R,STARTUPINFOA *R_,PROCESS_INFORMATION *R_); 
W_ void ExitProcess(U4);
W_ U4 ReadFile(U8,U1R,U4,U4R,U1R);
W_ U4 SetFilePointer(U8,U4,U4R,U4);
W_ U4 WriteFile(U8,U1R,U4,U4R,U1R);
//
#define INVALID_HANDLE_VALUE (~U8_(0))
enum{
 FILE_SHARE_WRITE=2,
 GENERIC_READ=0x80000000,
 GENERIC_WRITE=0x40000000,
 OPEN_EXISTING=3};
//
// Initialized global data.
S_ U4 d[2]={0xDEADB175,0x01};
//
// Utility functions.
S_ U1 hex[16]={'0','1','2','3','4','5','6','7','8','9','a','b','c','d','e','f'};
S_ U1R Hex(U1R a,U4 v){a[0]=hex[v&15];return a+1;}
S_ U1R HexU1(U1R a,U4 v){Hex(a,v>>4);Hex(a+1,v);return a+2;}
S_ U1R HexU2(U1R a,U4 v){HexU1(a,v>>8);HexU1(a+2,v);return a+4;}
//
// Entry point.
O_ void main(U4 argc,U1R *R_ argv){
 // If not called with arguments (base 'sme.exe' launch).
 if(argc!=2){
  // Copy 'sme.exe' to 'sme-cpy.exe'.
  U4 c[1];CopyFileExA(U1R_("sme.exe"),U1R_("sme-cpy.exe"),0,0,U4R_(c),0);
  // Launch 'sme-cpy.exe 1'.
  S_ STARTUPINFOA si;
  S_ PROCESS_INFORMATION pi;
  si.cb=sizeof(STARTUPINFOA);
  CreateProcessA(0,U1R_("sme-cpy.exe 1"),0,0,1,0,0,0,&si,&pi);
  // Exit the process.
  ExitProcess(0);}
 // Called with arguments (the 'sme-cpy.exe 1' launch).
 // Open access to standard output into console.
 U8 f=CreateFileA(U1R_("CONOUT$"),GENERIC_WRITE,FILE_SHARE_WRITE,0,OPEN_EXISTING,0,0);
 // Write out value after 0xDEADB175 to console.
 U1 b[4]={'.','.','\n',0};HexU1(b,d[1]);
 U4 r[1];WriteFile(f,U1R_(b),4,U4R_(r),0);
 // Open 'sme.exe' for R/W, loop until this succeeds (just in case 'sme.exe' is still open for execute).
 U8 h;while(1){h=CreateFileA(U1R_("sme.exe"),GENERIC_WRITE|GENERIC_READ,FILE_SHARE_WRITE,0,OPEN_EXISTING,0,0);
  if(h!=INVALID_HANDLE_VALUE)break;}
 // Read 'sme.exe' into memory, using something dumb (know file size is under 64 KiB).
 S_ U4 m[16384]; 
 ReadFile(h,U1R_(m),65536,U4R_(r),0);
 // Find 0xDEADB175 offset (know this happens to be U4 aligned).
 U4 o=0;while(o<16380){if(m[o]==0xDEADB175)break;o++;}o=o*4+4;
 // Write out offset to console.
 U1 b2[6]={'.','.','.','.','\n',0};HexU2(b2,o);
 WriteFile(f,U1R_(b2),6,U4R_(r),0);
 // Increment value after 0xDEADB175.
 d[1]++;
 // Write updated value to 'sme.exe'.
 SetFilePointer(h,o,0,0);
 WriteFile(h,U1R_((d+1)),4,U4R_(r),0);
 // Close the file.
 CloseHandle(h);
 // Exit the process.
 while(1)ExitProcess(0);}

________________________________________________________________________________________________________________________________
[PRG] PE Binary Header

PE


Building a simple 64-bit binary from scratch.
References.


Unfortunately Win 10 breaks compatibility with the small PE tricks which worked on Win 7 and prior. Since I don't want to do my own custom importer (due to the risk of Win 10 breaking that compatibility as well), there is a gap of 5 cachelines between initial file headers and imports. Overall this burns 5 cachelines total for headers and imports. This is 80:1 bloat factor required to get a binary executing on Windows. In an ideal machine the header would just be 4 bytes which gets replaced to a 32-bit pointer to a linker function, with entry point at offset 4, and the entire binary loaded as R/W/E.

First 512 Bytes


Win10 requires minimum 512 offset and alignment for a section (which is required to get imports). So this packs non-import data into the first 3 cachelines using structure aliasing. There is a 5 cacheline gap to the start of the imports which can be used for whatever the binary wants. Unused fields are missing the '=' and are zero.

      ====================
        IMAGE_DOS_HEADER
      --------------------
{000} U2 e_magic=0x5A4D;
      U2 e_cblp;
                                ======================
                                  IMAGE_NT_HEADERS64
                                ----------------------
{004} U2 e_cp;                  U4 Signature='PE\0\0';
      U2 e_crlc;
                                =====================
                                  IMAGE_FILE_HEADER
                                ---------------------
{008} U2 e_cparhdr              U2 Machine=0x8664; // IMAGE_FILE_MACHINE_AMD64
      U2 e_minalloc;            U2 NumberOfSections=1;
{012} U2 e_maxalloc;            U4 TimeDateStamp;
      U2 e_ss;
{016} U2 e_sp;                  U4 PointerToSymbolTable;
      U2 e_csum;
{020} U2 e_ip;                  U4 NumberOfSymbols;
      U2 e_cs;              
{024} U2 e_lfarlc;              U2 SizeOfOptionalHeader=120; // Just two directories with aliasin
      U2 e_ovno;                U2 Characteristics=2; // IMAGE_FILE_EXECUTABLE_IMAGE
                                ===========================
                                  IMAGE_OPTIONAL_HEADER64
                                ---------------------------
{028} U2 e_res[0];              U2 Magic=0x20b; // IMAGE_NT_OPTIONAL_HDR64_MAGIC
      U2 e_res[1];              U1 MajorLinkerVersion;
                                U1 MinorLinkerVersion;
{032} U2 e_res[2];              U4 SizeOfCode;
      U2 e_res[3];
{036} U2 e_oemid;               U4 SizeOfInitializedData;
      U2 e_oeminfo;         
{040} U2 e_res2[0];             U4 SizeOfUninitializedData;
      U2 e_res2[1];         
{044} U2 e_res2[2];             U4 AddressOfEntryPoint=604; // Start the repair tool
      U2 e_res2[3];         
{048} U2 e_res2[4];             U4 BaseOfCode;
      U2 e_res2[5];
{052} U2 e_res2[6];             U8 ImageBase=0x400000; // 4 MiB offset
      U2 e_res2[7];
{056} U2 e_res2[8];         
      U2 e_res2[9];
{060} U4 e_lfanew=4;            U4 SectionAlignment=4;
{064}                           U4 FileAlignment=4;
{068}                           U2 MajorOperatingSystemVersion;
                                U2 MinorOperatingSystemVersion;
{072}                           U2 MajorImageVersion;
                                U2 MinorImageVersion;
{076}                           U2 MajorSubsystemVersion=4; // WinNT Win32 version
                                U2 MinorSubsystemVersion;
{080}                           U4 Win32VersionValue;
{084}                           U4 SizeOfImage=BINARY_SIZE;
{088}                           U4 SizeOfHeaders=512;
{092}                           U4 CheckSum;
{096}                           U2 Subsystem=3; // IMAGE_SUBSYSTEM_WINDOWS_CUI
                                U2 DllCharacteristics;
(100}                           U8 SizeOfStackReserve;
{108}                           U8 SizeOfStackCommit=0x100000; // 1 MiB
{116}                           U8 SizeOfHeapReserve;
{124}                           U8 SizeOfHeapCommit=0x100000; // 1 MiB
{132}                           U4 LoaderFlags;
{136}                           U4 NumberOfRvaAndSizes=2; // Required to to get import table
{140}                           U8 DataDirectory[0]=0; // Exports needs to be empty
      ========================
        IMAGE_SECTION_HEADER
      ------------------------
{148} U1 Name[8];               U4 DataDirectory[1].VirtualAddress=540; // Imports
{152}                           U4 DataDirectory[1].Size=40; // Enough for 2 entries
{156} U4 VirtualSize=BINARY_SIZE-512;
{160} U4 VirtualAddress=512;
{164} U4 SizeOfRawData=BINARY_SIZE-512;
{168} U4 PointerToRawData=512;
{172} U4 PointerToRelocations;
{176} U4 PointerToLinenumbers;
{180} U2 NumberOfRelocations;
      U2 NumberOfLinenumbers;
{184}  U4 Characteristics=0xE0000060;
       // IMAGE_SCN_CNT_CODE|
       // IMAGE_SCN_CNT_INITIALIZED_DATA|
       // IMAGE_SCN_MEM_EXECUTE|
       // IMAGE_SCN_MEM_READ|
       // IMAGE_SCN_MEM_WRITE
{188} U4 Unused
      ==============
        FREE SPACE
      --------------
{192} to {511} - 5 cachelines.

The One Section


This provides the imports and the rest of the binary code and data. I kept all the import data in the section instead of attempting to place it in the prior 512 byte header, just in case Windows checks VA to section range. This part packs the mess of PE import tables and string data into 2 cachelines. It aliases many of the structures based on knowing parts that are not accessed. The "repair tool" copies the function pointers 16 bytes earlier, then restores the original IMAGE_IMPORT_BY_NAME RVAs as would be seen on disk. This enables the header in memory to be stored back to an executable and still function properly. The IMAGE_IMPORT_BY_NAME is tricky because it wants 16-bit alignment and two leading zeros.

      ======================
        FUNCTION ADDRESSES
      ----------------------
{512} U8 LoadLibraryA
{520} U8 GetProcAddress
      ======================
        IMAGE_THUNK_DATA64
      ----------------------
{528} U8 Function=588; // Offset to "LoadLibraryA"
{536} U8 Function=558; // Offset to "GetProcAddress"  U4 unused;
                                                      ===========================
                                                        IMAGE_IMPORT_DESCRIPTOR
                                                      ---------------------------
                                                      U4 OriginalFirstThunk;
{544} U8 Function=0; // End IAT                       U4 TimeDateStamp;
                                                      U4 ForwarderChain;
{552}                                                 U4 Name=580; // Offset to "kernel32"
      ========================
        IMAGE_IMPORT_BY_NAME
      ------------------------
      U1 unused[2];                                   U4 FirstThunk=528; // Offset to Import Address Table
      U1[2]='\0\0'
{560} U1[4]='GetP'                                    U4 OriginalFirstThunk;
      U1[4]='rocA'                                    U4 TimeDateStamp;
{568} U1[4]='ddre'                                    U4 ForwarderChain;
      U1[4]='ss\0\0';                                 U4 Name;
{576}                                                 U4 FirstThunk=0; // Final entry must be empty
      U1[4]='kern'
{584} U1[8]='el32\0\0Lo'
{592} U1[8]='adLibrar'
{600} U1[4]='yA\0\0';
      ===============================
        REPAIR TOOL AND ENTRY POINT
      -------------------------------
{604} b8 10 02 40 00           mov eax,0x400210; // Start of IMAGE_THUNK_DATA
{611} 48 8b 18                 mov rbx,QWORD PTR [rax]; // Fetch LoadLibrary pointer
{612} 48 89 58 f0              mov QWORD PTR [rax-0x10],rbx; // Store at [512]
{616} 2E 48 c7 00 4c 02 00 00  mov QWORD PTR cs:[rax],0x24c; // Restore string RVA
{624} 48 8b 58 08              mov rbx,QWORD PTR [rax+0x8]; // Fetch GetProcAddress pointer
      48 89 58 f8              mov QWORD PTR [rax-0x8],rbx; // Store at [520]
{632} 48 c7 40 08 2e 02 00 00  mov QWORD PTR [rax+0x8],0x22e; // Restore string RVA
      =============
        APP START
      -------------
{640}

Builder Example


The following C code will build a quick proof of concept binary.

//
// SIMPLE SELF-CONTAINED GCC 'C' BASED WIN32 BINARY BUILDER
//
// Test with,
// gcc b64.c -march=amdfam10 -std=gnu11 -Ofast -o b64.exe -s -lkernel32 -luser32 -lgdi32 -lwinmm
// b64.exe
// tst.exe
// echo %ERRORLEVEL%
//
// Language tools.
#define E_(x,y) __builtin_expect(x,y)
#define O_ __attribute__((noreturn))
#define R_ __restrict
#define S_ static
#define W_ __attribute__((__stdcall__)) __attribute__((__force_align_arg_pointer__))
//
// Type system.
typedef unsigned char U1;
typedef unsigned short U2;
typedef unsigned int U4;
typedef unsigned long long U8;
typedef U1 *R_ U1R;
typedef U2 *R_ U2R;
typedef U4 *R_ U4R;
typedef U8 *R_ U8R;
#define U1R_(x) ((U1R)(x))
#define U2R_(x) ((U2R)(x))
#define U4R_(x) ((U4R)(x))
#define U8R_(x) ((U8R)(x))
#define U1_(x) ((U1)(x))
#define U2_(x) ((U2)(x))
#define U4_(x) ((U4)(x))
#define U8_(x) ((U8)(x))
//
// Win32 API for x86-64.
W_ U4 CloseHandle(U8);
W_ U8 CreateFileA(U1R,U4,U4,U8,U4,U4,U8);
W_ void ExitProcess(U4);
W_ U4 SetFilePointer(U8,U4,U4R,U4);
W_ U4 WriteFile(U8,U1R,U4,U4R,U1R);
//
#define INVALID_HANDLE_VALUE (~U8_(0))
enum{
 CREATE_ALWAYS=2,
 FILE_SHARE_WRITE=2,
 GENERIC_READ=0x80000000,
 GENERIC_WRITE=0x40000000};
//
// Entry point.
O_ void main(U4 argc,U1R *R_ argv){
 // Building a 64 KiB binary (lots of extra space for later).
 // This defaults to zero fill (so zeros need not be written).
 #define BINARY_SIZE 65536
 S_ U8 buf[BINARY_SIZE/8];
 U1R b=U1R_(buf);
 //
 U2R_(b+0)[0]=0x5a4d; // e_magic
 // 
 U1R_(b+4)[0]='P'; // Signature
 U1R_(b+4)[1]='E';
 //
 U2R_(b+8)[0]=0x8664; // Machine
 U2R_(b+10)[0]=1; // NumberOfSections
 U2R_(b+24)[0]=120; // SizeOfOptionalHeader
 U2R_(b+26)[0]=2; // Characteristics
 //
 U2R_(b+28)[0]=0x20b; // Magic
 U4R_(b+44)[0]=604; // AddressOfEntryPoint
 U8R_(b+52)[0]=0x400000; // ImageBase
 U4R_(b+60)[0]=4; // SectionAlignment and e_lfanew
 U4R_(b+64)[0]=4; // FileAlignment
 U2R_(b+76)[0]=4; // MajorSubsystemVersion
 U4R_(b+84)[0]=BINARY_SIZE; // SizeOfImage
 U4R_(b+88)[0]=512; // SizeOfHeaders
 U2R_(b+96)[0]=3; // Subsystem
 U8R_(b+108)[0]=0x100000; // SizeOfStackCommit
 U8R_(b+124)[0]=0x100000; // SizeOfHeapCommit
 U4R_(b+136)[0]=2; // NumberOfRvaAndSizes
 U4R_(b+148)[0]=540; // DataDirectory[1].VirtualAddress
 U4R_(b+152)[0]=40; // DataDirectory[1].Size
 //
 U4R_(b+156)[0]=BINARY_SIZE-512; // VirtualSize
 U4R_(b+160)[0]=512; // VirtualAddress
 U4R_(b+164)[0]=BINARY_SIZE-512; // SizeOfRawData
 U4R_(b+168)[0]=512; // PointerToRawData
 U4R_(b+184)[0]=0xE0000060; // Characteristics
 //
 U8R_(b+528)[0]=588; // Function
 U8R_(b+536)[0]=558; // Function
 U4R_(b+552)[0]=580; // Name
 U4R_(b+556)[0]=528; // FirstThunk
 //
 U1R_(b+560)[0]='G';
 U1R_(b+560)[1]='e';
 U1R_(b+560)[2]='t';
 U1R_(b+560)[3]='P';
 U1R_(b+560)[4]='r';
 U1R_(b+560)[5]='o';
 U1R_(b+560)[6]='c';
 U1R_(b+560)[7]='A';
 U1R_(b+560)[8]='d';
 U1R_(b+560)[9]='d';
 U1R_(b+560)[10]='r';
 U1R_(b+560)[11]='e';
 U1R_(b+560)[12]='s';
 U1R_(b+560)[13]='s';
 //
 U1R_(b+580)[0]='k';
 U1R_(b+581)[0]='e';
 U1R_(b+582)[0]='r';
 U1R_(b+583)[0]='n';
 U1R_(b+584)[0]='e';
 U1R_(b+585)[0]='l';
 U1R_(b+586)[0]='3';
 U1R_(b+587)[0]='2';
 //
 U1R_(b+590)[0]='L';
 U1R_(b+591)[0]='o';
 U1R_(b+592)[0]='a';
 U1R_(b+593)[0]='d';
 U1R_(b+594)[0]='L';
 U1R_(b+595)[0]='i';
 U1R_(b+596)[0]='b';
 U1R_(b+597)[0]='r';
 U1R_(b+598)[0]='a';
 U1R_(b+599)[0]='r';
 U1R_(b+600)[0]='y';
 U1R_(b+601)[0]='A';
 // 
 U1R_(b+604)[0]=0xb8;
 U1R_(b+604)[1]=0x10;
 U1R_(b+604)[2]=0x02;
 U1R_(b+604)[3]=0x40;
 U1R_(b+604)[4]=0x00;
 //
 U1R_(b+604)[5]=0x48;
 U1R_(b+604)[6]=0x8b;
 U1R_(b+604)[7]=0x18;
 //
 U1R_(b+604)[8]=0x48;
 U1R_(b+604)[9]=0x89;
 U1R_(b+604)[10]=0x58;
 U1R_(b+604)[11]=0xf0;
 
 U1R_(b+604)[12]=0x2e;
 U1R_(b+604)[13]=0x48;
 U1R_(b+604)[14]=0xc7;
 U1R_(b+604)[15]=0x00;
 U1R_(b+604)[16]=0x4c;
 U1R_(b+604)[17]=0x02;
 U1R_(b+604)[18]=0x00;
 U1R_(b+604)[19]=0x00;
 //
 U1R_(b+604)[20]=0x48;
 U1R_(b+604)[21]=0x8b;
 U1R_(b+604)[22]=0x58;
 U1R_(b+604)[23]=0x08;
 //
 U1R_(b+604)[24]=0x48;
 U1R_(b+604)[25]=0x89;
 U1R_(b+604)[26]=0x58;
 U1R_(b+604)[27]=0xf8;
 //
 U1R_(b+604)[28]=0x48;
 U1R_(b+604)[29]=0xc7;
 U1R_(b+604)[30]=0x40;
 U1R_(b+604)[31]=0x08;
 U1R_(b+604)[32]=0x2e;
 U1R_(b+604)[33]=0x02;
 U1R_(b+604)[34]=0x00;
 U1R_(b+604)[35]=0x00;
 //
 // Extra code to return lower 32-bits of GetProcAddress
 // mov rax,rbx; ret;
 U1R_(b+640)[0]=0x48;
 U1R_(b+640)[1]=0x89;
 U1R_(b+640)[2]=0xd8;
 U1R_(b+640)[3]=0xc3;
 //
 // Dump binary to file.
 U8 h=CreateFileA(U1R_("tst.exe"),GENERIC_WRITE|GENERIC_READ,FILE_SHARE_WRITE,0,CREATE_ALWAYS,0,0);
 U4 r[1];WriteFile(h,U1R_(buf),BINARY_SIZE,U4R_(r),0);
 CloseHandle(h);
 // Exit the process.
 while(1)ExitProcess(0);}

________________________________________________________________________________________________________________________________
[PRG] Linux?
Tried to get back to using Linux on a PC laptop, failed...


Where I left off on Arch install.


Random Notes


________________________________


[PIX] Resolution vs Pixel
Q. Why Does an LCD Need Much Higher Resolution Than a CRT?
Because of physical pixel shape. CRT's pixels have smooth partial overlap. CRT can reproduce a smooth signal with less physical resolution. LCD's pixels are quite hard, LCD requires a substantially larger amount of resolution to get to the point where the pixel's hard edge is not perceptual.

Q. Why Does the LCD Hard Pixel Fail Even For Text Rendering?
Because a substantial amount of parts of characters in a font are not just horizontal or vertical, or not pixel aligned. LCD hard pixel can only reproduce a higher frequency than physical resolution hard edge on axis and pixel aligned features.

Q. For Moving Images With No Temporal Aliasing, Why is Minimum Feature Size Larger Than a Pixel?
Pixel sized features with sub-pixel motion would alternate between being visible when aligned to pixel centers, to being half visible when aligned to pixel edges, or quarter visible if the feature is a point aligned to a pixel corner. This visibility change is the temporal aliasing. The only way to reduce temporal aliasing is to drop off contrast of pixel sized features until the visibility change is not perceptual any more. This is the standard for film rendering, and a requirement to provide a believable image.

Q. Would Phone OEMs Push Peak Resolution Images?
Pushing peak resolution sensor is fine, but the images generated with such a sensor don't actually provide pixel quality at peak resolution. Images could easily be scaled down substantially and end up with higher quality as artifacts get removed in the process. The aim of keeping images at peak resolution is to more quickly fill the phone's storage, since phone companies get margin on higher storage phones, or optionally push customers into cloud services with aim to charge a higher monthly fee.

________________________________________________________________________________________________________________________________
[PIX] DLSS3?
Notes from the original DLSS3 launch in 2022 ...
Related Links


Summary, as a game developer I would never choose to integrate DLSS3 into one of my games.



________________________________________________________________________________________________________________________________
[PIX] Morphological AA Links
Despite temporal techniques, spatial AA is still useful when overriding areas of low convergence ...

________________________________________________________________________________________________________________________________
[PIX] Modified Soft BFI
Display rates to BFI configurations?


"Modified"
Soft BFI is probably limited by non-linearity in display pixel transitions. Meaning {white,black} frames won't necessarily average to {50% grey}. The other issue is the loss of brightness. Assuming enough pixel response linearity, one could redistribute light across frames. Start with repeating the input frame N times on output. Could increase brightness of pixels which are not at peak already, and then subtract that amount from the later frame(s). Thus temporal energy conservation. At white, this would act as scan-and-hold, but say at 144 Hz, anything under 50% would act as 72 Hz with BFI.

________________________________


[CRT] Links Inventory Misc

TODOs




Links




Downsampling




Signals


VGA (RGBHV)


SCART
  • RGB {0 to 0.7V} 75 Ohm termination, pin {15,11,7} (shared with component)
  • S (composite video, or just composite sync) {0 to 1V} 75 Ohm, pin {20}

    NTSC




    Modelines


    Modeline data provided as {Display, Sync Start, Sync End, Total}
    See: www.mythtv.org/wiki/Modeline_Database

    Translator from modeline to AMD/NV PC driver settings:
      Front Porch = Sync Start - Display
      Sync Width = Sync End - Sync Start
    
    Diagram:
      |<----------------total-------------->|
      |<------------sync-end----------->|   .
      |<---------sync-start------->|    .   .
      |<-------display-------->|   .    .   .
      .                        .   .    .   .
      |XXXXXXXXXXXXXXXXXXXXXXXX|---|____|---|
                               .   .    .
               front porch ....|<->|    .
               sync width .........|<-->|
    

    Inventory


    Compaq MV 740 - 17" VGA, {31-70kHz,50-120Hz}, offline, currently dead (won't power on).


    Daewoo DTQ-20U4SC - 20" NTSC TV, like new


    Dell UltraScan 1600HS Series D1626HT - 21" VGA {30-107kHz,48-160Hz}, made by Sony, good tube, cleaned


    Dotronix DSV27 - 27" 480i, from Dotronix, new from new-old-stock LG tube


    eMachines 17fs - 17" VGA, {30-70kHz, 50-160Hz}, new-old-stock


    Future Power 17DB88 - 17" VGA, {?-?kHz,?-?Hz}, new-old-stock


    HP "1024" D2813 - 14" VGA, {?kHz,?Hz}, tube ok, cleaned


    Insignia IS-TV040919 - 20" NTSC TV, new-old-stock


    JVC AV-27D200 - 27" NTSC TV


    JVC i'ART 27" AV-27SF36 - 27" NTSC TV, new-old-stock


    KDS KF-7b - 17" VGA, {30-70kHz, 50-160Hz}, new-old-stock


    MakVision/Wei-YA M2929D4-TS SVGA Arcade Monitor - 29" VGA (480p/600p), {30-40kHz,47-90Hz}, new-old-stock


    MakVision/Wei-YA M3129DS-LG CGA/EGA/VGA Arcade Monitor - 29" VGA (240p/480p), {15/24/31kHz,47-70Hz}, new-old-stock


    MaxTech XT-4800 - 14" VGA, {30-48?kHz,?-87?Hz}, new-old-stock


    Philips 20PT6341/37 - 20" 240p/480i


    Philips 27PT6441/37 - 27" 240p/480i


    Philips 32PT6441/37 - 32" 240p/480i


    Philips 34PW850H37F - 34" 16:9 via Component (Does 480p/720p)


    Sharp 32SC260 - 32" 240p/480i


    Sony BVM-D20F1U - 20" 4:3 multi-format


    Sony Multiscan CPD-E200 - 17" VGA {30-85kHz,48-120Hz}


    Sony PVM-8041Q - 8" 240p/480i


    Sony PVM-1354Q - 13" 240p/480i (HR Tube, 16:9 Toggle)


    Sony PVM-14L2 - 14" 240p/480i (16:9 Toggle)


    Sony PVM-? / Olympus OEV203 - 19" 240/480i


    Sony PVM-1953ST - 19" 240/480i, Olympus (HR Tube, Endoscope Monitor)


    Sony PVM-1953MD - 19" 240/480i (HR Tube)


    Sony PVM-20M2U - 20" 240/480i (16:9 Toggle)


    Sony PVM-20M2MDU - 20" 240/480i


    Sony PVM-20M2MDU/ST - 20" 240/480i


    Sony Wega KD-34XBR970 - 34" 1080i HD TV


    Sony Wega KV-30HS420 - 30" 1080i HD TV


    ViewSonic 17GS - 17" VGA {30-69kHz,50-160Hz}, new-old-stock


    ViewSonic G75f - 17" VGA {30-86kHz,50-180Hz}, cleaned, tube ok


    ViewSonic Optiquest Q95 - 19" VGA {30-86kHz,50-160Hz}, cleaned, good tube


    ViewSonic PS790 - 19" VGA {30-95kHz,50-180Hz}, cleaned, medium tube life


    ________________________________________________________________________________________________________________________________

    Insignia


    This required a lot of fixes to bring back to life.



    UPS Transform, Old New Stock TVs Are Bad Out of the Box, then Finally Warm and Calibrated


    Cracked, and Repaired (Different Parts of the Same Board)

    ________________________________________________________________________________________________________________________________

    Endoscopy PVM


    Have too many PVMs.


    Ebay Endoscopy PVM-1953ST, Picked Up Via Freight, Clean Outside (Bad SNES Running 240p Test)

    ________________________________________________________________________________________________________________________________

    First Philips


    For the love of slot mask.



    Philips 20PT6341/37 - Deceptively Clean on the Inside, Might Need a Recap (Probably SNES That Needs Recap)

    ________________________________


    [CRT] Notes

    Misc


    Intergraph Interview 28hd98 Model TX-D8W71W


    Mitsubishi's Diamond Pro 2070SB (VGA, 140 kHz, 160 Hz)


    NEC DM-2000P


    ________________________________________________________________________________________________________________________________

    APEX/KLH




    ________________________________________________________________________________________________________________________________

    Daewoo




    ________________________________________________________________________________________________________________________________

    JVC


    Earlier (240p | 480i)


    D-Series (240p | 480i)


    JVC DT-V (15-45 kHz, 50-100 Hz)


    JVC I'Art (240p | 480i)


    JVC I'Art Pro (HD)


    ________________________________________________________________________________________________________________________________

    Orion




    ________________________________________________________________________________________________________________________________

    Panasonic




    Panasonic TX80P300A


    ________________________________________________________________________________________________________________________________

    Philips / Magnavox


    No Component (240p | 480i)


    Curved With Component (240p | 480i)


    Flat With Component (240p | 480i)


    Wide Screen 16:9 SD Model (240p | 480i)


    TODO....

    Philips 32PT740H/37A (240p | 480i | 480p | VGA)


    Philips 32PT830H/37A (240p | 480i | 480p | VGA)


    Philips 30PW862H (240p | 480i | 480p)


    Philips 34PW8520 (240p | 480i | 480p | VGA)


    Presentation Monitors (VGA)


    ________________________________________________________________________________________________________________________________

    Prima/Advent/Jensen




    ________________________________________________________________________________________________________________________________

    RCA


    MM Series


    4:3 Proscan


    Misc


    ________________________________________________________________________________________________________________________________

    Samsung


    Samsung Curved (240p | 480i)


    Samsung Dynaflat


    Samsung SlimFit


    ________________________________________________________________________________________________________________________________

    Sanyo




    ________________________________________________________________________________________________________________________________

    Sharp


    Non-Component (240p | 480i)


    Cinema Select (240p | 480i)


    Curved With Component (240p | 480i)


    Sharp X-Flat (240p | 480i)


    ________________________________________________________________________________________________________________________________

    Sony


    Sony BVM | PVM


    Sony KV-20XBR / KV-25XBR


    Sony FW900 (VGA)


    Sony Curved (240p | 480i)


    Sony Wega (240p | 480i)


    Sony Wega (Hi-Scan)


    Sony Wega (Super Fine Pitch)


    ________________________________________________________________________________________________________________________________

    Sylvania




    ________________________________________________________________________________________________________________________________

    Toshiba




    SD


    HD


    Orion


    ________________________________________________________________________________________________________________________________

    Zenith




    ________________________________


    [CSS] Test Area

    Heading Two





    Image Caption

    ________________________________


    [Truck]
    Life of the 1999 Manual Cummins Diesel Dodge 3500 Dually ...

    TODO


    ________________________________


    [Jeep]
    Life of the 2016 Jeep Unlimited Rubicon ...

    TODO




    AC Delete and Rewire (2020)


    Airbags except from the seats have been removed. AC has been removed except the compressor due to not being able to find an idler pulley replacement. The console metal frame was cut out except for the driver's side. The console itself was cut out except just the part infront of the driver side. All inside wiring was minimized, with all electrical tape removed, and rewrapped in a nice sleve. At fisrt I had gone to far in stripping out wiring, to the point where the jeep wouldn't start. Got an ORB2 scanner, only had a U110A code, which mapped to steering angle sensor fail. Also had the security dot on the dash, so knew I snipped out that bus on accident. Those both share the same bus, so found the wires (without a manual), and resolder and shrink wrapped in a correction. Jeep started up without any problem after that. Have a bunch of wiring left to do (the other side of the firewall, replace the giant terminal with something smaller, etc).


    A Lot More Room For Groceries

    Aero Wheel Install (2020)


    The beadlock is different than other aluminum offroad wheels, there is no lip to center the tire, and I suspect the Patagonia tires have a thicker bead than dirt track racing wheels. Mounting beadlock seemed strange at first, but after all the bolts had been torqued down, everything seems fine. There are less and larger bolts for the Aero wheels, but they have higher torque specs (30-35 ft-lbs). Easy but time consuming to mount the tire. Took about 3 psi to get the tire to seat past the safety ridge on the rim. Running 20 psi now for the street. Might adjust after I get enough miles to retorque the bolts.




    Rear


    Front


    Front Inside

    Possible Mistake (2020)


    The 15"x8" wheels are rated for 3500 lbs circle track racing. JKU stock is a pig at around 4500 lbs. The 15"x10" wheels are rated for 4000 lbs. Probably should have gone with the 15"x10" wheels. Instead going to continue to lighten the jeep.

    Aero Wheel Evaluation (2020)


    Bought a single Aero 53 Series Wheel from Summit for evaluation. This is a 15"x8" with 3" of backspacing with a 5x5 bolt pattern for the JKU. Fits just fine without any grinding of the caliper for my 2016 JKU. Lots of clearance. HSLA steel and only 23 lbs for the wheel. Went and ordered the other 3, only $137/wheel from Summit.



    Seat Covers (2020)


    Went with Bartact Base Line Performance seat covers. Wanted something minimal that would offer enough protection to keep the Jeep doors-free for most of the year. Like the extra zipper pouch in the front, easy to put wallet and phone in there so I don't have to worry about it falling out of the shorts.



    Tire Research For 15" Wheels (2020)


    Turns out no affordable 37" tire for 15x9, so 35" is as big as this story goes!
    General notes.


    Federal Couragia M/T Mud-Terrain


    General Grabber X3


    Ironman All Country M/T Tire - No 35" at 15"

    Milestar Patagonia M/T Mud-Terrain Radial Tire


    Some Wheel Research (2020)


    Notes.


    ________________________________


    [Auto]


    ________________________________


    [FL] Orlando Area


    ________________________________


    [GAME] Inventory and Links
    ADVANCE - DON'T HAVE


    AMIGA - DON'T HAVE


    AMSTRAD CPC - DON'T HAVE


    ARCADE - DON'T HAVE


    ATARI 7800 - DON'T HAVE


    C64 - DON'T HAVE


    FANTASY (PC) - DON'T HAVE


    MEGADRIVE


    NDS - DON'T HAVE


    NEOGEO MVS + SUPERGUN


    NES


    PC


    PGM


    PS1


    PS2


    PS5


    SNES


    ZX SPECTRUM - DON'T HAVE


    ________________________________


    [HW] FPGA Stuff
    This is a very slow burn project to think through my ideal interger ALU vector hardware. Designed for Xilinx 7 series FPGAs (or later). The end goal is to build out a neo-vintage console with my kids some day. Current thoughts are a radical departure from how integer hardware was done in the past.

    Others


    Boards


    Example Per DSP Budget


    ________________________________________________________________________________________________________________________________
    [HW] Ownership
    Once upon a time, when you purchased a computer, it booted into a console for a programming language, came with a manual with language and hardware documentation. The consumer owned the machine, was both supported and free to do whatever they wanted to do with it.

    Since then others have stepped in to remove personal ownership of the machine and exert policy by restricting your access to the machine via software. OS vendors act as if they own your machine. Hardware vendors also act as if they own the hardware. Signed firmware is an obvious example of this. Another important example is the lack of exposure of the hardware in user-accessable APIs, or not being able to run your own binaries.

    A large component of what enables this anti-consumer behavior is an industry which has evolved through amplification of complexity. The hardware got so needlessly complex that a single individual would have a hard time writting code to interface with the machine. Thus few can organize any type of counter option, except via things like hurd mentality which often doesn't produce good results.

    I'd personally like to see a return of a simple machine. Something accessable and inexpensive. One which boots into console with a langauge. One which can be easily understood and fully interfaced by a single individual. Think of this as a "hyper-calculator".

    ________________________________________________________________________________________________________________________________
    [HW] Multi-Tasking
    Why share?


    Clipboard Doesn't Need An API

    ________________________________________________________________________________________________________________________________
    [HW] 3-Bank Register File?
    Thoughts on manual 3-banks of dRAM for a register file.
     - Created from distributed RAM (dRAM)
     - Either 192 or 96 register configuration
     - 192 registers
        - Each bank: 64 entry x 3-bit Simple Dual Port (SDP, aka one read, one write) in 4 LUTs
        - 18 SLICEM: 18-bit =  6 SLICEMs/bank
        - 24 SLICEM: 24-bit =  8 SLICEMs/bank
        - 30 SLICEM: 30-bit = 10 SLICEMs/bank
        - Likely too many SLICEMs
     - 96 registers
        - Each bank: 32 entry x 6-bit Simple Dual Port (SDP, aka one read, one write) in 4 LUTs
        -  9 SLICEM: 18-bit = 3 SLICEMs/bank
        - 12 SLICEM: 24-bit = 4 SLICEMs/bank
        - 15 SLICEM: 30-bit = 5 SLICEMs/bank 
        - 16 SLICEM: 36-bit = 6 SLICEMs/bank
     - Each bank direct mapped to DSP input {A,B,C} (except maybe C) for high clocks
     - Could have separate store control for all 3 banks
        - Complex to write assembly for
     - Not all values will need to be in all banks
        - And accumulator usage typically won't require stores
        - So when DSP to dRAM is not needed, loads or constants could be stored
     - Likewise stores (not to register file) can use DSP output (instead of fetching from register file)
        - The dRAM stores will need to be filtered through a CLB to mux options
    

    ________________________________________________________________________________________________________________________________
    [HW] DSP MUX at C
    When always using the MUL path, and using a simple 3 banked register file, there is a clock available to MUX inputs to C due to registering M. Perhaps a prefered way to do input modifications at high clocks.
     - C can have a CLB 
        (bRAM A)__[>A]__(*)__[>M]__(+)__[>P]
        (bRAM B)__[>B]__/          /
        (bRAM C)_______(CLB)_[>C]_/
     - Could MUX in constants/etc to C
        - To avoid needing to burn the register file
    

    ________________________________________________________________________________________________________________________________
    [HW] Left-Justified
    The idea is to always maximize precision by default. Fixed point numbers are all {-1.0 to <1.0}. This kind of thinking completely changes everything (as will be seen in later sections).
     - Fixed point number convention
        MSB.....LSB  CONVENTION
        ===========  ==========
        00000000000   0.0 zero (false)
        01111111111  <1.0 largest positive
        10000000000  -1.0 smallest negative (true) 
     - Traditional 'unsigned' {0.0 to 1.0} values go {0.0 to -1.0} instead
        - Only the signed side has ability to encode the 1.0
     - Any 'unsigned' {0.0 to <1.0} values stay positive {0.0 to <1.0}
     - Memory is {0.0 to <1.0} accessed (left aligned, instead of right aligned indexing)
        - So a 320 entry memory would be accessed {0.0 to 320.0/512.0}, where the 1.0 is the next power of 2 in size
     - Using {-0.5 to 0.5} for {-1.0 to 1.0} ranged data
        - Then extra 'p=p+p' (accumulator added to itself) to renormalize before final output
    

    Signed MACC


     - 12-bit x 15-bit = 27-bit partial product sign extended to 32-bit (proxy)
        11111111 11111111 00000000 00000000
        fedcba98 76543210 fedcba98 76543210
        ========-========-========-========
        ........ ........ ....bbbb bbbbbbbb  12-bit input for MUL
        ........ ........ ....0111 11111111   2047
        ........ ........ ....1000 00000000  -2048 (representing -1.0)
        ........ ........ .aaaaaaa aaaaaaaa  15-bit input for MUL
        ........ ........ .1000000 00000000  -16384 (representing -1.0)
        .....mmm mmmmmmmm mmmmmmmm mmmmmmmm
        .....010 00000000 00000000 00000000  -16384 * -2047 =  33554432 (maximum positive output)
        .....110 00000000 00000000 00000000                   -33554432
        pppppppp pppppppp pppppppp pppppppp  32-bit accumulator
        sssssmmm mmmmmmmm mmmmmmmm mmmmmmmm  sign-extended multiply result
        ......bb bbbbbbbb bb...... ........  register part used for 12-bit input (MSBs)
        ......aa aaaaaaaa aaaaa... ........  register part used for 15-bit input (MSBs)
    
     - 18-bit x 25-bit = 43-bit partial product sign extended to 48-bit (target)
        22222222 22222222 11111111 11111111 00000000 00000000
        fedcba98 76543210 fedcba98 76543210 fedcba98 76543210
        ========-========-========-========-========-========
        ........ ........ ........ ......bb bbbbbbbb bbbbbbbb  18-bit input for MUL
        ........ ........ ........ ......01 11111111 11111111   131071
        ........ ........ ........ ......10 00000000 00000000  -131072 (representing -1.0)
        ........ ........ .......a aaaaaaaa aaaaaaaa aaaaaaaa  25-bit input for MUL
        ........ ........ .......1 00000000 00000000 00000000  -16777216 (respresenting -1.0)
        sssssmmm mmmmmmmm mmmmmmmm mmmmmmmm mmmmmmmm mmmmmmmm  sign-extended multiply result
        ......bb bbbbbbbb bbbbbbbb ........ ........ ........  part used for 18-bit input (MSBs)
        ......aa aaaaaaaa aaaaaaaa aaaaaaa. ........ ........  part used for 25-bit input (MSBs)
        ......cc cccccccc cccccccc cccccccc cccccc.. ........  part used for 32-bit register file (or ADD)
        ssssss.. ........ ........ ........ ........ ........  sign extension for C
        ........ ........ ........ ........ ......10 00000000  single one bit to setup for rounded accumulation
    

    High Precision Shift Left
     - DSP has ability to feed result back into adder at high-precision
        - This can be done without any delay
        - Enables 'P+P = 2*P = P<<1' 
     - This provides a framework for mapping back to normalized multiplies,
        - Example if we know this won't overflow,
          Map 'x*3+x' to '(x*(3/4)+x*(1/4))<<2'
          Which takes 2 'x+x' clocks to renormalize
    

    Rounding
    Truncation moves towards the smaller value. Often a series of MADs feeding into an accumulator will start with an add operand of zero. To setup for rounding, can feed in a 1 bit before the LSB. Showing non-normalized examples below.
     - Unsigned logic examples
         0.4 + 0.5 =  0.9 ->  0.0
         0.6 + 0.5 =  1.1 ->  1.0 
         1.4 + 0.5 =  1.9 ->  1.0
         1.6 + 0.5 =  2.1 ->  2.0 
     - Signed logic examples
        -0.4 + 0.5 =  0.1 ->  0.0
        -0.6 + 0.5 = -0.1 -> -1.0 
        -1.4 + 0.5 = -0.9 -> -1.0
        -1.6 + 0.5 = -1.1 -> -2.0  
    

    ________________________________________________________________________________________________________________________________
    [HW] Negatives
    The DSP48E1 guide uses a different way of documenting ALUMODE, this is my way.
     - For signed 16-bit use {0, -32768} as {0 to 1.0} convention
        - fedcba9876543210
          ================
          0000000000000000 ...  0
          0111111111111111 ...  32767 (maximum positive)
          1000000000000000 ... -32768 (minimum negative)
          1111111111111111 ... -1 
     - For bools use sign bit, so 16-bit is {0:=false,-32768:=true}
     - No negate, just 'neg(x) = not(x) + 1'
     - No double negate 'd = -a + -b', instead 'd = a + b' then use '-d' later
     - No subtract, instead NOT modifiers
        - a - b = a + not(b) + 1
        - a - b = not(not(a) + b) ... alternative more useful form
     - DSP48E1 when using Multiplier (instead of the A:B high precision add)
        - X options 
           -  P (recirculate past result)
           -  M (multiplier partial result)
           -  0 (zero)
        - Y options
           -  C (input)
           -  M (multiplier other partial result)
           - ~0 (all ones)
           -  0 (zero)
        - Z options
           -  P (recirculate past result)
           -  C (input)
           -  0 (zero)
        - Useful input configurations
           Z   XY
           =   ==
           0 + 0 (zero)
           0 + P (nop)
           0 + C (move)
           0 + M (multiply)
           C + M (multiply add)
           P + M (multiply accumulate)
           P + C (accumulate)
           P + P (high precision double result, aka shift left 1)         
        - ALUMODE
           - 0:     Z +(X+Y+CIN)  ... add
           - 1:   (~Z)+(X+Y+CIN)  ... -Z+(X+Y+CIN)-1 ... if CIN=1 then get -Z+(X+Y)
           - 2: ~(  Z +(X+Y+CIN)) ... neg output
           - 3: ~((~Z)+(X+Y+CIN)) ... sub, 'Z-(X+Y+CIN)'
    

    Example: Parabolic Sqrt (Left-Justified Logic)
     - Parabolic sqrt(x) estimation
        - Using {0 to MIN_NEGATIVE} as {0.0 to -1.0} convention
        - Result using positives would be '2*x-x*x'
        - But want negatives so '2*x+x*x'
        - Highest precision
           - p=x
           - p+=p ..... Have 5-bit guard so this won't overflow
           - p+=x*x ... This gets 18-bit * 25-bits of precision for multiply
        - Fast, lower precision
           - p=x-(-1.0*x) ... The 'x' in the multiply is 25-bits of precision 
           - p+=x*x ......... This gets 18-bit * 25-bits of precision for multiply
    

    Example: Simple Filter Kernel (Left-Justified Logic)
     - Could replace a gaussian in some cases
        - Using {0 to MIN_NEGATIVE} as {0.0 to -1.0} convention
        - Result using positives would be '(1-x*x)^2'
        - But want negatives so '0-(-1+x*x)^2'
        - Highest precision
           - s=p=-1+(x*x) ... This gets 18-bit * 25-bits of precision for multiply, need high latency path though store 
           - p=0-(s*s) ...... Accumulator 'P' doesn't have direct path into multiplier, so must load
    

    ________________________________________________________________________________________________________________________________
    [HW] Divide
    Strategy for doing divides in a left-justified normalized world is a critical foundation for many things.



    Iterative Solution


    ... WORK IN PROGRESS ...



    Making /0.0 Result in -1.0


    What About RCP
    Same logic, just set 'x=-1.0' and pick an acceptable scaling factor for 's'.

    ________________________________________________________________________________________________________________________________
    [HW] Variable Bit-Width MEM
    Word, Half, Byte, and Nibble
    Simplest form of compression: variable bit-width loads.


     - Load permutations, where '.' is a zero
        11111111111111110000000000000000  ADR
        fedcba9876543210fedcba9876543210  hbn
        ================================  ===
        vutsrqponmlkjihgfedcba9876543210  111
        vutsrqponmlkjihg................  111
        vutsrqpo........................  111
        vuts............................  111
        rqpo............................  110
        nmlkjihg........................  101
        nmlk............................  101
        jihg............................  100
        fedcba9876543210................  011
        fedcba98........................  011
        fedc............................  011
        ba98............................  010
        76543210........................  001   
        7654............................  001
        3210............................  000
        ================================
        xxxx............................   4x 8:1 MUX (8 LUTs = 2 SLICEs)
                                              hbn - address bits (direct map to mux)
        ....xxxx........................   4x 8:1 MUX (8 LUTs = 2 SLICEs)
                                              hb. - address bits
                                              ..0 - 4-bit nibble (output zero)
                                              ..1 - not 4-bit nibble
        ........xxxxxxxx................   8x 4:1 MUX (8 LUTs = 2 SLICEs)
                                              h. - address bit
                                              .0 - 4-bit nibble or 8-bit byte (output zero)
                                              .1 - 16-bit half or 32-bit word
        ................xxxxxxxxxxxxxxxx  16x 2:1 MUX (8 LUTs = 2 SLICEs, using 5 inputs to 2 output LUTs)
                                              0 - not 32-bit word (output zero)
                                              1 - 32-bit word
    
     - Store permutations, where '.' is a zero (ignored, because using byte write mask)
        11111111111111110000000000000000  ADR
        fedcba9876543210fedcba9876543210  hb   byte write mask
        ================================  ==   ===============
        vutsrqponmlkjihgfedcba9876543210  11   1111
        vutsrqponmlkjihg................  11   1100
        vutsrqpo........................  11   1000
        ........vutsrqpo................  10   0100
        ................vutsrqponmlkjihg  01   0011
        ................vutsrqpo........  01   0010
        ........................vutsrqpo  00   0001
        ================================
        xxxxxxxx........................  Direct map (no MUX)
        ........xxxxxxxx................  8x 2:1 MUX (4 LUTs = 1 SLICEs, using 5 inputs to 2 output LUTs)
                                             h - address bit
        ................xxxxxxxx........  8x 2:1 MUX (4 LUTs = 1 SLICEs, using 5 inputs to 2 output LUTs)
                                             h - address bit
        ........................xxxxxxxx  8x 4:1 MUX (8 LUTs = 2 SLICEs)
                                             hb - address bits
    

    ________________________________________________________________________________________________________________________________
    [HW] XOR Offseting
    Often don't want to burn an adder for 'base+offset', doing XOR instead could be useful.

    OFF   000 001 010 011 100 101 110 111        0 1 2 3 4 5 6 7 
    BAS   --- --- --- --- --- --- --- ---        - - - - - - - - 
    000 | 000 001 010 011 100 101 110 111    0 | 0 1 2 3 4 5 6 7  ---> zero BAS works like ADD
    001 | 001 000 011 010 101 100 111 110    1 | 1 0 3 2 5 4 7 6  ---
    010 | 010 011 000 001 110 111 100 101    2 | 2 3 0 1 6 7 4 5   ^
    011 | 011 010 001 000 111 110 101 100 -> 3 | 3 2 1 0 7 6 5 4   |   the rest provide various reordering patterns
    100 | 100 101 110 111 000 001 010 011    4 | 4 5 6 7 0 1 2 3   |
    101 | 101 100 111 110 001 000 011 010    5 | 5 4 7 6 1 0 3 2   v
    110 | 110 111 100 101 010 011 000 001    6 | 6 7 4 5 2 3 0 1  ---
    111 | 111 110 101 100 011 010 001 000    7 | 7 6 5 4 3 2 1 0  ---> ~0 BAS inverts the order of OFF
    

    ________________________________________________________________________________________________________________________________
    [HW] Port Conflicts


    ________________________________________________________________________________________________________________________________
    [HW] Instruction Palette
    Core Challenges


    Suggests an instruction palette would be useful. Instruction provides a pointer into a palette of instruction data. Can thus use the rest of the bits for constants. Or alternatively reference constants instead of instruction data.

    Register Operand Compression?
     - Examples
        20-bits : 5-bit x 4 {P,A,B,C}
        20-bits : 4-bit (16-reg window) x 4 {P,A,B,C} = 16-bit + 4-bit offset (2 register granularity)
        19-bits : 4-bit (16-reg window) x 4 {P,A,B,C} = 16-bit + 3-bit offset (4 register granularity)
        18-bits : 4-bit (16-reg window) x 4 {P,A,B,C} = 16-bit + 2-bit offset (8 register granularity)
        16-bits : 3-bit ( 8-reg window) x 4 {P,A,B,C} = 12-bit + 4-bit offset (2 register granularity)
        15-bits : 3-bit ( 8-reg window) x 4 {P,A,B,C} = 12-bit + 3-bit offset (4 register granularity)
    

    Needs more thought ...

    ________________________________________________________________________________________________________________________________
    [HW] Saturating Integer
    Likely too complex for FPGA.
     - Working example for 16-bit machine
     - DSP output {MSB guard bits, 16-bit output, discard bits LSB}
     - Standard signed saturation (32 terms of overflow guard area)
                    guard
                    |||||fedcba9876543210
                    ======---------------
         -1048576 = 100000000000000000000 - Lowest negative before saturate fails
           -32769 = 111110111111111111111 - Underflow
           -32768 = 111111000000000000000
           -32767 = 111111000000000000001
                    ======---------------
            32766 = 000000111111111111110
            32767 = 000000111111111111111
            32768 = 000001000000000000000 - Overflow
          1048575 = 011111111111111111111 - Highest positive before saturate fails
                    ======---------------
     - 5-bits guard + 1-bit of MSB from 16-bit result region
        - 6:1 LUT provides 'out-of-bounds' detection
     - Requires 8 5:2 LUTs to saturate 16-bit output
        - {2-bits output, 1-bit out-of-bounds, 1-bit saturate enable, 1-bit MSB of guard} inputs to LUT
     - This is thus 2 levels of LUT deep
    

    ________________________________________________________________________________________________________________________________
    [HW] PIMD
    Pipelined Instruction Multiple Data?
    Network of cores, where the instruction flows through the network (PIMD), instead of being applied to the network at the same time (SIMD). In the most simple network, a line, reductions become linear time (vs log time with SIMD), but with high latency from start to finish.

    First problem with PIMD is that ALUs are also pipelined, so would need some way to manage that delay and still have useful execution. Don't want to use memory at each ALU to buffer a series of instructions. One alternative is to replay the instruction locally, but run through a series of registers. So reductions are always vectorized (aka multi-component).

    Not well thought out ...

    Big Program Small Memory


    As ALU density increases, natually the quanity of local memories increase, and the size of those memories decrease. It becomes possible to run unique program/node only if the program is tiny. There would be a desire to interleave the program across the memories, to enable large programs to be executed. This also has a secondary goal, to amortize memory access for the program across all the memories. Thus if the machine had N nodes, instructions would only be fetched 1/N times per node during execution.

    Starting with the simplest of networks, a directional torus which is interleaved so there is no long return path. Path worst case path length is 2x the node spacing. Showing a simplified example of an 8 node machine in the horizontal axis.

     ,-------. ,-------. ,-------. ,--.
    a    h    b    g    c    f    d   e
     `--' `-------' `-------' `-------'
    

    Which looks like this logically.

    a -> b -> c -> d -> e -> f -> g -> h
    ^                                  |
    `----------------------------------'
    

    De-pipelined start example on clock 0. Steady state pipelined execution example on clock 8. Example in octal. Each node pulls the instruction to execute from the instruction previously executed by the logically left neighbor. With exception that node 'A' pulls it's instruction from prior instruction latch value of node 'H' (it's left neighbor). Each clock the nodes read the instruction latch value from the logically left neighbor. And the latch value is updated on the 8th clock cycle. Taking the a read port access to the node's tiny local memory. Note for each line of 8 instructions the instruction stream is actually backwards.

             INSTRUCTION LATCH          INSTRUCTION EXECUTE
    clk    A  B  C  D  E  F  G  H     A  B  C  D  E  F  G  H
    ===    == == == == == == == ==    == == == == == == == ==
      0    07 06 05 04 03 02 01 00                               <-- 8 instructions latched every 8 clocks
      1       07 06 05 04 03 02 01    00
      2          07 06 05 04 03 02    01 00
      3             07 06 05 04 03    02 01 00
      4                07 06 05 04    03 02 01 00
      5                   07 06 05    04 03 02 01 00
      6                      07 06    05 04 03 02 01 00
      7                         07    06 05 04 03 02 01 00
      8    17 16 15 14 13 12 11 10    07 06 05 04 03 02 01 00    <-- 8 instructions latched every 8 clocks
     10       17 16 15 14 13 12 11    10 07 06 05 04 03 02 01
     11          17 16 15 14 13 12    11 10 07 06 05 04 03 02
     12             17 16 15 14 13    12 11 10 07 06 05 04 03
     ..              ...                        ...
    

    This extends to 2D directional torus quite easily. For an 8x8 node example, 64 new instructions would be latched on every 64th clock cycle. Every 8 clock cycles the instruction latch would be pulled from the vertical neighbor.

    Noticed though that instruction decode gets expensive, probably don't want to replicate the decode, instead rather fanout the decoded data, so perhaps not useful ...