oo             oo     oo                                                                                            20230824  
  OO     ,oOOo. oOOooo oOOooo ,oOOo. ,oOOOo                                                                           
  OO     OO  OO  OO     OO    OO  OO OO..            THIS IS FICTITIOUS, FROM FICTIONAL PERSONA, NO IDENTIFICATION WITH ACTUAL
  OO.    OO  OO  OO.    OO.   OO"""' `"""OO          PEOPLE|PRODUCTS|EMPLOYERS|BUSINESSES  IS INTENDED  OR SHOULD BE  INFERRED
  `Ooooo `OooO'  `Oooo  `Oooo `Ooooo ooooO'          SPAM HOLE: FIRST AND LAST NAME  JOINED WITHOUT A SPACE AT  PROTONMAIL.COM

________________________________________________________________________________________________________________________________
[SPC]

IN PROGRESS ...

________________________________________________________________________________________________________________________________
[SPC] Page Faults & BO Alloc
Post on the mechanics of CPU/GPU communication. Using AMDgpu based timing results on the SteamDeck an example, but relating to the larger picture of PC GFX APIs like Vulkan/etc.

Both RADV and AMDVLK: Flush/invalidate mapped memory ranges is a NOP. So bus-crossing dGPU traffic to HOST_VISIBLE is automatically snooping CPU caches. The one without HOST_CACHED, is Write+Combined [WC] on store, and Uncached [UC] on read. The one with HOST_CACHED is non-WC/UC.

In AMDgpu (the kernel driver), likely DEVICE_LOCAL maps to AMDGPU_GEM_DOMAIN_VRAM (also the carve out on APUs) and the non-DEVICE_LOCAL maps to AMDGPU_GEM_DOMAIN_GTT.

AMD+RADV added {DEVICE_COHERENT_BIT_AMD, DEVICE_UNCACHED_BIT_AMD} variations to the core 4 memory types. Likely to support GPU crash debug. But also provides a way to avoid needing to write-back (flush) GPU caches before CPU read. Likely AMDgpu kernel flag mapping below.

This AMDGPU_GEM_CREATE_CPU_GTT_USWC appear to toggle on WriteCombine [WC] for CPU store, and Uncached [US] for CPU reads (cases of HOST_VISIBLE without HOST_CACHED).

For review from ChipsAndCheese Deck bandwidths: ~71 GB/s GPU, ~43 GB/s DMA, ~34 GB/s shader copy CPU<->GPU, ~25 GB/s CPU/CPU, and damn, brutal 0.27 GB/s CPU mapped GPU buffer reads, 0.71 GB/s CPU mapped GPU buffer writes.

And going direct to AMDgpu instead of VK on the Deck shows these kinds of bandwidths (non-DEVICE_LOCAL, HOST_VISIBLE with HOST_CACHED and without). So using Write-Combined is amazingly painful for stores.

Tangental Notes

Implies that the choices one might make on dGPU PC don't necessarily port over to APUs at all. Another challenge: it takes almost 7 seconds to zero-fill using a 64-bit store for() loop the 8-GiB of mapped memory. Hints at why load times are such a challenge even in the best case.
This all hints at why PC OS derived systems are lacking in stuffing GPU VRAM. Really need some kind of bus mastered DMA (zero-copy) between non-volatile storage (disk) and GPU DRAM to avoid this CPU-touching performance tax.
Or for an APU, a way to have the storage device write direct to DRAM in pages setup for use on the GPU, because anything the CPU writes that the GPU reads is stuck on the strangled snooping bus or the strangled WC buffer limit on store.
If the app tried to manage GPU memory oversubscription via these strangled CPU mappings you'd literally see many second stalls. One can only assume the kernel can do storage<->DRAM DMA for managing GPU DRAM oversubscription?
Even if could fill VRAM in 1 sec, just 2x oversubscription of VRAM implies 1 sec delay to context switch (assuming 2 apps accessing all meM). So classic VM multitasking is useless. Only single app focus model with pinned memory makes any useful sense. Release pin on focus change.
The trend is massive res massive VRAM paired with tiny IO bus. So the case for pinned memory and bus master transfers only grows importance as things evolve.

Back to SteamDeck Numbers
GTT (HOST_VISIBLE) + USWC (non-HOST_CACHED) takes 1 sec for 1st 4 GiB BO alloc, but *30 sec* for 2nd 4 GiB alloc. Kernel driver time for memory allocation (maybe page table related) can be brutal (general comment for PCs too).

Related, I think Chips&Cheese CPU/GPU link DMA and Compute Bandwidths are only measured to a GTT+USWC buffer (only supporting useless uncached CPU reads). Bandwidth exceeds the CPU's bus capacity, implies it is 'garlic' or GPU bus only accesses, direct to DRAM.

Since RADV+AMDVLK don't support user-space CPU mapped flush | invalidate, this implies the only supported mapping to read from DRAM direct is USWC (uncached R and write+combine W). Doing GPU stores to a CPU mapped GTT without USWC would be crippingly slow (limited by the snooping bus rate). But this is unfortunately the only option available for CPU read back. So if doing a shader store, it better be only a few waves and running in parallel.

"Use GTT because it's as fast as VRAM on the Deck", could only work if GTT+USWC, as that would be only way to get Garlic (direct to DRAM high bandwidth bus). GTT without USWC would need Onion (slow snooping bus) because AMDgpu's only no-CPU map option is for VRAM!

Summary of the theory on best Steam Deck practices. This is the plan for my deck-compute-only driver too. Theory -> as in I haven't yet verified the GPU-side parts (my driver isn't that far along yet).

Possible to do better? MAYBE! CPU readback actually has 2 problems, 1st the slow GPU-side copy (4 GiB copy via snooping bus could be almost 4 sec, but direct to DRAM via USWC might be just 160 ms). Also CPU only has 4 MiB of L3, so the majority of 4 GiB will be uncached later. Believe UC MTYPE (uncached) forces the CPU into serialized behavior. My test was single thread 8-byte/access reads. VMOVDQA_M256_YMM! Looks like Zen2 might be able to get 32-byte/access via VMOVDQA, and going multi-threaded (8 thread), that might be a 32x speed up.

If so might be able to approach under a GB/s for the CPU-side part (UC multithread via VMOVDQA), which would be close enough to the non-USWC running single threaded using the cache. What you'd really want here as a band-aid workaround is ability for the GPU to act as if the CPU map was USWC (so go direct to DRAM), but have the CPU map act as non-USWC, so it goes through the cache. Then some kind of hack to flush the tiny 4 MiB L3 and lower caches on the CPU.

CPU readback (reading and summing 8-byte) GTT without USWC.

1.16 GiB/s 1 thread
1.42 GiB/s 4 threads
1.48 GiB/s 8 threads

Going multi-core on cached readback doesn't really help much. Proper test, parked threads waiting on futex, signal, last-1st active core timing.

Now CPU readback GTT+USWC.

0.13 GiB/s 4 threads 8-byte reads
0.24 GiB/s 8 threads 8-byte reads
1.15 GiB/s 8 threads 32-byte MOVNTDQA

So measurements match theory, going multi-core with MOVNTDQA uncached read on GTT+USWC can be made to match 1 thread GTT (without USWC). Both those results above had been using a pair of 4 GiB allocations. The GTT+USWC one used one 4 GiB GTT+USWC for timing, and one 4 GiB GTT (unused). And that test suffered from a 30 sec GTT BO allocation. So something was going very wrong in the page mapping. When I rerun same test with just one 4 GiB GTT+USWC allocation, the 30 sec stall is gone, and the performance also changes.

18.34 GiB/s - 8 threads x 32-byte MOVNTDQA (UC)

Oh!
Perhaps there is some resource limit that kills perf if too much memory gets mapped, page faults? Top with thread cumulative results doesn't show anything significantly different between the 4 GiB and 8 GiB runs in terms of page faults ... suggests it must be something else.

And yet there is obviously a bug in my multi-core tests, you can tell directly from the page fault numbers, only one thread is taking all the faults. So it is back to finding my coding error (fail). Lunch break and fixed the bug. Two runs now and leaving threads open to get TOP results. First run definitely soaks up the page faults, second run is page fault free (expected). Both around only 7 GB/s.

And the 8 GiB of BO mapped, but only 4 GiB used run. The first pass gets only 1 GiB/s and the second gets 7 GiB/s. Page fault number is similar to last run, can only conclude page fault costs exploded?

30 sec BO alloc time + super low bandwidth on 1st pass only (where page faults happen) suggest that Linux Kernel logic explodes in cost if too many pages are used in this way. ~8K faults for 512 MiB accessed / thread = 64 KiB/fault ... X86-64 has either 4 KiB or 2 MiB for page size. So not using large pages (fail). Probably mapping 16 pages per fault. Not sure if this implies anything about GPU page size (but certainly hoping it isn't 4 KiB, ouch).

Some other very rough measured numbers of GTT+USWC with 8 cores splitting 4 GiB of BO.

~7 GiB/s R
~10 GiB/s W then R
~15 GiB/s W

I think these seem plausable now (so maybe no more code bugs). One takeaways of all this, is that you need to pre-warm the page tables for large mapped buffers (by touching all pages) when the user isn't waiting on results. And if you are doing batch jobs that {open device, send data to GPU, get data back, close device} you are screwed!

And lastly (maybe) comparison of GTT and GTT+USWC both 8 threads splitting streaming through 4 GiB of mapping (second pass, no page fault issues):

R> ~14 GiB/s (GTT) and ~7 GiB/s (GTT+USWC)
W+R> ~14 GiB/s (GTT) and ~10 GiB/s (GTT+USWC)
W> ~13 GiB/s (GTT) and ~15 GiB/s (GTT+USWC)

So an alternative option summary for those who don't want the extra GPU-side GPU to CPU mapped buffer copy step.

If anyone is looking to repro the 30 sec stall: amdgpu_bo_alloc() one 4 GiB GTT+USWC, then one 4 GiB GTT buffer, the second alloc causes the Deck to become unresponsive for 30 seconds.

Note, madvise() with MADV_HUGEPAGE on mapped 4 GiB region doesn't do anything (still faults at 64 KiB granularity), and none of these MADV_{WILLNEED|POPULATE_READ|POPULATE_WRITE} have any effect either (still waits until use before faulting, causing low initial effective bandwidth).

64KiB strided write through 4 GiB GTT+USWC (to pre-fault) costs the same as writing full 4 GiB, roughly 3 seconds. So it is quite literally massive page fault overhead for 1st access. No possible workaround found at this time for initial load time problems.

If doing two 4 GiB GTT+USWC allocations, there is also a 30 second BO allocate cost on the 2nd one. And this makes the initial page fault cost for access to the first 4 GiB take another 30 seconds. Effectively hangs the machine for a full minute.

Doing two 4 GiB GTT allocations (without USWC), doesn't incure any of the 30 second stalls. So that problem is specific to big USWC allocations. However the initial access page fault problem (extra 3 seconds) is there, so a 2ndary problem with just mapping lots of APU memory.

Allocation of one 4 GiB GTT first then a 4 GiB GTT+USWC doesn't see the 30 second allocation stall. Almost like anything post a big USWC alloc is poisioned. And after mapping both, then accessing the GTT only, the 30 second time for page faulting comes back. Even if you don't map the 2nd GTT+USWC, the 30 second initial page faulting time is still there, so the act of mapping doesn't matter, simply doing the BO allocation had already doomed the Linux page management.

Despite header docs which imply flag only works on DOMAIN_VRAM, using DOMAIN_GTT+AMDGPU_GEM_CREATE_NO_CPU_ACCESS is apparently what you want for non-mapped GTT allocations. Mapped 4 GiB GTT with 4 GiB GTT+NO_CPU, drops the initial page fault time from 30 seconds to 3 sec.

________________________________________________________________________________________________________________________________
[SPC] Blame Onat

Workspace

One day a friend Onat said to me that on Linux Steam Deck the Vulkan driver is in user space, and it is possible to even have both RADV and the AMD Vulkan drivers running on the system at the same time ...

And That Was Enough to Seed the Idea
An idea that could not be ignored, a platform still exists, where in theory one could ship a game with their own generated shader binary and no driver. A way out of GPU API Hell! This was the day I stopped my other Windows PC projects, and became an exclusive Steam Deck developer.

________________________________

[EAT] Life of a Steak
Sometimes old technology is the best. Like when making Steaks. Salt, short dry age, season, WOOD FIRE, eat.

________________________________

[X68]
This was an idea for a simplified machine-level x86-64 interface ...

Subset of x86-64 where instructions are always a multiple of 32-bits
Using ignored segment override prefixes for padding
Instructions mapped to a human readable 8-character string

Resources

Register Naming
Characters {0-9} and {A-F} reserved for hex numbers. So the 16 registers are mapped {G-V}.

_G_ _H_ _I_ _J_ _K_ _L_ _M_ _N_ _O_ _P_ _Q_ _R_ _S_ _T_ _U_ _V_
rax rcx rdx rbx rsp rbp rsi rdi r8_ r9_ r10 r11 r12 r13 r14 r15

Addressing Mode Syntax

@u ...... [reg(u)]
@u23 .... [reg(u)+0x23]
@# ...... [rip+0x########], where # is the second DWORD
@u# ..... [reg(u)+0x########]
@uv ..... [reg(v)+reg(u)*1]
@2uv .... [reg(v)+reg(u)*2]
@4uv .... [reg(v)+reg(u)*4]
@8uv .... [reg(v)+reg(u)*8]
@4u23 ... [reg(u)*4+0x23]
@4u# .... [reg(u)*4+0x########]

Sizing

'v=@u ... reg(v) = BYTE PTR [reg(u)]
"v=@u ... reg(v) = WORD PTR [reg(u)]
v=@u .... reg(v) = DWORD PTR [reg(u)]
:v=@u ... reg(v) = QWORD PTR [reg(u)]
'v`@u ... reg(v) = sign extend BYTE PTR [reg(u)]

Maths

v+u ... ADD (v+=u)
v*u ... IMUL
~v .... NEG
v?u ... CMP
-v .... NEG
v&u ... AND
v^u ... XOR
v=u ... MOV
v$u ... LEA

Examples

u+v ...... 3E4501FE ds add r14d,r15d
u+23 ..... 4183C623 add r14d,0x23
u+# ...... 3E4181C6 ds add r14d,0x######## ... 32-bit immediate # is next DWORD
u*v ...... 450FAFF7 imul r14d,r15d
; ........ C30F1F00 ret; nop DWORD PTR [rax]
@u=v ..... 3E45893E mov DWORD PTR ds:[r14],r15d
@u12=v ... 45897E12 mov DWORD PTR [r14+0x12],r15d
:u+v ..... 3E4D01FE ds add r14,r15
:u*v ..... 4D0FAFF7 imul r14,r15

________________________________________________________________________________________________________________________________

The 'Right' Font

Only right angles, no aliasing even if scaled nearest
X68.fon ... 5x10 character cell at 320x240 provides a 64x24 character cell screen
X68D.fon ... Doubled size version (useful when not running on CRT)
Made with Fony

X68 at 1:2x2

X68 at 1:1

An alternative for even smaller text? (Outtake)

At 1:2x2

At 1:1

________________________________________________________________________________________________________________________________

Windows/Linux ABIs

For reference...
Reviewing stack operations on x86-64. Stack grows down, RSP points to last written entry.

CALL rax ... rsp-=8; [rsp]=next(rip); rip=rax;
POP rax .... rax=[rsp]; rsp+=8;
PUSH rax ... rsp-=8; [rsp]=rax;
RET ........ rip=[rsp]; rsp-=8;

Reviewing Windows and Linux ABI register conventions.

 N X86 X68 WIN LXN
 = === === === ===
 0 rax g   r   r   (return value)
 1 rcx h   a0  a3
 2 rdx i   a1  a2
 3 rbx j   sav sav
 4 rsp k   sav sav (stack pointer)
 5 rbp l   sav sav
 6 rsi m   sav a1
 7 rdi n   sav a0
 8 r8  o   a2  a4
 9 r9  p   a3  a5
 a r10 q   vol vol
 b r11 r   vol vol
 c r12 s   sav sav
 d r13 t   sav sav
 e r14 u   sav sav
 f r15 v   sav sav

Then the Windows stack conventions. Anything less than RSP can be overwritten any time, thus must move RSP before writing below a set RSP point. Before a CALL, RSP must be 16-byte aligned. There is a 32-byte 'shadow' region reserved for called function usage.

...
[RSP+0x28] A5
[RSP+0x20] A4
[RSP+0x18] not A3 (R9  shadow)
[RSP+0x10] not A2 (R8  shadow)
[RSP+0x08] not A1 (RDX shadow)
[RSP+0x00] not A0 (RCX shadow) ... 16-byte aligned
(return_goes_here)

Linux conventions.

...
[RSP+0x08] A7
[RSP+0x00] A6
(return_goes_here)

________________________________________________________________________________________________________________________________

Windows Terminal Remap

Docs for taking Windows terminal codes and mapping them into simple 8-bit single byte codes (for a portable editor).

//_______________________________________________/WINDOWS:KEYDOC
// _/KEYS\_____
// EN end
// ES escape
// BS backspace
// DN down
// DL delete
// HM home
// IN insert
// LF left
// PD page down
// PU page up
// RN return
// RT right
// SP space
// TB tab
// UP up
// _/EXCEPTIONS\___________________________________
// CTRL+h aliases CTRL+BACKSPACE
// CTRL+i aliases TAB
// CTRL+j aliases CTRL+RETURN
// CTRL+m aliases RETURN
// CTRL+[ aliases ESCAPE aliases control code start
// NO SHIFT+{BACKSPACE,RETURN,SPACE}
// NO  CTRL+{`-=;',.z,TAB,SPACE}
// NO   ALT+{TAB,RETURN}
// _/INPUT\_________________________
// =-=============-==-==-==-==-==-==
// A space........ 1b 20
// A ' ........... 1b 27
// A , ........... 1b 2c
// A - ........... 1b 2d
// A . ........... 1b 2e
// A / ........... 1b 2f
// A = ........... 1b 3d
// A 0 ........... 1b 30
// . . ........... .. ..
// A 9 ........... 1b 39
// A ; ........... 1b 3b
// A [ ........... 1b 5b
// A \ ........... 1b 5c
// A ] ........... 1b 5d
// A ` ........... 1b 60
// A a ........... 1b 61
// . . ........... .. ..
// A z ........... 1b 7a
// A backspace ... 1b 7f
// =-=============-==-==-==-==-==-==
// S tab ......... 1b 5b 5a
// =-=============-==-==-==-==-==-==
//   insert ...... 1b 5b 32 7e
//   delete ...... 1b 5b 33 7e
//   page up ..... 1b 5b 35 7e
//   page down ... 1b 5b 36 7e
//   up .......... 1b 5b 41
//   down ........ 1b 5b 42
//   right ....... 1b 5b 43
//   left ........ 1b 5b 44
//   end ......... 1b 5b 46
//   home ........ 1b 5b 48
// =-=============-==-==-==-==-==-==
// S insert ...... 1b 5b 32 36 32 7e
// S delete ...... 1b 5b 33 36 32 7e
// S page up ..... 1b 5b 35 36 32 7e
// S page down ... 1b 5b 36 36 32 7e
// S up .......... 1b 5b 31 3b 32 41
// S down ........ 1b 5b 31 3b 32 42
// S right ....... 1b 5b 31 3b 32 43
// S left ........ 1b 5b 31 3b 32 44
// S end ......... 1b 5b 31 3b 32 46
// S home ........ 1b 5b 31 3b 32 48
// =-=============-==-==-==-==-==-==
// C insert ...... 1b 5b 32 36 35 7e
// C delete ...... 1b 5b 33 36 35 7e
// C page up ..... 1b 5b 35 36 35 7e
// C page down ... 1b 5b 36 36 35 7e
// C up .......... 1b 5b 31 3b 35 41
// C down ........ 1b 5b 31 3b 35 42
// C right ....... 1b 5b 31 3b 35 43
// C left ........ 1b 5b 31 3b 35 44
// C end ......... 1b 5b 31 3b 35 46
// C home ........ 1b 5b 31 3b 35 48
// =-=============-==-==-==-==-==-==
// _/OUTPUT_MATCHING\_________ _/OUTPUT_CUSTOM\___________
// ==-==- ==-==- ==-==- ==-==- ==-==- ==-==- ==-==- ==-==-
// 00     20 SP  40 @   60 `   80     a0 SPa c0     e0 ` a
// 01 a c 21 !   41 A   61 a   81     a1 INa c1 INc e1 a a
// 02 b c 22 "   42 B   62 b   82     a2 DLa c2 DLc e2 b a
// 03 c c 23 #   43 C   63 c   83     a3 PUa c3 PUc e3 c a 
// 04 d c 24 $   44 D   64 d   84     a4 PDa c4 PDc e4 d a  
// 05 e c 25 %   45 E   65 e   85     a5 UPa c5 UPc e5 e a 
// 06 f c 26 &   46 F   66 f   86     a6 DNa c6 DNc e6 f a  
// 07 g c 27 '   47 G   67 g   87     a7 RTa c7 RTc e7 g a 
// 08 h c 28 (   48 H   68 h   88     a8 LFa c8 LFc e8 h a  
// 09 TB  29 )   49 I   69 i   89 TBs a9 ENa c9 ENc e9 i a  
// 0a j c 2a *   4a J   6a j   8a     aa HMa ca HMc ea j a  
// 0b k c 2b +   4b K   6b k   8b     ab     cb     eb k a  
// 0c l c 2c ,   4c L   6c l   8c     ac , a cc     ec l a  
// 0d RN  2d -   4d M   6d m   8d     ad - a cd     ed m a  
// 0e n c 2e .   4e N   6e n   8e     ae . a ce     ee n a  
// 0f o c 2f /   4f O   6f o   8f     af / a cf     ef o a  
// 10 p c 30 0   50 P   70 p   90     b0 0 a d0     f0 p a
// 11 q c 31 1   51 Q   71 q   91 INs b1 1 a d1 IN  f1 q a
// 12 r c 32 2   52 R   72 r   92 DLs b2 2 a d2 DL  f2 r a
// 13 s c 33 3   53 S   73 s   93 PUs b3 3 a d3 PU  f3 s a  
// 14 t c 34 4   54 T   74 t   94 PDs b4 4 a d4 PD  f4 t a  
// 15 u c 35 5   55 U   75 u   95 UPs b5 5 a d5 UP  f5 u a  
// 16 v c 36 6   56 V   76 v   96 DNs b6 6 a d6 DN  f6 v a  
// 17 w c 37 7   57 W   77 w   97 RTs b7 7 a d7 RT  f7 w a  
// 18 x c 38 8   58 X   78 x   98 LFs b8 8 a d8 LF  f8 x a  
// 19 y c 39 9   59 Y   79 y   99 ENs b9 9 a d9 EN  f9 y a  
// 1a z c 3a :   5a Z   7a z   9a HMs ba     da HM  fa z a  
// 1b ES  3b ;   5b [   7b {   9b     bb ; a db [ a fb    
// 1c \ c 3c <   5c \   7c |   9c     bc     dc \ a fc    
// 1d ] c 3d =   5d ]   7d }   9d     bd = a dd ] a fd    
// 1e     3e >   5e ^   7e ~   9e     be     de     fe    
// 1f / c 3f ?   5f _   7f BS  9f     bf     df     ff BSa
// ==-==- ==-==- ==-==- ==-==- ==-==- ==-==- ==-==- ==-==-
// _/ENCODING_6_CHARS_INTO_32BIT\_________________________________
// char 0 - 7-bit 
// char 1 - 7-bit
// char 2 - 0,31-5a -> 30-5a -> 0-2a -> 0-39 -> 6-bit
// char 3 - 0,36,3b,7e
//  0000000
//  0110110 ... extract 2 bits
//  0111011
//  1111110
//     ab 
// char 4 - 0,32,35
//  000000
//  110010 ... extract 2 bits
//  110101
//     ab
// char 5 - 0,41,,7e - can just use 7-bit
// ---------------------------------------------------------------
// 11111111111111110000000000000000
// fedcba9876543210fedcba9876543210
// ================================
// .0000000........................  char 0 [lower 7-bits]
// ........1111111.................  char 1 [lower 7-bits]
// ...............222222...........  char 2 [clamp(c-0x30,0,0x39)]
// .....................33.........  char 3 [(c>>2)&3]
// .......................44.......  char 4 [(c>>1)&3]
// .........................5555555  char 5 [lower 7-bits]

________________________________________________________________________________________________________________________________

SPIR-V Notes

Aim
The point of this was to look at the possibility to replace GLSL with some simplified virtual assembly language (something that is closer to 1:1 mapping to GCN ISA), and see if that can be expressed in SPIR-V. I believe the answer to that is YES. Notice that multi-component values like uvec4 get reduced to OpLoad and OpStore without using OpPhi given a branch, so a compiler would need to be able to handle optimizing with loads and stores. Which implies it would be easy to just pre-allocate N*4 registers as N multi-component values, then use load/store to access.

Tangentally, it is possible just in the examples below to see how SPIR-V is a great example of bad engineering as SPIR-V obfuscates the meaning of code with very poor information density.

SPIR-V Reference
SPIR-V 1.0 Spec

Minimal SPIR-V File?
__GLSL__

#version 450
layout(local_size_x=64)in;
void main(){}

__SPIR-V__

; SPIR-V
; Version: 1.0
; Generator: Khronos Glslang Reference Front End; 7
; Bound: 11
; Schema: 0
               OpCapability Shader
          %1 = OpExtInstImport "GLSL.std.450"
               OpMemoryModel Logical GLSL450
               OpEntryPoint GLCompute %main "main"
               OpExecutionMode %main LocalSize 64 1 1
               OpSource GLSL 450
               OpName %main "main"
               OpDecorate %gl_WorkGroupSize BuiltIn WorkgroupSize
       %void = OpTypeVoid
          %3 = OpTypeFunction %void
       %uint = OpTypeInt 32 0
     %v3uint = OpTypeVector %uint 3
    %uint_64 = OpConstant %uint 64
     %uint_1 = OpConstant %uint 1
%gl_WorkGroupSize = OpConstantComposite %v3uint %uint_64 %uint_1 %uint_1
       %main = OpFunction %void None %3
          %5 = OpLabel
               OpReturn
               OpFunctionEnd

Adding Buffer Binding
__GLSL__

#version 450
layout(set=0,binding=1,std430)buffer b0_{uvec4 b0[4096];};
layout(local_size_x=64)in;
void main(){}

__SPIR-V__

; SPIR-V
; Version: 1.0
; Generator: Khronos Glslang Reference Front End; 7
; Bound: 17
; Schema: 0
               OpCapability Shader
          %1 = OpExtInstImport "GLSL.std.450"
               OpMemoryModel Logical GLSL450
               OpEntryPoint GLCompute %main "main"
               OpExecutionMode %main LocalSize 64 1 1
               OpSource GLSL 450
               OpName %main "main"
               OpName %b0_ "b0_"
               OpMemberName %b0_ 0 "b0"
               OpName %_ ""
               OpDecorate %_arr_v4uint_uint_4096 ArrayStride 16
               OpMemberDecorate %b0_ 0 Offset 0
               OpDecorate %b0_ BufferBlock
               OpDecorate %_ DescriptorSet 0
               OpDecorate %_ Binding 1
               OpDecorate %gl_WorkGroupSize BuiltIn WorkgroupSize
       %void = OpTypeVoid
          %3 = OpTypeFunction %void
       %uint = OpTypeInt 32 0
     %v4uint = OpTypeVector %uint 4
  %uint_4096 = OpConstant %uint 4096
%_arr_v4uint_uint_4096 = OpTypeArray %v4uint %uint_4096
        %b0_ = OpTypeStruct %_arr_v4uint_uint_4096
%_ptr_Uniform_b0_ = OpTypePointer Uniform %b0_
          %_ = OpVariable %_ptr_Uniform_b0_ Uniform
     %v3uint = OpTypeVector %uint 3
    %uint_64 = OpConstant %uint 64
     %uint_1 = OpConstant %uint 1
%gl_WorkGroupSize = OpConstantComposite %v3uint %uint_64 %uint_1 %uint_1
       %main = OpFunction %void None %3
          %5 = OpLabel
               OpReturn
               OpFunctionEnd

Buffer Load, Component Modify, Buffer Store
__GLSL__

#version 450
layout(set=0,binding=1,std430)buffer b0_{uvec4 b0[4096];};
layout(local_size_x=64)in;
void main(){uvec4 u=b0[0];u.x+=1u;b0[0]=u;}

__SPIR-V__

; SPIR-V
; Version: 1.0
; Generator: Khronos Glslang Reference Front End; 7
; Bound: 32
; Schema: 0
               OpCapability Shader
          %1 = OpExtInstImport "GLSL.std.450"
               OpMemoryModel Logical GLSL450
               OpEntryPoint GLCompute %main "main"
               OpExecutionMode %main LocalSize 64 1 1
               OpSource GLSL 450
               OpName %main "main"
               OpName %u "u"
               OpName %b0_ "b0_"
               OpMemberName %b0_ 0 "b0"
               OpName %_ ""
               OpDecorate %_arr_v4uint_uint_4096 ArrayStride 16
               OpMemberDecorate %b0_ 0 Offset 0
               OpDecorate %b0_ BufferBlock
               OpDecorate %_ DescriptorSet 0
               OpDecorate %_ Binding 1
               OpDecorate %gl_WorkGroupSize BuiltIn WorkgroupSize
       %void = OpTypeVoid
          %3 = OpTypeFunction %void
       %uint = OpTypeInt 32 0
     %v4uint = OpTypeVector %uint 4
%_ptr_Function_v4uint = OpTypePointer Function %v4uint
  %uint_4096 = OpConstant %uint 4096
%_arr_v4uint_uint_4096 = OpTypeArray %v4uint %uint_4096
        %b0_ = OpTypeStruct %_arr_v4uint_uint_4096
%_ptr_Uniform_b0_ = OpTypePointer Uniform %b0_
          %_ = OpVariable %_ptr_Uniform_b0_ Uniform
        %int = OpTypeInt 32 1
      %int_0 = OpConstant %int 0
%_ptr_Uniform_v4uint = OpTypePointer Uniform %v4uint
     %uint_1 = OpConstant %uint 1
     %uint_0 = OpConstant %uint 0
%_ptr_Function_uint = OpTypePointer Function %uint
     %v3uint = OpTypeVector %uint 3
    %uint_64 = OpConstant %uint 64
%gl_WorkGroupSize = OpConstantComposite %v3uint %uint_64 %uint_1 %uint_1
       %main = OpFunction %void None %3
          %5 = OpLabel
          %u = OpVariable %_ptr_Function_v4uint Function
         %18 = OpAccessChain %_ptr_Uniform_v4uint %_ %int_0 %int_0
         %19 = OpLoad %v4uint %18
               OpStore %u %19
         %23 = OpAccessChain %_ptr_Function_uint %u %uint_0
         %24 = OpLoad %uint %23
         %25 = OpIAdd %uint %24 %uint_1
         %26 = OpAccessChain %_ptr_Function_uint %u %uint_0
               OpStore %26 %25
         %27 = OpLoad %v4uint %u
         %28 = OpAccessChain %_ptr_Uniform_v4uint %_ %int_0 %int_0
               OpStore %28 %27
               OpReturn
               OpFunctionEnd

Now With Simple Conditional
__GLSL__

#version 450
layout(set=0,binding=1,std430)buffer b0_{uvec4 b0[4096];};
layout(local_size_x=64)in;
void main(){uvec4 u=b0[0];if(u.x!=0u)u.x+=1u;else u.x+=2u;b0[0]=u;}

__SPIR-V__

; SPIR-V
; Version: 1.0
; Generator: Khronos Glslang Reference Front End; 7
; Bound: 44
; Schema: 0
               OpCapability Shader
          %1 = OpExtInstImport "GLSL.std.450"
               OpMemoryModel Logical GLSL450
               OpEntryPoint GLCompute %main "main"
               OpExecutionMode %main LocalSize 64 1 1
               OpSource GLSL 450
               OpName %main "main"
               OpName %u "u"
               OpName %b0_ "b0_"
               OpMemberName %b0_ 0 "b0"
               OpName %_ ""
               OpDecorate %_arr_v4uint_uint_4096 ArrayStride 16
               OpMemberDecorate %b0_ 0 Offset 0
               OpDecorate %b0_ BufferBlock
               OpDecorate %_ DescriptorSet 0
               OpDecorate %_ Binding 1
               OpDecorate %gl_WorkGroupSize BuiltIn WorkgroupSize
       %void = OpTypeVoid
          %3 = OpTypeFunction %void
       %uint = OpTypeInt 32 0
     %v4uint = OpTypeVector %uint 4
%_ptr_Function_v4uint = OpTypePointer Function %v4uint
  %uint_4096 = OpConstant %uint 4096
%_arr_v4uint_uint_4096 = OpTypeArray %v4uint %uint_4096
        %b0_ = OpTypeStruct %_arr_v4uint_uint_4096
%_ptr_Uniform_b0_ = OpTypePointer Uniform %b0_
          %_ = OpVariable %_ptr_Uniform_b0_ Uniform
        %int = OpTypeInt 32 1
      %int_0 = OpConstant %int 0
%_ptr_Uniform_v4uint = OpTypePointer Uniform %v4uint
     %uint_0 = OpConstant %uint 0
%_ptr_Function_uint = OpTypePointer Function %uint
       %bool = OpTypeBool
     %uint_1 = OpConstant %uint 1
     %uint_2 = OpConstant %uint 2
     %v3uint = OpTypeVector %uint 3
    %uint_64 = OpConstant %uint 64
%gl_WorkGroupSize = OpConstantComposite %v3uint %uint_64 %uint_1 %uint_1
       %main = OpFunction %void None %3
          %5 = OpLabel
          %u = OpVariable %_ptr_Function_v4uint Function
         %18 = OpAccessChain %_ptr_Uniform_v4uint %_ %int_0 %int_0
         %19 = OpLoad %v4uint %18
               OpStore %u %19
         %22 = OpAccessChain %_ptr_Function_uint %u %uint_0
         %23 = OpLoad %uint %22
         %25 = OpINotEqual %bool %23 %uint_0
               OpSelectionMerge %27 None
               OpBranchConditional %25 %26 %33
         %26 = OpLabel
         %29 = OpAccessChain %_ptr_Function_uint %u %uint_0
         %30 = OpLoad %uint %29
         %31 = OpIAdd %uint %30 %uint_1
         %32 = OpAccessChain %_ptr_Function_uint %u %uint_0
               OpStore %32 %31
               OpBranch %27
         %33 = OpLabel
         %35 = OpAccessChain %_ptr_Function_uint %u %uint_0
         %36 = OpLoad %uint %35
         %37 = OpIAdd %uint %36 %uint_2
         %38 = OpAccessChain %_ptr_Function_uint %u %uint_0
               OpStore %38 %37
               OpBranch %27
         %27 = OpLabel
         %39 = OpLoad %v4uint %u
         %40 = OpAccessChain %_ptr_Uniform_v4uint %_ %int_0 %int_0
               OpStore %40 %39
               OpReturn
               OpFunctionEnd

________________________________________________________________________________________________________________________________

Shipping Only 16-Bit

TLDR, going to require 16-bit shader support! One man show, avoiding having both a 32-bit and 16-bit shader variation is highly desired. Don't want to limit the tech based on decisions required to support 32-bit. Would it be practical to develop and ship just a 16-bit version? This is Vulkan only, no XBox (for certain), and no Playstation (lack of time). Don't need 16-bit buffer access, and wave size is managed via spec constants, so relatively easy.

vulkan.gpuinfo.org - VkPhysicalDeviceFeatures::shaderInt16 - VK_KHR_shader_float16_int8::shaderFloat16

AMD : GCN5 (Vega) and up. Hits Steam Deck (assuming they have 16-bit support in the software stack, RADV does apparently). Note GCN4 (Polaris) has single rate 16-bit so that would be supported if AMD would stop disabling it in software.
Intel : This says 16-bit support since Gen8 (at least in Linux). Beyond Arc, not sure if Intel iGPUs will be fast enough to run this project. Assuming Gen11 baseline maybe, and would need to also have 16-wide wave permutation to be optimal.
NVIDIA : Turing (aka 20 series) and up. NV lacks some of the ISA support, but likely its emulated. NVIDIA is not a performance limited target, so will take 'no perf uplift' or even a 'slight reduction in perf' to avoid authoring a 32-bit only shader variations.

Adding up the top cards (all NVIDIA) on Steam Hardware Survey shows at least 30% market won't support 16-bit. But also that at least 30% of the market will support 16-bit, and that is good enough for me.

________________________________________________________________________________________________________________________________

X68 Epilogue

Keeping notes here incase I ever choose to revisit ...
The time for this project was replaced by [SBC]. The ability to write GPU binaries and command buffers was too great, and after that there is effectively no need for any CPU logic except the ugly that is interfacing inside a modern OS.

________________________________

[PR0] Random Prototype 0
TODO, out of time perhaps more on these later...

Raw View Without Hole Filling Or Temporal Reconstruction

Majority of white dots are actually holes in the scene. This is using stratified visibility, it doesn't necessarily find an intersection for each pixel.

With Temporal Reconstruction and Grain

This uses a spatial temporal reconstruction that also fills holes and removes noise.

________________________________________________________________________________________________________________________________
[PR0] Octahedral Framebuffer
Implemnted a 1024x512 rectangular layout octahedral framebuffer, with a 360 degree cylindrical projection in a three stage pipeline. The intermediate stage samples the octahedron into a VGA-like 720x256 resolution target with a warped vertical. This is to avoid perspective induced undersampling. Final stage applies the CRT shader.

CRT shader has progressively thicker scanlines at top and bottom due to the vertical warpage. Running full 16:9 but with a strong vignette. The shadow mask is blended out towards the center of the screen to increase peak brightness Horizontal blur is increased towards the right and left for the chromatic aberration.

Both the sampling of the octahedron and the VGA intermediate images are done with linear filtering and a wide gaussian kernel. Monochromatic tonemapping is applied afterwards. Followed by linear-space colorizing of the greyscale. The octahedron output is 32-bit packed {8-bit 1/(1+luma) in gamma 2.0, 13-bit x, 11-bit y}. Sub-pixel position will later be used in filtering.

Shots below have linear temporal average of simple ray traced dummy scene, enough to show first pass of post pipeline. Have some changes to try before moving on to the next step in development.

Not planning on doing bloom, due to the wide gaussian kernel bright areas naturally have a slight bleed, that is likely enough hint of brightness. Not going to do DOF, following the strategy, if it cannot be done really well, don't bother. Not doing motion blur, as there are no hard edges for the eye to get stuck on, focusing on peak frame rate instead. Not doing local contrast adaption or sharpening of any kind, don't like the look of negative lobes inverting the edge.

________________________________________________________________________________________________________________________________
[PR0] GPU Program Bring Up
Not yet to the fun part, still laying down foundation. Added a frame counter as a push constant for the single dispatch. This will be enough to branch on to get to even and odd frame permutations. So that double buffering works properly. Dispatch sizing is a fixed 2K workgroups. Which in theory at 64 VGPRs/wave target, is good for a 128 CU machine (in classic GCN arch). Classic GCN has maxed out at a 64 CU machine. Will have to modify this stuff later.

Bringing up helper GLSL structure, binding points, etc. Initial testing involved writing to the front buffer a color based on the frame count push constant. Front buffer rendering seems to "work" both when full-screen (low latency) and windowed (high latency). Example screen shot shows {red,black} color due to window compositor reading during program writes.

Software Spin Wait - With pipelined execution, a wait on a "barrier" would be expected to not wait. The barrier only functions as a safe-guard in case something goes wrong. Initial check to see if a barrier is signaled should thus involve a cached read (because there would be an expectation of other waves later reading the same value). Only if the initial read fails should an uncached spin wait be invoked. Can see from the RDNA ISA Guide, hardware support both in scalar and vector loads for GLC=0 reads which hit on the cache, and GLC=1 reads which force a fetch from L2, and evict the line afterwards ("miss-evict"). Ideal spin wait would be the following,

// Want this logic in SALU only (no burning vector ALU cycles).
if(ramR.barrier<signal){        // (A.) SMEM GLC=0 read, only enter if not signaled.
 while(ramRV.barrier<signal){}} // (B.) SMEM GLC=1 read, spin while not signaled.

The first wave would miss on the first (A.) read. If the barrier passes, all future waves will hit and quickly pass. If the spin (B.) is invoked, the second wave will miss on the first (A.) read, because GLC=1 evicts the line post-read. But the third will hit.

API ASK #0 - Ability to provide branch hints (coherent vs divergent, and expected branch outcome). The compiler output (see below) is always the slower option. It keeps the most uncommon path inline resulting in the most inefficient execution.

HARDWARE ASK #0 - Would be nice to have a way to force a miss on read but leave the line in the cache.

AMD DRIVER BUG #0 - AMD driver sees "readonly" then ignores the other memory qualifiers. This is both a correctness and optimization bug. So there is no way to get SMEM loads with GLC=1 set. This pushes the overhead into the VMEM and VALU paths for (B.). If the first wave sees a non-signaled state in (A.) then likely all waves on that cache will always invoke the slower (B.) spin loop, because nothing will be refreshing the K$ line.

// layout(set=0,binding=2,std430)readonly buffer ssbo3_ {RamT ramRV;};
s_buffer_load_dword  s0, s[12:15], null               // 000000000020: F4200006 FA000000

// layout(set=0,binding=2,std430)volatile readonly buffer ssbo3_ {RamT ramRV;};
s_buffer_load_dword  s0, s[12:15], null               // 000000000020: F4200006 FA000000

// layout(set=0,binding=2,std430)coherent readonly buffer ssbo3_ {RamT ramV;};
s_buffer_load_dword  s0, s[12:15], null               // 000000000020: F4200006 FA000000

// layout(set=0,binding=2,std430)volatile buffer ssbo3_ {RamT ramRV;};
buffer_load_dword  v1, v0, s[12:15], 0 dlc glc        // 000000000020: E030C000 80030100

// layout(set=0,binding=2,std430)coherent buffer ssbo3_ {RamT ramRV;};
buffer_load_dword  v1, v0, s[12:15], 0 dlc glc        // 000000000020: E030C000 80030100

AMD DRIVER BUG #1 - The above code with the VMEM spin workaround won't work due to this second bug. AMD driver incorrectly hoists the VMEM GLC=1 load outside the loop, leading to incorrect behavior. In the example below the signal state is zero (instead of less than some number).

if(ramR.barrier!=0u){ // (A.).
 while(ramV.barrier!=0u){}} // (B.).

// This is (A.).
  s_buffer_load_dword  s0, s[12:15], null               // 000000000028: F4200006 FA000000
  s_waitcnt     lgkmcnt(0)                              // 000000000030: BF8CC07F
  s_cmp_eq_i32  s0, 0                                   // 000000000034: BF008000
  s_cbranch_scc1  label_006C                            // 000000000038: BF85000C
// This should be in the (B.) loop.
  buffer_load_dword  v1, v0, s[12:15], 0 dlc glc        // 00000000003C: E030C000 80030100
  s_nop         0x0000                                  // 000000000044: BF800000
  s_nop         0x0000                                  // 000000000048: BF800000
  s_nop         0x0000                                  // 00000000004C: BF800000
  s_nop         0x0000                                  // 000000000050: BF800000
  s_nop         0x0000                                  // 000000000054: BF800000
  s_nop         0x0000                                  // 000000000058: BF800000
  s_nop         0x0000                                  // 00000000005C: BF800000
// This is (B.).
label_0060:
  s_waitcnt     vmcnt(0)                                // 000000000060: BF8C3F70
  v_cmp_eq_i32  vcc_lo, 0, v1                           // 000000000064: 7D040280
  s_cbranch_vccz  label_0060                            // 000000000068: BF86FFFD
label_006C:

AMD DRIVER BUG #2 - Now trying to trick the compiler into doing the right thing. First with 'subgroupElect()', which generates the same bug as prior, but adds another performance bug. This should just be a simple operation to save and set EXEC to 1, then restore afterwards. But instead the compiler acts as if it is already in divergent control flow ('ff1' is find first 1).

if(subgroupElect()){if(ramR.barrier!=0u){while(ramV.barrier!=0u){}}}

  // This is the subgroupElect() code for a wave sized workgroup in known coherent execution.
  v_mbcnt_lo_u32_b32  v1, -1, 0                         // 000000000010: D7650001 000100C1
  s_ff1_i32_b64  s0, exec                               // 000000000018: BE80147E
  v_mbcnt_hi_u32_b32  v1, -1, v1                        // 00000000001C: D7660001 000202C1
  . . .
  s_and_saveexec_b64  s[10:11], vcc                     // 000000000028: BE8A246A

AMD DRIVER BUG #3 - Trying to work around the above performance bug (since running work for just lane 0 will be needed elsewhere). Using 'gl_LocalInvocationID.x' paired with 'layout(local_size_x=32)' won't work either. This example gets 'wave_size(32)' in the disassembly. The compiler still uses VALU work instead of just masking EXEC.

if(gl_LocalInvocationID.x==0u){if(ramR.barrier!=0u){while(ramV.barrier!=0u){}}}

  v_cmp_eq_i32  vcc_lo, 0, v0                           // 000000000010: 7D040080
  s_and_saveexec_b32  s0, vcc_lo                        // 000000000014: BE803C6A
  s_cbranch_execz  label_006C                           // 000000000018: BF880014

AMD DRIVER BUG #4 - Last attempt to workaround the performance bug also fails. The driver will always use the slow path burning VALU instruction(s) for what should map to one 'S_AND_SAVEEXEC_B32 s[...],1' scalar instruction on Navi. Using subgroup ops results in 'wave_size(64)' in the disassembly.

if(gl_SubgroupInvocationID==0u){if(ramR.barrier!=0u){while(ramV.barrier!=0u){}}}

  v_mbcnt_lo_u32_b32  v1, -1, 0                         // 000000000010: D7650001 000100C1
  v_mbcnt_hi_u32_b32  v1, -1, v1                        // 000000000018: D7660001 000202C1
  v_cmp_gt_i32  vcc, 1, v1                              // 000000000020: 7D080281
  s_and_saveexec_b64  s[10:11], vcc                     // 000000000024: BE8A246A

AMD DRIVER BUG #5 - There is another obvious bug in the above disassembly. The program is using 'layout(local_size_x=32)' and using 'gl_SubgroupInvocationID' or 'subgroupElect()' causes the compiler to switch to 'wave_size(64)' mode with the high 32 lanes doing nothing. This also means it is using 'V_MBCNT_HI_U32_B32' which wouldn't be needed in wave32 mode.

AMD DRIVER BUG #6 - Using 'VK_EXT_subgroup_size_control' doesn't support forcing wave32 mode on Navi.

Takeaways?
If software prevents you from accessing it, the hardware doesn't actually exist.
All reasonable efforts to optimize on AMD are thwarted by it's software stack. No choice but to ship with the slowest path on the hardware.

One optimization is possible however, early in execution, going to process 'gl_LocalInvocationID' to write a waveID into an SGPR. This way the VGPR for 'gl_LocalInvocationID' can be freed. Later will use 'gl_SubgroupInvocationID' when required to build a lane index (which materializes lane index from ALU instead of keeping it in a VGPR).

________________________________________________________________________________________________________________________________
[PR0] ABI Crossing
Thoughts on interfacing a custom language to library calls, didn't end up using the custom language for this project...

Stack Crossing

System ABIs use 16-byte aligned stacks.

ABI REQUIREMENTS AFTER A CALL INSTRUCTION
=========================================
[rsp+8] SYSV 7th argument, WIN first entry of 32-byte shadow space
[rsp+0] Return address, this is 16-byte aligned
[rsp-8] Free space

RSP BEFORE A CALL IS THUS NOT 16-BYTE ALIGNED
=============================================
[rsp+0] SYSV 7th argument, WIN first entry of 32-byte shadow space

This will use 8-byte aligned stacks, because they are not used for 16-byte data. The ABI crossing call will need to start by aligning the stack, and restoring it before the return.

// Aligned case,
//  [64] [rsp   ] return address
//  [56] [rsp-8 ] save rsp ....... skipped
//  [48] [rsp-16] save rsp ....... final rsp points here 
// -----------------------------------------------------
// Unaligned case,
//  [56] [rsp   ] return address
//  [48] [rsp-8 ] save rsp ....... final rsp points here
//  [40] [rsp-16] save rsp ....... unused
// -----------------------------------------------------
enter:
 mov [rsp-8],rsp
 mov [rsp-16],rsp
 add rsp,-8
 and rsp,~15
 ...
leave:
 mov rsp,[rsp+0]
 ret

ABIs

Only supporting x86-64 in 64-bit mode for this project. Have a few points to cross between my custom non-language and the rest of the system. C ABI different for Windows vs everyone else, and system calls on non-windows platforms. The 'a' is an argument (numbered), the 'non' is non-volatile, 'vol' is volatile, and everything else is volatile. The 'WIN' is the Windows ABI, the 'SYSV' is shared across unix/BSDs, the 'KRN' is the Linux kernel syscall convention.

REG       X86-FAIL  WIN     SYSV    KRN
========  ========  ======  ======  =======
r0 (rax)  ........  return  return  num/ret  --- reuse for call address or syscall number
r1 (rcx)  ........  a0 ...  a3 ...  vol ...  --- save
r2 (rdx)  ........  a1 ...  a2 ...  a2 ....  --- save
r3 (rbx)  ........  non ..  non ..  non ...
r4 (rsp)  SIB ....  stack   stack   stack .  --- save
r5 (rbp)  RIP ....  non ..  non ..  non ...
r6 (rsi)  ........  non ..  a1 ...  a1 ....  --- save on non-win
r7 (rdi)  ........  non ..  a0 ...  a0 ....  --- save on non-win
r8 .....  ........  a2 ...  a4 ...  a4 ....  --- save
r9 .....  ........  a3 ...  a5 ...  a5 ....  --- save
r10 ....  ........  vol ..  vol ..  a3 ....  --- save
r11 ....  ........  vol ..  vol ..  vol ...  --- save
r12 ....  SIB ....  non ..  non ..  non ...
r13 ....  RIP ....  non ..  non ..  non ...
r14 ....  ........  non ..  non ..  non ...
r15 ....  ........  non ..  non ..  non ...

Stacks must be 16-byte aligned, showing state of the stack after a call.

WIN STACK CONVENTION
====================
| [rsp+0x80] a16
  [rsp+0x78] a15
| [rsp+0x70] a14
  [rsp+0x68] a13
| [rsp+0x60] a12
  [rsp+0x58] a11
| [rsp+0x50] a10
  [rsp+0x48] a9
| [rsp+0x40] a8
  [rsp+0x38] a7
| [rsp+0x30] a6
  [rsp+0x28] a5
| [rsp+0x20] Shadow space
  [rsp+0x18] Shadow space
| [rsp+0x10] Shadow space
  [rsp+0x08] Shadow space
| [rsp+0x00] Return address, must be 16-byte aligned (rsp after call)

SYSV STACK CONVENTION
=====================
| [rsp+0x50] a16
  [rsp+0x48] a15
| [rsp+0x40] a14
  [rsp+0x38] a13
| [rsp+0x30] a12
  [rsp+0x28] a11
| [rsp+0x20] a10
  [rsp+0x18] a9
| [rsp+0x10] a8
  [rsp+0x08] a7
| [rsp+0x00] Return address, must be 16-byte aligned (rsp after call)

Expected language crossing granularity is low, so I'm not inclined to do anything other than make it easy to manage. A language crossing will include a stack crossing as well, as I'm not going to keep rsp 16-byte aligned. This is bloody ugly, but it will work. Will have to check if I have any language crossings using floating point.

ENGINE CONVENTION
=================
| [rsp+0xb8] r11
  [rsp+0xb0] r10
| [rsp+0xa8] r9
  [rsp+0xa0] r8
| [rsp+0x98] rdx
  [rsp+0x90] rsi
| [rsp+0x88] rsp
  [rsp+0x80] rdx (save volatile)
----------------
| [rsp+0x78] a16 (args)
  [rsp+0x70] a15
| [rsp+0x68] a14
  [rsp+0x60] a13
| [rsp+0x58] a12
  [rsp+0x50] a11
| [rsp+0x48] a10
  [rsp+0x40] a9
| [rsp+0x38] a8
  [rsp+0x30] a7 (adjusted pointer sysv only)     
| [rsp+0x28] a6 (register copy sysv only)     
  [rsp+0x20] a5 (register copy sysv only)     
| [rsp+0x18] a4 (copied back to registers before the call)  
  [rsp+0x10] a3  
| [rsp+0x08] a2  
  [rsp+0x00] a1  

// On entry, 
//  - rax is the address to call
//  - rcx is future stack pointer for call
entry:
 mov [rcx+0x80],rdx
 mov [rcx+0x88],rsp
#if SYSV
 mov [rcx+0x90],rsi
 mov [rcx+0x98],rdi 
#endif
 mov [rcx+0xa0],r8
 mov [rcx+0xa8],r9
 mov [rcx+0xb0],r10
 mov [rcx+0xb8],r11
#if WIN
 mov rsp,rcx
 mov r9,[rcx+0x18]
 mov r8,[rcx+0x10]
 mov rdx,[rcx+0x08]
 mov rcx,[rcx+0x00]
#endif
#if SYSV
 lea rsp,[rcx+0x30]
 mov r8,[rcx+0x20]
 mov r9,[rcx+0x28]
 mov rdi,[rcx+0x00]
 mov rsi,[rcx+0x08]
 mov rdx,[rcx+0x10]
 mov rcx,[rcx+0x18]
#endif
 call rax
#if WIN
 mov rdx,[rsp+0x80]
 mov r8,[rsp+0xa0]
 mov r9,[rsp+0xa8]
 mov r10,[rsp+0xb0]
 mov r11,[rsp+0xb8]
 mov rsp,[rsp+0x88]
#endif
#if SYSV
 mov rdx,[rsp+0x80-0x30]
 mov rsi,[rcx+0x90-0x30]
 mov rdi,[rcx+0x98-0x30] 
 mov r8,[rcx+0xa0-0x30]
 mov r9,[rcx+0xa8-0x30]
 mov r10,[rcx+0xb0-0x30]
 mov r11,[rcx+0xb8-0x30]
 mov rsp,[rcx+0x88-0x30]
#endif
 ret

________________________________

[GPU] Links

________________________________________________________________________________________________________________________________
[GPU] 3D Barycentric
Useful for skinning volumetric data

d=1-(a+b+c) ... coordinates must sum to one
r = point to convert into barycentric 
r{a,b,c,d} = points of tetrahedron
{a,b,c} = inv(T)*(r-rd)

T = 
 x1-x4 x2-x4 x3-x4
 y1-y4 y2-y4 y3-y4
 z1-z4 z2-z4 z3-z4

  INVERT A 3x3 MATRIX
=======================
 a b c
 d e f = A
 g h i

 (ei-fh) -(bi-ch)  (bf-ce)
-(di-fg)  (ai-cg) -(af-cd) * 1/det(A)
 (dh-eg) -(ah-bg)  (ae-bd)
 
det(A)
  (ei-fh) * a -(di-fg) * b + (dh-eg) * c

________________________________________________________________________________________________________________________________
[GPU] Shader Device Clock
VK_KHR_shader_clock - device clock support

AMD/NVIDIA : Supports device clock on all platforms that have 16-bit support
Intel : No support for shader device clock

TODO: Is NVIDIA's device shader clock a consistent frequency?

________________________________________________________________________________________________________________________________
[GPU] Wave OPs

Suggestion of API and implementation for wave operations.
This is a copy of what I like for personal development.

=========
  TERMS
=========
P1 ... 'predicate' (bool single component)
I1 ... 'integer' (32-bit signed integer single component)
UI1 .. 'unsigned integer'
W1 ... 'word' (16-bit signed integer)
C .... 'coherent' (function is static or dynamically uniform control flow)
V .... 'volatile' (function can be called in unknown control flow)

=====================================
  AVOIDING PROBLEMS WITH DIVERGENCE
=====================================
Pass around 'P1 laneMask' and only go into divergent control flow locally.
This requires a different style of programming.

P1 laneMask = ...; // Existing lane mask value.
P1 laneMask2 = laneMask & newMask; // Make a local lane mask for a new subset of active threads.
if(laneMask2){ ... } // Do logic which is limited to a subset of lanes.
f(laneMask,...); // Do logic which is limited to older subset of lanes.

Note above, 'f()' gets the lane mask passed in.
So 'f()' is always called from dynamically uniform control flow.
And can thus do any operations that require all lanes active.
The standard method of possibly having dynamically divergent control flow cannot do that.

=======
  API
=======
NO 'gl_SubgroupInvocationID' or 'WaveGetLaneIndex()'
 - INSTEAD only launch 1D workgroups and compute from 'gl_LocalInvocationID.x' and 'SV_GroupThreadID.x'
 - 2D coordiates are always generated from a 1D workgroup due to needing lane swizzling to get perf
 - Shader can avoid the AND operation if workgroup is know to be wave sized
 - May want to maintain a 16-bit lane index to save space in some cases
 - AMD
    - The 'gl_LocalInvocationID.x' is placed in 'v0' before program launch (fast path)
    - Driver will NOT optimize 'gl_SubgroupInvocationID' to 'gl_LocalInvocationID.x'
    - The 'gl_SubgroupInvocationID' gets built (slow path)
       - Using 2 VALU instructions via V_MBCNT_{HI,LO}_U32_B32 which is possibly slower (wave64, or 1 op wave32)

NO GENERIC SHUFFLE VIA 'subgroupShuffle()' OR 'WaveReadLaneAt(,nonUniformValue)'
 - This is because of min spec hardware portability
 - See 'Quad' and 'WaveXor' cases for constrained shuffle usage
 - AMD
    - DS_SWIZZLE_B32 only works in groups of 32 lanes (on GCN and RDNA)
       - Same with the introduction of DS_*PERMUTE_B32 (note AMD ISA guide has some incorrect descriptions on this)
    - GCN hardware is only wave64
    - No way to easily portably force wave32 on RDNA
       - So no way to guarantee usage of DS_BPERMUTE_B32 (wave32 only path)
       - The wave64 path ends up using a V_READLANE_B32 waterfall loop (could be 64 interations, so unusably slow)
    - Forced to use LDS for this kind of functionality
    - No need for new API to use LDS

NO USING 'subgroupElect()' OR 'WaveIsFirstLane()'
 - Both have the overhead of needing to find the first lane in possibly divergent control flow
 - They are thus slow
 - Instead manually mask to lane 0 via 'if((gl_LocalInvocationID.x & waveSizeMinusOne)==0)'
 - Where the AND part is skipped for wave sized workgroups
 - AMD
    - Driver output for 'subgroupElect()' is expensive
    - It is not optimized for even compile-time known uniform control flow

NO READ-FIRST-LANE
 - Because on some platforms this implies using 'find-first-one' to figure out the first active lane
 - So instead only call from non-divergent control flow and use explicit lane=0 in function calls
 - This will be slightly less optimal (probably not measurable) on AMD, but better overall
 - AMD
    - 'V_READFIRSTLANE_B32 d,s' is a 32-bit instruction
    - 'V_READLANE_B32 d,s,0' is a 64-bit instruction (slightly less optimal)

WITH SOME EXCEPTIONS, NO LANE SHARING BEYOND GROUPS OF 16 LANES
 - Supported on all AMD and NVIDIA hardware
 - Should in theory work on Intel hardware (they can do wave16)
    - Wave16 is their fast path
 - Exceptions
    - Single lane read/write 
    - Ballot

I1 Quad{0,1,2,3,X,Y,D}CI1(I1 v)
 - {0,1,2,3} selects quad element
    - Separate functions force compile-time uniformity (portable fast path)
    - DX: QuadReadLaneAt(,{0,1,2,3})
    - VK: subgroupQuadBroadcast(,{0,1,2,3})
 - {X,Y,D} swap with directional neighbor
    - DX: QuadReadAcross{X,Y,Diagonal}()
    - VK: subgroupQuadSwap{Horizontal,Vertical,Diagonal}()
 - AMD: all DPP ops (fast path)
 - GL: emulation?
    - Can use dFd{x,y}Fine() functions for emulation
    - See 'GPU Pro 2' "Shader Amortization using Pixel Quad Message Passing"
    - However {ES,WebGL} lacks the 'Fine()' functions 
    - ES?: might be able to use 'GL_FRAGMENT_SHADER_DERIVATIVE_HINT' 'GL_NICEST'

void WavePutCI1(inout I1 dst, I1 src, I1 dynamicallyUniformLaneDst, I1 laneSrc)
 - Emulation is possible because 'C' (only callable from dynamically uniform flow control)
 - This could be useful for storing stacks in a VGPR
 - This is a function since it can be mapped without emulation to a hardware instruction on AMD
 - The 'laneSrc' is passed in to enable fast path if wave-sized workgroups vs multi-wave workgroups
 - Usage,
    WavePutCI1(dst, src, 2, (gl_LocalInvocationID.x & waveSizeMinusOne))
    WavePutCI1(dst, src, 2,  gl_LocalInvocationID.x                    )
 - Emulation,
    if(dynamicallyUniformLaneDst == laneSrc) d = x;
 - AMD: V_WRITELANE_B32
    - Ignores EXEC mask

I1 WaveGetCI1(I1 src, I1 dynamicallyUniformLaneSrc)
 - Return 'src' value from lane 'dynamicallyUniformLaneSrc'
 - For review the 'C' is 'coherent' meaning can only be called from dynamically uniform control flow
 - For review the 'I1' is 32-bit integer
 - AMD: V_READLANE_B32
    - Works for wave32|wave64
    - Ignores EXEC mask

I1 WaveXor1CI1(I1 src)
I1 WaveXor2CI1(I1 src)
I1 WaveXor4CI1(I1 src)
I1 WaveXor8CI1(I1 src)
 - Designed to support 2D reductions from 4x4:1
 - Requires minimum wave16 support (Intel's fast path)
 - VK: subgroupShuffleXor(src, {1,2,4,8})
 - DX
 - AMD
    - subgroupShuffleXor(,{1,2}) uses DPP quad permute
    - subgroupShuffleXor(,4) uses DPP row XOR mask
    - subgroupShuffleXor(,8) uses DPP row rotate by 8
    - subgroupShuffleXor(,16) uses DS_SWIZZLE_B32 (which requires S_WAITCNT, slower)
    - subgroupShuffleXor(,32) uses a horrible amount of code for wave64 (unusably slow)

FOR REFERENCE, THE CORRECT WAY TO DO ATOMIC APPEND
 - This gets around stupid driver behavior of AMD
 - AMD pattern matches atomicAdd(,staticUniform) and turns it into garbage
 - Can fix it by atomicAdd(,dynamicallyUniform) where the compiler doesn't see staticUniform
 - Instead do this,
P1 p=...; // Set to true to append, false to not append
UI4 b=subgroupBallot(p);
UI1 stopStupid=gl_LocalInvocationID.x>>31; // Generate a VGPR zero that the compiler doesn't pattern match
UI1 c=subgroupBallotBitCount(b)+stopStupid; // Force the compiler to promote from SGPR to VGPR
UI1 d=0;
if(gl_LocalInvocationID.x==0)d=atomicAdd(...,c); // Do the atomic on lane zero only
UI1 e=subgroupBallotExclusiveBitCount(b); // Factor this work in the latency window of the atomic add
subgroupBarrier(); // Required to be API safe
d=subgroupBroadcast(d,0); // Fast lane zero broadcast avoids 'find-first-one' overhead
d+=e; // Output position for append

________________________________________________________________________________________________________________________________
[GPU] Float Bool Fixes
There are times when there is a need to do bool logic inside floating point numbers. Here is a good starting point for implementing such logic, with comments about implementation on AMD GPUs.

1.0 ............ true
0.0 ............ false
And(x,y) ....... min(x,y)
And(x,y) ....... saturate(x*y)
And3(x,y,z) .... min3(x,y,z) ........ 1 op (32-bit/16-bit), 2 ops (packed 16-bit, no packed MIN3)
AndOr(x,y,z) ... saturate(x*y+z)
AndNot(x,y) .... -x*y+1.0
Gt(x,y) ........ Gtz(x-y) ........... 2 ops
Gtz(x) ......... saturate( INF*x) ... {NaN := 0, x GT 0 := 1, else 0}
Lt(x,y) ........ Ltz(x-y) ........... 2 ops
Ltz(x) ......... saturate(-INF*x) ... {NaN := 0, x LT 0 := 1, else 0}
Ne(x,y) ........ Gtz(abs(x-y)) ...... 2 ops (32-bit), 3 ops (16-bit, ABS not free)
Not(x) ......... 1.0-x
Or(x,y) ........ max(x,y)
Or(x,y) ........ saturate(x+y)
Or3(x,y,z) ..... max3(x,y,z) ........ 1 op (32-bit/16-bit), 2 ops (packed 16-bit, no packedk MAX3)
Sel(x,y,z) ..... z*y+((-z)*x+x) ..... z==0.0?x:y, 2 ops, preserves precision

The following get more expensive (extra op).

Le(x,y) ... Not(Gt(x,y))
Ge(x,y) ... Not(Lt(x,y))
Eq(x,y) ... Not(Ne(x,y))

They could be faster if there was a way to run the hardware in a mode without NaNs, with modified floating point rules.

INF-INF := NaN ...... Actual IEEE rule (a problem)
(+/-)INF*0 := NaN ... Actual IEEE rule (a problem)
WOULD RATHER HAVE NO NANs AND INSTEAD THIS LOGIC
-INF+INF := 0 ....... Desired rule
-INF-INF := -INF .... Desired rule
+INF+INF := +INF .... Desired rule
SO THE FOLLOWING IS POSSIBLE
Eq(x,y) ............. saturate(-INF*abs(x-y)+INF)
Ge(x,y) ............. Gez(x-y)
Gez(x) .............. saturate(+INF*x+INF)
Le(x,y) ............. Lez(x-y)
Lez(x) .............. saturate(-INF*x+INF)

Detailed logic using new rules.

EQ
==
saturate((-INF*{ 0.0})+INF) ... saturate(( 0.0)+INF) ... saturate(+INF) ... 1.0
saturate((-INF*{+INF})+INF) ... saturate((-INF)+INF) ... saturate( 0.0) ... 0.0
saturate((-INF*{   +})+INF) ... saturate((-INF)+INF) ... saturate( 0.0) ... 0.0

GEZ
===
saturate((+INF*{ 0.0})+INF) ... saturate(( 0.0)+INF) ... saturate(+INF) ... 1.0
saturate((+INF*{-INF})+INF) ... saturate((-INF)+INF) ... saturate( 0.0) ... 0.0
saturate((+INF*{   -})+INF) ... saturate((-INF)+INF) ... saturate( 0.0) ... 0.0
saturate((+INF*{+INF})+INF) ... saturate((+INF)+INF) ... saturate(+INF) ... 1.0
saturate((+INF*{   +})+INF) ... saturate((+INF)+INF) ... saturate(+INF) ... 1.0

LEZ
===
saturate((-INF*{ 0.0})+INF) ... saturate(( 0.0)+INF) ... saturate(+INF) ... 1.0
saturate((-INF*{-INF})+INF) ... saturate((+INF)+INF) ... saturate(+INF) ... 1.0
saturate((-INF*{   -})+INF) ... saturate((+INF)+INF) ... saturate(+INF) ... 1.0
saturate((-INF*{+INF})+INF) ... saturate((-INF)+INF) ... saturate( 0.0) ... 0.0
saturate((-INF*{   +})+INF) ... saturate((-INF)+INF) ... saturate( 0.0) ... 0.0

Case For Hardware FAMA

Leverage all of the typical register read banking (4 way) with 4 operand instructions
FAMA as in FUSED ADD MULTIPLY ADD, introduce a pre adder
e = (a + b) * c + d ... fama(a,b,c,d)
This enables single instruction lerp(a,b,c) := (b-a)*c+a assuming no precision loss
Along with single instruction bool logic: Gtz(), Ltz(), Ne(), Sel() := lerp(), etc
Or3(x,y,z) ... saturate(fama(x,y,1.0,z))
etc

________________________________________________________________________________________________________________________________
[GPU] Log Depth Encoding
For reference...

// LOG DEPTH ENCODING
// ==================
// - Don't need too much precision around the minimum traversable coordinate
// - Or alternatively can clip on near plane
//    - This logic improves precision by a good amount (1/118 to 1/174)
//    - When s=2047, n=256, a=1/256, m=2^25
// - Encoding: x=log2(z*a+(1-a*n))*b -> {0 to s}
//    - m ... maximum depth value that can be encoded
//    - n ... minimum depth value that can be encoded
//    - s ... maximum step value
//    - z ... {0 to m}
//    - a ... controls distribution close to zero
//    - b ... s/log2(m*a+1-n*a)
// - Decoding: z=exp2(x*(1/b))*(1/a)+(n-(1/a))

Why not just mask part of a FP16 value and use that instead of log depth encoding?
Float Toy

//    - Breakdown
//       fedcba9876543210
//       ================
//       s............... sign (ignore)
//       .eeeee.......... exponent (don't want top bit, due to wasted enocding)
//       ......mmmmmmmmmm mantissa
//       ----------------
//       ..eeeemmmmmmm... possible encoding for simple masking
//       ..11111111111... 1.993 (around 2)
//       ..00000000001... 4.8e-7 (around 1/2M)
//    - Using simple masking burns roughly 1/16 of encoding in a linear region
//    - Complex masking can only approach 1/32 of encoding
//    - This neglects lower 3-bits of precision (gets worse if including more bits)
//    - So NO!

________________________________________________________________________________________________________________________________
[GPU] Ultimate Video Quality

Very Few Dispatches Each Frame - CPU is doing effectively nothing. All the game logic is on the GPU.
Input Sampled on the GPU - Background CPU thread is pushing latest input to GPU readable buffer. GPU is reading CPU input and generating camera translational update right before view-dependent rendering.
Camera Rotation Independent Rendering - Scene is rendered initially into an Octahedron. This rendering is dependent on camera translation but not camera angle. GPU is reading CPU input again and generating camera rotation update right before final view-angle-dependent rendering, which takes the octahedral space and generates the cylindrical projection the user sees.
V-Sync is ON - This is the only way to ensure consistent motion is visually consistent in time.

  TERMS
=========
latency ... as in input read on GPU to start of first frame's line on CRT (ignoring H blanking)
gi ........ GPU view independent work
gd ........ GPU view translation dependent work
gc ........ GPU view camera angle dependent work

  MAXED OUT GPU
==================
/_gi5_/_gd5_/_gc5_//_gi6_/_gd6_/_gc6_//_gi7_/_gd7_/_gc7_/
[____scanout_4____][____scanout_5____][____scanout_6____]
      ^     ^      ^
      |     |<---->| ... Camera rotation latency (slightly lower) 
      |            |
      |<---------->| ... Camera translation and button to flash latency (slightly higher)

  MAXED OUT GPU - LATENCY INDEPENDENT OF CPU WORK
===================================================
(_cpu6_)           (_cpu7_)           (_cpu8_)
/_gi5_/_gd5_/_gc5_//_gi6_/_gd6_/_gc6_//_gi7_/_gd7_/_gc7_/   -+
[____scanout_4____][____scanout_5____][____scanout_6____]    |
                                                             |
  or                                                         |- same latency
                                                             |
           (_cpu6_)           (_cpu7_)           (_cpu8_)    |
/_gi5_/_gd5_/_gc5_//_gi6_/_gd6_/_gc6_//_gi7_/_gd7_/_gc7_/   -+
[____scanout_4____][____scanout_5____][____scanout_6____]

  NON-MAXED OUT GPU - LATENCY DEPENDENT ON WHEN GPU WORK STARTS
=================================================================
        /_gi5_/_gd5_/_gc5_/        /_gi6_/_gd6_/_gc6_/        /_gi7_/_gd7_/_gc7_/
[________scanout_4________][________scanout_5________][________scanout_6________]
              ^     ^      ^
              |     |<---->|
              |            | ... Lower latency
              |<---------->|

  vs
  
/_gi5_/_gd5_/_gc5_/        /_gi6_/_gd6_/_gc6_/        /_gi7_/_gd7_/_gc7_/
[________scanout_4________][________scanout_5________][________scanout_6________]
      ^     ^              ^
      |     |<------------>|
      |                    | ... Higher latency
      |<------------------>|

________________________________________________________________________________________________________________________________
[GPU] Unlimited Boy
Fantasy console inspired by the 160x144 pixel and 4 shades of grey Gameboy...

Unlimited Boy Concept

Push up to 256x128 = 32768 pixels (Letterboxed NES, better for modern 16:9 displays)
Push up to 3-bit/pixel (8 shade monochome)
Separate sprite mask from sprite, so sprite can use all 8 shades
Sprite uses 4 bit planes {mask, high bit, med bit, low bit}
8x4 bit plane in one 32-bit integer, full 8x4 sprite in a 16-byte 'uvec4' (single load)
Get 64 million sprites in 1 GiB of buffer memory (uncompressed)
Can fetch a sprite into K$ using one S_LOAD_DWORDX4 operation
----
64 lane workgroup = 32x1 pixels / lane = 256x8 pixel row
16 workgroups / screen ... not enough to fill GPU
Each workgroup working on a subset of the "unlimited" sprite list
Ordered composite of workgroups at end
----
Workgroup works on a sprite at a time
Each lane has 4 uints (one per bit plane) for 32x1 pixels
Extract the associated 8x1 line from sprite for the 4 planes
Shift and mask, then compsite with logic ops
----
Can burn sprites for various things (effectively unlimited memory)
Like have sub-pixel shift sprites (4 sprites = 2x2, or 16 sprites = 4x4)
Or different brightness, rotation angles, scaling, etc
----
Without using extra sprite memory, could introduce dither mask modifier
Enables a sprite to have dithered 'transparency'
Could then split sprites into N copies, each with different 'dither'
Where all the dithers add up to the origional sprite
So that Z ordered sprites won't have pop, because the 'sprite' is split into N layers
So as sprites occupy similar Z they blend together

________________________________

[PRG] Links

________________________________________________________________________________________________________________________________
[PRG] Lottes6x16 Font

A Bitmap Terminal Font

Designed for monospace text editing.
Lottes6x16.fon - Easy to find for general app usage like in Notepad2.
LottesTerminal6x16.fon - Special version to make work in windows terminal.

As a programmer who sticks to 1080p displays, I use this bitmap font for source editing and windows terminal. The font was made using Fony. Right click on the file and "Install" to install. Use "6x16" in the terminal.

________________________________________________________________________________________________________________________________
[PRG] Page Warming
Something From an Existing 'C' Engine ...
The desire for the user is to have a hitch-free experience. OS design today seems more around bloatware, not designed for tiny tight binaries. The problem being that pages are not necessarily there until needed, and that process can be a latency chain nightmare (hitch fest). To workaround this problem, on program launch, and repeated each time the app gets focus, a background thread walks all pages.

Code is warmed first, with simple read of the first word of each 4K page.
Data is warmed next, with an atomic ADD of a loaded zero of the first word of each 4K page. The atomic forces initially zero-fill pages to be converted from the common zero-fill page to a unique dedicated page. The 'loaded zero' (unknown at compile time) is done to make sure a smart compiler cannot factor out the atomic operation.

All code is done with ROM_ defined to 1, the source file simply includes itself, wrapped with beginning WrmBas() and ending WrmEnd() functions so it becomes possible to easily know the range of addresses for code.

  #define ROM_ 1
  S_ void WrmBas(void){Crash();}
  #include "nvg0.c"
  S_ void WrmEnd(void){Crash();}
  #undef ROM_

All data is placed into one structure (with RAM_ defined to 1, source including itself), so finding start and end is easy.

  #define RAM_ 1
  typedef struct{
   #include "nvg0.c"
   A_(64) I1 end[1024*1024/4];}RamT;S_ A_(64) L1 ramM[sizeof(RamT)/8];
  #define ramR TR_(RamT,L1_(ramM))
  #define ramV TV_(RamT,L1_(ramM))
  #undef RAM_

________________________________________________________________________________________________________________________________
[PRG] Self Modifying Binary

Single File App

Turns out this still works in Win10. But is likely to not work in the future (for another post).
The concept is simple, instead of having a binary and data file(s), just have a binary, where the application saves it's configuration state directly into the binary. Or the step beyond, saving a RAM snapshot into the binary, so the application can easily startup where it last left off, and the user can have any number of save points by having different binaries. Distribution and install of the application is just place the file wherever you want to run it from. Uninstall is just delete the binary. Very easy setup, no registry or config file garbage.

The technique is quite simple.

When the binary starts it copies itself to a temp file, then launches the temp file, then exits.
The temp file launch runs the application.
Temp file launch is free to modify the original binary (which is no longer running).

Not shown below, but on exit, the temp file launch could launch the original binary with a command to delete the temp file. This would work to automatically not leave a garbage file around.

Proof of Concept

//
// SIMPLE SELF-CONTAINED GCC 'C' BASED WIN32 SELF-MOD EXE TEST APP
//
// Compile with: gcc sme.c -march=amdfam10 -std=gnu11 -Ofast -o sme.exe -s -lkernel32 -luser32 -lgdi32 -lwinmm
//
// Language tools.
#define E_(x,y) __builtin_expect(x,y)
#define O_ __attribute__((noreturn))
#define R_ __restrict
#define S_ static
#define W_ __attribute__((__stdcall__)) __attribute__((__force_align_arg_pointer__))
//
// Type system.
typedef unsigned char U1;
typedef unsigned short U2;
typedef unsigned int U4;
typedef unsigned long long U8;
typedef U1 *R_ U1R;
typedef U4 *R_ U4R;
#define U1R_(x) ((U1R)(x))
#define U4R_(x) ((U4R)(x))
#define U8_(x) ((U8)(x))
//
// Win32 API for x86-64.
typedef struct{U8 hProcess;U8 hThread;U4 dwProcessId;U4 dwThreadId;}PROCESS_INFORMATION;
typedef struct{U4 cb;U1R lpReserved;U1R lpDesktop;U1R lpTitle;U4 dwX;U4 dwY;U4 dwXSize;U4 dwYSize;U4 dwXCountChars;
 U4 dwYCountChars;U4 dwFillAttribute;U4 dwFlags;U2 wShowWindow;U2 cbReserved2;U8 lpReserved2;U8 hStdInput;U8 hStdOutput;
 U8 hStdError;}STARTUPINFOA; 
//  
W_ U4 CloseHandle(U8);
W_ U4 CopyFileExA(U1R,U1R,U8,U8,U4R,U4);
W_ U8 CreateFileA(U1R,U4,U4,U8,U4,U4,U8);
W_ U4 CreateProcessA(U1R,U1R,U8,U8,U4,U4,U8,U1R,STARTUPINFOA *R_,PROCESS_INFORMATION *R_); 
W_ void ExitProcess(U4);
W_ U4 ReadFile(U8,U1R,U4,U4R,U1R);
W_ U4 SetFilePointer(U8,U4,U4R,U4);
W_ U4 WriteFile(U8,U1R,U4,U4R,U1R);
//
#define INVALID_HANDLE_VALUE (~U8_(0))
enum{
 FILE_SHARE_WRITE=2,
 GENERIC_READ=0x80000000,
 GENERIC_WRITE=0x40000000,
 OPEN_EXISTING=3};
//
// Initialized global data.
S_ U4 d[2]={0xDEADB175,0x01};
//
// Utility functions.
S_ U1 hex[16]={'0','1','2','3','4','5','6','7','8','9','a','b','c','d','e','f'};
S_ U1R Hex(U1R a,U4 v){a[0]=hex[v&15];return a+1;}
S_ U1R HexU1(U1R a,U4 v){Hex(a,v>>4);Hex(a+1,v);return a+2;}
S_ U1R HexU2(U1R a,U4 v){HexU1(a,v>>8);HexU1(a+2,v);return a+4;}
//
// Entry point.
O_ void main(U4 argc,U1R *R_ argv){
 // If not called with arguments (base 'sme.exe' launch).
 if(argc!=2){
  // Copy 'sme.exe' to 'sme-cpy.exe'.
  U4 c[1];CopyFileExA(U1R_("sme.exe"),U1R_("sme-cpy.exe"),0,0,U4R_(c),0);
  // Launch 'sme-cpy.exe 1'.
  S_ STARTUPINFOA si;
  S_ PROCESS_INFORMATION pi;
  si.cb=sizeof(STARTUPINFOA);
  CreateProcessA(0,U1R_("sme-cpy.exe 1"),0,0,1,0,0,0,&si,&pi);
  // Exit the process.
  ExitProcess(0);}
 // Called with arguments (the 'sme-cpy.exe 1' launch).
 // Open access to standard output into console.
 U8 f=CreateFileA(U1R_("CONOUT$"),GENERIC_WRITE,FILE_SHARE_WRITE,0,OPEN_EXISTING,0,0);
 // Write out value after 0xDEADB175 to console.
 U1 b[4]={'.','.','\n',0};HexU1(b,d[1]);
 U4 r[1];WriteFile(f,U1R_(b),4,U4R_(r),0);
 // Open 'sme.exe' for R/W, loop until this succeeds (just in case 'sme.exe' is still open for execute).
 U8 h;while(1){h=CreateFileA(U1R_("sme.exe"),GENERIC_WRITE|GENERIC_READ,FILE_SHARE_WRITE,0,OPEN_EXISTING,0,0);
  if(h!=INVALID_HANDLE_VALUE)break;}
 // Read 'sme.exe' into memory, using something dumb (know file size is under 64 KiB).
 S_ U4 m[16384]; 
 ReadFile(h,U1R_(m),65536,U4R_(r),0);
 // Find 0xDEADB175 offset (know this happens to be U4 aligned).
 U4 o=0;while(o<16380){if(m[o]==0xDEADB175)break;o++;}o=o*4+4;
 // Write out offset to console.
 U1 b2[6]={'.','.','.','.','\n',0};HexU2(b2,o);
 WriteFile(f,U1R_(b2),6,U4R_(r),0);
 // Increment value after 0xDEADB175.
 d[1]++;
 // Write updated value to 'sme.exe'.
 SetFilePointer(h,o,0,0);
 WriteFile(h,U1R_((d+1)),4,U4R_(r),0);
 // Close the file.
 CloseHandle(h);
 // Exit the process.
 while(1)ExitProcess(0);}

________________________________________________________________________________________________________________________________
[PRG] PE Binary Header

PE

Building a simple 64-bit binary from scratch.
References.

Unfortunately Win 10 breaks compatibility with the small PE tricks which worked on Win 7 and prior. Since I don't want to do my own custom importer (due to the risk of Win 10 breaking that compatibility as well), there is a gap of 5 cachelines between initial file headers and imports. Overall this burns 5 cachelines total for headers and imports. This is 80:1 bloat factor required to get a binary executing on Windows. In an ideal machine the header would just be 4 bytes which gets replaced to a 32-bit pointer to a linker function, with entry point at offset 4, and the entire binary loaded as R/W/E.

First 512 Bytes

Win10 requires minimum 512 offset and alignment for a section (which is required to get imports). So this packs non-import data into the first 3 cachelines using structure aliasing. There is a 5 cacheline gap to the start of the imports which can be used for whatever the binary wants. Unused fields are missing the '=' and are zero.

      ====================
        IMAGE_DOS_HEADER
      --------------------
{000} U2 e_magic=0x5A4D;
      U2 e_cblp;
                                ======================
                                  IMAGE_NT_HEADERS64
                                ----------------------
{004} U2 e_cp;                  U4 Signature='PE\0\0';
      U2 e_crlc;
                                =====================
                                  IMAGE_FILE_HEADER
                                ---------------------
{008} U2 e_cparhdr              U2 Machine=0x8664; // IMAGE_FILE_MACHINE_AMD64
      U2 e_minalloc;            U2 NumberOfSections=1;
{012} U2 e_maxalloc;            U4 TimeDateStamp;
      U2 e_ss;
{016} U2 e_sp;                  U4 PointerToSymbolTable;
      U2 e_csum;
{020} U2 e_ip;                  U4 NumberOfSymbols;
      U2 e_cs;              
{024} U2 e_lfarlc;              U2 SizeOfOptionalHeader=120; // Just two directories with aliasin
      U2 e_ovno;                U2 Characteristics=2; // IMAGE_FILE_EXECUTABLE_IMAGE
                                ===========================
                                  IMAGE_OPTIONAL_HEADER64
                                ---------------------------
{028} U2 e_res[0];              U2 Magic=0x20b; // IMAGE_NT_OPTIONAL_HDR64_MAGIC
      U2 e_res[1];              U1 MajorLinkerVersion;
                                U1 MinorLinkerVersion;
{032} U2 e_res[2];              U4 SizeOfCode;
      U2 e_res[3];
{036} U2 e_oemid;               U4 SizeOfInitializedData;
      U2 e_oeminfo;         
{040} U2 e_res2[0];             U4 SizeOfUninitializedData;
      U2 e_res2[1];         
{044} U2 e_res2[2];             U4 AddressOfEntryPoint=604; // Start the repair tool
      U2 e_res2[3];         
{048} U2 e_res2[4];             U4 BaseOfCode;
      U2 e_res2[5];
{052} U2 e_res2[6];             U8 ImageBase=0x400000; // 4 MiB offset
      U2 e_res2[7];
{056} U2 e_res2[8];         
      U2 e_res2[9];
{060} U4 e_lfanew=4;            U4 SectionAlignment=4;
{064}                           U4 FileAlignment=4;
{068}                           U2 MajorOperatingSystemVersion;
                                U2 MinorOperatingSystemVersion;
{072}                           U2 MajorImageVersion;
                                U2 MinorImageVersion;
{076}                           U2 MajorSubsystemVersion=4; // WinNT Win32 version
                                U2 MinorSubsystemVersion;
{080}                           U4 Win32VersionValue;
{084}                           U4 SizeOfImage=BINARY_SIZE;
{088}                           U4 SizeOfHeaders=512;
{092}                           U4 CheckSum;
{096}                           U2 Subsystem=3; // IMAGE_SUBSYSTEM_WINDOWS_CUI
                                U2 DllCharacteristics;
(100}                           U8 SizeOfStackReserve;
{108}                           U8 SizeOfStackCommit=0x100000; // 1 MiB
{116}                           U8 SizeOfHeapReserve;
{124}                           U8 SizeOfHeapCommit=0x100000; // 1 MiB
{132}                           U4 LoaderFlags;
{136}                           U4 NumberOfRvaAndSizes=2; // Required to to get import table
{140}                           U8 DataDirectory[0]=0; // Exports needs to be empty
      ========================
        IMAGE_SECTION_HEADER
      ------------------------
{148} U1 Name[8];               U4 DataDirectory[1].VirtualAddress=540; // Imports
{152}                           U4 DataDirectory[1].Size=40; // Enough for 2 entries
{156} U4 VirtualSize=BINARY_SIZE-512;
{160} U4 VirtualAddress=512;
{164} U4 SizeOfRawData=BINARY_SIZE-512;
{168} U4 PointerToRawData=512;
{172} U4 PointerToRelocations;
{176} U4 PointerToLinenumbers;
{180} U2 NumberOfRelocations;
      U2 NumberOfLinenumbers;
{184}  U4 Characteristics=0xE0000060;
       // IMAGE_SCN_CNT_CODE|
       // IMAGE_SCN_CNT_INITIALIZED_DATA|
       // IMAGE_SCN_MEM_EXECUTE|
       // IMAGE_SCN_MEM_READ|
       // IMAGE_SCN_MEM_WRITE
{188} U4 Unused
      ==============
        FREE SPACE
      --------------
{192} to {511} - 5 cachelines.

The One Section

This provides the imports and the rest of the binary code and data. I kept all the import data in the section instead of attempting to place it in the prior 512 byte header, just in case Windows checks VA to section range. This part packs the mess of PE import tables and string data into 2 cachelines. It aliases many of the structures based on knowing parts that are not accessed. The "repair tool" copies the function pointers 16 bytes earlier, then restores the original IMAGE_IMPORT_BY_NAME RVAs as would be seen on disk. This enables the header in memory to be stored back to an executable and still function properly. The IMAGE_IMPORT_BY_NAME is tricky because it wants 16-bit alignment and two leading zeros.

      ======================
        FUNCTION ADDRESSES
      ----------------------
{512} U8 LoadLibraryA
{520} U8 GetProcAddress
      ======================
        IMAGE_THUNK_DATA64
      ----------------------
{528} U8 Function=588; // Offset to "LoadLibraryA"
{536} U8 Function=558; // Offset to "GetProcAddress"  U4 unused;
                                                      ===========================
                                                        IMAGE_IMPORT_DESCRIPTOR
                                                      ---------------------------
                                                      U4 OriginalFirstThunk;
{544} U8 Function=0; // End IAT                       U4 TimeDateStamp;
                                                      U4 ForwarderChain;
{552}                                                 U4 Name=580; // Offset to "kernel32"
      ========================
        IMAGE_IMPORT_BY_NAME
      ------------------------
      U1 unused[2];                                   U4 FirstThunk=528; // Offset to Import Address Table
      U1[2]='\0\0'
{560} U1[4]='GetP'                                    U4 OriginalFirstThunk;
      U1[4]='rocA'                                    U4 TimeDateStamp;
{568} U1[4]='ddre'                                    U4 ForwarderChain;
      U1[4]='ss\0\0';                                 U4 Name;
{576}                                                 U4 FirstThunk=0; // Final entry must be empty
      U1[4]='kern'
{584} U1[8]='el32\0\0Lo'
{592} U1[8]='adLibrar'
{600} U1[4]='yA\0\0';
      ===============================
        REPAIR TOOL AND ENTRY POINT
      -------------------------------
{604} b8 10 02 40 00           mov eax,0x400210; // Start of IMAGE_THUNK_DATA
{611} 48 8b 18                 mov rbx,QWORD PTR [rax]; // Fetch LoadLibrary pointer
{612} 48 89 58 f0              mov QWORD PTR [rax-0x10],rbx; // Store at [512]
{616} 2E 48 c7 00 4c 02 00 00  mov QWORD PTR cs:[rax],0x24c; // Restore string RVA
{624} 48 8b 58 08              mov rbx,QWORD PTR [rax+0x8]; // Fetch GetProcAddress pointer
      48 89 58 f8              mov QWORD PTR [rax-0x8],rbx; // Store at [520]
{632} 48 c7 40 08 2e 02 00 00  mov QWORD PTR [rax+0x8],0x22e; // Restore string RVA
      =============
        APP START
      -------------
{640}

Builder Example

The following C code will build a quick proof of concept binary.

//
// SIMPLE SELF-CONTAINED GCC 'C' BASED WIN32 BINARY BUILDER
//
// Test with,
// gcc b64.c -march=amdfam10 -std=gnu11 -Ofast -o b64.exe -s -lkernel32 -luser32 -lgdi32 -lwinmm
// b64.exe
// tst.exe
// echo %ERRORLEVEL%
//
// Language tools.
#define E_(x,y) __builtin_expect(x,y)
#define O_ __attribute__((noreturn))
#define R_ __restrict
#define S_ static
#define W_ __attribute__((__stdcall__)) __attribute__((__force_align_arg_pointer__))
//
// Type system.
typedef unsigned char U1;
typedef unsigned short U2;
typedef unsigned int U4;
typedef unsigned long long U8;
typedef U1 *R_ U1R;
typedef U2 *R_ U2R;
typedef U4 *R_ U4R;
typedef U8 *R_ U8R;
#define U1R_(x) ((U1R)(x))
#define U2R_(x) ((U2R)(x))
#define U4R_(x) ((U4R)(x))
#define U8R_(x) ((U8R)(x))
#define U1_(x) ((U1)(x))
#define U2_(x) ((U2)(x))
#define U4_(x) ((U4)(x))
#define U8_(x) ((U8)(x))
//
// Win32 API for x86-64.
W_ U4 CloseHandle(U8);
W_ U8 CreateFileA(U1R,U4,U4,U8,U4,U4,U8);
W_ void ExitProcess(U4);
W_ U4 SetFilePointer(U8,U4,U4R,U4);
W_ U4 WriteFile(U8,U1R,U4,U4R,U1R);
//
#define INVALID_HANDLE_VALUE (~U8_(0))
enum{
 CREATE_ALWAYS=2,
 FILE_SHARE_WRITE=2,
 GENERIC_READ=0x80000000,
 GENERIC_WRITE=0x40000000};
//
// Entry point.
O_ void main(U4 argc,U1R *R_ argv){
 // Building a 64 KiB binary (lots of extra space for later).
 // This defaults to zero fill (so zeros need not be written).
 #define BINARY_SIZE 65536
 S_ U8 buf[BINARY_SIZE/8];
 U1R b=U1R_(buf);
 //
 U2R_(b+0)[0]=0x5a4d; // e_magic
 // 
 U1R_(b+4)[0]='P'; // Signature
 U1R_(b+4)[1]='E';
 //
 U2R_(b+8)[0]=0x8664; // Machine
 U2R_(b+10)[0]=1; // NumberOfSections
 U2R_(b+24)[0]=120; // SizeOfOptionalHeader
 U2R_(b+26)[0]=2; // Characteristics
 //
 U2R_(b+28)[0]=0x20b; // Magic
 U4R_(b+44)[0]=604; // AddressOfEntryPoint
 U8R_(b+52)[0]=0x400000; // ImageBase
 U4R_(b+60)[0]=4; // SectionAlignment and e_lfanew
 U4R_(b+64)[0]=4; // FileAlignment
 U2R_(b+76)[0]=4; // MajorSubsystemVersion
 U4R_(b+84)[0]=BINARY_SIZE; // SizeOfImage
 U4R_(b+88)[0]=512; // SizeOfHeaders
 U2R_(b+96)[0]=3; // Subsystem
 U8R_(b+108)[0]=0x100000; // SizeOfStackCommit
 U8R_(b+124)[0]=0x100000; // SizeOfHeapCommit
 U4R_(b+136)[0]=2; // NumberOfRvaAndSizes
 U4R_(b+148)[0]=540; // DataDirectory[1].VirtualAddress
 U4R_(b+152)[0]=40; // DataDirectory[1].Size
 //
 U4R_(b+156)[0]=BINARY_SIZE-512; // VirtualSize
 U4R_(b+160)[0]=512; // VirtualAddress
 U4R_(b+164)[0]=BINARY_SIZE-512; // SizeOfRawData
 U4R_(b+168)[0]=512; // PointerToRawData
 U4R_(b+184)[0]=0xE0000060; // Characteristics
 //
 U8R_(b+528)[0]=588; // Function
 U8R_(b+536)[0]=558; // Function
 U4R_(b+552)[0]=580; // Name
 U4R_(b+556)[0]=528; // FirstThunk
 //
 U1R_(b+560)[0]='G';
 U1R_(b+560)[1]='e';
 U1R_(b+560)[2]='t';
 U1R_(b+560)[3]='P';
 U1R_(b+560)[4]='r';
 U1R_(b+560)[5]='o';
 U1R_(b+560)[6]='c';
 U1R_(b+560)[7]='A';
 U1R_(b+560)[8]='d';
 U1R_(b+560)[9]='d';
 U1R_(b+560)[10]='r';
 U1R_(b+560)[11]='e';
 U1R_(b+560)[12]='s';
 U1R_(b+560)[13]='s';
 //
 U1R_(b+580)[0]='k';
 U1R_(b+581)[0]='e';
 U1R_(b+582)[0]='r';
 U1R_(b+583)[0]='n';
 U1R_(b+584)[0]='e';
 U1R_(b+585)[0]='l';
 U1R_(b+586)[0]='3';
 U1R_(b+587)[0]='2';
 //
 U1R_(b+590)[0]='L';
 U1R_(b+591)[0]='o';
 U1R_(b+592)[0]='a';
 U1R_(b+593)[0]='d';
 U1R_(b+594)[0]='L';
 U1R_(b+595)[0]='i';
 U1R_(b+596)[0]='b';
 U1R_(b+597)[0]='r';
 U1R_(b+598)[0]='a';
 U1R_(b+599)[0]='r';
 U1R_(b+600)[0]='y';
 U1R_(b+601)[0]='A';
 // 
 U1R_(b+604)[0]=0xb8;
 U1R_(b+604)[1]=0x10;
 U1R_(b+604)[2]=0x02;
 U1R_(b+604)[3]=0x40;
 U1R_(b+604)[4]=0x00;
 //
 U1R_(b+604)[5]=0x48;
 U1R_(b+604)[6]=0x8b;
 U1R_(b+604)[7]=0x18;
 //
 U1R_(b+604)[8]=0x48;
 U1R_(b+604)[9]=0x89;
 U1R_(b+604)[10]=0x58;
 U1R_(b+604)[11]=0xf0;
 
 U1R_(b+604)[12]=0x2e;
 U1R_(b+604)[13]=0x48;
 U1R_(b+604)[14]=0xc7;
 U1R_(b+604)[15]=0x00;
 U1R_(b+604)[16]=0x4c;
 U1R_(b+604)[17]=0x02;
 U1R_(b+604)[18]=0x00;
 U1R_(b+604)[19]=0x00;
 //
 U1R_(b+604)[20]=0x48;
 U1R_(b+604)[21]=0x8b;
 U1R_(b+604)[22]=0x58;
 U1R_(b+604)[23]=0x08;
 //
 U1R_(b+604)[24]=0x48;
 U1R_(b+604)[25]=0x89;
 U1R_(b+604)[26]=0x58;
 U1R_(b+604)[27]=0xf8;
 //
 U1R_(b+604)[28]=0x48;
 U1R_(b+604)[29]=0xc7;
 U1R_(b+604)[30]=0x40;
 U1R_(b+604)[31]=0x08;
 U1R_(b+604)[32]=0x2e;
 U1R_(b+604)[33]=0x02;
 U1R_(b+604)[34]=0x00;
 U1R_(b+604)[35]=0x00;
 //
 // Extra code to return lower 32-bits of GetProcAddress
 // mov rax,rbx; ret;
 U1R_(b+640)[0]=0x48;
 U1R_(b+640)[1]=0x89;
 U1R_(b+640)[2]=0xd8;
 U1R_(b+640)[3]=0xc3;
 //
 // Dump binary to file.
 U8 h=CreateFileA(U1R_("tst.exe"),GENERIC_WRITE|GENERIC_READ,FILE_SHARE_WRITE,0,CREATE_ALWAYS,0,0);
 U4 r[1];WriteFile(h,U1R_(buf),BINARY_SIZE,U4R_(r),0);
 CloseHandle(h);
 // Exit the process.
 while(1)ExitProcess(0);}

________________________________________________________________________________________________________________________________
[PRG] Linux?
Tried to get back to using Linux on a PC laptop, failed...

F2 during boot to get into the BIOS (at least on this machine)
Was forced to set a BIOS 'Supervisor Password' to disable Secure Boot (took a while to figure that one out)
Disable Secure Boot
Found I could disable BIOS passward again after disabling Secure Boot (so why require the password prior?)
Disabled 'Fast Boot' but it get re-enabled after rebooting (what I'm supposed to go into BIOS each boot?)
Turned off Win11's 'Fast Boot' from the control panel, but doesn't change the BIOS behavior
Apparently there is no way to change the 'Boot Mode' to Legacy (damn, UEFI is a nightmare)
Tried to install from Virtual Box but Windows keeps sleeping the USB thumb drive (seriously)
Disabled 'USB selective suspend setting' (of course that didn't work)
Downloaded 'Rufus' to try to make an Arch ISO thumb stick from this Windows machine (also broken)
Bought a new thumb drive, even though the 'old' thumb drive is effectively new too
Eventually got linux to run from Virtual Box
Got blocked attempting to build a Linux thumb drive based system, wouldn't UEFI boot (probably config issue)

Where I left off on Arch install.

cgdisk /dev/sda ... to create 3 partitions {0700 (for FAT32),ef00 (for EFI), 8304 (for /)}
mkfs.fat -F 32 /dev/sda1
mkfs.fat -F 32 /dev/sda2
mkfs.ext4 -O "^has_journal" /dev/sda3
mount /dev/sda3 /mnt
mount --mkdir /dev/sda2 /mnt/boot
pacstrap -K /mnt base linux linux-firmware
genfstab -U /mnt >> /mnt/etc/fstab
arch-chroot /mnt
ln -sf /usr/share/zoneinfo/America/New_York /etc/localtime
hwclock --systohc
pacman -S nano
nano /etc/locale.gen and uncomment en_US.UTF-8 UTF-8
locale-gen
nano /etc/locale.conf and add LANG=en_US.UTF-8
nano /etc/hostname and add ihatelinuxtoo
passwd and set password
pacman -S grub
grub-install --target=x86_64-efi --efi-directory=/boot --removable --recheck
exit
umount -R /mnt
reboot

Random Notes

BusyBox
Cross-Compiled Linux From Scratch - Embedded (x86)
Musl
But apparently there is no way to use anything other than glibc because of the NV binary driver requirments.

________________________________

[PIX] Resolution vs Pixel
Q. Why Does an LCD Need Much Higher Resolution Than a CRT?
Because of physical pixel shape. CRT's pixels have smooth partial overlap. CRT can reproduce a smooth signal with less physical resolution. LCD's pixels are quite hard, LCD requires a substantially larger amount of resolution to get to the point where the pixel's hard edge is not perceptual.

Q. Why Does the LCD Hard Pixel Fail Even For Text Rendering?
Because a substantial amount of parts of characters in a font are not just horizontal or vertical, or not pixel aligned. LCD hard pixel can only reproduce a higher frequency than physical resolution hard edge on axis and pixel aligned features.

Q. For Moving Images With No Temporal Aliasing, Why is Minimum Feature Size Larger Than a Pixel?
Pixel sized features with sub-pixel motion would alternate between being visible when aligned to pixel centers, to being half visible when aligned to pixel edges, or quarter visible if the feature is a point aligned to a pixel corner. This visibility change is the temporal aliasing. The only way to reduce temporal aliasing is to drop off contrast of pixel sized features until the visibility change is not perceptual any more. This is the standard for film rendering, and a requirement to provide a believable image.

Q. Would Phone OEMs Push Peak Resolution Images?
Pushing peak resolution sensor is fine, but the images generated with such a sensor don't actually provide pixel quality at peak resolution. Images could easily be scaled down substantially and end up with higher quality as artifacts get removed in the process. The aim of keeping images at peak resolution is to more quickly fill the phone's storage, since phone companies get margin on higher storage phones, or optionally push customers into cloud services with aim to charge a higher monthly fee.

________________________________________________________________________________________________________________________________
[PIX] DLSS3?
Notes from the original DLSS3 launch in 2022 ...
Related Links

Summary, as a game developer I would never choose to integrate DLSS3 into one of my games.

Added Latency of Extra Reconstruction - Temporal interpolation (doubling) happens at present time. Run logic to generate the interpolated frame, present that. Queue up the game rendered frame to present next. So at a minimum the time to generate the interpolated frame is added to latency.
Added Latency of Interpolation - Button to muzzle flash wouldn't see a latency add, because the interpolated frame would show some amount of the effects of the muzzle flash. However camera motion would see around a half input frame latency add, because the first frame is only half to the rendered frame. It is likely that most reviews miss the point that camera motion latency is worse than button to flash measurements with temporal frame interpolation.
Jittered Motion - DLSS3 depends on Variable Refresh Rate (VRR). DLSS3 latency is substantially worse with V-Sync enabled. However VRR can never produce consistent linear motion, so it is NOT acceptable from an ultimate quality perspective. The amount of motion jitter is relative to both the variability of time it takes to render a frame, and the frame rate of the display.
Power Hungry - The associated latency reduction technology depends on keeping the GPU at peak power, to minimize the time it takes to render a frame. This is would not be a good strategy for laptops for example.
BFI is Better to Reduce Motion Blur - At the high frame rates which DLSS3 starts to have lower perceptual artifacts and more acceptable input latency, v-sync with black frame insertion is a better technology to reduce the perceptual motion blur caused by scan-and-hold displays than simply increasing frame rate.

________________________________________________________________________________________________________________________________
[PIX] Morphological AA Links
Despite temporal techniques, spatial AA is still useful when overriding areas of low convergence ...

Dynamic Temporal Antialiasing and Upsampling in Call of Duty
Filmic SMAA Sharp Morphological and Temporal Antialiasing - Online copy, details search optimizations
Filtering Approaches for Real-Time Anti-Aliasing - Siggraph 2011 Course
FXAA 3.11 Source - Copy on Github

________________________________________________________________________________________________________________________________
[PIX] Modified Soft BFI
Display rates to BFI configurations?

360 Hz ... 90 Hz (/4)
240 Hz ... 80 Hz (/3)
144 Hz ... 72 Hz (/2)
120 Hz ... 60 Hz (/2) ... Probably not useful?

"Modified"
Soft BFI is probably limited by non-linearity in display pixel transitions. Meaning {white,black} frames won't necessarily average to {50% grey}. The other issue is the loss of brightness. Assuming enough pixel response linearity, one could redistribute light across frames. Start with repeating the input frame N times on output. Could increase brightness of pixels which are not at peak already, and then subtract that amount from the later frame(s). Thus temporal energy conservation. At white, this would act as scan-and-hold, but say at 144 Hz, anything under 50% would act as 72 Hz with BFI.

________________________________

[CRT] Links Inventory Misc

TODOs

Get remote for Sharp

Downsampling

Signals

VGA (RGBHV)

RGB {0 to 0.7V} 75 Ohm, pin {1,2,3}
HV {0 to 5.0V} TTL compatible, pins {H=13, V=14}
BNC cable {H=grey/yellow, V=white/black}
Pin 9 {5V}

SCART

RGB {0 to 0.7V} 75 Ohm termination, pin {15,11,7} (shared with component)

S (composite video, or just composite sync) {0 to 1V} 75 Ohm, pin {20}

NTSC

TODO: Correct this documentation for half line!
15.734 kHz
29.97 (30/1.001) FPS
59.94 (60/1.001) FPS for 240p modes
VERT
483 visible
2 line - front porch
6 line - sync
34 line - back porch
525 scanlines total
HORZ
52.6 us - video
1.5 us - front porch
4.7 us - back porch
4.7 us - sync
10.9 us - blanking total (sum of porches and sync)

Modelines

Modeline data provided as {Display, Sync Start, Sync End, Total}
See: www.mythtv.org/wiki/Modeline_Database

Translator from modeline to AMD/NV PC driver settings:
  Front Porch = Sync Start - Display
  Sync Width = Sync End - Sync Start

Diagram:
  |<----------------total-------------->|
  |<------------sync-end----------->|   .
  |<---------sync-start------->|    .   .
  |<-------display-------->|   .    .   .
  .                        .   .    .   .
  |XXXXXXXXXXXXXXXXXXXXXXXX|---|____|---|
                           .   .    .
           front porch ....|<->|    .
           sync width .........|<-->|

Inventory

Compaq MV 740 - 17" VGA, {31-70kHz,50-120Hz}, offline, currently dead (won't power on).

FREE IF ANYONE WANTS IT! - Don't have time to look into fixing it

Daewoo DTQ-20U4SC - 20" NTSC TV, like new

Might want to RGB mod this (if possible)
[RGB] Seems possible (no info)
Manual | Service Manual
IN USE - connected to TinyNES and Genesis 2 (both via composite)

Dell UltraScan 1600HS Series D1626HT - 21" VGA {30-107kHz,48-160Hz}, made by Sony, good tube, cleaned

TODO: Grab local copy of manual and service manual!
Service Manual
IN STORAGE - Don't have room on my desk right now

Dotronix DSV27 - 27" 480i, from Dotronix, new from new-old-stock LG tube

The LG DU27FB32C insides destroy 240p signals, so this is a 480i only device
Needs some re-calibration (new state not great)
Overall not impressed, amazingly expensive, seems like it is just new LG TV in a new metal body
Manual | Service Manual
IN USE - Connected to PS2 via component

eMachines 17fs - 17" VGA, {30-70kHz, 50-160Hz}, new-old-stock

Could use a film to improve black levels (not good ambient reflection)
IN USE - Connected to AMD Laptop

Future Power 17DB88 - 17" VGA, {?-?kHz,?-?Hz}, new-old-stock

IN STORAGE

HP "1024" D2813 - 14" VGA, {?kHz,?Hz}, tube ok, cleaned

OFFLINE - Need to track down sparking problem (maybe flyback)
FREE IF ANYONE WANTS IT! - Don't have time to look into fixing it

Insignia IS-TV040919 - 20" NTSC TV, new-old-stock

Manual | Service Manual
Repaired the courier transform cracked PCB, got video, had problem with menu and controller not working?
Reglued the flyback screw housing, but it didn't hold on reassembly
Strengthened the plastic rails holding the board
OFFLINE - Thought this was left in a working state actually (need to check again)

JVC AV-27D200 - 27" NTSC TV

Manual
OFFLINE - Cleaned, but board needs trace repair, so waiting on time to get stuff to do that

JVC i'ART 27" AV-27SF36 - 27" NTSC TV, new-old-stock

OFFLINE - Going to use for 'arcade' setup

KDS KF-7b - 17" VGA, {30-70kHz, 50-160Hz}, new-old-stock

Could use a film to improve black levels (not good ambient reflection)
Plastic body disolves with isopropyl alcohol, so this ended up with a new 'artistic textured surface'
IN USE - Connected to day job laptop

MakVision/Wei-YA M2929D4-TS SVGA Arcade Monitor - 29" VGA (480p/600p), {30-40kHz,47-90Hz}, new-old-stock

Specs
OFFLINE - Screen shows OSD, but HDFury Nano, and AMD VGA laptop not wanting to drive this now, was fine before

MakVision/Wei-YA M3129DS-LG CGA/EGA/VGA Arcade Monitor - 29" VGA (240p/480p), {15/24/31kHz,47-70Hz}, new-old-stock

Need extra power cord: 49-5179-00
Perhaps Other Model Schematics?
Perhaps Other Model Schematics?
C3129D 15kHz Fix
OFFLINE - The horizontal linearity from new is horrible, no documented adjustment knobs, not usable until fixed

MaxTech XT-4800 - 14" VGA, {30-48?kHz,?-87?Hz}, new-old-stock

Manual (XT-4861)
IN STORAGE

Philips 20PT6341/37 - 20" 240p/480i

This probably needs a re-cap and re-calibration
Having trouble with service menu due to power button issues on remote, lost component input for some reason after a reboot
Wanted to use this during game development for slot-mask 240p testing
CRT Database | Manual | Service Manual
OFFLINE - Until service menu issues are resolved

Philips 27PT6441/37 - 27" 240p/480i

This is bright, might need recap and calibration
Have GBS-C for this, intended to be a future home arcade display
Manual | Service Manual
OFFLINE - Waiting for time to service (needs cleaning too)

Philips 32PT6441/37 - 32" 240p/480i

This one needs tube scratch repair
Cleaned, painted, recapped only the really bad caps
Have GBS-C for this, intended to be a future home arcade display
Manual | Service Manual
OFFLINE - Waiting for time to re-test

Philips 34PW850H37F - 34" 16:9 via Component (Does 480p/720p)

Was intending to use this for SPC, but stalled as it is too heavy
OFFLINE - Dropped this on my food, waiting until I have help to lift it (needs cleaning)

Sharp 32SC260 - 32" 240p/480i

This is bright, might need recap and calibration
Have GBS-C for this, intended to be a future home arcade display
OFFLINE - Waiting for time to service

Sony BVM-D20F1U - 20" 4:3 multi-format

This needs internal cleaning, and should be connected to a 480p source instead of SNES
Not super bright, been recapped
IN USE - Connected to SNES

Sony Multiscan CPD-E200 - 17" VGA {30-85kHz,48-120Hz}

Sold!
Cleaned, tube good, but prior owner damaged screen anti-glare coating
Specs | Manual | Service Manual
SOLD - This display has an agressively ringing horizontal sharpening which can not be turned off!

Sony PVM-8041Q - 8" 240p/480i

This needs internal cleaning, preventive recap
IN USE - Portable PVM

Sony PVM-1354Q - 13" 240p/480i (HR Tube, 16:9 Toggle)

Needs a clean
OFFLINE - Still "Portable" PVM, just not currently using

Sony PVM-14L2 - 14" 240p/480i (16:9 Toggle)

Got a VGA board for this, need to try it
This needs internal cleaning, calibration, preventive recap, and need to fix the power button (sticks)
IN USE - Connected to PS1

Sony PVM-? / Olympus OEV203 - 19" 240/480i

OFFLINE - Haven't had the time to get into this one

Sony PVM-1953ST - 19" 240/480i, Olympus (HR Tube, Endoscope Monitor)

This has a good external cleaning
This needs internal cleaning, calibration, preventive recap, and fix defaults to re-enable the knobs
OFFLINE - Was using for GBS-C downscaling testing

Sony PVM-1953MD - 19" 240/480i (HR Tube)

Needs to be cleaned
OFFLINE - Haven't had the time to get into this one

Sony PVM-20M2U - 20" 240/480i (16:9 Toggle)

This came with mostly unfunctional front buttons, need to fix that
Has some convergence issues on bottom half of screen
This was a smoker owned CRT (will never buy those again)
Cleaned, put back together without some front buttons, need a chopstick to use
Added a destructive interference anti-reflective film to tube, works great
OFFLINE - Still needs recap

Sony PVM-20M2MDU - 20" 240/480i

OFFLINE - Good display, but rusted chassis (need to restore)

Sony PVM-20M2MDU/ST - 20" 240/480i

This needs internal cleaning, and preventive recap
Bright, like new, probably needs slight calibration, doesn't have an HR tube (but like it that way)
IN USE - Connected to NeoGeo MVS + Supergun

Sony Wega KD-34XBR970 - 34" 1080i HD TV

This does 720p but has some lag, so it is intended to be used for Netflix/etc in a bedroom
Very bright
Also a backup in case the 30" HDTV dies
OFFLINE - Waiting for time to clean, recap, and calibrate

Sony Wega KV-30HS420 - 30" 1080i HD TV

Needs a degauss, cleaning, preventative recap, and calibration
Want to recalibrate this for less overscan (pain when connected to PC)
Wega List | Manual | Service Manual
Service Menu Info
Want modeline for 540p operation! - Haven't been able to get that to work over HDMI.
Shumps topic says most require component (not HDMI) for 540p
IN USE - Connected to PS5 and computer, has good 720p mode

ViewSonic 17GS - 17" VGA {30-69kHz,50-160Hz}, new-old-stock

IN USE - Connected to NV Laptop

ViewSonic G75f - 17" VGA {30-86kHz,50-180Hz}, cleaned, tube ok

OFFLINE - Plastic is self-destructing, needs a new external chassis

ViewSonic Optiquest Q95 - 19" VGA {30-86kHz,50-160Hz}, cleaned, good tube

Manual
OFFLINE - Need to fix front buttons

ViewSonic PS790 - 19" VGA {30-95kHz,50-180Hz}, cleaned, medium tube life

Manual
OFFLINE - Needs recap

________________________________________________________________________________________________________________________________

Insignia

This required a lot of fixes to bring back to life.

UPS Transform, Old New Stock TVs Are Bad Out of the Box, then Finally Warm and Calibrated

Cracked, and Repaired (Different Parts of the Same Board)

________________________________________________________________________________________________________________________________

Endoscopy PVM

Have too many PVMs.

Ebay Endoscopy PVM-1953ST, Picked Up Via Freight, Clean Outside (Bad SNES Running 240p Test)

________________________________________________________________________________________________________________________________

First Philips

For the love of slot mask.

Philips 20PT6341/37 - Deceptively Clean on the Inside, Might Need a Recap (Probably SNES That Needs Recap)

________________________________

[CRT] Notes

Misc

Intergraph Interview 28hd98 Model TX-D8W71W

Youtube

Mitsubishi's Diamond Pro 2070SB (VGA, 140 kHz, 160 Hz)

CRT Database
Also: NEC FP2141SB, HP P1230, LaCie Electron 22 Blue IV, SGI C220

NEC DM-2000P

CRT Database
RGB (240p) over 34-pin connection

________________________________________________________________________________________________________________________________

APEX/KLH

Some 'Apex PF3220 / KLH KF3228' have a Toshiba Microfilter tube (purple tint)!
PF2025 with component looks great
[RGB] AT2704S
[RGB] AT2408S

________________________________________________________________________________________________________________________________

Daewoo

[Ebay] $500 ($85 shipping) - DTQ-19P2FC new-old-stock

________________________________________________________________________________________________________________________________

JVC

Earlier (240p | 480i)

D-Series (240p | 480i)

Always interested in pre-2001 JVCs for arcade monitor!
1999 had higher quality electronics
2001 27" models lacked full geometry adjustment
Best are model numbers ending in 0
The non D-Series has the same internals
Non-component JVCs can typically be easily RGB modded
RGBable: AV-32D503 AV-32D303 AV-32D203 AV-20D202 AV-36D501 AV-36D201 AV-32D501 AV-32D201 AV-27D501 AV-27D201
[RGB] Poor description
[RGB] AV-32D501
[RGB] AV-27020
Non-RGBable: AV-36320 AV-36330 AV-36360 AV-36S33 AV-36S36

JVC DT-V (15-45 kHz, 50-100 Hz)

Crt Database - JVC DT-V1700CG
DT-V1700 DT-V1710 DT-V1710C DT-V1900 DT-V1910 DT-V1910CG
Ebay: $2500 (CA Pickup) - DT-V1910CG
Ebay: $1200 ($540 shipping from UK) - DT-V1710CG (12700 hr)
Ebay: $1200 ($540 shipping from UK) - DT-V1710CG (9500 hr)
Ebay: $1850 ($540 shipping from UK) - DT-V1710CG (6400 hr)

JVC I'Art (240p | 480i)

CRT Database on AV-14F703 - Need remote to change input, same as Toshiba 14AF42, small Orion TV with component

JVC I'Art Pro (HD)

TODO

________________________________________________________________________________________________________________________________

Orion

[Ebay] $700 ($194 shipping) - TV2501 new-old-stock

________________________________________________________________________________________________________________________________

Panasonic

Avoid most of these. Known 1993 and beyond to have undefeatable sharpening, so only if RGB mod!
[RGB] CT-9R10CT CT-9R11A - Requires RGB amp :(
[RGB] CT-S1390Y CT-13R16V - Needs amp, has problems :(
[RGB] WV-CK2020A CW-CK2020A
[noRGB] CT-27SL14 CT-27L8G CT-20G14A CT-13R37S

Panasonic TX80P300A

Youtube
Has 480p VGA input

________________________________________________________________________________________________________________________________

Philips / Magnavox

No Component (240p | 480i)

Curved With Component (240p | 480i)

Want one of these!
27PS55
[CRT Database] 27PS55 S321 - Like JVC D Series, but S321 (only) has service menu can be calibrated via remote
[CRT Database] 27PS55 S121 - Cannot easily be calibrated
[Ebay] $230 (TX Pickup) - 27PS55 S321
[Ebay] $500 (WI Pickup) - 27PS55S S121 (not the good one)
[Ebay] $? ($150 shipping) - 27PS55 S321 (no remote) - Watching!!!!!!!!!!!!!
Designer Series
27PT543S/27A has a slightly finer pitch and darker tube
[CRT Database] 27PT543S/37A (Designer Series)

Flat With Component (240p | 480i)

These are all good!
Speakers on Bottom, Black Bar Body Style
Better for space...
[CRT Database] 20PT6341/37 (Designer Series)
[Ebay] $200 ($150 shipping) - 20PT6245/37 (no remote) - Watching!!!!!!!!!!!!!
[Ebay] $125 ($40 shipping) - 20MT4405/17 (bright screen?)
Speakers Surround Body Style
Looks good: 27ST6210/27 20MS3442/17 27MS4504/17
[Philips] 20PT6441/37
[Reedit] 20PT6441/37
[Ebay] $180 ($120 shipping) - 20PT643R01 - Watching!!!!!!!!!!!!!!
[Ebay] $140 ($150 shipping) - 20PT643R (no remote)
[Ebay] $135 (IN Pickup) - 27RF50S325
[Ebay] $? (PA Pickup) - 27MS4504/17

Wide Screen 16:9 SD Model (240p | 480i)

Do these have aspect controls, can they do 480p, etc?
26PW6341/37 and 30PW6341/37, better voltage regulation (compared to the 4:3 TVs)

TODO....

Philips 32PT740H/37A (240p | 480i | 480p | VGA)

Curved 4:3 supporting HD resolutions

Philips 32PT830H/37A (240p | 480i | 480p | VGA)

Non-Curved 4:3 supporting HD resolutions

Philips 30PW862H (240p | 480i | 480p)

16:9 set with RGB-HV input

Philips 34PW8520 (240p | 480i | 480p | VGA)

16:9 set with VGA input

Presentation Monitors (VGA)

Would grab one if I could find one, super rare!
4:3 with 31kHz, says 'Multimedia Display' in lower left
PD-5029-S (640x{480,400,350}), PS1127 PS1132 (up to 1024x768), 32PD8000 (480p, 600p)

________________________________________________________________________________________________________________________________

Prima/Advent/Jensen

Prima China brand, rebrand as Advent and Jesen
Advent Q1435A: Confirmed slot mask with no processing and component inputs
Advent HT3061A: HDTV 480p and 1080i over component and DVI. Lag? 240p?

________________________________________________________________________________________________________________________________

RCA

MM Series

Pickup anything with the 'digital|HI-RES' label in the upper right!
MM27100 MM36100 - MM27110 MM32110 MM36110
Lowendmac.com article on MM36100
MM36100 Manual
MM36100 36" and does {480p, 600p, 1080i} via VGA
Later MM100's (MM36110) did {480p, 1080i} over component (earlier didn't)

4:3 Proscan

Cound be interesting, but won't find one of these!
Could do 1024x768 interlaced at 86Hz
PS32800HR PS36800HR PS36810 PS32810 PS27810

Misc

Some to avoid, some never to be found!
Curved HD and HD Proscan - F38310 (does scaling), PS38000 (?), etc, confusing (scale or not)
Xbox Ready Models (SD Component) F27650 F32650 - Looks good
Home Theatre Premiere F36715 - Looks like component has sharpening, is it defeatable?
SDTV (Curved) - Looks like component has sharpening, is it defeatable?
SDTV Truflat - Looks like component has sharpening, is it defeatable?
HDTV Truflat [4:3] - D36TF30 (480p VGA) D27F570T
HDTV Truflat [16:9] - Scenium D34W135D (480p DVI), Dish HD34-300/310 (480p DVI)

________________________________________________________________________________________________________________________________

Samsung

Samsung Curved (240p | 480i)

Only for RGB mod!
[RGB] TXD1973
[RGB] TXE1970
[RGB] TXH1370 TXH1372 TXH1373 TXH1386 TXH1970 TXH1972 TXH1973 TXH1986

Samsung Dynaflat

Avoid, these all have sharpening which cannot be turned off even in the service menu!
EDTV/HDTV models deinterlaced 240p/480i with lag
TXN3271HF
480p and 1080i are lag free

Samsung SlimFit

TX-3079WH seemed to have real 240p and maybe not always-on sharpening (based on internet images)? Verify.
Note some of the SlimFit CRTs don't have HDMI (like TX-S2783)
[CRT Database] TX-T2793H
[4:3] TX-T2793H
[16:9] TX-R3079WH
[Ebay] $180 (MI Pickup) - TX-R2779H
[Ebay] $130 (CN Pickup) - TX-T2782 (Recapped)

________________________________________________________________________________________________________________________________

Sanyo

Only if have good RGB mod!
The curved component SD sets appear to have sharpening, not sure if it is defeatable
The curved 4:3 HD set has 3 frames of lag at 240p
The flat Vizzon sets appear to have sharpening, not sure if it is defeatable
HD sets appear to scale all signals to 1080i
[Reddit] DS20930 with RGB mod
DS31520 Service Manual
[CRT Database] VM-8614F
[Shumps] Sanyo RGB mod attempts
[RGB] DS31520 DS20930
[noRGB] DS24425

________________________________________________________________________________________________________________________________

Sharp

Non-Component (240p | 480i)

[RGB] 19L-M100 32J-S400 36J-S400
[RGB] CJ13M10 20K-S100
[RGB] 13K-M100B 13K-M150B (service menu without remote!)
[noRGB] CN13M10B 'CN13M10 (2000 and later)' 13N-M100B 13N-M150B 27C240
[Ebay] $350 (NJ pickup) - 25R-S100 new-old-stock
[Shumps] Non-complete RGB mod for 25R-S100

Cinema Select (240p | 480i)

Seems worth trying!
Label upper left, need the remote, or redi-remote (with component input button)
Other models of this era didn't have component, and some look identical except for the label
[Component] 27K-X2000 31HX1000 31HX1200 35HX1000 35HX1200 36K-X2000 CK36S60

Curved With Component (240p | 480i)

Seems worth trying!
Similar to JVC D-Series
VM disable (aka Sharpening) in the service menu
[RGB] 27SC260
[Ebay] $300 ($230 shipping) - 27SC260
[Ebay] $100 (freight) - 27SC26B
[Ebay] $300 ($48 shipping) - 27R-S480

Sharp X-Flat (240p | 480i)

Seems worth trying!
At least one 20" model had no visual VSM (sharpening)
27F541 Review - Sharpening, CRT NA Doc says sharpening can be disabled on this TV
[RGB] 27F640 - Like JVC I'Art, service menu can be done without remote, has VM disable!
27F640 27F541
[Ebay] $500 (free shipping) - Sharp 27F541

________________________________________________________________________________________________________________________________

Sony

Sony BVM | PVM

Sony KV-20XBR / KV-25XBR

My parents had one of these (1985), absolutely amazing (RGB inputs)

Sony FW900 (VGA)

24" HD CRT

Sony Curved (240p | 480i)

[RGB AA-1] KV-27V10 KV-20V50 KV-27V55
[RGB AA-2D] KV-27S22 KV-27S26 KV-27S36 KV-27V22 KV-27V26 KV-27V36 KV-32S22 KV-32S26 KV-32S36 KV-32TW26 KV-32V26 KV-32V36
KV-35S26 KV-35S36 KV-35V36 KV-35V76
[RGB BA-1] KV-13TR28 KV-13TR29 KV-13V50 KV-20TR23 KV-20TS29 KV-20TS32 KV-20TS50 KV-20V50
[RGB BA-4D] KV-13M42 KV-20M42 KV-20S42 KV-20S43 KV-20S90 KV-27S42 KV-27S46 KV-27S66 KV-35S36
[RGB] KV-35S36
[RGB AA-2D] KV-35S36 (detailed)
TODO...
[SCART 27"] KV-2900
[RGB 35"] KV-35S36
[RGB 32"] KV-32S42
[RGB 27"] KV-27S42 KV-27S46 KV-27S66 KV-27V10 KV-27V42 KV-27V55
[RGB 20"] KV-20M40 KV-20M42 KV-20S42 KV-20S43 KV-20S90 KV-20TR23 KV-20TS29 KV-20TS32 KV-20TS50 KV-20V50
[RGB 13"] KV-13M10 KV-13M42 KV-13TR28 KV-13TR29 KV-13V50
[noRGB 20"] KV-20V60
[noRGB 19"] KV-19TR20
[noRGB 13"] KV-13M20 KV-13TR27 KV-13TR24

Sony Wega (240p | 480i)

Only interested in ones which can get RGB mod!
[RGB BA-5] KV-20FS12 KV-20FV12 KV-21FE12 KV-21FM12 KV-27FS13 KV-27FS17 KV-27FV17 KV-29FV17 KV-32FS13 KV-32FS17 KV-34FS17
Suggesting RGB mod because sharpening disable isn't fully effective
FV310 claimed to be the best (higher quality regulators): KV-36FV310 KV-32FV310 KV-27FV310
[noCOMPONENT] KV-27FV15 KV-24FV12 KV-24FV10 KV-20FV12
[RGB 38"] KV-38FS200 KV-38FV250 KV-38FV310
[RGB 36"] KV-36FS100 KV-36FS200 KV-36FS210 KV-36FV300 KV-36FV310
[RGB 34"] KV-34FS17 KV-34FS100 KV-34FS100 KV-34FV250 KV-34FV310
[RGB 32"] KV-32FS13 KV-32FS17 KV-32FS100 KV-32FS200 KV-32FS210 KV-32FV300 KV-32FV310
[RGB 29"] KV-29FA210 KV-29FS100 KV-29FS100 KV-29FV17 KV-29FV300 KV-29FV300 KV-29FV310
[RGB 27"] KV-27FS13 KV-27FS17 KV-27FS100 KV-27FS210 KV-27FV17 KV-27FV300 KV-27FV310
[RGB 24"] KV-24FV10 KV-24FV12
[RGB 21"] KV-21FE12 KV-21FM12
[RGB 20"] KV-20FA210 KV-20FS12 KV-20FV10 KV-20FV12
[RGB 13"] KV-13FM12 KV-13FM13 KV-13FM14
[noRGB 27"] KD-27FS170 KV-27FS100L KV-27FS120
[noRGB 24"] KV-24FS100 KV-24FS120 KV-24FV300
[noRGB 20"] KV-20FS100 KV-20FS120 KV-20FV300
[noRGB 13"] KV-13FS100 KV-13FS110

Sony Wega (Hi-Scan)

These are all 1080i, don't get except for 16x9, as they don't have real 240p modes!
[16:9] KD-34XBR970 KV-34HS420N KV-34HS420 KV-30HS420
[16:9 noHDMI] KV-34XBR800 KV-34HS510 KV-30HS510 KD-34XB2 KW-34D1 - The ones with DVI might do 480p 16:9 native, but with lag
[4:3] KV-40XBR800 KV-36XBR800 KV-36HS510 KV-36HS500 KV-36HS420 KD-32XS945
[4:3] KV-32HS420 KV-32XBR800 KV-32HS510 KV-32HS500 KV-32HV600 KV-27HS420
[4:3] KV-40XBR700 KV-36XBR450H KV-36XBR450 KV-36XBR400
[4:3] KV-36HS20 KV-32XBR450 KV-32XBR400 KV-32HS20

Sony Wega (Super Fine Pitch)

TODO: Notes on lower latency models (HDPT)...
These are also all 1080i, so don't get the 4:3 one!
Wega List
[16:9] KD-34XBR960N KD-34XBR960 KD-34XS955N KD-34XS955 KV-34XBR910 KD-30XS955 KV-30XBR910
[4:3] KD-36XS955
[Reddit] Notes on service settings on KD-36XS955
[Ebay] $450 (Delaware Pickup) - KD-34XBR960
[Ebay] $1100 (NY Pickup) - KD-34XBR960
[Ebay] $550 (IL Pickup) - KV-34XBR910
[Ebay] $700 (Georgia Pickup) - KV-34XBR910

________________________________________________________________________________________________________________________________

Sylvania

[noRGB] 6427GFG 6420FE

________________________________________________________________________________________________________________________________

Toshiba

1990 'Lavender Mask' had lower ambient reflection, results in purple ambient tint
1995 'Microfilter' had higher contrast

Anything 1997-1999 with component and a 'Cinema Series' logo on front would be nice to have!
'FST PERFECT' tubes in Cinema Series line (best), 'FST BLACK' in 32"/35"/36" 'SuperTUBE'
1997 CN36G97 first NA TV with component inputs and 'FST PERFECT' tube
1999 CN36Z71 possibly the last SD 'FST PERFECT' tube
True Toshiba has a square info tag on back

Maybe a pre-2003 TV for 480p
All 16:9 HD sets until 2004 had 'Microfilter', no 3:4 HD sets had it
2000 CW34X92 had no SD (line doubled), no 720p, but lag-free 480p
2003 everything up-scaled to 1080i/540p
The 'timm' doesn't have a 'Microfilter' tube, but does 15kHz and 31kHz

Orion

Maybe a Toshiba 14AF43 if free, not 14AF44 (2004)
[RGB] 13A21 13A22 19A21
[RGB] 14AF43 14AF44 14AF45 14AF46
[RGB] 20AF43 20AF44 20AF45 20AF46
[RGB] 24AF43 24AF44 24AF45 24AF46
[RGB] 27AF43 27AF44 27AF45 27AF46 27AFX54
[RGB] CF19G32
[noRGB] 13A23 13A24 13A25 20A43
2001 all 24" and under, and curved 27" went Orion
2003 all 27" and under went Orion
2004 Orion was cutting costs (worse)
2005 everything was Orion (including HD)
Orion requires remote to change input, only 4 button TVs
Some Orion have VM (Velocity Modulation, sharpening), disable via disconnecting cable from neck board
Orion black crush can be disabled in service menu
[CRT Database] 14AF43 - One of a few 14" with component

________________________________________________________________________________________________________________________________

Zenith

[noRGB] C27J28B

________________________________

[CSS] Test Area

Heading Two

Bold text
Hyperlink
Italic text
Normal text
Monospace
```
Preformated text
```

Image Caption

________________________________

[Truck]
Life of the 1999 Manual Cummins Diesel Dodge 3500 Dually ...

TODO

[ ] Replace diffs
[ ] Rear LSD is starting to go, figure out replacement diffs
[ ] Fix passenger window mechanism (catching)
[ ] Have shop install clutch and flywheel
[ ] Wipe out inside of truck
[ ] Switch to undertank diesel send, fix fuel gauge send
[ ] Replace filters on lift pump
[ ] Track down interior bubbling sound (heater core maybe)
---
[x] Vacume/clean inside
[x] Fill up tires
[x] Fix non-accessable air stem
[x] Take to shop
[x] What motor: "1998.5-2002 trucks , have the 5.9L Cummins engine with 24 valves"
[x] What gearbox: "NV4500 5-Speed Manual"
[x] How much power does it make? (was 256/460 hp/lb-ft, estimated 400/880 with quad+injectors)
[x] Order clutch: NMU70279-04-5SCE - Ceramic single disc from Valair
[x] Hows ceramic? ... Youtuber says it's fine towing
[x] Replace passenger window mechanism

________________________________

[Jeep]
Life of the 2016 Jeep Unlimited Rubicon ...

TODO

[ ] Replace the tires.
[ ] Get 2016 Willy ECU, because some of those didn't have electronic key.
[ ] Fix front turn bulb.
[ ] Root the electrical problem.
[ ] Remove remaining fuses that are not required for basic functionality.
[ ] Battery died again, remove rest of non-necessary fuses.
[ ] Cut out remaining unused old seat belt retractors, star bolt not able to open.
[ ] Cut off metal tabs for front fenders.
[ ] Look into front bumber with winch?
[ ] Fix rust and paint.
[ ] Remove body fender (requires new attachment for hood latch, or no hood).
[ ] Cut off front bumber excess (requires clean and paint cut edges).
[ ] Remove wheel speed sensors (already have fuse out, no more MPH reading).
[ ] Remove parking brake lines to back wheels.
[ ] Tube the door area, install mesh to avoid de-arm on rollover.
[ ] Metalcloak 3.5" lift?
---
[x] Order Rust Kutter and SteelIt for metal repair (and acetone).
[x] Order new tires.
[x] Charging battery (trying to save it).
[x] 2023 change all fluids (trans, diff, gearbox, etc).
[x] Starter died, replace starter, hopefully that fixes electrical problem (nope).
[x] Installed bypass for nany (Searchers Rubicon Locker Override JK kit).
[x] Replaced rear seatbelts too, 3 Point Seat Belt - Non-Retractable, from SeatBeltsPlus.com.
[x] Sold OEM wheels and tires.

AC Delete and Rewire (2020)

Airbags except from the seats have been removed. AC has been removed except the compressor due to not being able to find an idler pulley replacement. The console metal frame was cut out except for the driver's side. The console itself was cut out except just the part infront of the driver side. All inside wiring was minimized, with all electrical tape removed, and rewrapped in a nice sleve. At fisrt I had gone to far in stripping out wiring, to the point where the jeep wouldn't start. Got an ORB2 scanner, only had a U110A code, which mapped to steering angle sensor fail. Also had the security dot on the dash, so knew I snipped out that bus on accident. Those both share the same bus, so found the wires (without a manual), and resolder and shrink wrapped in a correction. Jeep started up without any problem after that. Have a bunch of wiring left to do (the other side of the firewall, replace the giant terminal with something smaller, etc).

A Lot More Room For Groceries

Aero Wheel Install (2020)

The beadlock is different than other aluminum offroad wheels, there is no lip to center the tire, and I suspect the Patagonia tires have a thicker bead than dirt track racing wheels. Mounting beadlock seemed strange at first, but after all the bolts had been torqued down, everything seems fine. There are less and larger bolts for the Aero wheels, but they have higher torque specs (30-35 ft-lbs). Easy but time consuming to mount the tire. Took about 3 psi to get the tire to seat past the safety ridge on the rim. Running 20 psi now for the street. Might adjust after I get enough miles to retorque the bolts.

Rear

Front

Front Inside

Possible Mistake (2020)

The 15"x8" wheels are rated for 3500 lbs circle track racing. JKU stock is a pig at around 4500 lbs. The 15"x10" wheels are rated for 4000 lbs. Probably should have gone with the 15"x10" wheels. Instead going to continue to lighten the jeep.

Aero Wheel Evaluation (2020)

Bought a single Aero 53 Series Wheel from Summit for evaluation. This is a 15"x8" with 3" of backspacing with a 5x5 bolt pattern for the JKU. Fits just fine without any grinding of the caliper for my 2016 JKU. Lots of clearance. HSLA steel and only 23 lbs for the wheel. Went and ordered the other 3, only $137/wheel from Summit.

Seat Covers (2020)

Went with Bartact Base Line Performance seat covers. Wanted something minimal that would offer enough protection to keep the Jeep doors-free for most of the year. Like the extra zipper pouch in the front, easy to put wallet and phone in there so I don't have to worry about it falling out of the shorts.

Tire Research For 15" Wheels (2020)

Turns out no affordable 37" tire for 15x9, so 35" is as big as this story goes!
General notes.

Proper Tire Pressure for the Trail - Only need C rating for street, 15" often better than larger wheels.

Federal Couragia M/T Mud-Terrain

$224 - Federal Couragia M/T Mud-Terrain Tire 35x12.50r15 C-ply - 65.6 lbs (35 psi)
Concerns about balancing reported by a few people.

General Grabber X3

$212 - 35x12.50r15 C-ply - 75 lbs (35 psi)

Ironman All Country M/T Tire - No 35" at 15"

Milestar Patagonia M/T Mud-Terrain Radial Tire

$194 - 35x12.50r15 - 64 lbs (?? psi)
Concerns about street ware on low psi (OEM suggesting high psi required for crowning so center lugs pickup load).
Ran a set of these, lasted almost 3 years, good enough!

Some Wheel Research (2020)

Notes.

Allied Racing Wheels Rock-A-Thon Steel Beadlock Competition Wheels
List of 15" on JK
Using Aero 53 beadlock
$195 - Rock 8 Steel Beadlock Wheels (and others) - 15x8 - 3.75" backspacing - (5x5 bolt would be custom item)
$225 - Gatekeeper Beadlock - 15x8 - 3.5" or 4.5" backspacing (was currently unavailable)
$250 - Gatekeeper Beadlock - 17x9 - 3.5" or 4.5" backspacing
$200 - Sidetracked Steel Beadlock - 17x9 - 3.5" backspacing - 5x5 lug
$169 - Allied Rock-a-Thon BeadLock Wheels - 15x8 - 3.75" to 4" backspacing - (5x5 bolt would be custom)
$128 - Bassett Racing D-Hole IMCA Black Powdercoated Wheels 58D54ILK - 15x8 - 4" backspacing - 5x5 bolt pattern (18 bolts)

________________________________

[Auto]

@ Alberto Big Boost - Florida local, shop
@ Rich Rebuilds - Doing a hayabusa conversion
@ Steve Morris - SML/SMX custom engines
@ Super Fast Matt - Sportbike engined things
@ Taylor Ray - Florida local, C6 corvette build, etc
@ Wesley Kagan
800WHP All Motor Z06 8200RPM Encounters ZX14 SuperBike on Highway!(It�s Stupid Fast) - 427 cu LS7 14:1 compression. Frankenstein cam and heads. Motor done by KK Performance
Beal Racing Engines - Low price 632 BBC
Blue Print Engines - High volume 'street' engine supplier
Dobbertin Performance - Makes driveshaft yoke adapter for Corvette diff
Ernie Brink - Claims to have fixes for rotary engine issues.
Fasterproms Old C6, Now With Sequential Gearbox - Oviedo backroads? Xineering
Holden Rodeo Track Monster
Jerico C5 Corvette Transaxle
Lingenfelter - They have a 440 LS NA (not listed) with 800 hp (8k RPM, hydraulic roller, 14:1)
Reher Morrison - Racing engines
Steve Schmidt Racing - Engines
Transaxle Engineering

________________________________

[FL] Orlando Area

Beach - Playalinda
Biking - Alafia River State Park (2 hour drive)
Biking - MTB List
Bioluminescent Kayak - bioluminescencetours.com
Bioluminescent Kayak - www.bkadventure.com/florida-bioluminescent-kayaking
Bioluminescent Kayak - getupandgokayaking.com/bioluminescent-kayaking
Drive - Biolab Road (and Black Point Wildlife Drive)
Drive - Lake Apopka Wildlife Drive
Drive - List
Birds/Hiking - Cruickshank Trail (Off Black Point Wildlife Drive)
Eats - Boca
Eats - Farm & Haus (For the Burrito)
Eats - Orlando Meats - Unfortunately this didn't survive COVID
Farmers Market - Lake Eola (Saturday norning)
Farmers Market - Oviedo (1st Saturday of Month 8am-1pm)
Farmers Market - Winter Garden (Saturday 9am-2pm)
Farmers Market - Winter Park (Saturday 7am-1pm)
Hiking - Bear Creek Nature Trail (Small)
Hiking - Econ River Wilderness
Hiking - Spring Hammock Preserve
Hiking - Florida Trail Prairie Lakes Loop (Kenansville, South)
Hiking - Florida Trail Tosohatchee (Christmas 10 Miles, Maybe Some Wading)
Hiking - Florida Trail Croom River (West 6.3 Miles)
Hiking - Tibet-Butler Nature Preserve (3.6 Miles)
Indoor Climbing - www.aiguille.com
Jetski - Extreme Jetski of Orlando (Not on Sunday)
Karting - Orlando Kart Center
Kayak - Cape Canaveral National Seashore
Kayak - Crystal River
Kayak - Lake County Blueways
Kayak - Lake Mills
Kayak - Lake Toho
Kayak - Shingle Creek
Kayak - Wekiwa Springs
Misc - Nona Adventure Park
Mountain Biking - UCF Arboretum Natural Lands
Shows - Blue Man Group

________________________________

[GAME] Inventory and Links
ADVANCE - DON'T HAVE

AMIGA - DON'T HAVE

AMSTRAD CPC - DON'T HAVE

Vespertino (in dev)

ARCADE - DON'T HAVE

ATARI 7800 - DON'T HAVE

Rikki & Vikki (2019)

C64 - DON'T HAVE

FANTASY (PC) - DON'T HAVE

MEGADRIVE

1000 in 1
Aladdin (1000 in 1)
Alien Soldier (1000 in 1)
Adventures of Batman and Robin (1000 in 1)
Astebros - Still in Development
Comix Zone (1000 in 1)
Contra Hard Corps (1000 in 1)
Dynamite Headdy (1000 in 1)
Earthworm Jim (1000 in 1)
Earthworm Jim 2 (1000 in 1)
Gunstar Heros (1000 in 1)
Mega Turrican (1000 in 1)
Musha (1000 in 1)
Panorama Cotton (1000 in 1)
Phantom Gear - Still in Development
Ranger X (1000 in 1)
Revenge of Shinobi [PAL] (1000 in 1)
Ristar (1000 in 1)
Sonic (1000 in 1)
Sonic 2
Sonic 3 (1000 in 1)
Sonic and Knuckles (1000 in 1)
Streets of Rage 2 (1000 in 1)
Strider (1000 in 1)
Thunderforce 3 (1000 in 1)
Thunderforce 4 (1000 in 1)
Turtles the Hyperstone Heist (1000 in 1)
Xeno Crisis
ZPF - Still in Development

NDS - DON'T HAVE

Contra 4

NEOGEO MVS + SUPERGUN

161 in 1
Metal Slug - Forsale/Trade
Metal Slug 2 - Forsale/Trade

NES

1000 in 1
Alwas Awakening - Don't Have
Batman
Batman Return of the Joker (1000 in 1)
Castlevania 3 (1000 in 1)
Kirbys Adventure (1000 in 1)
Malasombra - In Dev
Metal Storm (1000 in 1)
Micro Mages
Sky Destroyer (1000 in 1)
Snake Rattle n Roll
Vice Project Doom - Don't Have

Battle Axe - Don't Have
Syndicate (1993) - Don't Have

PGM

Bee Storm - Don't Have
Demon Front - Don't Have
Do Don Pachi - Daioujou - Don't Have
Espgaluda - Don't Have
Ketsui - Don't Have
Kights of Valour - Don't Have
Kights of Valour 2 Plus - Don't Have
Martial Masters - Don't Have
Spectral vs Generation - Don't Have
The Gladiator - Don't Have

PS1

Adventures of Little Ralph - Don't Have
Adventures of Lomax
Alundra
Alundra 2
Castlevania: Symphony of the Night
Destruction Derby 2
DoDonPachi - Don't Have
Harmful Park - Don't Have
In the Hunt
Legend of Mana - Don't Have
Megaman X4
Megaman 8 - Don't Have
Rapid Reload (aka Gunners Heaven)
Street Fighter Alpha 3 - Don't Have
Tenchi wo Kurau II - Don't Have

PS2

Capcom Classics Collection Vol 1
Capcom Classics Collection Vol 2
God of War
God of War 2
Jak and Daxter: The Lost Frontier
Jak and Daxter: The Precursor Legacy
Jak 2
Jak 3
Metal Slug 4/5 - Forsale/Trade
Ratchet and Clank
Ratchet and Clank: Deadlocked
Ratchet and Clank: Going Commando
Ratchet and Clank: Size Matters
Ratchet and Clank: Up Your Arsenal
Sky Gunner - Don't Have

PS5

Blazing Chrome - Don't Have
Ratchet and Clank: Rift Apart
Teenage Mutant Ninja Turtles: Shredder's Revenge - Don't Have
Vengeful Guardian: Moonrider - Don't Have

SNES

900 in 1
Aladin (900 in 1)
Axelay (900 in 1)
Legend of Zelda
The Lost Vikings (900 to 1)
The Magical Quest (900 to 1)
TMNT 4: Turtles in Time
Wild Guns - Don't Have

ZX SPECTRUM - DON'T HAVE

________________________________

[HW] FPGA Stuff
This is a very slow burn project to think through my ideal interger ALU vector hardware. Designed for Xilinx 7 series FPGAs (or later). The end goal is to build out a neo-vintage console with my kids some day. Current thoughts are a radical departure from how integer hardware was done in the past.

Others

Boards

Example Per DSP Budget

7A100T
240 DSPs, 11100 SLICEL, 4750 SLICEM (each SLICE is 4 LUTs and 8 flip-flops)
46 SLICEL/DSP
19 SLICEM/DSP

________________________________________________________________________________________________________________________________
[HW] Ownership
Once upon a time, when you purchased a computer, it booted into a console for a programming language, came with a manual with language and hardware documentation. The consumer owned the machine, was both supported and free to do whatever they wanted to do with it.

Since then others have stepped in to remove personal ownership of the machine and exert policy by restricting your access to the machine via software. OS vendors act as if they own your machine. Hardware vendors also act as if they own the hardware. Signed firmware is an obvious example of this. Another important example is the lack of exposure of the hardware in user-accessable APIs, or not being able to run your own binaries.

A large component of what enables this anti-consumer behavior is an industry which has evolved through amplification of complexity. The hardware got so needlessly complex that a single individual would have a hard time writting code to interface with the machine. Thus few can organize any type of counter option, except via things like hurd mentality which often doesn't produce good results.

I'd personally like to see a return of a simple machine. Something accessable and inexpensive. One which boots into console with a langauge. One which can be easily understood and fully interfaced by a single individual. Think of this as a "hyper-calculator".

________________________________________________________________________________________________________________________________
[HW] Multi-Tasking
Why share?

Switch with N slots (N is the number of simultaneous programs).
Human input mux {key, mouse, gamepad, etc}, where input is sent to only the machine with focus.
Display mux, where program with focus owns screen.
Display mux with picture in picture for cases where one wants "split screen".
Mux ownership visible to programs (so they know when to sleep, or not display, etc).
Audio mixer that physically mixes output from programs, with per-program knobs.
Input audio mux separate from human input mux.
Program is a cartride which includes {processors, memory, and flash}.
No need for most "disk" access, program is persistent in flash.
Network input is broadcast to all programs, with physical button to enable/disable.
Network output is accumulated from all programs, with physical button to enable/disable.

Clipboard Doesn't Need An API

Fixed memory location for 'type' of data as integer, with high bit as 'cooperative lock'.
Fixed memory location and size for clipboard data.
Shared by all.
Nothing more is needed.

________________________________________________________________________________________________________________________________
[HW] 3-Bank Register File?
Thoughts on manual 3-banks of dRAM for a register file.

 - Created from distributed RAM (dRAM)
 - Either 192 or 96 register configuration
 - 192 registers
    - Each bank: 64 entry x 3-bit Simple Dual Port (SDP, aka one read, one write) in 4 LUTs
    - 18 SLICEM: 18-bit =  6 SLICEMs/bank
    - 24 SLICEM: 24-bit =  8 SLICEMs/bank
    - 30 SLICEM: 30-bit = 10 SLICEMs/bank
    - Likely too many SLICEMs
 - 96 registers
    - Each bank: 32 entry x 6-bit Simple Dual Port (SDP, aka one read, one write) in 4 LUTs
    -  9 SLICEM: 18-bit = 3 SLICEMs/bank
    - 12 SLICEM: 24-bit = 4 SLICEMs/bank
    - 15 SLICEM: 30-bit = 5 SLICEMs/bank 
    - 16 SLICEM: 36-bit = 6 SLICEMs/bank
 - Each bank direct mapped to DSP input {A,B,C} (except maybe C) for high clocks
 - Could have separate store control for all 3 banks
    - Complex to write assembly for
 - Not all values will need to be in all banks
    - And accumulator usage typically won't require stores
    - So when DSP to dRAM is not needed, loads or constants could be stored
 - Likewise stores (not to register file) can use DSP output (instead of fetching from register file)
    - The dRAM stores will need to be filtered through a CLB to mux options

________________________________________________________________________________________________________________________________
[HW] DSP MUX at C
When always using the MUL path, and using a simple 3 banked register file, there is a clock available to MUX inputs to C due to registering M. Perhaps a prefered way to do input modifications at high clocks.

 - C can have a CLB 
    (bRAM A)__[>A]__(*)__[>M]__(+)__[>P]
    (bRAM B)__[>B]__/          /
    (bRAM C)_______(CLB)_[>C]_/
 - Could MUX in constants/etc to C
    - To avoid needing to burn the register file

________________________________________________________________________________________________________________________________
[HW] Left-Justified
The idea is to always maximize precision by default. Fixed point numbers are all {-1.0 to <1.0}. This kind of thinking completely changes everything (as will be seen in later sections).

 - Fixed point number convention
    MSB.....LSB  CONVENTION
    ===========  ==========
    00000000000   0.0 zero (false)
    01111111111  <1.0 largest positive
    10000000000  -1.0 smallest negative (true) 
 - Traditional 'unsigned' {0.0 to 1.0} values go {0.0 to -1.0} instead
    - Only the signed side has ability to encode the 1.0
 - Any 'unsigned' {0.0 to <1.0} values stay positive {0.0 to <1.0}
 - Memory is {0.0 to <1.0} accessed (left aligned, instead of right aligned indexing)
    - So a 320 entry memory would be accessed {0.0 to 320.0/512.0}, where the 1.0 is the next power of 2 in size
 - Using {-0.5 to 0.5} for {-1.0 to 1.0} ranged data
    - Then extra 'p=p+p' (accumulator added to itself) to renormalize before final output

Signed MACC

Designed to work with Xlinix FPGA DSPs
B: One lower bit-width MUL operand (LSBs zeroed, following left-justify convention)
A: One medium bit-width MUL operand (LSBs zeroed, following left-justify convention)
C: One register bit-width ADD operand
P: One high bit-width accumulator (larger than register file width)
Showing first a 32-bit proxy, then the actual 48-bit accumulator target configuration

 - 12-bit x 15-bit = 27-bit partial product sign extended to 32-bit (proxy)
    11111111 11111111 00000000 00000000
    fedcba98 76543210 fedcba98 76543210
    ========-========-========-========
    ........ ........ ....bbbb bbbbbbbb  12-bit input for MUL
    ........ ........ ....0111 11111111   2047
    ........ ........ ....1000 00000000  -2048 (representing -1.0)
    ........ ........ .aaaaaaa aaaaaaaa  15-bit input for MUL
    ........ ........ .1000000 00000000  -16384 (representing -1.0)
    .....mmm mmmmmmmm mmmmmmmm mmmmmmmm
    .....010 00000000 00000000 00000000  -16384 * -2047 =  33554432 (maximum positive output)
    .....110 00000000 00000000 00000000                   -33554432
    pppppppp pppppppp pppppppp pppppppp  32-bit accumulator
    sssssmmm mmmmmmmm mmmmmmmm mmmmmmmm  sign-extended multiply result
    ......bb bbbbbbbb bb...... ........  register part used for 12-bit input (MSBs)
    ......aa aaaaaaaa aaaaa... ........  register part used for 15-bit input (MSBs)

 - 18-bit x 25-bit = 43-bit partial product sign extended to 48-bit (target)
    22222222 22222222 11111111 11111111 00000000 00000000
    fedcba98 76543210 fedcba98 76543210 fedcba98 76543210
    ========-========-========-========-========-========
    ........ ........ ........ ......bb bbbbbbbb bbbbbbbb  18-bit input for MUL
    ........ ........ ........ ......01 11111111 11111111   131071
    ........ ........ ........ ......10 00000000 00000000  -131072 (representing -1.0)
    ........ ........ .......a aaaaaaaa aaaaaaaa aaaaaaaa  25-bit input for MUL
    ........ ........ .......1 00000000 00000000 00000000  -16777216 (respresenting -1.0)
    sssssmmm mmmmmmmm mmmmmmmm mmmmmmmm mmmmmmmm mmmmmmmm  sign-extended multiply result
    ......bb bbbbbbbb bbbbbbbb ........ ........ ........  part used for 18-bit input (MSBs)
    ......aa aaaaaaaa aaaaaaaa aaaaaaa. ........ ........  part used for 25-bit input (MSBs)
    ......cc cccccccc cccccccc cccccccc cccccc.. ........  part used for 32-bit register file (or ADD)
    ssssss.. ........ ........ ........ ........ ........  sign extension for C
    ........ ........ ........ ........ ......10 00000000  single one bit to setup for rounded accumulation

High Precision Shift Left

 - DSP has ability to feed result back into adder at high-precision
    - This can be done without any delay
    - Enables 'P+P = 2*P = P<<1' 
 - This provides a framework for mapping back to normalized multiplies,
    - Example if we know this won't overflow,
      Map 'x*3+x' to '(x*(3/4)+x*(1/4))<<2'
      Which takes 2 'x+x' clocks to renormalize

Rounding
Truncation moves towards the smaller value. Often a series of MADs feeding into an accumulator will start with an add operand of zero. To setup for rounding, can feed in a 1 bit before the LSB. Showing non-normalized examples below.

 - Unsigned logic examples
     0.4 + 0.5 =  0.9 ->  0.0
     0.6 + 0.5 =  1.1 ->  1.0 
     1.4 + 0.5 =  1.9 ->  1.0
     1.6 + 0.5 =  2.1 ->  2.0 
 - Signed logic examples
    -0.4 + 0.5 =  0.1 ->  0.0
    -0.6 + 0.5 = -0.1 -> -1.0 
    -1.4 + 0.5 = -0.9 -> -1.0
    -1.6 + 0.5 = -1.1 -> -2.0

________________________________________________________________________________________________________________________________
[HW] Negatives
The DSP48E1 guide uses a different way of documenting ALUMODE, this is my way.

 - For signed 16-bit use {0, -32768} as {0 to 1.0} convention
    - fedcba9876543210
      ================
      0000000000000000 ...  0
      0111111111111111 ...  32767 (maximum positive)
      1000000000000000 ... -32768 (minimum negative)
      1111111111111111 ... -1 
 - For bools use sign bit, so 16-bit is {0:=false,-32768:=true}
 - No negate, just 'neg(x) = not(x) + 1'
 - No double negate 'd = -a + -b', instead 'd = a + b' then use '-d' later
 - No subtract, instead NOT modifiers
    - a - b = a + not(b) + 1
    - a - b = not(not(a) + b) ... alternative more useful form
 - DSP48E1 when using Multiplier (instead of the A:B high precision add)
    - X options 
       -  P (recirculate past result)
       -  M (multiplier partial result)
       -  0 (zero)
    - Y options
       -  C (input)
       -  M (multiplier other partial result)
       - ~0 (all ones)
       -  0 (zero)
    - Z options
       -  P (recirculate past result)
       -  C (input)
       -  0 (zero)
    - Useful input configurations
       Z   XY
       =   ==
       0 + 0 (zero)
       0 + P (nop)
       0 + C (move)
       0 + M (multiply)
       C + M (multiply add)
       P + M (multiply accumulate)
       P + C (accumulate)
       P + P (high precision double result, aka shift left 1)         
    - ALUMODE
       - 0:     Z +(X+Y+CIN)  ... add
       - 1:   (~Z)+(X+Y+CIN)  ... -Z+(X+Y+CIN)-1 ... if CIN=1 then get -Z+(X+Y)
       - 2: ~(  Z +(X+Y+CIN)) ... neg output
       - 3: ~((~Z)+(X+Y+CIN)) ... sub, 'Z-(X+Y+CIN)'

Example: Parabolic Sqrt (Left-Justified Logic)

 - Parabolic sqrt(x) estimation
    - Using {0 to MIN_NEGATIVE} as {0.0 to -1.0} convention
    - Result using positives would be '2*x-x*x'
    - But want negatives so '2*x+x*x'
    - Highest precision
       - p=x
       - p+=p ..... Have 5-bit guard so this won't overflow
       - p+=x*x ... This gets 18-bit * 25-bits of precision for multiply
    - Fast, lower precision
       - p=x-(-1.0*x) ... The 'x' in the multiply is 25-bits of precision 
       - p+=x*x ......... This gets 18-bit * 25-bits of precision for multiply

Example: Simple Filter Kernel (Left-Justified Logic)

 - Could replace a gaussian in some cases
    - Using {0 to MIN_NEGATIVE} as {0.0 to -1.0} convention
    - Result using positives would be '(1-x*x)^2'
    - But want negatives so '0-(-1+x*x)^2'
    - Highest precision
       - s=p=-1+(x*x) ... This gets 18-bit * 25-bits of precision for multiply, need high latency path though store 
       - p=0-(s*s) ...... Accumulator 'P' doesn't have direct path into multiplier, so must load

________________________________________________________________________________________________________________________________
[HW] Divide
Strategy for doing divides in a left-justified normalized world is a critical foundation for many things.

Divide as in doing 'y=x/a'?
Where 'x' and 'a' are {-1.0 to <1.0} ranged
And typically 'a' would be {-1.0 to 0.0} ranged
Instant problem with 'y=x/a' in this case, as 'a' would always grow the 'abs(y)' beyond the {1.0}
Example, if 'x=-1.0' then 'abs(y)' is '1/a' so 'a=1/256' for example would yield 'abs(y)=256'
So really want to do 'y=x/(a*s)' where 's' is the expected maximum useful value of 1/a
Sets the effective minimum value of 'a' before the output gets clamped
Anything 'a<1/s' would result in 'abs(y)=1.0' clamped result

Iterative Solution

This in the end will get expressed as a binary search for the solution 'y'
Which implies computing the inverse to test the solution: 'x==y*a*s'
Since the HW cannot express 's>1.0' values this gets converted to 'x*(1/s)==y*a' where 's>1.0'
Note since 'x' is an input, the 'x*(1/s)' gets factored out of the search iteration
The search is thus for 't==y*a' where 't=x*(1/s)'
Note this has almost the same form as a binary search for 'sqrt(x)' which would be 't==y*y' test with 't=x'

... WORK IN PROGRESS ...

Making /0.0 Result in -1.0

The 'a' would be {-1.0 to 0.0} ranged, and don't want a sign flip on the 'x/a' based inputs
So this is realy 'y=x/(-s*a)' or 'x=-s*a*y' or 'x*(-1/s)==y*a' for testing
Actually want 'x*(1/s)==-(y*a)' instead, so the CMP that was a SUB becomes an ADD
Since this is a signed normalized machine, -1.0 functions as INF
?????????????????????????????????????????????????????????????????????????????????????????????
Start the search at 'y=-1.0' and conditionally move towards 'y=0.0'
The test to advance 'y' at the interval will be 'x*(1/s)>=-(y*a)'
Or more specifically '(x*(1/s))+(y*a)>=0'

What About RCP
Same logic, just set 'x=-1.0' and pick an acceptable scaling factor for 's'.

________________________________________________________________________________________________________________________________
[HW] Variable Bit-Width MEM
Word, Half, Byte, and Nibble
Simplest form of compression: variable bit-width loads.

Variable bit width loads are left aligned (new) instead of right aligned (traditional)
Has an addressing querk: use highest possible address for a given size (to avoid extra LUT driving MUX)
Requires 8 SLICEs total (for one 32-bit input)
Requires a one clock pipeline stage
Requires 3-bit mask (drives inputs to MUX) from decoded instruction for sizing
Could be super important hardware, because FPGAs are highly memory constrained devices!

 - Load permutations, where '.' is a zero
    11111111111111110000000000000000  ADR
    fedcba9876543210fedcba9876543210  hbn
    ================================  ===
    vutsrqponmlkjihgfedcba9876543210  111
    vutsrqponmlkjihg................  111
    vutsrqpo........................  111
    vuts............................  111
    rqpo............................  110
    nmlkjihg........................  101
    nmlk............................  101
    jihg............................  100
    fedcba9876543210................  011
    fedcba98........................  011
    fedc............................  011
    ba98............................  010
    76543210........................  001   
    7654............................  001
    3210............................  000
    ================================
    xxxx............................   4x 8:1 MUX (8 LUTs = 2 SLICEs)
                                          hbn - address bits (direct map to mux)
    ....xxxx........................   4x 8:1 MUX (8 LUTs = 2 SLICEs)
                                          hb. - address bits
                                          ..0 - 4-bit nibble (output zero)
                                          ..1 - not 4-bit nibble
    ........xxxxxxxx................   8x 4:1 MUX (8 LUTs = 2 SLICEs)
                                          h. - address bit
                                          .0 - 4-bit nibble or 8-bit byte (output zero)
                                          .1 - 16-bit half or 32-bit word
    ................xxxxxxxxxxxxxxxx  16x 2:1 MUX (8 LUTs = 2 SLICEs, using 5 inputs to 2 output LUTs)
                                          0 - not 32-bit word (output zero)
                                          1 - 32-bit word

 - Store permutations, where '.' is a zero (ignored, because using byte write mask)
    11111111111111110000000000000000  ADR
    fedcba9876543210fedcba9876543210  hb   byte write mask
    ================================  ==   ===============
    vutsrqponmlkjihgfedcba9876543210  11   1111
    vutsrqponmlkjihg................  11   1100
    vutsrqpo........................  11   1000
    ........vutsrqpo................  10   0100
    ................vutsrqponmlkjihg  01   0011
    ................vutsrqpo........  01   0010
    ........................vutsrqpo  00   0001
    ================================
    xxxxxxxx........................  Direct map (no MUX)
    ........xxxxxxxx................  8x 2:1 MUX (4 LUTs = 1 SLICEs, using 5 inputs to 2 output LUTs)
                                         h - address bit
    ................xxxxxxxx........  8x 2:1 MUX (4 LUTs = 1 SLICEs, using 5 inputs to 2 output LUTs)
                                         h - address bit
    ........................xxxxxxxx  8x 4:1 MUX (8 LUTs = 2 SLICEs)
                                         hb - address bits

________________________________________________________________________________________________________________________________
[HW] XOR Offseting
Often don't want to burn an adder for 'base+offset', doing XOR instead could be useful.

OFF   000 001 010 011 100 101 110 111        0 1 2 3 4 5 6 7 
BAS   --- --- --- --- --- --- --- ---        - - - - - - - - 
000 | 000 001 010 011 100 101 110 111    0 | 0 1 2 3 4 5 6 7  ---> zero BAS works like ADD
001 | 001 000 011 010 101 100 111 110    1 | 1 0 3 2 5 4 7 6  ---
010 | 010 011 000 001 110 111 100 101    2 | 2 3 0 1 6 7 4 5   ^
011 | 011 010 001 000 111 110 101 100 -> 3 | 3 2 1 0 7 6 5 4   |   the rest provide various reordering patterns
100 | 100 101 110 111 000 001 010 011    4 | 4 5 6 7 0 1 2 3   |
101 | 101 100 111 110 001 000 011 010    5 | 5 4 7 6 1 0 3 2   v
110 | 110 111 100 101 010 011 000 001    6 | 6 7 4 5 2 3 0 1  ---
111 | 111 110 101 100 011 010 001 000    7 | 7 6 5 4 3 2 1 0  ---> ~0 BAS inverts the order of OFF

________________________________________________________________________________________________________________________________
[HW] Port Conflicts

Reusing ALU P accumulator as an argument(s), leaves open 1 or more register file read port(s) for other usage
The dRAMs (for register file) have both a regular output, and a registered output, and the register has a write enable, so in theory if there is a post-read pipeline stage (such as nibble/byte/half/word extraction) reusing the registered output, it might be possible to use a non-registered read, although this might result in lower peak clocks
A direct operand mapped manual multi-port register file (via separate dRAMs) provides multiple store ports, which if not all are needed, could be open for other usage, as typically a result might not need to be reused in all operand slots, and note the address lines are also free to be changed

________________________________________________________________________________________________________________________________
[HW] Instruction Palette
Core Challenges

When ALU density gets high, also need large programs to do interesting stuff (see GPU trends)
Don't want to burn much RAM for instructions, want some forms of compression
Latency mitigation often involves loop unrolling
Sometimes it will be same set of instructions with a different window of registers
Sometimes it will be same set of instructions with different constants (compute divide or sqrt, etc)

Suggests an instruction palette would be useful. Instruction provides a pointer into a palette of instruction data. Can thus use the rest of the bits for constants. Or alternatively reference constants instead of instruction data.

Register Operand Compression?

 - Examples
    20-bits : 5-bit x 4 {P,A,B,C}
    20-bits : 4-bit (16-reg window) x 4 {P,A,B,C} = 16-bit + 4-bit offset (2 register granularity)
    19-bits : 4-bit (16-reg window) x 4 {P,A,B,C} = 16-bit + 3-bit offset (4 register granularity)
    18-bits : 4-bit (16-reg window) x 4 {P,A,B,C} = 16-bit + 2-bit offset (8 register granularity)
    16-bits : 3-bit ( 8-reg window) x 4 {P,A,B,C} = 12-bit + 4-bit offset (2 register granularity)
    15-bits : 3-bit ( 8-reg window) x 4 {P,A,B,C} = 12-bit + 3-bit offset (4 register granularity)

Needs more thought ...

________________________________________________________________________________________________________________________________
[HW] Saturating Integer
Likely too complex for FPGA.

 - Working example for 16-bit machine
 - DSP output {MSB guard bits, 16-bit output, discard bits LSB}
 - Standard signed saturation (32 terms of overflow guard area)
                guard
                |||||fedcba9876543210
                ======---------------
     -1048576 = 100000000000000000000 - Lowest negative before saturate fails
       -32769 = 111110111111111111111 - Underflow
       -32768 = 111111000000000000000
       -32767 = 111111000000000000001
                ======---------------
        32766 = 000000111111111111110
        32767 = 000000111111111111111
        32768 = 000001000000000000000 - Overflow
      1048575 = 011111111111111111111 - Highest positive before saturate fails
                ======---------------
 - 5-bits guard + 1-bit of MSB from 16-bit result region
    - 6:1 LUT provides 'out-of-bounds' detection
 - Requires 8 5:2 LUTs to saturate 16-bit output
    - {2-bits output, 1-bit out-of-bounds, 1-bit saturate enable, 1-bit MSB of guard} inputs to LUT
 - This is thus 2 levels of LUT deep

________________________________________________________________________________________________________________________________
[HW] PIMD
Pipelined Instruction Multiple Data?
Network of cores, where the instruction flows through the network (PIMD), instead of being applied to the network at the same time (SIMD). In the most simple network, a line, reductions become linear time (vs log time with SIMD), but with high latency from start to finish.

First problem with PIMD is that ALUs are also pipelined, so would need some way to manage that delay and still have useful execution. Don't want to use memory at each ALU to buffer a series of instructions. One alternative is to replay the instruction locally, but run through a series of registers. So reductions are always vectorized (aka multi-component).

Not well thought out ...

Big Program Small Memory

As ALU density increases, natually the quanity of local memories increase, and the size of those memories decrease. It becomes possible to run unique program/node only if the program is tiny. There would be a desire to interleave the program across the memories, to enable large programs to be executed. This also has a secondary goal, to amortize memory access for the program across all the memories. Thus if the machine had N nodes, instructions would only be fetched 1/N times per node during execution.

Starting with the simplest of networks, a directional torus which is interleaved so there is no long return path. Path worst case path length is 2x the node spacing. Showing a simplified example of an 8 node machine in the horizontal axis.

 ,-------. ,-------. ,-------. ,--.
a    h    b    g    c    f    d   e
 `--' `-------' `-------' `-------'

Which looks like this logically.

a -> b -> c -> d -> e -> f -> g -> h
^                                  |
`----------------------------------'

De-pipelined start example on clock 0. Steady state pipelined execution example on clock 8. Example in octal. Each node pulls the instruction to execute from the instruction previously executed by the logically left neighbor. With exception that node 'A' pulls it's instruction from prior instruction latch value of node 'H' (it's left neighbor). Each clock the nodes read the instruction latch value from the logically left neighbor. And the latch value is updated on the 8th clock cycle. Taking the a read port access to the node's tiny local memory. Note for each line of 8 instructions the instruction stream is actually backwards.

         INSTRUCTION LATCH          INSTRUCTION EXECUTE
clk    A  B  C  D  E  F  G  H     A  B  C  D  E  F  G  H
===    == == == == == == == ==    == == == == == == == ==
  0    07 06 05 04 03 02 01 00                               <-- 8 instructions latched every 8 clocks
  1       07 06 05 04 03 02 01    00
  2          07 06 05 04 03 02    01 00
  3             07 06 05 04 03    02 01 00
  4                07 06 05 04    03 02 01 00
  5                   07 06 05    04 03 02 01 00
  6                      07 06    05 04 03 02 01 00
  7                         07    06 05 04 03 02 01 00
  8    17 16 15 14 13 12 11 10    07 06 05 04 03 02 01 00    <-- 8 instructions latched every 8 clocks
 10       17 16 15 14 13 12 11    10 07 06 05 04 03 02 01
 11          17 16 15 14 13 12    11 10 07 06 05 04 03 02
 12             17 16 15 14 13    12 11 10 07 06 05 04 03
 ..              ...                        ...

This extends to 2D directional torus quite easily. For an 8x8 node example, 64 new instructions would be latched on every 64th clock cycle. Every 8 clock cycles the instruction latch would be pulled from the vertical neighbor.

Noticed though that instruction decode gets expensive, probably don't want to replicate the decode, instead rather fanout the decoded data, so perhaps not useful ...