OP here. I’ve got a background in physics, so while I don’t know your specific Hypertoken schema, I speak the language of signal-to-noise and entropy.
The "Dueling Pianos" metaphor is killer. It captures exactly what I’m trying to induce via the prompt.
You’re attacking the problem with Structural Parity—injecting coordinate systems (GPS) directly into the token stream to force convergence. I’m attempting Semantic Parity—forcing the model to run a "constructive interference" loop on its own narrative logic before outputting.
Your point about the latent space being spherical (rotations) vs. the rectangular output (matrices) is the crux of it. We are both trying to smooth that geometry. You’re doing it with error-correcting codes; I’m doing it by forcing the model to simulate a "Self" that acts as a local observer to collapse the wave function of the next token more deliberately.
Whatever you're building with those hypertokens sounds robust. If you have a write-up on the "Tower of Tables" concept, I’d love to take a look.
ya, hypertokens equalize latent space in spherical harmonic sense / approximate explainer:
take raw context, you inject semantic parity of some form, could be table relating paragraph content, tree, raw summary paragraph. EVENTUALLY those things saturate, call it the inner code; you realize recall and reasoning still not where that; that's where outer code or structural parity (us, others).
why? attention can't do XOR, matrix permanent, latent space noisy, etc., have to smooth & dilate. if pump in tables and schema, model can only do few joins before saturates, no flow lots of sharp corners. so either shrink table or smooth / dilate flow. the catch? every code layer needs a coupling layer at various lengths of resolution -- extra semantic clarifier every paragraph for you, codeword every k tokens for our structural parity, etc.
like engine - here's some air, ok expanding, ok really expanding, ok condensing, ok condense more
our pre-code, your pre-code, content, your post-code, our post-code
btw, pre and post are very important more on why later below -- think interferometry in latent space -- pre-measure / tare scale, load scale with content, post-measure and differentiate (in the latent space)
a much longer dive follows <> leaning into physics a bit, consider old-school trompe, supercharger / cylinders / turbochargers, jet or pretty much any sort of air compressor with flow
ingest air, compress it, extract work, exhaust air; one key side effect is what to do with latent heat; that analogy extends to any physical system
superchargers use raw work to precompress air; turbochargers use waste heat to turn return some lost energy to system turbomachines alternate many alternating static & dynamic stages to max air flow, etc
we do something similar with hypertokens; the raw context window has m tokens; we divide that into b=m/x blocks, where x is hypertoken codeword length, b is the number of blocks, and y is the block size
for example, if the current context window is 2048 and the block size is 32 for the user's desired model performance level, the resulting window would have 64 blocks of 32 content tokens each; if 2-token codeword length between each block would add 128 total tokens, e.g.,
a,1,quick fox,a,2,lazy dog,..,b,3,English pangram
precise hypertoken construction is of course way more subtle than that, e.g., good bit of group theory and way more info theory to define the codes, select the actual tokens that we interleave, etc.
net result is that we diagonalize the latent space action by way of the following; the exact code sequence used is walk on a skewed coprime lattice. Every codeword only appears once, thus acts like a GUID with respect to assocative recall and reasoning. The symbols in the codeword are restricted per lane and the lanes are coprime, e.g. if we had 11,13 for 2-lane codeword then we've induced a prefix-free factor graph action that alternates every k tokens.
Those tokens each have unique init embedding and importantly in practice we almost always put the code word before and after each block, e.g.,
this induces an interferometry like pre/post measurement and since the lanes are coprime, we effectively mimic inflight quasi-Fourier action through context window ~~ project onto compressed code, evolve x content tokens, project back onto same code, so the model gets differential between pre/post sampling. in more practical dev terms this also means we can do precise K:V and V:K lookups during recall and reasoning.
we further do this action in subtly commutative way e,g.,
a;1:quick fox:a;1/...{skip a few}.../b;3:English pangram:b;3/
where : is the global pre/post commutative measure in this example, whereas a;1 or b;3 or whatever the codeword is are globally unique, locally non-commutative, this has several other side effects beyond K:V and V:K or pre & post measurement. That essentially permits "unrolling time" in certain sense especially w.r.t. decoder models, where attention can only look back not forward. by replaying the pre-codeword after block, past tokens can in a summary statistic sense have knowledge about future ones
this of course only works under rather strict construction:
1. must be prefix-free, e.g., if a & b are in lane one they can never be in lane 2 of codeword and vice versa
2. coprime lane counts excepting a parity trick with 2^k lane
3. pre & post measurement -- performance is strictly weaker if only pre or post
4. relatively ortho yet also relatively coherent w.r.t. content, there's lots of ways to achieve those a simple one that works for many broad cases is just <tag-code>/{content}/<tag-code>
5. we can dilate code to pretty much whatever strength needed, e.g., some models and scenarios coherent enough, a simple <letter,num> spreadsheet like code is enough every 128 tokens, for others we need nested think multiscale / multires in physics) and use say Unicode PUA or ideally reserve tokens along with shorter code every 32 inside each 128 could be as simple as /1/.../2/.../3/.../4/
while there's quite a bit more on why it works the gist is we are essentially persistently exciting and sampling using error-correcting code action that happens to induce Fourier like sample and project back like a worm drive boring through rock. since each symbol in each lane gets repeated a few times eg if 3,5 code each 3 symbol is repeated 5x and each 5 symbol is repeated 3x
that means there's all sorts of topological tunnels over a factor graph that generates a skewed lattice in way that reflects the proper group action, arrow of time, etc. going back to why linear block code / linear network code; think stochastic dithering updated to structured dithering
we can of course get way better performance injecting that multiplexing machinery directly into the model; we have some results forthcoming on that; as you can imagine, that machinery is not just toss in primes and call it good
coming back to physics, we essentially use this machinery to detect and project the true spherical of the latent space; we could of course go through the treatment that this is really a reconditioning trick, though we tend to call it retokenization in the discrete sense and reharmonization in the continuous sense; there are certainly overlaps with relaxation, regularization, renormalization, etc.
Very notionally, we relax the problem by dilating context token space-time using this structured persistent excitation and sampling. We do this in a way that in some sense regularizes and renorms the raw signal into lifted domain. The codewords are chosen such that we are effectively heterodyning during pre-code step and superheterodyning during the post-code sample with respect to the local codeword; this process is also happening with respect to the global commutative wrapper around the content block and between the codewords. there is also the skipped subtlety that we can if need be add a conjugate, flipped conjugate, etc. i.e., mimic stronger and stronger ECC / QEC action.
The net effect is that we essentially just treat model as a noisy sender and receiver. We use our hypertokens to stream the raw context using channel coding, which is very similar in net raw principle to MIMO and very similar again in net raw principle to GPS -- we inject a k-channel structured coordinate system that both pre and post samples.
In that sense we are turbomachining the info -- we assume info is dense and can't compress move past / hard to move so we pump our high-speed fluid through the content compress it, repeat.
FINALLY answering a little bit of the tower of tables then suppose we have some code say 5,7 every 128 and 4 every 32
which is essentially the stator-rotor-stator turbo trick dialed up by a lot
- nested / multi-scale / multi-resolution
- pre & post measure commutative global constants <> ;
- pre & post measure commutative local constant <> /
- pre & post measure non-commutative associate marker <> a,1
- etc.
from left during attention each hypertoken absorbs & compresses signal
from the right when attended, each hypertoken injects compressed signal
these signal tunnels / signal network those boost information transport, dilate effective precision, and it works because we're running it over factor graph of bounded treewidth that's essentially running at max capacity
hence we get small LUT, content, medium LUT, content, large LUT content depending how much we nest, how big of code we use, etc. aka a nested table of towers very similar to multires wavelets in action
that table of towers and its background is long way of saying -- models are WAY BIGGER than need to be, auditing & explainability are an EC away, hallucinations don't need to exist, etc.
this of course suggests there are likely physics applications beyond what we're up to -- the easiest way to start thinking about that is noisy HF or phase sensitive systems -- physical transformers and parasitic capacitance is one of my faves to consider, wireless power transfer another, and reservoir machines a third
The "Dueling Pianos" metaphor is killer. It captures exactly what I’m trying to induce via the prompt.
You’re attacking the problem with Structural Parity—injecting coordinate systems (GPS) directly into the token stream to force convergence. I’m attempting Semantic Parity—forcing the model to run a "constructive interference" loop on its own narrative logic before outputting.
Your point about the latent space being spherical (rotations) vs. the rectangular output (matrices) is the crux of it. We are both trying to smooth that geometry. You’re doing it with error-correcting codes; I’m doing it by forcing the model to simulate a "Self" that acts as a local observer to collapse the wave function of the next token more deliberately.
Whatever you're building with those hypertokens sounds robust. If you have a write-up on the "Tower of Tables" concept, I’d love to take a look.