Interestingly, the two oddball ALU flags (PF and AF) live on in at least one modern arch. Apple Silicon has an undocumented extension to calculate those and store them in unused ARM flag bits. Having Rosetta manually calculate those on each of the ALU ops (or at least the ones the compiler can't remove the flag calculation of) would massively cut into it's speed. And apparently Apple either had traces of them being used or enough fear that someone important was still using them in order to include hardware support.
I‘m a little bit surprised these flags are used in a way that a simple backflow analysis wouldn’t reveal. If those flags are used within one function (that is, flags only need to be preserved across local branches), then an analysis should find the uses and corresponding instructions that set those. If those flags are used across function boundaries (e.g. after a return or passed as an argument), then it would be harder to detect, so a conservative emulator would have to correctly emulate the flags across all calls and returns.
But do those uses actually exist? What are those use cases for the flags that isn’t just some straight line code or maybe using some simple branches? Is there code that relies on even weirder cases like creating a fault, with the fault handler depending on correctly set A/P flags? Or did Apple just not want to deal with having flow analyses in their JIT emulator and decided going the hardware route was easier?
Famous last words, "simple". What about sequences like:
pushfd
pop eax
and eax, ebx
Now you need to know what ebx is to know for certain that PF and AF are irrelevant. Which is very likely, but not guaranteed, and the halting problem rears its ugly head.
I don't know whether there are good heuristics there or even how common that really is, but my gut feeling is that it may still be easy to "accidentally" drop into the slow AF/PF-preserving case, which can hurt in performance-critical code.
On the other hand, calculating AF/PF in hardware in the first place seems, from my outside view at least, much simpler, and appears to prevent a great deal of potential headache.
That's kind of what I did here, I "used" the bits on the stack by loading them into eax and ANDing them with whatever's in ebx. But yeah, the value could also be more directly referenced on the stack...
I've written a Prime minicomputer emulator, and it has these kinds of flags. In my experience, if you don't do exact hardware emulation, there will be some oddball program that doesn't work. If someone posted their program that Apple couldn't correctly emulate, it would cast doubt on their emulation in general, even if this oddball program is the only case in the world where it didn't work.
For Apple, I'm guessing that failing to emulate even an oddball program would be an unacceptable risk.
My guess is that there's an x86-ISA JIT or emulator or threaded-code VM, that itself reuses these flag bits (when it can guarantee a stretch of non-ALU code) to smuggle information between HLL ops. So Apple is in effect transparently transporting the smuggled data.
As I understand it, the 8086 has a separate address space for io.
As in, memory address $3f8 is just a RAM location and not the same thing as the base register of the COM1 serial port.
This is unlike 1980's home computers that use a 6502, where things like the keyboard interface appear to be just another memory location from software.
Is there anything interesting related to how that is handled in hardware?
To the 8086 hardware, the "IO Space" is a single line[1] on the bus, essentially a 21st address bit. When that is set, the "memory-like" devices know to ignore the transaction and "io" devices know to decode the address.
At the software level, it's accessed by different instructions. Instead of MOV with a memory location, you use IN or OUT. And notably those instructions take only a 16 bit address and no segment selector, so IO space is limited to 64kb.
Also note that to the IBM PC, IO space was routinely truncated to just 10 bits. I honestly don't know if that was ever standardized or not, but that's the way production devices and chipsets have always worked. The extra address lines never worked, and generally aliased with devices in the bottom of the space that were ignoring the top bits.
[1] Actually it's multiplexed in a group of 8 bus states addressed by three lines, because Intel was stingy about pins. But logically it's an extra address line. [This is the spot where I repeat my request to Ken to please do a blog post on the insane bus management on this device, which I'd dearly love to see.]
> please do a blog post on the insane bus management on this device
I'm working on it, but it's a difficult topic. The bus management circuitry is both very complicated (a bunch of flip flops making a complex state machine full of special cases) and lacking in general concepts. So it's hard to figure out how to make it comprehensible and interesting.
This reminds me of Chris Tarnovski criticizing some microcontroller designs because the buses run all over the die, rather than being gated by a central multiplexor, and that much capacitance and logic inputs to drive being limiting the maximum clock speed.
The 8086 inherited the separate I/O space from the Datapoint 2200, a "programmable terminal". Since the Datapoint 2200 had a custom TTL processor, it was convenient to build in instructions to manipulate its I/O hardware directly. It doesn't make as much sense for a general-purpose microprocessor, which is why the 6502 etc didn't follow that approach. (Separate I/O instructions make sense in something like the IBM System/360 mainframes, where they were executed by a separate channel controller and gave you more performance.)
As far as the implementation of I/O in the 8086 chip, I haven't come across anything particularly interesting yet. It's essentially the same bus state machine as memory, except it signals an I/O access rather than a memory access.
Yes, there's an output from the Group Decode ROM that indicates the instruction is an IN or OUT instruction. This causes the M/IO line to signal an I/O operation rather than a memory operation.
I think this is my favorite series of articles of yours so far. In parts because purely from the outside, as a programmer, I'm very familiar with the 8086 and its successors, and so have always been curious about what's going on behind the curtain beyond the usual coarse microarchitecture concepts. The level of detail you go into is perfect.
I guess that many people here will feel similar, given the 8086's role.
That chip is too complex for me. My cutoff is chips with one metal layer. I don't have a good process for removing one layer at a time if there are multiple metal layers. Also, Moore's law means that chips get exponentially more complex over time and my skills don't grow exponentially :-) Finally, I'm limited to about 1um features before microscope quality and the wavelength of light become a problem. Thus, the early 1980s is about as far as I can go.
https://bytecellar.com/2022/11/16/a-secret-apple-silicon-ext...