Silicon reverse-engineering: the Intel 8086 processor's flag circuitry

monocasa · on Feb 11, 2023

Interestingly, the two oddball ALU flags (PF and AF) live on in at least one modern arch. Apple Silicon has an undocumented extension to calculate those and store them in unused ARM flag bits. Having Rosetta manually calculate those on each of the ALU ops (or at least the ones the compiler can't remove the flag calculation of) would massively cut into it's speed. And apparently Apple either had traces of them being used or enough fear that someone important was still using them in order to include hardware support.

https://bytecellar.com/2022/11/16/a-secret-apple-silicon-ext...

userbinator · on Feb 11, 2023

I've seen compression algorithm inner loops (handwritten Asm) use those flags.

ant6n · on Feb 11, 2023

I‘m a little bit surprised these flags are used in a way that a simple backflow analysis wouldn’t reveal. If those flags are used within one function (that is, flags only need to be preserved across local branches), then an analysis should find the uses and corresponding instructions that set those. If those flags are used across function boundaries (e.g. after a return or passed as an argument), then it would be harder to detect, so a conservative emulator would have to correctly emulate the flags across all calls and returns.

But do those uses actually exist? What are those use cases for the flags that isn’t just some straight line code or maybe using some simple branches? Is there code that relies on even weirder cases like creating a fault, with the fault handler depending on correctly set A/P flags? Or did Apple just not want to deal with having flow analyses in their JIT emulator and decided going the hardware route was easier?

anyfoo · on Feb 12, 2023

> that a simple backflow analysis wouldn’t reveal

Famous last words, "simple". What about sequences like:

   pushfd
   pop eax
   and eax, ebx

Now you need to know what ebx is to know for certain that PF and AF are irrelevant. Which is very likely, but not guaranteed, and the halting problem rears its ugly head.

I don't know whether there are good heuristics there or even how common that really is, but my gut feeling is that it may still be easy to "accidentally" drop into the slow AF/PF-preserving case, which can hurt in performance-critical code.

On the other hand, calculating AF/PF in hardware in the first place seems, from my outside view at least, much simpler, and appears to prevent a great deal of potential headache.

ant6n · on Feb 12, 2023

Arguably, pushfd is a use of af/pf, unless the corresponding bits on stack are unused (which is hard to prove)

anyfoo · on Feb 12, 2023

That's kind of what I did here, I "used" the bits on the stack by loading them into eax and ANDing them with whatever's in ebx. But yeah, the value could also be more directly referenced on the stack...

ant6n · on Feb 13, 2023

I meant “use” in terms of a live variable analysis: https://en.m.wikipedia.org/wiki/Live-variable_analysis

prirun · on Feb 11, 2023

I've written a Prime minicomputer emulator, and it has these kinds of flags. In my experience, if you don't do exact hardware emulation, there will be some oddball program that doesn't work. If someone posted their program that Apple couldn't correctly emulate, it would cast doubt on their emulation in general, even if this oddball program is the only case in the world where it didn't work.

For Apple, I'm guessing that failing to emulate even an oddball program would be an unacceptable risk.

derefr · on Feb 11, 2023

My guess is that there's an x86-ISA JIT or emulator or threaded-code VM, that itself reuses these flag bits (when it can guarantee a stretch of non-ALU code) to smuggle information between HLL ops. So Apple is in effect transparently transporting the smuggled data.

kens · on Feb 11, 2023

More 8086 reverse-engineering for your enjoyment. Let me know if you have any questions.

controversial97 · on Feb 11, 2023

As I understand it, the 8086 has a separate address space for io.

As in, memory address $3f8 is just a RAM location and not the same thing as the base register of the COM1 serial port.

This is unlike 1980's home computers that use a 6502, where things like the keyboard interface appear to be just another memory location from software.

Is there anything interesting related to how that is handled in hardware?

ajross · on Feb 11, 2023

To the 8086 hardware, the "IO Space" is a single line[1] on the bus, essentially a 21st address bit. When that is set, the "memory-like" devices know to ignore the transaction and "io" devices know to decode the address.

At the software level, it's accessed by different instructions. Instead of MOV with a memory location, you use IN or OUT. And notably those instructions take only a 16 bit address and no segment selector, so IO space is limited to 64kb.

Also note that to the IBM PC, IO space was routinely truncated to just 10 bits. I honestly don't know if that was ever standardized or not, but that's the way production devices and chipsets have always worked. The extra address lines never worked, and generally aliased with devices in the bottom of the space that were ignoring the top bits.

[1] Actually it's multiplexed in a group of 8 bus states addressed by three lines, because Intel was stingy about pins. But logically it's an extra address line. [This is the spot where I repeat my request to Ken to please do a blog post on the insane bus management on this device, which I'd dearly love to see.]

kens · on Feb 11, 2023

> please do a blog post on the insane bus management on this device

I'm working on it, but it's a difficult topic. The bus management circuitry is both very complicated (a bunch of flip flops making a complex state machine full of special cases) and lacking in general concepts. So it's hard to figure out how to make it comprehensible and interesting.

ajross · on Feb 12, 2023

Heh, no rush. But if nothing else that validates my confusion. :)

controversial97 · on Feb 12, 2023

This reminds me of Chris Tarnovski criticizing some microcontroller designs because the buses run all over the die, rather than being gated by a central multiplexor, and that much capacitance and logic inputs to drive being limiting the maximum clock speed.

Relevant? I'm not a cpu expert.

kens · on Feb 11, 2023

The 8086 inherited the separate I/O space from the Datapoint 2200, a "programmable terminal". Since the Datapoint 2200 had a custom TTL processor, it was convenient to build in instructions to manipulate its I/O hardware directly. It doesn't make as much sense for a general-purpose microprocessor, which is why the 6502 etc didn't follow that approach. (Separate I/O instructions make sense in something like the IBM System/360 mainframes, where they were executed by a separate channel controller and gave you more performance.)

As far as the implementation of I/O in the 8086 chip, I haven't come across anything particularly interesting yet. It's essentially the same bus state machine as memory, except it signals an I/O access rather than a memory access.

controversial97 · on Feb 11, 2023

Is there an output from the instruction decoder that indicates an instruction is doing io? or maybe one for read and one for write?

kens · on Feb 11, 2023

Yes, there's an output from the Group Decode ROM that indicates the instruction is an IN or OUT instruction. This causes the M/IO line to signal an I/O operation rather than a memory operation.

anyfoo · on Feb 12, 2023

I think this is my favorite series of articles of yours so far. In parts because purely from the outside, as a programmer, I'm very familiar with the 8086 and its successors, and so have always been curious about what's going on behind the curtain beyond the usual coarse microarchitecture concepts. The level of detail you go into is perfect.

I guess that many people here will feel similar, given the 8086's role.

bonzini · on Feb 12, 2023

How does the ALU handle the fact that INC/DEC do not update the carry flag? Different control signals depending on the ALU operation being performed?

Also, does LOOP use a different ALU operation to decrement CX, since it doesn't update the flags at all?

kens · on Feb 12, 2023

The INC/DEC instructions block the control signal that updates the carry flag. This happens in the carry flag circuit.(See footnote 10.)

For LOOP, the microcode doesn't have the F bit set, so the flags aren't updated from the ALU. Same with NOT.

metadat · on Feb 12, 2023

Are you able to do an AMD Athlon T-Bird article? Or are these chips "too new"?

I'm not sure if there is a cutoff for what you're able to / interested in exploring.

kens · on Feb 12, 2023

That chip is too complex for me. My cutoff is chips with one metal layer. I don't have a good process for removing one layer at a time if there are multiple metal layers. Also, Moore's law means that chips get exponentially more complex over time and my skills don't grow exponentially :-) Finally, I'm limited to about 1um features before microscope quality and the wavelength of light become a problem. Thus, the early 1980s is about as far as I can go.

mariuz · on Feb 12, 2023

I wish there was something like visual 6502 for Intel 8086 http://visual6502.org