A Practical Approach to Decoding the x86

LYNN

An image of an Intel 386SX chip, attached to a circuit board.

Now what's this "Computermajig"?

What the heck is a computer? – me, a few hours before working on this project.

Computers are like little guys who get mail that tell them to do a single thing, and then they go do that single thing. Over, and over. Some guys get very simple instructions, and some get complex instructions. The 8086 line of processors are in the latter category, and in fact are part of a group of processors called CISC (Complex instruction set computers.) The alternative to CISC is RISC (Reduced instruction set computers) which you may have heard some buzz about recently.

Yes, RISC is very interesting because it is good on battery life, is easier to produce the chips, and also easier to get started learning the chips. In contrast CISC is clunky, difficult to learn, and is only really around because everyone just got used to it. I know which one I want to spend a few thousand hours on, learning every single detail about. Yup, CISC! Today we're talking about the good ole Intel 80386 processor, released in October 1985.

"RISC Architecture is gonna change everything."

–Angelina Jolie, as acidburn, in the movie Hackers. Go watch it, seriously.

Complex?

Here's the specifications of a single instruction:

Prefix	Opcode	mod-rm	SIB	Displacement	Immediate
0-4 bytes	1-2 bytes	1 byte	1 byte	0, 2, 4 bytes	0, 2, 4 bytes

Prefix, Mod-rm, SIB, Displacement, and Immediate are all optional. That means the only required part of an instruction is the Opcode. Opcode is how the computer identifies which action is being taken, and the rest of these optional bytes help the computer know exactly how it is going to be used. These have canonical names that programmers know about, and then the associated byte that the computer knows about; an example would be the canonical name "ADD", and the opcode associated with it 0x20.

This isn't the whole story; ADD actually has quite a few variants on the opcode depending on what we are ADDing.

Each one of these portions of the opcode has a very specific purpose, and you can learn more about it on c-jump, which is where I did most of my early learning. If you are anything like me, though, these examples and diagrams can only go so far.

To complicate things, only some of these optional bytes are present in some opcodes, and some are exclusive to each-other. This is probably all laid out very well in the above resource, but instead of reading (Boring!) let's learn by doing.

Learning by Doing

For this you'll need a few packages. I'm assuming you are on a Linux environment, and I will be representing my dependencies specifically using the GNU Guix package manager.

guix shell --pure xxd fasm vim

You can pick a different editor, of course. Now, let's vim nop.asm and enter the following code:

;; nop.asm
;; does literally nothing
nop

Nop is the canonical name for "No Operation." In other words, it does nothing other than cycle the Cpu. This is useful, I promise, but for now let's just view it as the world's most simple complex instruction.

This file doesn't do much on it's own, but with fasm (Flat Assembler) we can make it something useful to us: an x86 instruction! Let's do that now:

fasm nop.asm

You should see an output that says something like '1 bytes.' Besides the grammar issue there, it is important to note: there is only 1 byte in this instruction! Let's crack that single byte open and decode it:

xxd -b nop.bin

First, xxd is a tool that allows us to look at the bytes of a binary file conveniently. We pass the -b flag so it comes out as binary, to make it easier to see every bit (in x86 instructions, we have a lot of bit-based flagging.) The output looks like this:

00000000: 10010000

Let's learn how to read it:

memory location	memory value
00000000	10010000

This means that at the start of our binary file (0), we have a single byte: 10010000. If you recall from our definition of instructions earlier, the only required byte of an instruction is the Opcode. That means this is the Opcode! Easy, right? Now let's refer to another useful resource: Wikipedia! According to this listing, NOP, or no operation, is opcode 0x90.

It's actually more common to refer to opcodes as hexidecimal, or base 16, than binary, base 2.

We could convert our binary to hex, or if we are excetionally clever we could do it in our heads. I'm going to opt for a simple change to our xxd command:

xxd nop.bin

The default output for xxd is hexidecimal, and you should see something very exciting: a 90 at memory location 0!

And all of the Rest

Okay, so we got a single byte instruction decoded. Now what? Well the process I am using here is just writing assembly code that reflects what I have questions about. So, let's do that. Here is another file that is slightly more interesting:

; and.asm
; logical ANDs the registers al, and bl
and al, bl

This command actually does something: It logically ANDs the al register with the bl register, and then stores the result in the al register.

Registers are a different topic entirely, so just think of them as "stored values" we can use later.

If we do our same process of fasm and xxd we should get the following output:

00000000: 00100000 11011000

With this command we have two bytes now: the Opcode obviously, but what is the second byte? If we refer to the Wikipedia resource, we can see that AND can be a variety of opcodes. The one we just commited was opcode 0x20, because we used two 8bit registers. We know that 00100000 (Right? !)

Remember, 0x20 is in base 16, which is 32 in base 10! easy.

This means that the second byte comes after the Opcode, so that rules out the Prefix. If we refere to the c-jump reference, we can read the definitions of each optional part of an istruction. ModRM specifies that it is required for instructions that support register or memory operands. We know that our instruction used registers, al and bl, so this is our best bet.

This is the anatomy of the ModRM byte:

Bytes 7-6	Bytes 5-3	Bytes 2-0
Mod	Reg	R/M

Mod let's us know what Mode we are in: 00 is register indirect, 01 is 1 byte displacement, 10 is four byte displacement, and 11 is register mode. We're looking for a 11 here, to match what we would expect from a register-to-register command.

Reg tells us which register we are referencing source. Here is the look-up table for that:

Reg	8bit register	16bit register	32bit register
000	al	ax	eax
001	cl	cx	ecx
010	dl	dx	edx
011	bl	bx	ebx
100	ah	sp	esp
101	ch	bp	ebp
110	dh	si	esi
111	bh	di	edi

I mentioned earlier that we used 8 bit registers, but if you were wondering how I know that this table shows you. Both al and bl are in the 8bit register column, so we know our instruction was the 8bit mode. We're looking for 011, as our source was bl.

R/m can either be another Register, or a memory address. In our case it was another register, so we use the same table above. Our other register was al so we are looking for 000.

Now, let's look at our xxd output again, specifically our second byte:

00000000: 00100000 11011000

Reg here is the value 11, which is what we would want: Register addressing mode. Reg is 011, which lines up with our expectation for bl, and R/m is similarly 000 to match our expectation of al.

And there you have it: Understanding how instructions are encoded also lets us understand how they are Decoded. With a few references and a trusted assembler like fasm we can take any instruction and work our way backwards from the code, and gaining some understanding of how our computer handles instructions.

I hope you managed to learn something.

This post was written as I was learning to reverse some binary code while writing an i386 emulator, so expect more posts in this area!