Movatterモバイル変換

Posted at 2011-12-16 15:02 |RSS feed (Full text feed) |Blog Index
Next article:Friday Q&A 2011-12-23: Disassembling the Assembly, Part 2
Previous article:Friday Q&A 2011-12-02: Object File Inspection Tools
Tags:assembly disassembly fridayqna guest objectivec

Friday Q&A 2011-12-16: Disassembling the Assembly, Part 1

byGwynne Raskind

As a small change of pace, today's post is written by guest authorGwynne Raskind. My last post touched a bit on disassembling object files, and Gwynne wanted to dive deeply into just how to read the output in detail. Without further ado, I present her wonderful in-depth look at readingx86_64 assembly.

Inthe December 2 edition of his Friday Q&A series, Michael Ash wrote about several tools for object file analysis, based around a simple piece of sample code which he ran through each tool for examples to show.

His article is lacking in only one respect: it doesn't go into detail about what the assembly language that these tools show actually means. It's just common sense that he didn't; it's an advanced and intricate topic, deserving of an article of its own. I decided to write that article.

The Sample Code
I'll be using exactly the same code that Mike did, replicated here:

// clang -framework Cocoa -fobjc-arc test.m#import<Cocoa/Cocoa.h>@interfaceMyClass :NSObject{NSString*_name;int_number;}-(id)initWithName:(NSString*)namenumber:(int)number;@property(strong)NSString*name;@propertyintnumber;@end@implementationMyClass@synthesizename=_name,number=_number;-(id)initWithName:(NSString*)namenumber:(int)number{if((self=[superinit])){_name=name;_number=number;}returnself;}@endNSString*MyFunction(NSString*parameter){NSString*string2=[@"Prefix"stringByAppendingString:parameter];NSLog(@"%@",string2);returnstring2;}intmain(intargc,char**argv){@autoreleasepool{MyClass*obj=[[MyClassalloc]initWithName:@"name"number:42];NSString*string=MyFunction([objname]);NSLog(@"%@",string);return0;}}

Some things to notice right away:

This code uses ARC.
Accordingly, this code is 64-bit only and requires a recent version of the Clang compiler.
When run, the program will print "Prefixname" twice.

A Crash Course in x86 Architecture
Before diving into the assembly language itself, here's a quick lesson in the basics of the x86_64 (aka AMD64) architecture. The official reference manuals can be found atthe AMD developer website, and cover in extremely technical detail almost everything you'll ever need to know about the underlying workings of the CPU. Several gaps are filled in by theAMD64 Application Binary Interface Specification, which defines the Application Binary Interface (ABI) for C and C++ programs running in 64-bit mode on an Intel processor. The AMD64 specifications document the running of the CPU itself, while the ABI spec defines the conventions used by programs running on the CPU.

Where possible, I will speak in general terms about x86_64 architecture. Very little about how programs work at this level is specific to Mac OS X. While the functions called by the Objective-C runtime are very much OS-specific, the assembly language instructions that call those functions follow the same specifications as any x86_64 system.

Note: If you're already familiar with such concepts as virtual memory, the stack, the heap, and CPU registers, you can skip this entire section.

A Model of Memory
First, we look at the memory model of the computer. The x86_64 architecture specifies a "flat, paged memory model", which in simple terms means that all of the physical memory is laid out as one enormous block, divided up evenly into equally-sized "pages" of a predefined size. Software running on the x86_64 architecture can address a maximum of 48 bits worth of physical memory; this is less than the 64 bits one might expect due to the fact that no shipping CPU actually supports that many address lines. Addresses are always 64 bits long, but the top 16 bits of a physical memory address are always zero. The x86_64 specification does not provide for any other use of those 16 bits, such as tagging; they are reserved for a time when future implementations have both the need and the capability of addressing more than 32 TB of physical RAM at a time.

x86_64 also requires the implementation to provide virtual memory and protected memory. This means that the OS can set up the system in such a way that every process it runs sees its own complete 64-bit address space (virtual memory), andonly its own (protected memory). The OS is responsible for ensuring that an application gets the memory it actually uses by the use of "paging". The CPU intercepts all memory accesses by userland processes ("virtual addresses") and translates them to physical addresses with the help of the OS. A more in-depth description of how virtual memory and paging work is beyond this article; for now, it's enough to understand that every application has its own individual 64-bit memory space and can't see any other process' space.

Note: This is not the same as the "virtual memory" which you may be familiar with, using space on the computer's hard drive as extra RAM, though that kind of virtual memory (a "paging file", or "swap space") is implemented in part by use of the CPU's virtual memory system.

This enormous 64 bits worth of address space is divided up into two areas: The stack and the heap. The stack is an area set aside high in the address space (typically high, anyway; in practice it can be just about anywhere) for the use of subroutine calls and local variable storage. The stack always growsdownward; as the amount of information on the stack increases, the address of the top of the stackdecreases. On older systems with smaller memory models, it was possible for the stack to grow too far downward and collide with other areas, but while it's still technically possible for this to happen, other things would go wrong long before a heap collision (in particular, the stack would run off the edge of its allocated memory pages and cause a protection fault). The CPU has a few instructions specifically designed for manipulating the stack, though they often go unused in favor of more efficient methods in modern code. You can think of the stack as a moderately large chunk of memory allocated by the system at the launch of your program.

The heap effectively consists of every area of memory that is not the stack; memory from the heap is allocated at runtime by the system for the process' use. The heap contains the stack, in fact, though they are usually considered conceptually separate. All of your executable code is loaded into a section of the heap, as well as copies of any libraries your executable links to.Note: These are not actually copies, as it would be ridiculously inefficient to copy every library for every loaded process, but it's easier to just think of them as copies until you have a good grasp of virtual memory. Memory allocated by your process during its execution also comes from the heap.

The CPU and Its Registers
The CPU is the chip that actually does all the work. It fetches, decodes, and executes aninstruction stream; what this means in practical terms is you give it a bunch of machine code and it does what the code tells it. Machine code is a bunch of bytes generated from source code by a compiler. A human could build machine code by hand, but it would be an exceedingly arduous process, and it's rarely if ever worth the time to do with any computer made in the last thirty years or so. The intermediate step between source code and machine code is, of course, assembly language, and humansdo spend a lot of time working in assembly language for various reasons, mostly having to do with either things source code can't do or things compilers can't optimize as well as humans - yet.

Note: In fact, most compilers work by compiling the source code from C or another high-level language to assembly language and then translating the assembly language to machine code (along with some other intermediate steps). However, you never see the assembly language unless you ask to.

Aregister is an area of storage set aside inside the CPU itself for effectively instantaneous access. Registers serve a large variety of purposes. An x86_64 CPU has a set of at least 100 registers - whew! Fortunately, an application developer, even working in assembly language, rarely has to be concerned with more than about 20 of them at most. The majority of the registers (including the control, debug, table descriptor, performance, and machine-check registers, to name a few) are accessible only to kernel code. Most of the rest, such as themmx,xmm, andymm registers, are only used by vector code, and thefpr registers are only used for floating-point calculations (there are exceptions, but as a rule of thumb it's a safe starting point). In addition, only parts of therflags register are ever used by application code.

One of the quirks of the x86 architecture, ever since the original 16-bit 8086 instruction set, is that many of the same registers can be addressed by different names which determine whichpart of the register is being read or written. Most of the general-purpose registers can be addressed down to a single byte. For example, therax (accumulator) register, the first 64-bit general-purpose register, can also be addressed aseax (the low 32 bits ofrax),ax (the low 16 bits ofrax),ah (the second lowest 8 bits ofrax), andal (the low 8 bits ofrax). This capability is useful for handling smaller data types - for example, to handle a signed 32-bit addition requires only a singleadd instruction based on 32-bit register names rather than several instructions designed to emulate 32-bit sign extension and integer overflow on 64-bit registers.

A register namedr*x is 64 bits;e*x is 32 bits,*x is 16 bits, and*h or*l are 8 bits. For ther8-r15 registers, these names are insteadrN (64 bits),rNd (32 bits),rNw (16 bits), andrNb (8 bits).rip andrflags can only be accessed as 64-bit registers, and the 8-bit versions ofrsi,rdi,rsp, andrbp are namedsil,dil,spl, andbpl.

The registers that a userland processis concerned with on a regular basis are:

rax, rbx, rcx, rdx, r8-r15 - These are general-purpose registers, used for just about anything at any given moment, though the ABI locks these down to considerably more specific purposes. These registers can also be called the accumulator (rax), the base register (rbx), the count register (rcx), and the data register (rdx).
rsi, rdi - These are technically index registers ('source index' and 'destination index'), but in modern code they are typically used as general purpose registers, within the specification of the ABI.
rbp, rsp - The "base pointer" and "stack pointer" registers. These are used for accessing the stack; the CPU's stack instructions will always assume thatrsp holds the address of the top of the stack.
rflags - The flags register, holding a long list of flags indicating the results of calculations done by instructions. The flags register can not be directly addressed. Operations affected by CPU flags are generally part of the instructions themselves; for instance, conditional jump instructions work differently depending on the current flags, and arithmetic operations change the flags. Certain instructions affect the flags directly, such asstc andclc, which respectively set and clear the Carry Flag. It is also possible to read the flags register directly by pushing it to the stack and write to it directly by popping from the stack into it. The flags a userland process can affect are:
- CF - Carry Flag.CF is set when the result of an addition is a carry or the result of a subtraction is a borrow. It is also affected by arithmetic bit shifting instructions and bit test instructions, cleared by bitwise logic instructions, and manipulated directly by thestc,clc, andcmc instructions.
- PF - Parity Flag.PF is set when there are an even number of 1 bits in the low byte of the last result of some operations. It can be used for parity checks.
- AF - Auxiliary Carry Flag.AF is set when an arithmetic or BCD operation generates a carry or borrow from bit 3 of the result. Its use is limited to doing decimal math directly on the CPU, and it sees little use.
- ZF - Zero Flag.ZF isset when the last arithmetic operation had a result of zero. Compare and test instructions also set or clearZF appropriately. It is often used as for equality testing, as it is set when comparing two equal operands.
- SF - Sign Flag.SF is set if the last arithmetic operation had a negative result. More exactly, after an arithmetic operation,SF is set to the value of the highest significant bit of the result.
- DF - Direction Flag.DF is used to control whether the string instructions increment or decrementrsi andrdi during their operation, and can be manipulated by thestd andcld instructions. This flag is rarely used in modern code, as the string instructions see little use.
- OF - Overflow Flag.OF is set when the sign of the result of the last signed arithmetic operation is different from the signs of both source operands. This means that the result was too big or too small to hold in the destination.
rip - The instruction pointer register. This holds the memory address of the instruction currently being executed by the CPU.rip can be addressed directly in x86_64, but only for use as a memory offset. To write torip, one must execute one of the many control transfer instructions. As instructions are executed,rip increases by the size of each one (instructions are of very variable size in the x86 architectures), with the exception of control transfer instructions, which work by changing the value ofrip according to the transfer target.

Calling Conventions
The calling conventions of an architecture, which are typically what people mean when they say ABI, specify the ways that functions receive parameters, return values, manage the stack, and other fundamentals not already part of the CPU architecture. x86_64's calling conventions are somewhat complicated, so I'll include an abbreviated version here which will get you through all of the sample code.

Conveniently, none of the functions in the sample code take non-integer parameters, or any large number of parameters. At this point, one might immediately protest thatchar **,NSString *, andid certainly arenot integers! However, for the purpose of function parameter passing, an integer is a value that fits within the bit width of the architecture, i.e. 64 bits for x86_64. Pointers are exactly that size, whileint is smaller (x86_64 is an LP64 architecture, which means thatlong is 64 bits, butint is 32).

Integer parameters to functions are passed via a series of registers. The first parameter goes inrdi. The second goes inrsi, the third inrdx, thenrcx,r8, andr9, in that order. If there are more integer arguments than that, the remainder are pushed onto the stack in right-to-left (reverse) order.

Apart from that, the only quirk of the calling conventions we need to be concerned with for this code is the sequence for variadic functions. A variadic function is one which uses thestdarg interface to take a variable number of parameters. In this case,NSLog is the culprit. There's only one oddity in how variadic functions take parameters, at least at the assembly language level: The byte value ofal (the low 8 bits ofrax) is used to specify the number of vector registers used to pass arguments to the function. Since no vector registers are used by our sample code, this number is always zero.

Finally, functions return simple integer values (again, remember that for these purposes, a pointer is an integer value) inrax, or in some rarer cases,rdx.

The complete calling conventions are considerably more complicated; if you're curious, have a look at theAMD64 Application Binary Interface Specification.

The Assembly Language
The full disassembly of the program is 645 lines long. For sanity's sake, I won't be pasting it here. I'll instead be following along with the code as I explain it. You can disassemble it yourself by running/usr/bin/clang -S test.m -o test.s -fobjc-arc in the directory where you compiled the sample code and viewing thetest.s file. This is the compiler's generated assembly code, which is better annotated than anything else, as the compiler doesn't have to guess at anything's name or location.

Looking at the disassembly, one might notice that the code contains eight functions. Eight? The sample code only has three! Where did those other five come from? Four methods are synthesized by the compiler per the@synthesize directive, and the[MyClass .cxx_destruct] method is created by the compiler to do C++- and ARC-related cleanup.

main
The code formain is:

intmain(intargc,char**argv){@autoreleasepool{MyClass*obj=[[MyClassalloc]initWithName:@"name"number:42];NSString*string=MyFunction([objname]);NSLog(@"%@",string);return0;}}

And the compiler's assembly language output, stripped of several confusing directives for brevity's sake:

_main:pushq%rbpmovq%rsp,%rbpsubq$96,%rspleaqL__unnamed_cfstring_23(%rip),%raxleaqL__unnamed_cfstring_26(%rip),%rcxmovl$42,%edxleaql_objc_msgSend_fixup_alloc(%rip),%r8movl$0,-4(%rbp)movl%edi,-8(%rbp)movq%rsi,-16(%rbp)movq%rax,-48(%rbp)##8-byteSpillmovq%rcx,-56(%rbp)##8-byteSpillmovq%r8,-64(%rbp)##8-byteSpillmovl%edx,-68(%rbp)##4-byteSpillcallq_objc_autoreleasePoolPushmovqL_OBJC_CLASSLIST_REFERENCES_$_(%rip),%rcxmovq%rcx,%rdimovq-64(%rbp),%rsi##8-byteReloadmovq%rax,-80(%rbp)##8-byteSpillcallq*l_objc_msgSend_fixup_alloc(%rip)movqL_OBJC_SELECTOR_REFERENCES_27(%rip),%rsimovq%rax,%rdimovq-56(%rbp),%rdx##8-byteReloadmovl-68(%rbp),%ecx##4-byteReloadcallq_objc_msgSendmovq%rax,-24(%rbp)movq-24(%rbp),%raxmovqL_OBJC_SELECTOR_REFERENCES_28(%rip),%rsimovq%rax,%rdicallq_objc_msgSendmovq%rax,%rdicallq_objc_retainAutoreleasedReturnValuemovq%rax,%rdimovq%rax,-88(%rbp)##8-byteSpillcallq_MyFunctionmovq%rax,%rdicallq_objc_retainAutoreleasedReturnValuemovq%rax,-32(%rbp)movq-88(%rbp),%rax##8-byteReloadmovq%rax,%rdicallq_objc_releasemovq-32(%rbp),%rsimovq-48(%rbp),%rdi##8-byteReloadmovb$0,%alcallq_NSLogmovl$0,-4(%rbp)movl$1,-36(%rbp)movq-32(%rbp),%rdxmovq%rdx,%rdicallq_objc_releasemovq-24(%rbp),%rdxmovq%rdx,%rdicallq_objc_releasemovq-80(%rbp),%rdi##8-byteReloadcallq_objc_autoreleasePoolPopmovl-4(%rbp),%eaxaddq$96,%rsppopq%rbpret

Whew!main's pretty long in assembly, huh? There are some important things to recognize here:

Per the ABI,rdi is the first argument register for integer/pointer arguments, and contains the value ofargc.
Likewise,rsi contains the value ofargv.
Also likewise,rdx has the value ofenvp. This holds trueeven thoughenvp is not declared as a parameter tomain!
Finally,rcx holds the value of a more mysterious"exec_path" parameter, whose presence I only discovered when I peeked at the disassembly of thestart function, part of the C runtime.
And, per x86 convention,rsp points to the top of the stack. Becausemain is a subroutine ofstart, the 8 bytes pointed to byrsp are the return address formain, the next instruction instart.

Let's take it one instruction at a time.

pushq %rbp - Starting off pretty simple. Save the base pointer on the stack so we can restore it later. The ABI specifies thatrbp must be preserved across function calls, so since it's about to change, it gets saved.
movq %rsp,%rbp - Copyrsp torbp. This is part of a standard C function's prologue, setting up the stack to hold any local variables that aren't put in registers for whatever reason.
subq $96,%rsp - A number preceded by$ in assembly language is a literal decimal number used as an operand to an instruction, so this line subtracts 96 fromrsp, growing the stack by 96 bytes. This is how much stack space the compiler has determined it will need for the rest of the function.
leaq L__unnamed_cfstring_23(%rip),%rax - Load the address ofL__unnamed_cfstring_23 intorax, usingrip as the base.rip-relative addressing is typically used for loading such things as constant strings and selector names, as well as for fast branches. This particular load grabs the string@"%@" from the place it was stored in the executable. This string will later be used as a method parameter.
leaq L__unnamed_cfstring_26(%rip),%rcx - Same as above, but loading@"name" intorcx.
movl $42,%edx - Load the 32-bit value 42 intoedx (the low 32 bits ofrdx). This value is also used later.
leaq l_objc_msgSend_fixup_alloc(%rip),%r8 - Grab the address of thel_objc_msgSend_fixup_alloc symbol from the Objective-C segment of the executable, and save that address inr8. Once again, this is used later.
movl $0, -4(%rbp) - Load a 32-bit zero into the bottom of the stack.
This serves as a useful reminder that the stack growsdownwards; given that we know that%rbp points to thebottom of the stack, i.e. the highest address at which the stack exists, this line is actually setting thelast four bytes of the stack to zero.
So what does this actually do? As it turns out, for all intents and purposes, it does absolutely nothing! It's the result of the compiler's determination to make sure no garbage value gets used later, as seen in the next instruction, even though the value is never again read.
movl %edi, -8(%rbp) - Saveedi, the low 32 bits ofrdi, on the stack. Asedi is the first integer argument register, this is actually the value ofargc. The previous instruction, setting the last 32 bits of the stack to zero, now makes a bit more sense; the same effect could have been achieved by code something like*rbp = ((int64_t)argc & 0x00000000FFFFFFFF);, except that sign-extending and ANDing the value of argc would have been several more operations. Unfortunately for the unoptimizing compiler's track record, this instruction also turns out to be useless, as the value ofargc is never actually used.
movq %rsi, -16(%rbp) - Saversi, also known asargv at the moment, on the stack. A third useless instruction in a row, sinceargv isn't used either.
```
movq%rax,-48(%rbp)##8-byteSpillmovq%rcx,-56(%rbp)##8-byteSpillmovq%r8,-64(%rbp)##8-byteSpillmovl%edx,-68(%rbp)##4-byteSpill
```
Saverax (the string@"%@"),rcx (the string@"name"),r8 (a pointer tol_objc_msgSend_fixup_alloc) andedx (the number 42) on the stack as "spill" values.
What in the world is a spill value, you might ask? A register spill takes place when the compiler needs a register to store a value in, typically as a parameter to a function call since parameters go in specific registers, and none are available. The value in the needed register is saved on the stack ("spilled") so it can be restored ("reloaded") later. In this case, where optimization is shut off, the compiler doesn't have any of the data-flow analysis it would need to realize that all this spilling is unnecessary, and everything in useful registers gets spilled.
callq _objc_autoreleasePoolPush - Make a subroutine call toobjc_autoreleasePoolPush(). A subroutine call consists of two operations, performed atomically with respect to other instructions (i.e. they can not be preempted halfway through): Push the address of the next instruction to be executed to the stack, and execute a branch to the address of the first instruction of the called function. Sinceobjc_autoreleasePoolPush() doesn't take any parameters, what's in most of the registers doesn't matter. When it returns, however,rax contains itsvoid * return value, a pointer which acts as an opaque handle to the position of the new autorelease pool on the pool stack. This value is invisible to the Objective-C code, which sees only the@autoreleasepool statement.
```
movqL_OBJC_CLASSLIST_REFERENCES_$_(%rip),%rcxmovq%rcx,%rdimovq-64(%rbp),%rsi##8-byteReloadmovq%rax,-80(%rbp)##8-byteSpillcallq*l_objc_msgSend_fixup_alloc(%rip)
```
Load the value atrip + L_OBJC_CLASSLIST_REFERENCES_$_ intorcx, copyrcx intordi, reload the address ofl_objc_msgSend_fixup_alloc from the stack intorsi, spillrax (the autorelease pool handle) to the stack, and finally, make a subroutine call tol_objc_msgSend_fixup_alloc.
L_OBJC_CLASSLIST_REFERENCES_$_ is the symbol for theMyClass class object. The load intorcx and then the immediate copy tordi is once again a problem of lack of data-flow analysis; the compiler blindly picks the first available register to load the value into, then stores it in the first integer parameter register from there.
What rules cause it to considerrcx the first available register?rax is still in use as a return value until the next couple of instructions, andrbx isn't considered because its value is preserved across function calls, making it a very un-preferred register for use.
So far, theMyClass class object is parameter 1. The reload from the stack pulls the pointer tol_objc_msgSend_fixup_alloc into argument 2. The spill ofrax saves the autorelease pool handle, sincerax will be clobbered by the subroutine return. Andl_objc_msgSend_fixup_alloc is avtable call; the address of the realalloc method will be "fixed up" at runtime for optimization purposes.
This sequence therefore amounts to an optimized Objective-C message send. Recall that every Objective-C method takes two hidden arguments,self and_cmd. In this case,self is[MyClass class], and_cmd isalloc (or more exactly, a vtable pointer to a common alloc method for all classes). A very similar sequence follows.
```
movqL_OBJC_SELECTOR_REFERENCES_27(%rip),%rsimovq%rax,%rdimovq-56(%rbp),%rdx##8-byteReloadmovl-68(%rbp),%ecx##4-byteReloadcallq_objc_msgSend
```
Load the value atrip + L_OBJC_SELECTOR_REFERENCES_27 intorsi, copyrax tordi, reload@"name" intordx, reload42 intoecx and subroutine-call to objc_msgSend.
L_OBJC_SELECTOR_REFERENCES_27 is the selector for[MyClass initWithName:number:], placed intorsi, or argument 2.rax holds the return value ofalloc, which is the newMyClass object, and it's copied into argument 1. The third parameter,rdx, is loaded with the constant NSString@"name", and the fourth parameter with the number 42. Finally,objc_msgSend() is called. This is the call sequence for[ initWithName:@"name" number:42]. The init method will return the value ofself inrax.
movq %rax, -24(%rbp) andmovq -24(%rbp), %rax - Yes, that's right, these two instructions are entirely redundant. Because-24(%rbp) is used later, it's good for the value to be saved. Unfortunately, the immediate reload back intorax is not justified.
```
movqL_OBJC_SELECTOR_REFERENCES_28(%rip),%rsimovq%rax,%rdicallq_objc_msgSend
```
Hopefully, you've got the hang of this by now; this isobjc_msgSend(rax, @selector(name));. Return value inrax as usual.
movq %rax, %rdi andcallq _objc_retainAutoreleasedReturnValue should be obvious now.objc_retainAutoreleasedReturnValue(obj); is inserted by ARC to keep the return value of thename method alive, since the temporary variable created invisibly by the Objective-C compiler to hold the value is implicitly declared__strong.
```
movq%rax,%rdimovq%rax,-88(%rbp)##8-byteSpillcallq_MyFunctionmovq%rax,%rdicallq_objc_retainAutoreleasedReturnValue
```
Save the return value ofname, copy it as the first parameter toMyFunction(), callMyFunction(), callobjc_retainAutoreleasedReturnValue() on the return from it.
```
movq%rax,-32(%rbp)movq-88(%rbp),%rax##8-byteReloadmovq%rax,%rdicallq_objc_release
```
Save the return value ofMyFunction(). Then, reload the result of[MyClass name], and callobjc_release() on it, as ARC has noticed that it's no longer used.
```
movq-32(%rbp),%rsimovq-48(%rbp),%rdi##8-byteReloadmovb$0,%alcallq_NSLog
```
A simple call toNSLog(), with the only odd feature being the set ofal to zero. BecauseNSLog() is a variadic function, the calling convention specifies thatal holds the number of vector registers used when calling it. No vector registers are used, so it's just set to zero.
movl $0, -4(%rbp) andmovl $1, -36(%rbp) - I have to admit, I see no reason whatsoever for the compiler to toss a zero and a one onto what look rather like random parts of the stack, here or anywhere else inmain(). Nothing like these values is used anywhere in the optimized version of the code. The store of zero at least gets used further down, but the store of a 1 seems entirely meaningless.
```
movq-32(%rbp),%rdxmovq%rdx,%rdicallq_objc_release
```
Release the return value ofMyFunction() - you have set aside a sheet of paper to keep track of which values went in which offsets on the stack and in which registers, haven't you? If not, it'd be little wonder if you were a little lost by now.
```
movq-24(%rbp),%rdxmovq%rdx,%rdicallq_objc_release
```
Now releaseobj, the object of classMyClass that we allocated before.
```
movq-80(%rbp),%rdi##8-byteReloadcallq_objc_autoreleasePoolPop
```
Reload the autorelease pool handle and pop it by callingobjc_autoreleasePoolPop(). This is the code inserted by the closing brace} of the@autoreleasepool statement.
```
movl-4(%rbp),%eaxaddq$96,%rsppopq%rbpret
```
Load the zero on the stack intoeax asmain's return value.
Restore the stack pointer to its original position whenmain was called.
Pop the original value ofrbp off the stack and back intorbp.
Pop the address of the next instruction off the stack intorip, also known as returning from a subroutine call.

And that'smain()! What a long-winded mess.

I must admit at this point that I went out of my way to make this function difficult to understand in one critical respect: I've been working from theunoptimized version of the code generated by the compiler. The code built with-Os is, surprisingly, much easier to understand, with a lot of redundant work completely eliminated and the registers managed much more efficiently. There's also almost no work done on the stack, since the compiler in optimizing mode is free to make use of a larger pool of scratch registers.

I did this because until you can understand the control flow of an unoptimized routine, there's no point in reading optimized code. Starting with the optimized code is a bit like learning to swim in water so shallow you can't even put your face under, except for those times when the compiler does something fantastically tricky to get a speed or size bonus, when it suddenly becomes rather like diving into the deep end of an Olympic pool.

Conclusion
That's the end of this article, but it's only part 1 in a series. Hopefully, you've enjoyed it so far; in part 2 I'll explore the rest of the methods in the sample code, as well as the optimized version of the code and the C runtime'sstart function.

Did you enjoy this article? I'm selling whole books full of them! Volumes II and III are now out! They're available as ePub, PDF, print, and on iBooks and Kindle.Click here for more information.

Comments:

Daveat2011-12-16 15:23:51:

My goodness this is fantastic.*bookmarked*

Steve Wellerat2011-12-16 15:51:06:

Please repeat for ARM!

BJ Homerat2011-12-16 16:50:00:

This is very useful. I've been meaning to find a good primer on reading assembly for a while, and this couldn't be more appropriate.

But who in the world came up with these register names? Can we fire them?

mikeashat2011-12-16 17:06:20:

Fire them, out of a cannon, into the sun, perhaps?

Jeroenat2011-12-17 00:47:58:

Geesh that brings back the memories! (of the much nicer 68K and PPC though…)

Fire them at the moon so they go splat and remain there as a warning for the future.

Great stuff…

Alistairat2011-12-17 10:28:10:

What a superb article, thanks for taking the time to write it. As someone else said, yes please, a version for ARM. That really took me back!

Gwynne Raskindat2011-12-17 14:16:29:

Steve, Alistair:I'm not very familiar with the ARM architecture at the assembly language level, but I guess this is the perfect time to learn! I'll see what I can do about an ARM version once I've finished part 2, since it seems so popular an idea :).

Hugh Fisherat2011-12-20 01:35:12:

BJ Homer, the register naming conventions on x86 are an evolutionary hangover from the dim, distant days of the 1970s.

Historically mainframe CPUs, like the IBM 360 which is still with us, had 16 or more general purpose registers numbered from 0. Minicomputers like the PDP-11 had fewer but still general purpose registers, also numbered from 0. You can/could see these influences in the PowerPC and Motorola M68K which used similar naming schemes, as did most RISC architectures.

The x86, though, evolved in a pure microprocessor environment. The first 8080 and then 8086 had so few transistors that every register had a unique purpose. You literally could only add numbers in the accumulator (AX) register, you could only use the string index (SI) register to fetch a byte at an offset from an address, and so on. Since there were so few registers and each was different it made sense to give them different names.

The 8 bit 8080 had an 8 bit accumulator while the 8086 was 16 bit. Intel wanted the 8086 to be largely source compatible - and that's assembler source compatible - so made it easy to for 8 bit code to use AL. (Why AH, the top 8 bits of a 16 bit accumulator, exists is a mystery to me but presumably it had some purpose.)

With the 386 the architects finally had enough transistors to switch to general purpose registers where you could (almost) apply any operation to any register, improving both aesthetics and performance, but the hideous names had to stay for backwards compatibility. In the 1980s and even 1990s PC programs were still often written in assembly language.

MMX and SSE, being post 386, got general purpose registers from the start and reg#N names. When AMD extended the x86 architecture to 64 bit addressing they also doubled the number of registers and named them R8 to R15.

Computer architecture textbooks in the 1980s and 1990s didn't use the x86 for examples because it was so kludgy, preferring the cleaner 68K/RISC designs. It says something about elegance vs practicality that the x86 is still with us and those others mostly aren't :-(

Scott Littleat2011-12-20 13:11:20:

Great stuff! I was thinking that this type of thing would be great to see after Mike's previous article about getting to the Assembly. I do a lot of hacking and swizzling in Mail and am often looking at the assembly of Mail.app and Message.framework, but mostly I don't understand.

Thanks for this primer and I'm looking forward to Part 2!

vczillaat2011-12-31 07:30:43:

I'd hate to be an architecture 'nazi' but I couldn't help but notice a mistake concerning memory in the article.

It's stated that all physical addresses are 48 bit wide and that bits 48-63 are all zeroes.

In fact this is only true for linear addresses (which are addresses before paging translation) and only when the processor is a 64 bit processor in compatibility mode.

Physical addresses are 52 bits (and this is implementation dependent).

When the processor is in 64-bit mode all linear addresses must be in what is called a canonical form.

Imagine a 48-bit logic address as signed int which is sign extended to 64 bit when stored.

Or said otherwise for that address bits 48 to 64 are a copy of bit 47.

In general kernel implementations on different operating systems occupy the negative physical address space.

It's a forward looking compatibility feature.
That way if in a future implementation they widen physical or linear addresses everything will be at the same relative offset from zero.

Comments RSS feed for this page

Add your thoughts, post a comment:

Spam and off-topic posts will be deleted without notice. Culprits may be publicly humiliated at my sole discretion.

Code syntax highlighting thanks toPygments.

Name:
The Answer to the Ultimate Question of Life, the Universe, and Everything?
Comment:
	Formatting:`<i> <b> <blockquote> <code>`.
	NOTE: Due to an increase in spam, URLs are forbidden! Please provide search terms or fragment your URLs so they don't look like URLs.