Movatterモバイル変換

Posted at 2017-07-01 04:23 |RSS feed (Full text feed) |Blog Index
Next article:Friday Q&A 2017-07-14: Swift.Codable
Previous article:More Advanced Swift Workshop, and Blog and Book Updates
Tags:assembly fridayqna objectivec

Friday Q&A 2017-06-30: Dissecting objc_msgSend on ARM64

byMike Ash

We're back! During the week of WWDC, I spoke at CocoaConf Next Door, and one of my talks involved a dissection ofobjc_msgSend's ARM64 implementation. I thought that turning it into an article would make for a nice return to blogging for Friday Q&A.

Overview
Every Objective-C object has a class, and every Objective-C class has a list of methods. Each method has a selector, a function pointer to the implementation, and some metadata. The job ofobjc_msgSend is to take the object and selector that's passed in, look up the corresponding method's function pointer, and then jump to that function pointer.

Looking up a method can be extremely complicated. If a method isn't found on a class, then it needs to continue searching in the superclasses. If no method is found at all, then it needs to call into the runtime's message forwarding code. If this is the very first message being sent to a particular class, then it has to call that class's+initialize method.

Looking up a method also needs to be extremely fast in the common case, since it's done for every method call. This, of course, is in conflict with the complicated lookup process.

Objective-C's solution to this conflict is the method cache. Each class has a cache which stores methods as pairs of selectors and function pointers, known in Objective-C asIMPs. They're organized as a hash table so lookups are fast. When looking up a method, the runtime first consults the cache. If the method isn't in the cache, it follows the slow, complicated procedure, and then places the result into the cache so that the next time can be fast.

objc_msgSend is written in assembly. There are two reasons for this: one is that it's not possible to write a function which preserves unknown arguments and jumps to an arbitrary function pointer in C. The language just doesn't have the necessary features to express such a thing. The other reason is that it's extremely important forobjc_msgSend to be fast, so every last instruction of it is written by hand so it can go as fast as possible.

Naturally, you don't want to write the whole complicated message lookup procedure in assembly langauge. It's not necessary, either, because things are going to be slow no matter what the moment you start going through it. The message send code can be divided into two parts: there's thefast path inobjc_msgSend itself, which is written in assembly, and theslow path implemented in C. The assembly part looks up the method in the cache and jump to it if it's found. If the method is not in the cache, then it calls into the C code to handle things.

Therefore, when looking atobjc_msgSend itself, it does the following:

Get the class of the object passed in.
Get the method cache of that class.
Use the selector passed in to look up the method in the cache.
If it's not in the cache, call into the C code.
Jump to theIMP for the method.

How does it do all of that? Let's see!

Instruction by Instruction
objc_msgSend has a few different paths it can take depending on circumstances. It has special code for handling things like messages tonil, tagged pointers, and hash table collisions. I'll start by looking at the most common, straight-line case where a message is sent to a non-nil, non-tagged pointer and the method is found in the cache without any need to scan. I'll note the various branching-off points as we go through them, and then once we're done with the common path I'll circle back and look at all of the others.

I'll list each instruction or group of instructions followed by a description of what it does and why. Just remember to lookup to find the instruction any given piece of text is discussing.

Each instruction is preceded by its offset from the beginning of the function. This serves as a counter, and lets you identify jump targets.

ARM64 has 31 integer registers which are 64 bits wide. They're referred to with the notationx0 throughx30. It's also possible to access the lower 32 bits of each register as if it were a separate register, usingw0 throughw30. Registersx0 throughx7 are used to pass the first eight parameters to a function. That means thatobjc_msgSend receives theself parameter inx0 and the selector_cmd parameter inx1.

Let's begin!

0x0000cmpx0,#0x00x0004b.le0x6c

This performs a signed comparison ofself with0 and jumps elsewhere if the value is less than or equal to zero. A value of zero isnil, so this handles the special case of messages to nil. This also handlestagged pointers. Tagged pointers on ARM64 are indicated by setting the high bit of the pointer. (This is an interesting contrast with x86-64, where it's the low bit.) If the high bit is set, then the value is negative when interpreted as a signed integer. For the common case ofself being a normal pointer, the branch is not taken.

0x0008ldrx13,[x0]

This loadsself'sisa by loading the 64-bit quantity pointed to byx0, which containsself. Thex13 register now contains theisa.

0x000candx16,x13,#0xffffffff8

ARM64 can usenon-pointer isas. Traditionally theisa points to the object's class, but non-pointerisa takes advantage of spare bits by cramming some other information into theisa as well. This instruction performs a logical AND to mask off all the extra bits, and leaves the actual class pointer inx16.

0x0010ldpx10,x11,[x16,#0x10]

This is my favorite instruction inobjc_msgSend. It loads the class's cache information intox10 andx11. Theldp instruction loadstwo registers' worth of data from memory into the registers named in the first two arguments. The third argument describes where to load the data, in this case at offset16 fromx16, which is the area of the class which holds the cache information. The cache itself looks like this:

typedefuint32_tmask_t;structcache_t{structbucket_t*_buckets;mask_t_mask;mask_t_occupied;}

Following theldp instruction,x10 contains the value of_buckets, andx11 contains_occupied in its high 32 bits, and_mask in its low 32 bits.

_occupied specifies how many entries the hash table contains, and plays no role inobjc_msgSend._mask is important: it describes the size of the hash table as a convenient AND-able mask. Its value is always a power of two minus 1, or in binary terms something that looks like000000001111111 with a variable number of 1s at the end. This value is needed to figure out the lookup index for a selector, and to wrap around the end when searching the table.

0x0014andw12,w1,w11

This instruction computes the starting hash table index for the selector passed in as_cmd.x1 contains_cmd, sow1 contains the bottom 32 bits of_cmd.w11 contains_mask as mentioned above. This instruction ANDs the two together and places the result intow12. The result is the equivalent of computing_cmd % table_size but without the expensive modulo operation.

0x0018addx12,x10,x12,lsl#4

The index is not enough. To start loading data from the table, we need the actual address to load from. This instruction computes that address by adding the table index to the table pointer. It shifts the table index left by4 bits first, which multiplies it by16, because each table bucket is16 bytes.x12 now contains the address of the first bucket to search.

0x001cldpx9,x17,[x12]

Our friendldp makes another appearance. This time it's loading from the pointer inx12, which points to the bucket to search. Each bucket contains a selector and anIMP.x9 now contains the selector for the current bucket, andx17 contains theIMP.

0x0020cmpx9,x10x0024b.ne0x2c

These instructions compare the bucket's selector inx9 with_cmd inx1. If they're not equal then this bucket does not contain an entry for the selector we're looking for, and in that case the second instruction jumps to offset0x2c, which handles non-matching buckets. If the selectors do match, then we've found the entry we're looking for, and execution continues with the next instruction.

0x0028brx17

This performs an unconditional jump tox17, which contains theIMP loaded from the current bucket. From here, execution will continue in the actual implementation of the target method, and this is the end ofobjc_msgSend's fast path. All of the argument registers have been left undisturbed, so the target method will receive all passed in arguments just as if it had been called directly.

When everything is cached and all the stars align, this path can execute in less than 3 nanoseconds on modern hardware.

That's the fast path, how about the rest of the code? Let's continue with the code for a non-matching bucket.

0x002ccbzx9,__objc_msgSend_uncached

x9 contains the selector loaded from the bucket. This instruction compares it with zero and jumps to__objc_msgSend_uncached if it's zero. A zero selector indicates an empty bucket, and an empty bucket means that the search has failed. The target method isn't in the cache, and it's time to fall back to the C code that performs a more comprehensive lookup.__objc_msgSend_uncached handles that. Otherwise, the bucket doesn't match but isn't empty, and the search continues.

0x0030cmpx12,x100x0034b.eq0x40

This instruction compares the current bucket address inx12 with the beginning of the hash table inx10. If they match, it jumps to code that wraps the search back to the end of the hash table. We haven't seen it yet, but the hash table search being performed here actually runs backwards. The search examines decreasing indexes until it hits the beginning of the table, then it starts over at the end. I'm not sure why it works this way rather than the more common approach of increasing addresses that wrap to the beginning, but it's a safe bet that it's because it ends up being faster this way.

Offset0x40 handles the wraparound case. Otherwise, execution proceeds to the next instruction.

0x0038ldpx9,x17,[x12,#-0x10]!

Anotherldp, once again loading a cache bucket. This time, it loads from offset0x10 to the address of the current cache bucket. The exclamation point at the end of the address reference is an interesting feature. This indicates a register write-back, which means that the register is updated with the newly computed value. In this case, it's effectively doingx12 -= 16 in addition to loading the new bucket, which makesx12 point to that new bucket.

0x003cb0x20

Now that the new bucket is loaded, execution can resume with the code that checks to see if the current bucket is a match. This loops back up to the instruction labeled0x0020 above, and runs through all of that code again with the new values. If it continues to find non-matching buckets, this code will keep running until it finds a match, an empty bucket, or hits the beginning of the table.

0x0040addx12,x12,w11,uxtw#4

This is the target for when the search wraps.x12 contains a pointer to the current bucket, which in this case is also the first bucket.w11 contains the table mask, which is the size of the table. This adds the two together, while also shiftingw11 left by 4 bits, multiplying it by16. The result is thatx12 now points to the end of the table, and the search can resume from there.

0x0044ldpx9,x17,[x12]

The now-familiarldp loads the new bucket intox9 andx17.

0x0048cmpx9,x10x004cb.ne0x540x0050brx17

This code checks to see if the bucket matches and jumps to the bucket'sIMP. It's a duplicate of the code at0x0020 above.

0x0054cbzx9,__objc_msgSend_uncached

Just like before, if the bucket is empty then it's a cache miss and execution proceeds into the comprehensive lookup code implemented in C.

0x0058cmpx12,x100x005cb.eq0x68

This checks for wraparoundagain, and jumps to0x68 if we've hit the beginning of the table a second time. In this case, it jumps into the comprehensive lookup code implemented in C:

0x0068b__objc_msgSend_uncached

This is something that should never actually happen. The table grows as entries are added to it, and it's never 100% full. Hash tables become inefficient when they're too full because collisions become too common.

Why is this here? A comment in the source code explains:

Clone scanning loop to miss instead of hang when cache is corrupt. The slow path may detect any corruption and halt later.

I doubt that this is common, but evidently the folks at Apple have seen memory corruption which caused the cache to be filled with bad entries, and jumping into the C code improves the diagnostics.

The existence of this check should have minimal impact on code that doesn't suffer from this corruption. Without it, the original loop could be reused, which would save a bit of instruction cache space, but the effect is minimal. This wraparound handler is not the common case anyway. It will only be invoked for selectors that get sorted near the beginning of the hash table, and then only if there's a collision and all the prior entries are occupied.

0x0060ldpx9,x17,[x12,#-0x10]!0x0064b0x48

The remainder of this loop is the same as before. Load the next bucket intox9 andx17, update the bucket pointer inx12, and go back to the top of the loop.

That's the end of the main body ofobjc_msgSend. What remains are special cases fornil and tagged pointers.

Tagged Pointer Handler
You'll recall that the very first instructions checked for those and jumped to offset0x6c to handle them. Let's continue from there:

0x006cb.eq0xa4

We've arrived here becauseself is less than or equal to zero. Less than zero indicates a tagged pointer, and zero isnil. The two cases are handled completely differently, so the first thing the code does here is check to see whetherself isnil or not. Ifself is equal to zero then this instruction branches to0xa4, which is where thenil handler lives. Otherwise, it's a tagged pointer, and execution continues with the next instruction.

Before we move on, let's briefly discuss how tagged pointers work. Tagged pointers support multiple classes. The top four bits of the tagged pointer (on ARM64) indicate which class the "object" is. They are essentially the tagged pointer's isa. Of course, four bits isn't nearly enough to hold a class pointer. Instead, there's a special table which stores the available tagged pointer classes. The class of a tagged pointer "object" is found by looking up the index in that table which corresponds to the top four bits.

This isn't the whole story. Tagged pointers (at least on ARM64) also support extended classes. When the top four bits are all set to1 the next eight bits are used to index into an extended tagged pointer class table. This allows the runtime to support more tagged pointer classes, at the cost of having less storage for them.

Let's continue.

0x0070movx10,#-0x1000000000000000

This setsx10 to an integer value with the top four bits set and all other bits set to zero. This will serve as a mask to extract the tag bits fromself.

0x0074cmpx0,x100x0078b.hs0x90

This checks for an extended tagged pointer. Ifself is greater than or equal to the value inx10, then that means the top four bits are all set. In that case, branch to0x90 which will handle extended classes. Otherwise, use the primary tagged pointer table.

0x007cadrpx10,_objc_debug_taggedpointer_classes@PAGE0x0080addx10,x10,_objc_debug_taggedpointer_classes@PAGEOFF

This little song and dance loads the address of_objc_debug_taggedpointer_classes, which is the primary tagged pointer table. ARM64 requires two instructions to load the address of a symbol. This is a standard technique on RISC-like architectures. Pointers on ARM64 are 64 bits wide, and instructions are only 32 bits wide. It's not possible to fit an entire pointer into one instruction.

x86 doesn't suffer from this problem, since it has variable-length instructions. It can just use a 10-byte instruction, where two bytes identify the instruction itself and the target register, and eight bytes hold the pointer value.

On a machine with fixed-length instructions, you load the value in pieces. In this case, only two pieces are needed. Theadrp instruction loads the top part of the value, and theadd then adds in the bottom part.

0x0084lsrx11,x0,#60

The tagged class index is in the top four bits ofx0. To use it as an index, it has to be shifted right by60 bits so it becomes an integer in the range0-15. This instruction performs that shift and places the index intox11.

0x0088ldrx16,[x10,x11,lsl#3]

This uses the index inx11 to load the entry from the table thatx10 points to. Thex16 register now contains the class of this tagged pointer.

0x008cb0x10

With the class inx16, we can now branch back to the main code. The code starting with offset0x10 assumes that the class pointer is loaded intox16 and performs dispatch from there. The tagged pointer handler can therefore just branch back to that code rather than duplicating logic here.

0x0090adrpx10,_objc_debug_taggedpointer_ext_classes@PAGE0x0094addx10,x10,_objc_debug_taggedpointer_ext_classes@PAGEOFF

The extended tagged class handler looks similar. These two instructions load the pointer to the extended table.

0x0098ubfxx11,x0,#52,#8

This instruction loads the extended class index. It extracts8 bits starting from bit52 inself intox11.

0x009cldrx16,[x10,x11,lsl#3]

Just like before, that index is used to look up the class in the table and load it intox16.

0x00a0b0x10

With the class inx16, it can branch back into the main code.

That's nearly everything. All that remains is thenil handler.

nil Handler
Finally we get to thenil handler. Here it is, in its entirety.

0x00a4movx1,#0x00x00a8movid0,#00000000000000000x00acmovid1,#00000000000000000x00b0movid2,#00000000000000000x00b4movid3,#00000000000000000x00b8ret

Thenil handler is completely different from the rest of the code. There's no class lookup or method dispatch. All it does fornil is return0 to the caller.

This task is a bit complicated by the fact thatobjc_msgSend doesn't know what kind of return value the caller expects. Is this method returning one integer, or two, or a floating-point value, or nothing at all?

Fortunately, all of the registers used for return values can be safely overwritten even if they're not being used for this particular call's return value. Integer return values are stored inx0 andx1 and floating point return values are stored in vector registersv0 throughv3. Multiple registers are used for returning smallerstructs.

This code clearsx1 andv0 throughv3. Thed0 throughd3 registers refer to the bottom half of the correspondingv registers, and storing into them clears the top half, so the effect of the fourmovi instructions is to clear those four registers. After doing this, it returns control to the caller.

You might wonder why this code doesn't clearx0. The answer to that is simple:x0 holdsself which in this case isnil, so it's already zero! You can save an instruction by not clearingx0 since it already holds the value we want.

What about largerstruct returns that don't fit into registers? This requires a little cooperation from the caller. Largestruct returns are performed by having the caller allocate enough memory for the return value, and then passing the address of that memory inx8. The function then writes to that memory to return a value.objc_msgSend can't clear this memory, because it doesn't know how big the return value is. To solve this, the compiler generates code which fills the memory with zeroes before callingobjc_msgSend.

That's the end of thenil handler, and ofobjc_msgSend as a whole.

Conclusion
It's always interesting to dive into framework internals.objc_msgSend in particular is a work of art, and delightful to read through.

That's it for today. Come back next time for more squishy goodness. Friday Q&A is driven by reader input, so if you have something you'd like to see discussed here,send it in!

Did you enjoy this article? I'm selling whole books full of them! Volumes II and III are now out! They're available as ePub, PDF, print, and on iBooks and Kindle.Click here for more information.

Comments:

Greg Parkerat2017-07-01 06:39:16:

The search examines decreasing indexes until it hits the beginning of the table, then it starts over at the end. I'm not sure why it works this way rather than the more common approach of increasing addresses that wrap to the beginning, but it's a safe bet that it's because it ends up being faster this way.

Incrementing to the end of the cache requires an extra instruction or two to calculate where the end of the cache is. The start of the cache is already known - it's the pointer we loaded from the class - so we decrement towards that.

Clone scanning loop to miss instead of hang when cache is corrupt. The slow path may detect any corruption and halt later.

I doubt that this is common, but evidently the folks at Apple have seen memory corruption which caused the cache to be filled with bad entries, and jumping into the C code improves the diagnostics.

The extra scanned-twice check prevents power-draining infinite loops in some cases of memory corruption or invalid objects. For example, heap corruption could fill the cache with non-zero data, or set the cache mask to zero. Corruption like this would otherwise cause the cache scan loop to run forever without a cache hit or a cache miss. The extra check stops the loop so we can turn the problem into a crash log instead.

There are also cases where another thread simultaneously modifying the cache can cause this thread to neither hit nor miss on the first scan. The C code does extra work to resolve that race. A previous version ofobjc_msgSend handled this incorrectly - it immediately aborted instead of falling back to the C code - which caused rare crashes when the threads were unlucky.

LeoNatanat2017-07-01 09:23:31:

"If this is the very first message being sent to a particular class, then it has to call that class's +load method."

Don't you mean the +initialize method? The +load method is called during the image load phase.

Jean-Danielat2017-07-01 09:27:29:

If this is the very first message being sent to a particular class, then it has to call that class's +load method.

I though +load was unconditionally called at initialization time (even before main(), when metadata are loaded) and that it was the purpose of +initialize to be called before the first objc_msgSend ?

Isn't it the case on ARM64 ?

Matthew Selfat2017-07-01 17:04:35:

Thanks for the analysis!

I wrote an assembly version of objc_msgSend() for the (then new) Motorola 68040 as part of NeXT's "performance month". NeXTSTEP 2.0 was going to be noticeably slower than 1.0 since there was now more code, but no more RAM than before. So the whole SW department worked on performance tweeks to bring in back in line.

The 68040 assembly version was ~40% faster than the previous version, but it's surprising how little difference that made on the memory-bound systems at the time. The code was assentially the same as the version you analyzed here!

One twist that isn't clear from the objc_msgSend() code is that the cache also includes "negative" cache values to record cache misses. It turns out that there is a lot of code that calls respondsToSelector with the result NO (mostly as part of the responder hierarchy). If you don't cache these failed lookups, you spend more time in the misses than the hits.

mikeashat2017-07-01 17:06:31:

Greg Parker: Thanks very much for the inside scoop.

LeoNatan andJean-Daniel: The+load thing was my brain momentarily failing. I corrected it to+initialize Thanks for pointing that out!

Jens Alfkeat2017-07-01 17:09:25:

Thanks for this. Good assembly code is like poetry; you have to put effort into studying it line by line, but it repays the attention.

I’ve never sat down and learned ARM the way I learned PPC back in the day, but it does look like a very nice instruction set.

I went and read the linked article about non-pointer isas, but I still have a few questions. Are you planning on writing on that topic?

mikeashat2017-07-01 17:49:19:

Jens Alfke: I thought Greg Parker's article on non-pointer isas covered it pretty well, but I'm open to writing something myself if there's more to be said. What unanswered questions do you have about them?

Blaine Garstat2017-07-01 20:24:05:

Mike, you missed explaining the greatest secret of the messenger, one that Matt and Dave and I cooked up on the black couches at Seaport: how it is that multiple clients can read a mutable hash table without synchronization. It stumps people to this day. The answer is that the messenger uses a specialized garbage collector! I had put locks in when we first tried to do multi-threaded ObjC and it was miserable and then I read a paper by David Black where the kernel would reset the PC back to the start of a critical range for a mutex, and, well, we use the absense of a PC in critical ranges to deduce safety of reclaiming stale caches.

Rick Roeat2017-07-01 22:13:35:

Even having read Greg’s article, I’m curious to hear more about non-pointer isa. That article is added 2013, so I bet there’s all sorts of Swift bridging fun in there now?

mikeashat2017-07-02 00:51:21:

Blaine Garst: I actually wrote about that before! See here:

https://mikeash.com/pyblog/friday-qa-2015-05-29-concurrent-memory-deallocation-in-the-objective-c-runtime.html

It's an amazing technique and remains one of my favorite programming-related things. The way it takes advantage of the asymmetry between reading and writing is just fantastic. My hat goes off to the three of you for coming up with it.

mahongqiat2018-12-23 16:50:30:

Can i translate this article and post in may blog?

mahongqiat2018-12-23 16:52:02:

Can i translate this article to chinese and post in my blog?

Comments RSS feed for this page

Add your thoughts, post a comment:

Spam and off-topic posts will be deleted without notice. Culprits may be publicly humiliated at my sole discretion.

Code syntax highlighting thanks toPygments.

Name:
The Answer to the Ultimate Question of Life, the Universe, and Everything?
Comment:
	Formatting:`<i> <b> <blockquote> <code>`.
	NOTE: Due to an increase in spam, URLs are forbidden! Please provide search terms or fragment your URLs so they don't look like URLs.