A New Bytecode Format for JavaScriptCore

Jun 21, 2019

byTadeu Zagallo

In revisionr237547 we introduced a new bytecode format for JavaScriptCore (JSC). The goals of the new format were to improve memory usage and allow the bytecode to be cached on disk, while the previous format was optimized for interpreter throughput at the cost of memory usage.

In this post, we will start with a quick overview of JSC’s bytecode, key aspects of the old bytecode format and the optimizations it enabled. Next, we will look into the new format and how it affects interpreter execution. Finally, we will look at the impact of the new format on memory usage and performance and how this rewrite improved type safety in JavaScriptCore.

Background

Before JSC executes any JavaScript code, it must lex, parse and generate bytecode for it. JSC has 4 tiers of execution:

Low Level Interpreter (LLInt): the start-up interpreter
Baseline JIT: a template JIT
DFG JIT: a low-latency optimizing compiler
FTL JIT: a high-throughput optimizing compiler

Execution starts by interpreting the bytecode, at the lowest tier, and as the code gets executed more it gets promoted to a higher tier. This is described in a lot more details inthis blog post about the FTL.

The bytecode is the source of truth throughout the whole engine. The LLInt executes the bytecode. The baseline is a template JIT, which emits snippets of machine code for each bytecode instruction. Finally, the DFG and FTL parse the bytecode and emit DFG IR, which is then run through an optimizing compiler.

Because the bytecode is the source of truth, it tends to stay alive in memory throughout the whole program execution. In JavaScript-heavy websites, such as Facebook or Reddit, the bytecode is responsible for20% of the overall memory usage.

The Bytecode

To make things more concrete, let’s look at a simple JavaScript program, learn how to inspect the bytecode generated by JSC and how to interpret the bytecode dump.

// double.jsfunctiondouble(a) {returna+a;}double(2);

If you run the above program withjsc -d double.js, JSC will dump all the bytecode it generates tostderr. The bytecode dump will contain the bytecode generated fordouble:

[   0] enter[   1] get_scope          loc4[   3] mov                loc5, loc4[   6] check_traps[   7] add                loc7, arg1, arg1, OperandTypes(126, 126)[  13] ret                loc7

Each line starts with the offset of the instruction in brackets, followed by the opcode name and its operands. Here we can see the operandsloc for local variables,arg for function arguments andOperandTypes, which is metadata about the predicted types of the arguments.

Old Bytecode Format

The old bytecode format had a few issues that we wanted to fix:

It used too much memory.
The instruction stream was writable, which prevented memory-mapping the bytecode stream.
It had optimizations that we no longer benefited from, such asdirect threading.

In order to better understand how we addressed these issues in the new format, we need a basic understanding of the old bytecode format. In the old format, instructions could be in one of two forms:unlinked, which is compact and optimized for storage andlinked, which is inflated and optimized for execution, containing memory addresses of runtime objects directly in the instruction stream.

Unlinked Instructions

The instructions were encoded usingvariable-width encoding. The opcode and each operand took as little space as possible, ranging from 1 to 5 bytes. Take the add instruction from the program above as an example, it would take 6 bytes: one for the opcode (add), one for each of the registers (loc7,arg1 andarg1 again) and two bytes for the operand types.

Linking / Linked Instructions

Before executing the bytecode it needed to be linked. Linking inflated all the instructions, making the opcode and each of the operands pointer-sized. The opcode was replaced by the actual pointer to the implementation of the instruction and the profiling-related metadata replaced with the memory address of the newly allocated profiling data structures. Theadd from above took 40 bytes to represent:

Execution

The bytecode is executed by the LLInt, which is written in a portable assembly calledofflineasm. Here are a few notes about offlineasm that may help understand the code snippets that will follow:

Temporary registers aret0-t5, argument registers area0-a3 and return registers arer0 andr1. For floating point, the equivalent registers areft0-ft5,fa0-fa3 andfr.cfp andsp are special registers that hold the call frame and stack pointer respectively.
Instructions have one of the following suffixes:b for byte,h for 16-bit word,i for 32-bit word,q for 64-bit word andp for pointer (either 32- or 64-bit depending on the architecture).
Macros are lambda expressions that take zero or more arguments and return code. Macros may be named or anonymous and can be passed as arguments to other macros.

If you want to learn more about offlineasm,LowLevelInterpreter.asm is the main offlineasm file in JSC and it contains a more in depth explanation of the language at the top.

The old bytecode format had two big advantages related to interpreter throughput: direct threading and inline caching.

Direct Threading

The linked bytecode had the actual pointer to the offlineasm implementation of the instruction in place of the opcode, which made executing the next instruction as simple as advancing the program counter (PC) by the size of the current instruction and making an indirect jump. That is illustrated in the following offlineasm code:

macrodispatch(instructionSize)addpinstructionSize *PtrSize,PCjmp [PC]end

Notice thatdispatch is a macro, this way the code is duplicated at the end of every instruction. This is very efficient, since it’s just an addition plus an indirect branch. The duplication reduces branch prediction pollution since we don’t have all instructions jumping to a common label and sharing the same indirect jump to the next instruction.

Inline Caching

Since the instruction stream is writable and all the arguments are pointer-sized, we can store metadata in the instruction stream itself. The best example of this is for theget_by_id instruction, which is emitted when we load from an object in JavaScript.

object.field

This emits aget_by_id for loading the propertyfield fromobject. Since this is one of the most common operations in JavaScript, it’s crucial that it be fast. JSC usesinline caching as a way of speeding this up. The way this was done in the interpreter was by reserving space in the instruction stream to cache metadata about the load. More specifically, we recorded theStructureID of the object we are loading from and the memory offset from which we have to load the value. The LLInt implementation ofget_by_id looked like the following:

_llint_op_get_by_id:// Read operand 2 from the instruction stream as a signed integer,// i.e. the virtual register of `object`loadisFromInstruction(2,t0)// Load `object` from the stack and its structureIDloadConstantOrVariableCell(t0,t3, .opGetByIdSlow)loadiJSCell::m_structureID[t3],t1// Read the cached StructureID and verify that it matches the object'sloadisFromInstruction(4,t2)bineqt2,t1, .opGetByIdSlow// Read the PropertyOffset of `field` and load the actual value from itloadisFromInstruction(5,t1)loadPropertyAtVariableOffset(t1,t3,t0)// Read the virtual register where the result should be stored and store// the value to itloadisFromInstruction(1,t2)storeqt0, [cfr,t2,8]dispatch(constexprop_get_by_id_length).opGetByIdSlow// Jump to C++ slow pathcallSlowPath(_llint_slow_path_get_by_id)dispatch(constexprop_get_by_id_length)

New Bytecode Format

When designing the new bytecode, we had two major goals: it should be more compact and easily cacheable on disk. With these two goals, we expected significant improvements on memory usage and set ourselves to improve runtime performance through caching. This shaped how we encoded instructions as well as runtime metadata.

The first and biggest change is that we no longer have a separate linked encoding for execution. This immediately means that the bytecode can no longer be direct threaded, since the address of the instruction could not be stored to disk, as it changes with every program invocation. Removing this optimization was a deliberate choice of the design.

In order to make the single format suitable for both storage and execution, each instruction can be encoded as eithernarrow orwide.

Narrow Instructions

In a narrow instruction, the opcode and its operands each take 1-byte. Here’s theadd instruction again, this time as a narrow instruction in the new format:

Ideally, all instructions would be encoded as narrow, but not all operands fit into a single byte. If any of the operands (or the opcode for that matter) requires more than 1 byte, the whole instruction is promoted to wide.

Wide Instructions

A wide instruction consists of a special 1-byte opcode,op_wide, followed by a series of 4-byte slots for the original opcode and each of its arguments.

An important thing to notice here is that it is not possible to have a single wide operand – if one operand overflows, the whole instruction becomes wide. This is an important compromise, because even though it might seem wasteful at first, it makes the implementation much simpler: the offset of any given operand is the same regardless of the instruction being narrow or wide. The only difference is whether the instruction stream is treated as an array of 4-byte values or 1-byte values.

Linking / Metadata Table

The other fundamental part of the new bytecode is the metadata table. During linking, instead of constructing a new instruction stream, we initialize a side table with all the writable data associated with any given instruction. Conceptually, the table is 2-dimensional, its first index is the opcode for the instruction, e.g.add, which points to a monomorphic array of metadata entries for that specific instruction. For example, we have a struct calledOpAdd::Metadata to hold the metadata for theadd instruction, so accessingmetadataTable[op_add] will result in a pointer of typeOpAdd::Metadata*. Additionally, each instruction has a special operand,metadataID, which is used as the second index in the metadata table.

For compactness, the metadata table is laid out in memory as a single block of memory. Here’s a representation of what the table above would actually look like in memory.

At the start of the table is the header, an array containing an integer for each opcode representing the offset from the start of the table where the metadata for that opcode starts. The next region is the payload, which contains the actual metadata for each opcode.

Execution

The biggest changes to execution are that the interpreter is nowindirect threaded, and for any writable or runtime metadata, we need to read from the metadata table. Another interesting aspect is how wide instructions are executed.

Indirect Threading

The process of mapping from opcode to instruction address, which used to be done as part of linking in the old format, is now part of the interpreter dispatch. This means that extra loads are required, and the dispatch macro now looks like the following:

macrodispatch(instructionSize)addpinstructionSize,PCloadb [PC],t0leap_g_opcodeMap,t1jmp [t1,t0,PtrSize]end

Metadata Table

The same kind of “inline caching”† still applies toget_by_id, but given that we need to write the metadata at runtime and the instruction stream is now read-only, this data needs to live in the metadata table.

Loads from the metadata table are a bit costly, since theCodeBlock needs to be loaded from the call frame, the metadata table from theCodeBlock, the array for the current opcode from the table and then finally the metadata entry from the array based on the current instruction’smetadataID.

† I write “inline caching” in quotes here since it’s not technically inline anymore.

Wide Instruction Execution

The execution of wide instructions is surprisingly simple: there are two functions for each instruction, one for the narrow version and another for the wide. We take advantage ofofflineasm to generate the two functions from a single implementation. (e.g. the implementation ofop_add will automatically generate its wide version,op_add_wide.)

By default, the narrow version of the instruction is executed, until the special instructionop_wide is executed. All it does is read the next opcode, and dispatch to its wide version, e.g.op_add_wide. Once the wide opcode is done, it goes back to dispatching a narrow instruction, since every wide instruction needs the 1-byteop_wide prefix.

Memory Usage

The new bytecode format uses approximately50% less memory, which means a reduction of10% in overall memory usage for JavaScript-heavy websites, such as Facebook or Reddit.

Here’s a chart comparing thebytecode size forreddit.com,apple.com,facebook.com andgmail.com.

Bytecode size charts for reddit.com, apple.com, facebook.com and gmail.com

Notice that in theAfter case theMetadata has been merged withLinked to represent the memory used by the metadata table andUnlinked represents the actual bytecode instructions. The reason is that a significant part of what lives in the new metadata used to live in the old linked bytecode. Otherwise, the comparison would look skewed, since the wholeLinked bar would be gone and theMetadata would seem to have doubled in size, when in reality it was part of the data fromLinked that moved intoMetadata.

Performance

As discussed earlier, indirect threading increases the overhead of the interpreter dispatch. However, given the average complexity of the bytecode instructions in JSC we did not expect the dispatch overhead to play a meaningful role in the overall performance of the interpreter. This is further amortized by the multiple JIT tiers. We profiled purely switching from direct to indirect threading in CLoop, the C++ backend for LLInt, and the results were neutral even with the JITs disabled.

Loads from the metadata table are costly, involving a large chain of loads. In order to reduce this chain, we pin acallee save register in the interpreter to hold the pointer to the metadata table at all times. Even with this optimization, it takes three loads to access the metadata entry. This was a 10% interpreter slowdown, but was neutral when running with all JIT tiers enabled.

Bonus: Type Safety

Changing the bytecode format required changes across the whole engine, so we decided to take this as an opportunity to improve the bytecode-related infrastructure, increasing the type safety, readability and maintainability of the code.

To support all this changes, first we needed to change how instructions were specified. Originally, instructions were declared in a JSON file, with their respective name and length (the number of operands plus one, for the opcode). Here is how theadd instruction was declared:

{"name":"op_add","length":5 }

When accessing the instruction and its operands from C++ the code would look roughly like this:

SLOW_PATH_DECL(slow_path_add){JSValuelhs =OP_C(2).jsValue();JSValuerhs =OP_C(3).jsValue();    ...}

Notice that operands were accessed by their offset:OP_C was a macro that would access theInstruction* pc argument (hidden here by theSLOW_PATH_DECL macro) at the given offset. However, the type of the data at the specified offset is unknown, so the result would have to be cast to the desired type. This is error-prone and makes the code harder to understand. The only way to learn which operands an instruction had (and their types) was to look for usages of it and look at variable names and type casts.

With the new infrastructure, when declaring an instruction, a name and type must be given for each of the operands, as well as declaring all the data that will be stored in the metadata table.

op :add,args: {dst:VirtualRegister,lhs:VirtualRegister,rhs:VirtualRegister,operandTypes:OperandTypes,    },metadata: {arithProfile:ArithProfile,    }

With this extra information, it is now possible to generate a fully typedstruct for each instruction, which results in safer C++ code:

SLOW_PATH_DECL(slow_path_add){OpAddbytecode =pc->as<OpAdd>();JSValuelhs =GET_C(bytecode.m_lhs);JSValuerhs =GET_C(bytecode.m_rhs);   ...}

In the example above, we first need to convert the genericInstruction* pc to the instruction we want to access, which will perform a runtime check. If the opcode matches, it returns an instance of the generated struct,OpAdd, which includes the fieldsm_lhs andm_rhs, both of typeVirtualRegister as specified in our instruction declaration.

The extra information also allowed us to replace a lot of mechanical code with safer, generated code. For example, we auto generate all the type safe code for writing instructions to the instruction stream, which automatically does the narrow/wide fitting. We also generate all the code to dump the instructions for debugging.

Conclusion

JavaScriptCore has a new bytecode format that uses 50% less memory on average and is available on Safari 12.1 andSafari Technology Preview. The work on the caching API for the new bytecode is already underway in the WebKit repository, and anyone interested is welcome to follow along and contribute onbugs.webkit.org! You can also get in touch with me onTwitter with any questions or comments.

NextRelease Notes for Safari Technology Preview 86Learn more

PreviouslyRelease Notes for Safari Technology Preview 85Learn more

Movatterモバイル変換

A New Bytecode Format for JavaScriptCore

Background

The Bytecode

Old Bytecode Format

Unlinked Instructions

Linking / Linked Instructions

Execution

New Bytecode Format

Narrow Instructions

Wide Instructions

Linking / Metadata Table

Execution

Indirect Threading

Metadata Table

Wide Instruction Execution

Memory Usage

Performance

Bonus: Type Safety

Conclusion