Tutorial part 4: Adding JIT-compilation to a toy interpreter

In this example we construct a “toy” interpreter, and add JIT-compilationto it.

Our toy interpreter

It’s a stack-based interpreter, and is intended as a (very simple) exampleof the kind of bytecode interpreter seen in dynamic languages such asPython, Ruby etc.

For the sake of simplicity, our toy virtual machine is very limited:

  • The only data type isint

  • It can only work on one function at a time (so that the onlyfunction call that can be made is to recurse).

  • Functions can only take one parameter.

  • Functions have a stack ofint values.

  • We’ll implement function call within the interpreter by calling afunction in our implementation, rather than implementing our ownframe stack.

  • The parser is only good enough to get the examples to work.

Naturally, a real interpreter would be much more complicated that this.

The following operations are supported:

Operation

Meaning

Old Stack

New Stack

DUP

Duplicate top of stack.

[...,x]

[...,x,x]

ROT

Swap top two elements of stack.

[...,x,y]

[...,y,x]

BINARY_ADD

Add the top two elements on the stack.

[...,x,y]

[...,(x+y)]

BINARY_SUBTRACT

Likewise, but subtract.

[...,x,y]

[...,(x-y)]

BINARY_MULT

Likewise, but multiply.

[...,x,y]

[...,(x*y)]

BINARY_COMPARE_LT

Compare the top two elements on the stack and push a nonzero/zero if (x<y).

[...,x,y]

[...,(x<y)]

RECURSE

Recurse, passing the top of the stack, and popping the result.

[...,x]

[...,fn(x)]

RETURN

Return the top of the stack.

[x]

[]

PUSH_CONSTarg

Push an int const.

[...]

[...,arg]

JUMP_ABS_IF_TRUEarg

Pop; if top of stack was nonzero, jump toarg.

[...,x]

[...]

Programs can be interpreted, disassembled, and compiled to machine code.

The interpreter reads.toy scripts. Here’s what a simple recursivefactorial program looks like, the scriptfactorial.toy.The parser ignores lines beginning with a#.

# Simple recursive factorial implementation, roughly equivalent to:##  int factorial (int arg)#  {#     if (arg < 2)#       return arg#     return arg * factorial (arg - 1)#  }# Initial state:# stack: [arg]# 0:DUP# stack: [arg, arg]# 1:PUSH_CONST2# stack: [arg, arg, 2]# 2:BINARY_COMPARE_LT# stack: [arg, (arg < 2)]# 3:JUMP_ABS_IF_TRUE9# stack: [arg]# 4:DUP# stack: [arg, arg]# 5:PUSH_CONST1# stack: [arg, arg, 1]# 6:BINARY_SUBTRACT# stack: [arg,  (arg - 1)# 7:RECURSE# stack: [arg, factorial(arg - 1)]# 8:BINARY_MULT# stack: [arg * factorial(arg - 1)]# 9:RETURN

The interpreter is a simple infinite loop with a bigswitch statementbased on what the next opcode is:

staticinttoyvm_function_interpret(toyvm_function*fn,intarg,FILE*trace){toyvm_frameframe;#define PUSH(ARG) (toyvm_frame_push (&frame, (ARG)))#define POP(ARG) (toyvm_frame_pop (&frame))frame.frm_function=fn;frame.frm_pc=0;frame.frm_cur_depth=0;PUSH(arg);while(1){toyvm_op*op;intx,y;assert(frame.frm_pc<fn->fn_num_ops);op=&fn->fn_ops[frame.frm_pc++];if(trace){toyvm_frame_dump_stack(&frame,trace);toyvm_function_disassemble_op(fn,op,frame.frm_pc,trace);}switch(op->op_opcode){/* Ops taking no operand.  */caseDUP:x=POP();PUSH(x);PUSH(x);break;caseROT:y=POP();x=POP();PUSH(y);PUSH(x);break;caseBINARY_ADD:y=POP();x=POP();PUSH(x+y);break;caseBINARY_SUBTRACT:y=POP();x=POP();PUSH(x-y);break;caseBINARY_MULT:y=POP();x=POP();PUSH(x*y);break;caseBINARY_COMPARE_LT:y=POP();x=POP();PUSH(x<y);break;caseRECURSE:x=POP();x=toyvm_function_interpret(fn,x,trace);PUSH(x);break;caseRETURN:returnPOP();/* Ops taking an operand.  */casePUSH_CONST:PUSH(op->op_operand);break;caseJUMP_ABS_IF_TRUE:x=POP();if(x)frame.frm_pc=op->op_operand;break;default:assert(0);/* unknown opcode */}/* end of switch on opcode */}/* end of while loop */#undef PUSH#undef POP}

Compiling to machine code

We want to generate machine code that can be cast to this type andthen directly executed in-process:

typedefint(*toyvm_compiled_code)(int);

The lifetime of the code is tied to that of agcc_jit_result*.We’ll handle this by bundling them up in a structure, so that we canclean them up together by callinggcc_jit_result_release():

structtoyvm_compiled_function{gcc_jit_result*cf_jit_result;toyvm_compiled_codecf_code;};

Our compiler isn’t very sophisticated; it takes the implementation ofeach opcode above, and maps it directly to the operations supported bythe libgccjit API.

How should we handle the stack? In theory we could calculate what thestack depth will be at each opcode, and optimize away the stackmanipulation “by hand”. We’ll see below that libgccjit is able to dothis for us, so we’ll implement stack manipulationin a direct way, by creating astack array andstack_depthvariables, local within the generated function, equivalent to this C code:

intstack_depth;intstack[MAX_STACK_DEPTH];

We’ll also have local variablesx andy for use when implementingthe opcodes, equivalent to this:

intx;inty;

This means our compiler has the following state:

structcompilation_state{gcc_jit_context*ctxt;gcc_jit_type*int_type;gcc_jit_type*bool_type;gcc_jit_type*stack_type;/* int[MAX_STACK_DEPTH] */gcc_jit_rvalue*const_one;gcc_jit_function*fn;gcc_jit_param*param_arg;gcc_jit_lvalue*stack;gcc_jit_lvalue*stack_depth;gcc_jit_lvalue*x;gcc_jit_lvalue*y;gcc_jit_location*op_locs[MAX_OPS];gcc_jit_block*initial_block;gcc_jit_block*op_blocks[MAX_OPS];};

Setting things up

First we create our types:

state.int_type=gcc_jit_context_get_type(state.ctxt,GCC_JIT_TYPE_INT);state.bool_type=gcc_jit_context_get_type(state.ctxt,GCC_JIT_TYPE_BOOL);state.stack_type=gcc_jit_context_new_array_type(state.ctxt,NULL,state.int_type,MAX_STACK_DEPTH);

along with extracting a usefulint constant:

state.const_one=gcc_jit_context_one(state.ctxt,state.int_type);

We’ll implement push and pop in terms of thestack array andstack_depth. Here are helper functions for adding statements toa block, implementing pushing and popping values:

staticvoidadd_push(compilation_state*state,gcc_jit_block*block,gcc_jit_rvalue*rvalue,gcc_jit_location*loc){/* stack[stack_depth] = RVALUE */gcc_jit_block_add_assignment(block,loc,/* stack[stack_depth] */gcc_jit_context_new_array_access(state->ctxt,loc,gcc_jit_lvalue_as_rvalue(state->stack),gcc_jit_lvalue_as_rvalue(state->stack_depth)),rvalue);/* "stack_depth++;".  */gcc_jit_block_add_assignment_op(block,loc,state->stack_depth,GCC_JIT_BINARY_OP_PLUS,state->const_one);}staticvoidadd_pop(compilation_state*state,gcc_jit_block*block,gcc_jit_lvalue*lvalue,gcc_jit_location*loc){/* "--stack_depth;".  */gcc_jit_block_add_assignment_op(block,loc,state->stack_depth,GCC_JIT_BINARY_OP_MINUS,state->const_one);/* "LVALUE = stack[stack_depth];".  */gcc_jit_block_add_assignment(block,loc,lvalue,/* stack[stack_depth] */gcc_jit_lvalue_as_rvalue(gcc_jit_context_new_array_access(state->ctxt,loc,gcc_jit_lvalue_as_rvalue(state->stack),gcc_jit_lvalue_as_rvalue(state->stack_depth))));}

We will support single-stepping through the generated code in thedebugger, so we need to creategcc_jit_location instances, oneper operation in the source code. These will reference the lines ofe.g.factorial.toy.

for(pc=0;pc<fn->fn_num_ops;pc++){toyvm_op*op=&fn->fn_ops[pc];state.op_locs[pc]=gcc_jit_context_new_location(state.ctxt,fn->fn_filename,op->op_linenum,0);/* column */}

Let’s create the function itself. As usual, we create its parameterfirst, then use the parameter to create the function:

state.param_arg=gcc_jit_context_new_param(state.ctxt,state.op_locs[0],state.int_type,"arg");state.fn=gcc_jit_context_new_function(state.ctxt,state.op_locs[0],GCC_JIT_FUNCTION_EXPORTED,state.int_type,funcname,1,&state.param_arg,0);

We create the locals within the function.

state.stack=gcc_jit_function_new_local(state.fn,NULL,state.stack_type,"stack");state.stack_depth=gcc_jit_function_new_local(state.fn,NULL,state.int_type,"stack_depth");state.x=gcc_jit_function_new_local(state.fn,NULL,state.int_type,"x");state.y=gcc_jit_function_new_local(state.fn,NULL,state.int_type,"y");

Populating the function

There’s some one-time initialization, and the API treats the first blockyou create as the entrypoint of the function, so we need to create thatblock first:

state.initial_block=gcc_jit_function_new_block(state.fn,"initial");

We can now create blocks for each of the operations. Most of these willbe consolidated into larger blocks when the optimizer runs.

for(pc=0;pc<fn->fn_num_ops;pc++){charbuf[100];sprintf(buf,"instr%i",pc);state.op_blocks[pc]=gcc_jit_function_new_block(state.fn,buf);}

Now that we have a block it can jump to when it’s done, we can populatethe initial block:

/* "stack_depth = 0;".  */gcc_jit_block_add_assignment(state.initial_block,state.op_locs[0],state.stack_depth,gcc_jit_context_zero(state.ctxt,state.int_type));/* "PUSH (arg);".  */add_push(&state,state.initial_block,gcc_jit_param_as_rvalue(state.param_arg),state.op_locs[0]);/* ...and jump to insn 0.  */gcc_jit_block_end_with_jump(state.initial_block,state.op_locs[0],state.op_blocks[0]);

We can now populate the blocks for the individual operations. We loopthrough them, adding instructions to their blocks:

for(pc=0;pc<fn->fn_num_ops;pc++){gcc_jit_location*loc=state.op_locs[pc];gcc_jit_block*block=state.op_blocks[pc];gcc_jit_block*next_block=(pc<fn->fn_num_ops?state.op_blocks[pc+1]:NULL);toyvm_op*op;op=&fn->fn_ops[pc];

We’re going to have another bigswitch statement for implementingthe opcodes, this time for compiling them, rather than interpretingthem. It’s helpful to have macros for implementing push and pop, so thatwe can make theswitch statement that’s coming up look as much aspossible like the one above within the interpreter:

#define X_EQUALS_POP()\      add_pop (&state, block, state.x, loc)#define Y_EQUALS_POP()\      add_pop (&state, block, state.y, loc)#define PUSH_RVALUE(RVALUE)\      add_push (&state, block, (RVALUE), loc)#define PUSH_X()\      PUSH_RVALUE (gcc_jit_lvalue_as_rvalue (state.x))#define PUSH_Y() \      PUSH_RVALUE (gcc_jit_lvalue_as_rvalue (state.y))

Note

A particularly clever implementation would have anidenticalswitch statement shared by the interpreter and the compiler, withsome preprocessor “magic”. We’re not doing that here, for the sakeof simplicity.

When I first implemented this compiler, I accidentally missed an editwhen copying and pasting theY_EQUALS_POP macro, so that popping thestack intoy instead erroneously assigned it tox, leavingyuninitialized.

To track this kind of thing down, we can usegcc_jit_block_add_comment() to add descriptive commentsto the internal representation. This is invaluable when looking throughthe generated IR for, sayfactorial:

gcc_jit_block_add_comment(block,loc,opcode_names[op->op_opcode]);

We can now write the bigswitch statement that implements theindividual opcodes, populating the relevant block with statements:

switch(op->op_opcode){caseDUP:X_EQUALS_POP();PUSH_X();PUSH_X();break;caseROT:Y_EQUALS_POP();X_EQUALS_POP();PUSH_Y();PUSH_X();break;caseBINARY_ADD:Y_EQUALS_POP();X_EQUALS_POP();PUSH_RVALUE(gcc_jit_context_new_binary_op(state.ctxt,loc,GCC_JIT_BINARY_OP_PLUS,state.int_type,gcc_jit_lvalue_as_rvalue(state.x),gcc_jit_lvalue_as_rvalue(state.y)));break;caseBINARY_SUBTRACT:Y_EQUALS_POP();X_EQUALS_POP();PUSH_RVALUE(gcc_jit_context_new_binary_op(state.ctxt,loc,GCC_JIT_BINARY_OP_MINUS,state.int_type,gcc_jit_lvalue_as_rvalue(state.x),gcc_jit_lvalue_as_rvalue(state.y)));break;caseBINARY_MULT:Y_EQUALS_POP();X_EQUALS_POP();PUSH_RVALUE(gcc_jit_context_new_binary_op(state.ctxt,loc,GCC_JIT_BINARY_OP_MULT,state.int_type,gcc_jit_lvalue_as_rvalue(state.x),gcc_jit_lvalue_as_rvalue(state.y)));break;caseBINARY_COMPARE_LT:Y_EQUALS_POP();X_EQUALS_POP();PUSH_RVALUE(/* cast of bool to int */gcc_jit_context_new_cast(state.ctxt,loc,/* (x < y) as a bool */gcc_jit_context_new_comparison(state.ctxt,loc,GCC_JIT_COMPARISON_LT,gcc_jit_lvalue_as_rvalue(state.x),gcc_jit_lvalue_as_rvalue(state.y)),state.int_type));break;caseRECURSE:{X_EQUALS_POP();gcc_jit_rvalue*arg=gcc_jit_lvalue_as_rvalue(state.x);PUSH_RVALUE(gcc_jit_context_new_call(state.ctxt,loc,state.fn,1,&arg));break;}caseRETURN:X_EQUALS_POP();gcc_jit_block_end_with_return(block,loc,gcc_jit_lvalue_as_rvalue(state.x));break;/* Ops taking an operand.  */casePUSH_CONST:PUSH_RVALUE(gcc_jit_context_new_rvalue_from_int(state.ctxt,state.int_type,op->op_operand));break;caseJUMP_ABS_IF_TRUE:X_EQUALS_POP();gcc_jit_block_end_with_conditional(block,loc,/* "(bool)x".  */gcc_jit_context_new_cast(state.ctxt,loc,gcc_jit_lvalue_as_rvalue(state.x),state.bool_type),state.op_blocks[op->op_operand],/* on_true */next_block);/* on_false */break;default:assert(0);}/* end of switch on opcode */

Every block must be terminated, via a call to one of thegcc_jit_block_end_with_ entrypoints. This has been done for twoof the opcodes, but we need to do it for the other ones, by jumpingto the next block.

if(op->op_opcode!=JUMP_ABS_IF_TRUE&&op->op_opcode!=RETURN)gcc_jit_block_end_with_jump(block,loc,next_block);

This is analogous to simply incrementing the program counter.

Verifying the control flow graph

Having finished looping over the blocks, the context is complete.

As before, we can verify that the control flow and statements are sane byusinggcc_jit_function_dump_to_dot():

gcc_jit_function_dump_to_dot(state.fn,"/tmp/factorial.dot");

and viewing the result. Note how the label names, comments, andvariable names show up in the dump, to make it easier to spoterrors in our compiler.

image of a control flow graph

Compiling the context

Having finished looping over the blocks and populating them withstatements, the context is complete.

We can now compile it, and extract machine code from the result:

gcc_jit_result*jit_result=gcc_jit_context_compile(state.ctxt);gcc_jit_context_release(state.ctxt);toyvm_compiled_function*toyvm_result=(toyvm_compiled_function*)calloc(1,sizeof(toyvm_compiled_function));if(!toyvm_result){fprintf(stderr,"out of memory allocating toyvm_compiled_function\n");gcc_jit_result_release(jit_result);returnNULL;}toyvm_result->cf_jit_result=jit_result;toyvm_result->cf_code=(toyvm_compiled_code)gcc_jit_result_get_code(jit_result,funcname);

We can now run the result:

toyvm_compiled_function*compiled_fn=toyvm_function_compile(fn);toyvm_compiled_codecode=compiled_fn->cf_code;printf("compiler result: %d\n",code(atoi(argv[2])));gcc_jit_result_release(compiled_fn->cf_jit_result);free(compiled_fn);

Single-stepping through the generated code

It’s possible to debug the generated code. To do this we need to both:

  • Set up source code locations for our statements, so that we canmeaningfully step through the code. We did this above bycallinggcc_jit_context_new_location() and using theresults.

  • Enable the generation of debugging information, by settingGCC_JIT_BOOL_OPTION_DEBUGINFO on thegcc_jit_context viagcc_jit_context_set_bool_option():

    gcc_jit_context_set_bool_option(ctxt,GCC_JIT_BOOL_OPTION_DEBUGINFO,1);

Having done this, we can put a breakpoint on the generated function:

$gdb --args ./toyvm factorial.toy10(gdb)break factorialFunction "factorial" not defined.Make breakpoint pending on future shared library load? (y or [n]) yBreakpoint 1 (factorial) pending.(gdb)runBreakpoint 1, factorial (arg=10) at factorial.toy:1414    DUP

We’ve set up location information, which referencesfactorial.toy.This allows us to use e.g.list to see where we are in the script:

(gdb)list910    # Initial state:11    # stack: [arg]1213    # 0:14    DUP15    # stack: [arg, arg]1617    # 1:18    PUSH_CONST 2

and to step through the function, examining the data:

(gdb)n18    PUSH_CONST 2(gdb)n22    BINARY_COMPARE_LT(gdb)print stack$5={10,10,2,0, -7152,32767,0,0}(gdb)print stack_depth$6=3

You’ll see that the parts of thestack array that haven’t beentouched yet are uninitialized.

Note

Turning on optimizations may lead to unpredictable results whenstepping through the generated code: the execution may appear to“jump around” the source code. This is analogous to turning up theoptimization level in a regular compiler.

Examining the generated code

How good is the optimized code?

We can turn up optimizations, by callinggcc_jit_context_set_int_option() withGCC_JIT_INT_OPTION_OPTIMIZATION_LEVEL:

gcc_jit_context_set_int_option(ctxt,GCC_JIT_INT_OPTION_OPTIMIZATION_LEVEL,3);

One of GCC’s internal representations is called “gimple”. A dump of theinitial gimple representation of the code can be seen by setting:

gcc_jit_context_set_bool_option(ctxt,GCC_JIT_BOOL_OPTION_DUMP_INITIAL_GIMPLE,1);

With optimization on and source locations displayed, this gives:

factorial(signedintarg){<unnamedtype>D.80;signedintD.81;signedintD.82;signedintD.83;signedintD.84;signedintD.85;signedinty;signedintx;signedintstack_depth;signedintstack[8];try{initial:stack_depth=0;stack[stack_depth]=arg;stack_depth=stack_depth+1;gotoinstr0;instr0:/* DUP */:stack_depth=stack_depth+-1;x=stack[stack_depth];stack[stack_depth]=x;stack_depth=stack_depth+1;stack[stack_depth]=x;stack_depth=stack_depth+1;gotoinstr1;instr1:/* PUSH_CONST */:stack[stack_depth]=2;stack_depth=stack_depth+1;gotoinstr2;/* etc */

You can see the generated machine code in assembly form via:

gcc_jit_context_set_bool_option(ctxt,GCC_JIT_BOOL_OPTION_DUMP_GENERATED_CODE,1);result=gcc_jit_context_compile(ctxt);

which shows that (on this x86_64 box) the compiler has unrolled the loopand is using MMX instructions to perform several multiplicationssimultaneously:

.file"fake.c".text.Ltext0:.p2align4,,15.globlfactorial.typefactorial,@functionfactorial:.LFB0:.file1"factorial.toy".loc1140.cfi_startproc.LVL0:.L2:.loc1260cmpl$1,%edijle.L13leal-1(%rdi),%edxmovl%edx,%ecxshrl$2,%ecxleal0(,%rcx,4),%esitestl%esi,%esije.L14cmpl$9,%edxjbe.L14leal-2(%rdi),%eaxmovl%eax,-16(%rsp)leal-3(%rdi),%eaxmovd-16(%rsp),%xmm0movl%edi,-16(%rsp)movl%eax,-12(%rsp)movd-16(%rsp),%xmm1xorl%eax,%eaxmovl%edx,-16(%rsp)movd-12(%rsp),%xmm4movd-16(%rsp),%xmm6punpckldq%xmm4,%xmm0movdqa.LC1(%rip),%xmm4punpckldq%xmm6,%xmm1punpcklqdq%xmm0,%xmm1movdqa.LC0(%rip),%xmm0jmp.L5# etc - edited for brevity

This is clearly overkill for a function that will likely overflow theint type before the vectorization is worthwhile - but then again, thisis a toy example.

Turning down the optimization level to 2:

gcc_jit_context_set_int_option(ctxt,GCC_JIT_INT_OPTION_OPTIMIZATION_LEVEL,3);

yields this code, which is simple enough to quote in its entirety:

.file"fake.c".text.p2align4,,15.globlfactorial.typefactorial,@functionfactorial:.LFB0:.cfi_startproc.L2:cmpl$1,%edijle.L8movl$1,%edxjmp.L4.p2align4,,10.p2align3.L6:movl%eax,%edi.L4:.L5:leal-1(%rdi),%eaximull%edi,%edxcmpl$1,%eaxjne.L6.L3:.L7:imull%edx,%eaxret.L8:movl%edi,%eaxmovl$1,%edxjmp.L7.cfi_endproc.LFE0:.sizefactorial,.-factorial.ident"GCC: (GNU) 4.9.0 20131023 (Red Hat 0.2)".section.note.GNU-stack,"",@progbits

Note that the stack pushing and popping have been eliminated, as has therecursive call (in favor of an iteration).

Putting it all together

The complete example can be seen in the source tree atgcc/jit/docs/examples/tut04-toyvm/toyvm.c

along with a Makefile and a couple of sample .toy scripts:

$ls -aldrwxrwxr-x. 2 david david   4096 Sep 19 17:46 .drwxrwxr-x. 3 david david   4096 Sep 19 15:26 ..-rw-rw-r--. 1 david david    615 Sep 19 12:43 factorial.toy-rw-rw-r--. 1 david david    834 Sep 19 13:08 fibonacci.toy-rw-rw-r--. 1 david david    238 Sep 19 14:22 Makefile-rw-rw-r--. 1 david david  16457 Sep 19 17:07 toyvm.c$make toyvmg++ -Wall -g -o toyvm toyvm.c -lgccjit$./toyvm factorial.toy10interpreter result: 3628800compiler result: 3628800$./toyvm fibonacci.toy10interpreter result: 55compiler result: 55

Behind the curtain: How does our code get optimized?

Our example is done, but you may be wondering about exactly how thecompiler turned what we gave it into the machine code seen above.

We can examine what the compiler is doing in detail by setting:

gcc_jit_context_set_bool_option(state.ctxt,GCC_JIT_BOOL_OPTION_DUMP_EVERYTHING,1);gcc_jit_context_set_bool_option(state.ctxt,GCC_JIT_BOOL_OPTION_KEEP_INTERMEDIATES,1);

This will dump detailed information about the compiler’s state to adirectory under/tmp, and keep it from being cleaned up.

The precise names and their formats of these files is subject to change.Higher optimization levels lead to more files.Here’s what I saw (edited for brevity; there were almost 200 files):

intermediate files written to /tmp/libgccjit-KPQbGw$ls /tmp/libgccjit-KPQbGw/fake.c.000i.cgraphfake.c.000i.type-inheritancefake.c.004t.gimplefake.c.007t.omplowerfake.c.008t.lowerfake.c.011t.ehfake.c.012t.cfgfake.c.014i.visibilityfake.c.015i.early_local_cleanupsfake.c.016t.ssa#etc

The gimple code is converted into Static Single Assignment form,with annotations for use when generating the debuginfo:

$less /tmp/libgccjit-KPQbGw/fake.c.016t.ssa
;;Functionfactorial(factorial,funcdef_no=0,decl_uid=53,symbol_order=0)factorial(signedintarg){signedintstack[8];signedintstack_depth;signedintx;signedinty;<unnamedtype>_20;signedint_21;signedint_38;signedint_44;signedint_51;signedint_56;initial:stack_depth_3=0;# DEBUG stack_depth => stack_depth_3stack[stack_depth_3]=arg_5(D);stack_depth_7=stack_depth_3+1;# DEBUG stack_depth => stack_depth_7# DEBUG instr0 => NULL# DEBUG/* DUP */ => NULLstack_depth_8=stack_depth_7+-1;# DEBUG stack_depth => stack_depth_8x_9=stack[stack_depth_8];# DEBUG x => x_9stack[stack_depth_8]=x_9;stack_depth_11=stack_depth_8+1;# DEBUG stack_depth => stack_depth_11stack[stack_depth_11]=x_9;stack_depth_13=stack_depth_11+1;# DEBUG stack_depth => stack_depth_13# DEBUG instr1 => NULL# DEBUG/* PUSH_CONST */ => NULLstack[stack_depth_13]=2;/* etc; edited for brevity */

We can perhaps better see the code by turning offGCC_JIT_BOOL_OPTION_DEBUGINFO to suppress all thoseDEBUGstatements, giving:

$less /tmp/libgccjit-1Hywc0/fake.c.016t.ssa
;;Functionfactorial(factorial,funcdef_no=0,decl_uid=53,symbol_order=0)factorial(signedintarg){signedintstack[8];signedintstack_depth;signedintx;signedinty;<unnamedtype>_20;signedint_21;signedint_38;signedint_44;signedint_51;signedint_56;initial:stack_depth_3=0;stack[stack_depth_3]=arg_5(D);stack_depth_7=stack_depth_3+1;stack_depth_8=stack_depth_7+-1;x_9=stack[stack_depth_8];stack[stack_depth_8]=x_9;stack_depth_11=stack_depth_8+1;stack[stack_depth_11]=x_9;stack_depth_13=stack_depth_11+1;stack[stack_depth_13]=2;stack_depth_15=stack_depth_13+1;stack_depth_16=stack_depth_15+-1;y_17=stack[stack_depth_16];stack_depth_18=stack_depth_16+-1;x_19=stack[stack_depth_18];_20=x_19<y_17;_21=(signedint)_20;stack[stack_depth_18]=_21;stack_depth_23=stack_depth_18+1;stack_depth_24=stack_depth_23+-1;x_25=stack[stack_depth_24];if(x_25!=0)goto<bb4>(instr9);elsegoto<bb3>(instr4);instr4:/* DUP */:stack_depth_26=stack_depth_24+-1;x_27=stack[stack_depth_26];stack[stack_depth_26]=x_27;stack_depth_29=stack_depth_26+1;stack[stack_depth_29]=x_27;stack_depth_31=stack_depth_29+1;stack[stack_depth_31]=1;stack_depth_33=stack_depth_31+1;stack_depth_34=stack_depth_33+-1;y_35=stack[stack_depth_34];stack_depth_36=stack_depth_34+-1;x_37=stack[stack_depth_36];_38=x_37-y_35;stack[stack_depth_36]=_38;stack_depth_40=stack_depth_36+1;stack_depth_41=stack_depth_40+-1;x_42=stack[stack_depth_41];_44=factorial(x_42);stack[stack_depth_41]=_44;stack_depth_46=stack_depth_41+1;stack_depth_47=stack_depth_46+-1;y_48=stack[stack_depth_47];stack_depth_49=stack_depth_47+-1;x_50=stack[stack_depth_49];_51=x_50*y_48;stack[stack_depth_49]=_51;stack_depth_53=stack_depth_49+1;# stack_depth_1 = PHI <stack_depth_24(2), stack_depth_53(3)>instr9:/* RETURN */:stack_depth_54=stack_depth_1+-1;x_55=stack[stack_depth_54];_56=x_55;stack={v}{CLOBBER};return_56;}

Note in the above how all thegcc_jit_block instances wecreated have been consolidated into just 3 blocks in GCC’s internalrepresentation:initial,instr4 andinstr9.

Optimizing away stack manipulation

Recall our simple implementation of stack operations. Let’s examinehow the stack operations are optimized away.

After a pass of constant-propagation, the depth of the stack at eachopcode can be determined at compile-time:

$less /tmp/libgccjit-1Hywc0/fake.c.021t.ccp1
;;Functionfactorial(factorial,funcdef_no=0,decl_uid=53,symbol_order=0)factorial(signedintarg){signedintstack[8];signedintstack_depth;signedintx;signedinty;<unnamedtype>_20;signedint_21;signedint_38;signedint_44;signedint_51;initial:stack[0]=arg_5(D);x_9=stack[0];stack[0]=x_9;stack[1]=x_9;stack[2]=2;y_17=stack[2];x_19=stack[1];_20=x_19<y_17;_21=(signedint)_20;stack[1]=_21;x_25=stack[1];if(x_25!=0)goto<bb4>(instr9);elsegoto<bb3>(instr4);instr4:/* DUP */:x_27=stack[0];stack[0]=x_27;stack[1]=x_27;stack[2]=1;y_35=stack[2];x_37=stack[1];_38=x_37-y_35;stack[1]=_38;x_42=stack[1];_44=factorial(x_42);stack[1]=_44;y_48=stack[1];x_50=stack[0];_51=x_50*y_48;stack[0]=_51;instr9:/* RETURN */:x_55=stack[0];x_56=x_55;stack={v}{CLOBBER};returnx_56;}

Note how, in the above, all thosestack_depth values are now justconstants: we’re accessing specific stack locations at each opcode.

The “esra” pass (“Early Scalar Replacement of Aggregates”) breaksout our “stack” array into individual elements:

$less /tmp/libgccjit-1Hywc0/fake.c.024t.esra
;;Functionfactorial(factorial,funcdef_no=0,decl_uid=53,symbol_order=0)Createdareplacementforstackoffset:0,size:32:stack$0Createdareplacementforstackoffset:32,size:32:stack$1Createdareplacementforstackoffset:64,size:32:stack$2SymbolstobeputinSSAform{D.89D.90D.91}IncrementalSSAupdatestartedatblock:0NumberofblocksinCFG:5Numberofblockstoupdate:4(80%)factorial(signedintarg){signedintstack$2;signedintstack$1;signedintstack$0;signedintstack[8];signedintstack_depth;signedintx;signedinty;<unnamedtype>_20;signedint_21;signedint_38;signedint_44;signedint_51;initial:stack$0_45=arg_5(D);x_9=stack$0_45;stack$0_39=x_9;stack$1_32=x_9;stack$2_30=2;y_17=stack$2_30;x_19=stack$1_32;_20=x_19<y_17;_21=(signedint)_20;stack$1_28=_21;x_25=stack$1_28;if(x_25!=0)goto<bb4>(instr9);elsegoto<bb3>(instr4);instr4:/* DUP */:x_27=stack$0_39;stack$0_22=x_27;stack$1_14=x_27;stack$2_12=1;y_35=stack$2_12;x_37=stack$1_14;_38=x_37-y_35;stack$1_10=_38;x_42=stack$1_10;_44=factorial(x_42);stack$1_6=_44;y_48=stack$1_6;x_50=stack$0_22;_51=x_50*y_48;stack$0_1=_51;# stack$0_52 = PHI <stack$0_39(2), stack$0_1(3)>instr9:/* RETURN */:x_55=stack$0_52;x_56=x_55;stack={v}{CLOBBER};returnx_56;}

Hence at this point, all those pushes and pops of the stack are nowsimply assignments to specific temporary variables.

After some copy propagation, the stack manipulation has been completelyoptimized away:

$less /tmp/libgccjit-1Hywc0/fake.c.026t.copyprop1
;;Functionfactorial(factorial,funcdef_no=0,decl_uid=53,symbol_order=0)factorial(signedintarg){signedintstack$2;signedintstack$1;signedintstack$0;signedintstack[8];signedintstack_depth;signedintx;signedinty;<unnamedtype>_20;signedint_21;signedint_38;signedint_44;signedint_51;initial:stack$0_39=arg_5(D);_20=arg_5(D)<=1;_21=(signedint)_20;if(_21!=0)goto<bb4>(instr9);elsegoto<bb3>(instr4);instr4:/* DUP */:_38=arg_5(D)+-1;_44=factorial(_38);_51=arg_5(D)*_44;stack$0_1=_51;# stack$0_52 = PHI <arg_5(D)(2), _51(3)>instr9:/* RETURN */:stack={v}{CLOBBER};returnstack$0_52;}

Later on, another pass finally eliminatedstack_depth local and theunused parts of thestack` array altogether:

$less /tmp/libgccjit-1Hywc0/fake.c.036t.release_ssa
;;Functionfactorial(factorial,funcdef_no=0,decl_uid=53,symbol_order=0)Released44names,314.29%,removed44holesfactorial(signedintarg){signedintstack$0;signedintmult_acc_1;<unnamedtype>_5;signedint_6;signedint_7;signedintmul_tmp_10;signedintmult_acc_11;signedintmult_acc_13;# arg_9 = PHI <arg_8(D)(0)># mult_acc_13 = PHI <1(0)>initial:<bb5>:# arg_4 = PHI <arg_9(2), _7(3)># mult_acc_1 = PHI <mult_acc_13(2), mult_acc_11(3)>_5=arg_4<=1;_6=(signedint)_5;if(_6!=0)goto<bb4>(instr9);elsegoto<bb3>(instr4);instr4:/* DUP */:_7=arg_4+-1;mult_acc_11=mult_acc_1*arg_4;goto<bb5>;# stack$0_12 = PHI <arg_4(5)>instr9:/* RETURN */:mul_tmp_10=mult_acc_1*stack$0_12;returnmul_tmp_10;}

Elimination of tail recursion

Another significant optimization is the detection that the call tofactorial is tail recursion, which can be eliminated in favor ofan iteration:

$less /tmp/libgccjit-1Hywc0/fake.c.030t.tailr1
;;Functionfactorial(factorial,funcdef_no=0,decl_uid=53,symbol_order=0)SymbolstobeputinSSAform{D.88}IncrementalSSAupdatestartedatblock:0NumberofblocksinCFG:5Numberofblockstoupdate:4(80%)factorial(signedintarg){signedintstack$2;signedintstack$1;signedintstack$0;signedintstack[8];signedintstack_depth;signedintx;signedinty;signedintmult_acc_1;<unnamedtype>_20;signedint_21;signedint_38;signedintmul_tmp_44;signedintmult_acc_51;# arg_5 = PHI <arg_39(D)(0), _38(3)># mult_acc_1 = PHI <1(0), mult_acc_51(3)>initial:_20=arg_5<=1;_21=(signedint)_20;if(_21!=0)goto<bb4>(instr9);elsegoto<bb3>(instr4);instr4:/* DUP */:_38=arg_5+-1;mult_acc_51=mult_acc_1*arg_5;goto<bb2>(initial);# stack$0_52 = PHI <arg_5(2)>instr9:/* RETURN */:stack={v}{CLOBBER};mul_tmp_44=mult_acc_1*stack$0_52;returnmul_tmp_44;}