Movatterモバイル変換


[0]ホーム

URL:


Skip to content
DEV Community
Log in Create account

DEV Community

Shagun Agrawal
Shagun Agrawal

Posted on • Originally published atshagunagrawal.me

Writing a C Compiler in Clojure

I recently found the bookWriting a C Compiler byNora Sandler. The book goes step by step in implementing a subset of C.
New language features are added in each chapter, and at the end of each chapter there aretests which verify that behavior. The tests are extensive and also heavily commented, which help in debugging.
The book is fun to read and implement. All features / steps are explained in detail, with ample pseudocode and explanations, but also not hand holding.
All the algorithms/implementation is in a pseudocode / pythonesque language, but the author has also shared their reference implementation inOCaml.

I have been wanting to make something in Clojure for a while. My first project with Clojure was aNintendo Entertainment System emulator, but after finishing the CPU, I got pretty burnt out.
I had made anemulator before (Gameboy), and it wasn't different enough and I slowly lost interest. Also performing lot of bit manipulations / operations in Clojure was not fun.

Apart from that I did some coding challenges like Advent of Code etc, but otherwise did not create any application using Clojure.
I had earlier seen this presentationWhat's So Hard About Writing A Compiler, Anyway? Oh - Ramsey Nasser
, which shared their general approach of making a compiler written in Clojure.
This talk motivated me to pursue writing the project in Clojure. Working with an AST, which would just be map of data felt like the perfect thing to build with Clojure.

I have always had interest in compilers. Before this, I had completedCrafting Interpreters.
I have completedProgramming Languages course by Dan Grossman, which explains and implements concepts (such as variables, functions, type systems) in different types of languages.
These resources definitely helped me out, as I had prior familiarity with the concepts presented in this book, but even without that I feel the book explains everything for somebody with no prior experience.
I had not written a compiler for C like language, and was not aware of how x86 assembly gets generated. This book turned out be a interesting project, and I learned a lot about C and x86 assembly.

I have implemented the first 12 chapters from the book. The compiler supports the below features. (Each feature roughly corresponds to each chapter in the book).

  • Operators
    • Unary-, ~
    • Binary+, -, *, /, %, <<, >>
    • Relational<, >, >=, <=, == !=
    • Logical!, &&, ||
  • Variables (with storage class specifiers)
  • If/else statements
  • Loops
  • Functions
  • Types (int, unsigned int, long, unsigned long)

This post details my experience writing the compiler. It mentions the problems and what approach I took to solve them.
There are a lot of references to other projects/conference talks/articles throughout, which gave me ideas on how to refactor / make it simpler etc.
The post is sectioned into the major passes of the compiler.
Each subsequent pass takes in as input the return value from the last stage.

The code is available atkaepr/cljcc.

(defnrun[source](->sourcelexparsetackyanalyzeassemblyemit))
Enter fullscreen modeExit fullscreen mode

Lexer

The first stage is the lexer, which converts the source input file to a list of tokens.

(defsource"int main(void) {  return 42;}")(lexsource);; =>{:tokens[{:kind:kw-int,:line2,:col1}{:kind:identifier,:line2,:col5,:literal"main"}{:kind:left-paren,:line2,:col9}{:kind:kw-void,:line2,:col10}{:kind:right-paren,:line2,:col14}{:kind:left-curly,:line2,:col16}{:kind:kw-return,:line3,:col3}{:kind:number,:line3,:col10,:literal"42"}{:kind:semicolon,:line3,:col12}{:kind:right-curly,:line4,:col1}{:kind:eof,:line5,:col1}],:line5,:col1}
Enter fullscreen modeExit fullscreen mode

The lexer is responsible for identifying keywords, numbers, identifiers, operators, ignoring whitespace etc.
Each character from the input file needs to be match for a valid token.
For e.g., if the current character is\n, we can ignore it and move to the next character.
If it's a valid alphabetical character, then we need to parse the remaining characters till we encounter a word break, then we check whether it's a valid keyword or an identifier.
Similar logic is present to tokenize operators, numbers etc.

My initial approach was using an index into the input string, which would point to current character.
I would try to match this character, and from there decide whether to parse it as a number, identifier etc. This is how it looked like.

(defnlex[idxsource](let[ch(nthsourceidx)](cond(digit?ch)(let[end-idx(; use regex to find word break; return that index)digit(subvecsourceidxend-idx)_(valid-digit?digit)token(create-digit-tokendigit)](conj(lex(incend-idx)source)token))(alphabet?ch)(let[end-idx(;use regex to find word break;return that index)identifier(subvecsourceidxend-idx)keyword?(is-keyword?identifier)token(create-tokenidentifierkeyword?)](conj(lex(incend-idx)source)token))(...)(...):else(lexer-erroridxsource))))
Enter fullscreen modeExit fullscreen mode

But this became too complicated.
I was heavily using the index.
In each condition clause, I would try to find a corresponding ending index ( after matching letters / digits etc ).
From both those indexes, would get the identifier and create the token.
I haven't added the code for handling errors above, so each condition was more complex.
I was getting off by one errors and it was hard to debug.
This approach was similar to one present inCrafting Interpreters | Scanner.
I was converting imperative Java code to Clojure, and it wasn't working out.

I had earlier watched this videoCode Review: Clojure Lexer by ThePrimeAgen.
In that video, they were doing code review, for a lexer written in Clojure.
Clojure Lexer implementation by Vikasg7

This lexer was also written for a C like language, thus the main structure for the lexer was also suitable for tokenizing C programs.
I took reference from this code, and implemented the lexer.

(defn-lexer-ctx[]{:tokens[]:line1:col1})(defnlex([source](lexsource(lexer-ctx)))([[chpkth:assource]{:keys[linecol]:asctx}](cond(empty?source)(updatectx:tokens#(conj%(t/create:eoflinecol)))(...)(...)(whitespace?ch)(recur(nextsource)(->ctx(update:colinc)))(letter?ch)(let[[chrsrst](split-withletter-digit?source)lexeme(applystrchrs)cnt(countchrs)kind(t/identifier->kindlexeme)token(if(=:identifierkind)(t/createkindlinecollexeme)(t/createkindlinecol))](recur(applystrrst)(->ctx(update:col#(+%cnt))(update:tokens#(conj%token))))):else(exc/lex-error{:lineline:colcol}))))
Enter fullscreen modeExit fullscreen mode

The main insights which helped simplify was that keeping an index around is not necessary.
I used index to find the current character, but a simpler way of getting the current character is destructuring.
Destructuring made the overall code much simpler.
I can peek the starting characters, and also keep the original string.

[first-charsecond-charthird-char:assource]"abcde"=>\a\b\c"abcde"
Enter fullscreen modeExit fullscreen mode

To simplify the individual conditional cases, instead of manually keeping track of index, used functions which specialized in string / collection processing.

For e.g.

var_name = 10;^       ^start   end
Enter fullscreen modeExit fullscreen mode

The earlier implementation would start from the start index, keep increasing index till it finds whitespace.
Then use these two indexes to build the identifier.
This is really error prone.

Instead of manually finding ending index, a simpler way is to just,

(split-withletter?"var_name = 10;")=>[(\v\a\r\_\n\a\m\e)(\space\=\space\1\0\;)]
Enter fullscreen modeExit fullscreen mode

pass the entire input string to a function which is responsible for matching characters given a predicate.
The first part is then converted to a token, andlex called recursively on rest of the string.
The final result is first token conj'd to result of the recursive call.

Parser

The list of tokens carries no semantic meaning.
The parsing phase is responsible for converting a list of tokens, to a tree which conforms toC's grammar.

(defex"static int x = 20;int main(void) {  int y = 22;  return x + y;}")(parse-from-srcex)=>[{:type:declaration,:declaration-type:variable,:variable-type{:type:int},:storage-class:static,:identifier"x",:initial{:type:exp,:exp-type:constant-exp,:value{:type:int,:value20}}}{:type:declaration,:declaration-type:function,:function-type{:type:function,:return-type{:type:int},:parameter-types[]},:storage-classnil,:identifier"main",:parameters[],:body[{:type:declaration,:declaration-type:variable,:variable-type{:type:int},:storage-classnil,:identifier"y",:initial{:type:exp,:exp-type:constant-exp,:value{:type:int,:value22}}}{:type:statement,:statement-type:return,:value{:type:exp,:exp-type:binary-exp,:binary-operator:plus,:children[:left:right],:left{:type:exp,:exp-type:variable-exp,:identifier"x"},:right{:type:exp,:exp-type:variable-exp,:identifier"y"}}}]}]
Enter fullscreen modeExit fullscreen mode

The book only implements a subset of C, so the parser does not handle all valid code. For e.g.

intx=20,y=22;// not supported by the parser, even though it's valid
Enter fullscreen modeExit fullscreen mode

Each language construct is composed of smaller constructs.
Program is a list of functions. Function is list of declarations or statements and so on.
This tree like data structure needs to be represented in code, and Clojure maps are the perfect to do it.

{:type:exp,:exp-type:binary-exp,:binary-operator:plus,:children[:left:right],:left{:type:exp,:exp-type:variable-exp,:identifier"x"},:right{:type:exp,:exp-type:variable-exp,:identifier"y"}}
Enter fullscreen modeExit fullscreen mode

Each node in this tree contains keys which help identify it's semantic meaning.
The above node represents a binary expression, and has keys for it's children nodes.

Maps are a joy to work with.
Unlike in a typed language (talking about type system like Java, C++) where I would have had to create new classes for each construct in the C grammar, maps help avoid that boilerplate.
I don't have to create specializedget, update like functions for any nodes.
As all the nodes are map, all Clojure functions which operate on maps work.

This was a double edged sword though.
As the compiler cannot guarantee at compile time what keys are present and no autocomplete, it became harder to keep track of the keys present in a node.

Anytime I would have to use something, I would need to go back to the constructor function and remember what name did I gave to those keys.

(defnbinary-exp-node[lrop]{:type:exp:exp-type:binary-exp:binary-operatorop:children[:left:right]:leftl:rightr})
Enter fullscreen modeExit fullscreen mode

This was tedious and error prone, but there is no way to solve this problem ( at least in terms of autocomplete ).
I thought records could help, but apart from giving automatic constructor function, they did not help to solve this problem.
It's still difficult to keep track of, and make sure of the keys present in any map.

To slightly alleviate this problem, I used a schema library.

Using a Schema Library

There are mainly two benefits to using a schema.

  • Central place which defines what keys / value can be in a map.
  • Perform runtime validation

I first triedclojure.spec, and it did work but I felt as I added more constructs it became too complicated, at least for my simple use case.
I just wanted to specify the keys on my map, and what value they should be.

I went withmetosin/malli.
It's way of specifying schema was exactly what I was looking for.

(defBinaryExp[:map[:type[:=:exp]][:exp-type[:=:binary-exp]][:binary-operator`[:enum~@(set(keyst/bin-ops))]][:left[:ref#'Exp]][:right[:ref#'Exp]][:children[:=[:left:right]]][:value-type{:optionaltrue}#'Type]])
Enter fullscreen modeExit fullscreen mode

Above is the schema for a binary expression.
All the keys are there, and what values they can be.

I used malli's validation at the end to verify whether my program's output conforms to the schema.

(m/coerce#'s/Programprogram); Validates whether my program conforms to the program schema
Enter fullscreen modeExit fullscreen mode

This function is present at the boundary of compiler pass.
Before passing the result to the next function, I validate whether the AST created is valid and conforms to the schema.
As the name of the keys are now centralized and with runtime validation, I caught bugs much earlier and was able to add more constructs easily, as the next compiler pass always had valid AST as input.

Parsing

Instaparse

Instead of trying to write my own parser, I usedInstaparse.
And it worked great.
I already had the grammar with me, and was able to use Instaparse to quickly write a parser.
It's a feature rich library and was great, until I had to implement operators with precedence.
It's possible to write a grammar which handles precedence, but it's not fun.
The grammar ends up containing intermediate rules, and as C has a lot of operators this was getting out of hand.
I ended up hand writing the parser.

Handwritten

The parser has some primitive functions.
These functions are then combined in bigger functions, which handle individual rules in the grammar.
My implementation closely resembles the one mentioned in the book.

(defn-expect"Expects the first token in list to be of given kind.  Returns the token and remaining tokens."[kind[token&rst]](if(=kind(:kindtoken))[tokenrst](exc/parser-error"Actual and expected token differ.")))(defn-parse-repeatedly"Repeatedly runs given parse function on input until end-kind encountered.  `parse-f` must return result in form [node remaining-tokens]."[tokensparse-fend-kind](loop[res[]tokenstokens](if(=end-kind(:kind(firsttokens)))[restokens](let[[noderst](parse-ftokens)](recur(conjresnode)rst)))))
Enter fullscreen modeExit fullscreen mode

These functions are called inside main language constructs.

(defn-parse-if-statement[tokens](let[[_tokens](expect:kw-iftokens)[_tokens](expect:left-parentokens)[exp-nodetokens](parse-exptokens)[_tokens](expect:right-parentokens)[then-stmttokens](parse-statementtokens)else?(=:kw-else(:kind(firsttokens)))](if(notelse?)[(if-statement-nodeexp-nodethen-stmt)tokens](let[[_tokens](expect:kw-elsetokens)[else-stmttokens](parse-statementtokens)][(if-statement-nodeexp-nodethen-stmtelse-stmt)tokens]))))(defn-parse-block[tokens](let[[_tokens](expect:left-curlytokens)[block-itemstokens](parse-repeatedlytokensparse-block-item:right-curly)[_tokens](expect:right-curlytokens)][block-itemstokens]))
Enter fullscreen modeExit fullscreen mode

The general structure of each function is expecting certain tokens, such as parenthesis, curly braces etc.
If there is a list of things, then useparse-repeatedly.
Store intermediate AST nodes in thelet blocks.
At the end, return the AST node, and the remaining tokens.

I don't like how this looks, but it works.
It's super simple to add more constructs.
( At least it was for the current implemented grammar. Maybe after pointers, arrays this will become difficult to maintain).
I wanted to implement this in a different way.

I tried to first convert this into some sort of threading macro.

(defnparse-if-statement[tokens](->>tokens(expect:kw-if)(expect:left-paren)))
Enter fullscreen modeExit fullscreen mode

The next step expects an expression, but I need to hold onto that ast node of the expression.
So I would have needed to store that in a let block anyway, or thread it through the remaining functions, which didn't look that simple to me.
I wasn't really able to figure out a better way to write the parser.
Once the general structure was there, it was easy to add new constructs and I kept going.

Below are some resources I found which implement a parser in Clojure.
I haven't been able to give them enough time, but just on a high level overview their implementation looks to be more declarative.
I will refactor the parser in future.


One thing which in general irked me throughout was whether I am doing thingsclojure way.
(Especially with how I implemented the parser.)
There is a section in this talk,Learning and Teaching Clojure on the job at Amperity - Dave Fetterman
, where the speaker mentions that they were writing Lisp, basically like C. Below is a quote from the speaker,

I found something called the let statement, this let me make it (lisp) look just like C, variable thing, variable thing, oh I have an imperative program with parentheses.

I felt the same way. There were certain instances where I felt I had come up with a neat functional way of doing things, but lot of my code, or the surrounding code still felt imperative.
Some functions became huge and unwieldy, but I wasn't really getting how should I refactor it / make it simple.

For example, below is the code for typechecking a local variable declaration.

(defn-typecheck-local-scope-variable-declaration[{:keys[identifiervariable-typestorage-classinitial]:asd}ident->symbol](condp=storage-class:extern(let[_(when(not(nil?initial))(exc/analyzer-error"Initializer on local extern variable declaration."d))prev-symbol(getident->symbolidentifier)prev-type(:typeprev-symbol)_(when(andprev-symbol(not=prev-typevariable-type))(exc/analyzer-error"Redeclared with different types."{:declaration1d:declaration2prev-symbol}))symbols(ifprev-symbolident->symbol(assocident->symbolidentifier(sym/create-symbolvariable-type(sym/static-attribute(sym/no-initializer-iv)true))))]{:declarationd:ident->symbolsymbols}):static(let[initial-value(to-static-initinitialvariable-type)updated-symbols(assocident->symbolidentifier(sym/create-symbolvariable-type(sym/static-attributeinitial-valuefalse)))]{:declarationd:ident->symbolupdated-symbols})(let[updated-symbols(assocident->symbolidentifier(sym/create-symbolvariable-type(sym/local-attribute)))casted-e(if(nil?initial)initial(convert-to-expinitialvariable-type))t-e(typecheck-optional-expressioncasted-eupdated-symbols)]{:declaration(assocd:initialt-e):ident->symbolupdated-symbols})))
Enter fullscreen modeExit fullscreen mode

I could refactor out some functionality above, such as the:static and:extern cases can be in their different functions, but it does not help that much.
I would be hiding these inside some other function, and sometimes I just prefer the entire logic to be present at once.
Another problem was naming, the original function name was already too long, and another specialized case function felt wrong.

There are a lot of such instances, especially in the next pass of the compiler, where I want to refactor.
There seems to be a lot of repetitive code, but it's mixed with the complexity of C specification as well (for e.g. storage class specifiers), so I still haven't been able to refactor it to something I am happy with.

One of the solutions to this problem ( or in general ), I guess would be reading more code.
I have watched a lot of conference talks, but throughout building this I realized I haven't actually read more codebases.
One of the projects which helped me and gave new ideas istools.analyzer. I will describe how it helped in thetacky section.

Listed below are some articles, projects which can help to find other projects, ideas etc for implementing things inclojure way.


Analyze

The analyzer phase consists of three smaller passes, but their core is the same.
All of them work with the parsed AST from previous step, and go through these steps:

  1. Resolve: Validates things such as variables re-declared twice in same scope.
    Variables used before declared, nested function definitions etc.

  2. Label Loop: Loops, break, continue statements require labels to identify them.
    This ends up being in the final assembly to create labels for jumps statements.

  3. Typecheck: Typechecks variable declaration, function calls etc.
    Also adds type information to each AST node.

The main addition in this phase is of the symbol map.
Each identifier in the program is now tracked.
It's name, storage class, type, initial value etc.
How should this map ( which is required for the above validation ) be passed around in different function calls.

For e.g.

Map<String,Attributes>m=newMap();handle_variable_declaration(ast,m){stringvar_name=...;Attributeattr=...;m.put(var_name,attr);}// similarly for functions, compound statements etctypecheck_program(ast,m){handle_variable_declaration(ast,m);handle_function_declaration(ast,m);}
Enter fullscreen modeExit fullscreen mode

There is a map, which we access by reference.
This map contains the type information and other attributes.
This reference is passed around to all the functions.

In imperative language this is simple.
Things such as hashmap are by references, and it can be updated from any function, nested to any levels and it still works.

I wasn't sure how to port this to Clojure.
As maps are immutable, even if Iassoc/update new values onto a map, it won't reflect outside.

(defnhandle-var-decl[astm](...(assocm:key:value))); this change not reflected in the m passed as parameter
Enter fullscreen modeExit fullscreen mode

We can return an updated map from this function.
Then the next function which requires the map for it's processing, would need to require it.
The problem was how to glue these functions together.
How to easily pass the updated map from function to another ?
The solution was usingreduce.

(defntypecheck-declaration[astm](...){:declaration(...):ident->symbol(updated-mm)})(defn-typecheck-program[program](let[rf(fn[accdecl](let[d(typecheck-declarationdecl(:ident->symbolacc))]{:program(conj(:programacc)(:declarationd)):ident->symbol(:ident->symbold)}))](reducerf{:program[]:ident->symbol{:at-top-leveltrue}}program)))
Enter fullscreen modeExit fullscreen mode

Let's say we process a declaration.
This declaration adds a new variable to the map, with it's type, initial value etc.
After we process the declaration, we return the updated symbol map and set it in the accumulator.
This map is then passed to the next declaration, as it's available in the reducer's accumulator.

Same approach for compound statements, which are list of statements.
Each statement processing function, returns the updated map, which is used in the next statement call.

A easier way to do this is usingatom, but I wanted to first try solving this without relying on atoms.
I did use atoms later on in a different phase, but for now returning an updated map from every function worked fine.

Overall this phase of the compiler returns the same AST as previous phase, but with extra type information and the symbol map.

;; same input program as above{:program[{:type:declaration,:declaration-type:variable,:variable-type{:type:int},:storage-class:static,:identifier"x",:initial{:type:exp,:exp-type:constant-exp,:value{:type:int,:value20}}}{:type:declaration,:declaration-type:function,:function-type{:type:function,:return-type{:type:int},:parameter-types[]},:storage-classnil,:identifier"main",:parameters[],:body[{:type:declaration,:declaration-type:variable,:variable-type{:type:int},:storage-classnil,:identifier"y.0",:initial{:type:exp,:exp-type:cast-exp,:target-type{:type:int},:typed-inner{:type:exp,:exp-type:constant-exp,:value{:type:int,:value22},:value-type{:type:int}},:children[:value],:value{:type:exp,:exp-type:constant-exp,:value{:type:int,:value22},:value-type{:type:int}},:value-type{:type:int}}}{:type:statement,:statement-type:return,:value{:type:exp,:exp-type:binary-exp,:binary-operator:plus,:children[:left:right],:left{:type:exp,:exp-type:variable-exp,:identifier"x",:value-type{:type:int}},:right{:type:exp,:exp-type:variable-exp,:identifier"y.0",:value-type{:type:int}},:value-type{:type:int}}}]}],:ident->symbol{"x"{:type{:type:int},:attribute{:type:static,:initial-value{:type:initial,:static-init{:type:int-init,:value20}},:global?false}},"main"{:type{:type:function,:return-type{:type:int},:parameter-types[]},:attribute{:type:fun,:defined?true,:global?true}},"y.0"{:type{:type:int},:attribute{:type:local}}}}
Enter fullscreen modeExit fullscreen mode

Tacky

The book introduces an intermediate IR representation called Tacky.
Expressions from above passes are converted to athree-address code.
Below is an example output from this pass.
The instructions key now has instructions such asjump,copy,add etc.
This is closer to the assembly representation.
This pass reduces the nested expressions to variables.
The operands to each instruction are either a constant value, or a variable.

(defsource"int foo(void) {  int x = 0;  for (; x < 21; x += 1)    ;  return x;}int main(void) { return foo() + foo(); }")(tacky-from-srcex)=>{:program[{:type:declaration,:declaration-type:function,:function-type{:type:function,:return-type{:type:int},:parameter-types[]},:storage-classnil,:identifier"foo",:parameters[],:global?true,:instructions[{:type:copy,:src{:type:constant,:value{:type:int,:value0}},:dst{:type:variable,:value"x.196"}}{:type:label,:identifier"for_start.198"}{:type:binary,:binary-operator:less-than,:src1{:type:variable,:value"x.196"},:src2{:type:constant,:value{:type:int,:value21}},:dst{:type:variable,:value"binary_result_less_than.200"}}{:type:jump-if-zero,:identifier"break_for_label.197",:val{:type:variable,:value"binary_result_less_than.200"}}{:type:label,:identifier"continue_for_label.197"}{:type:binary,:binary-operator:add,:src1{:type:variable,:value"x.196"},:src2{:type:constant,:value{:type:int,:value1}},:dst{:type:variable,:value"binary_result_add.199"}}{:type:copy,:src{:type:variable,:value"binary_result_add.199"},:dst{:type:variable,:value"x.196"}}{:type:jump,:identifier"for_start.198"}{:type:label,:identifier"break_for_label.197"}{:type:return,:val{:type:variable,:value"x.196"}}{:type:return,:val{:type:constant,:value{:type:int,:value0}}}]}{:type:declaration,:declaration-type:function,:function-type{:type:function,:return-type{:type:int},:parameter-types[]},:storage-classnil,:identifier"main",:parameters[],:global?true,:instructions[{:type:fun-call,:identifier"foo",:arguments[],:dst{:type:variable,:value"function_call_result_foo.201"}}{:type:fun-call,:identifier"foo",:arguments[],:dst{:type:variable,:value"function_call_result_foo.202"}}{:type:binary,:binary-operator:add,:src1{:type:variable,:value"function_call_result_foo.201"},:src2{:type:variable,:value"function_call_result_foo.202"},:dst{:type:variable,:value"binary_result_add.203"}}{:type:return,:val{:type:variable,:value"binary_result_add.203"}}{:type:return,:val{:type:constant,:value{:type:int,:value0}}}]}],:ident->symbol{"foo"{:type{:type:function,:return-type{:type:int},:parameter-types[]},:attribute{:type:fun,:defined?true,:global?true}},"x.196"{:type{:type:int},:attribute{:type:local}},"main"{:type{:type:function,:return-type{:type:int},:parameter-types[]},:attribute{:type:fun,:defined?true,:global?true}},"binary_result_add.199"{:type{:type:int},:attribute{:type:local}},"binary_result_less_than.200"{:type{:type:int},:attribute{:type:local}},"function_call_result_foo.201"{:type{:type:int},:attribute{:type:local}},"function_call_result_foo.202"{:type{:type:int},:attribute{:type:local}},"binary_result_add.203"{:type{:type:int},:attribute{:type:local}}}}
Enter fullscreen modeExit fullscreen mode

Each construct in the previous AST node, is converted to instructions in three address code format.

(defnif-statement-handler[ssymbols](let[cond-exp(run-expression-handler(:conditions)symbols)cond-value(:valcond-exp)cond-instructions(:instructionscond-exp)then-instructions(statement->tacky-instruction(:then-statements)symbols)end-label(label"if_end")else-label(label"if_else")else?(:else-statements)](ifelse?[cond-instructions(jump-if-zero-instructioncond-valueelse-label)then-instructions(jump-instructionend-label)(label-instructionelse-label)(statement->tacky-instruction(:else-statements)symbols)(label-instructionend-label)][cond-instructions(jump-if-zero-instructioncond-valueend-label)then-instructions(label-instructionend-label)])))
Enter fullscreen modeExit fullscreen mode

Above is convertingif statements to instructions in TAC form.
It first recursively generates the instructions for expressions inside a statement.
In above it's the condition expression.
The other statements inside theif block are generated, and the result from all these instructions is returned as a list of instructions.

The main change is now usingsymbols parameter. Which is an atom, containing a map from variable identifiers to their type / initial value etc.
Always returning an updated version of this map felt tedious now, and for this phase I used an atom.

The other place which felt repetitive was how I processed the AST nodes.
Below is the schema for an expression.

(defVariableExp[:map[:type[:=:exp]][:exp-type[:=:variable-exp]][:identifierstring?]])(defUnaryExp[:map[:type[:=:exp]][:exp-type[:=:unary-exp]][:unary-operator`[:enum~@t/unary-ops]][:value[:ref#'Exp]][:children[:=[:value]]]])(defBinaryExp[:map[:type[:=:exp]][:exp-type[:=:binary-exp]][:binary-operator`[:enum~@(set(keyst/bin-ops))]][:left[:ref#'Exp]][:right[:ref#'Exp]][:children[:=[:left:right]]]])(defExp[:schema{:registry{::mexp-variable#'VariableExp::mexp-unary#'UnaryExp::mexp-binary#'BinaryExp}}[:multi{:dispatch:exp-type}[:variable-exp#'VariableExp][:unary-exp#'UnaryExp][:binary-exp#'BinaryExp]]])
Enter fullscreen modeExit fullscreen mode

Exp is by it's nature recursive.
A binary expression operands can be any other expression, such as functional call, unary expression etc.

To process expressions, I started with the above approach.

(defmultiexp-handler[e](:typee))(defmethodexp-handler:binary[expsymbols](let[{v1:valinsts1:instructions}(exp-handler(:leftexp)){v2:valinsts2:instructions}(exp-handler(:rightexp))op(binary-operator(:binary-operatorexp))dst(variable(str"binary_result_"op))_(add-var-to-symboldst(tc/get-typeexp)symbols)binary-inst(binary-instructionopv1v2dst)]{:valdst:instructions(flatten[insts1insts2binary-inst])}))
Enter fullscreen modeExit fullscreen mode

Each handler, internally calledexp-handler again.
This felt just extra work, as it's a recursive definition there should be a way to recursively process this map.
When I access the:left on the binary expression, it should already contain the evaluated value.

I looked into how to operate on recursive tree like data structures in Clojure.
I first looked intoclojure.zip.
This blog postClojure Zippers by Ivan Grishaev and talkThe Art of Tree Shaping with Zippers by Arne Brasseur are a great intro.
But the conclusion I came up with was that this seemed to be doing too much, and I definitely do not get the full potential of using zippers.
I just wanted to have pre computed value in my maps.

The second thing wasclojure.walk.
This blog post byLearning to walk with Clojure by Abhinav Omprakash explains the walk functions in details with a lot of examples.

The functionpostwalk seemed to be the solution.
It does post order traversal, calling a suppliedf on child nodes.
It replaces each child node with(f child).
This was exactly what I was looking for.

(clojure.walk/postwalk(fn[x](prnx)(if(vector?x)x(*x2)))[1[2][34[5]]])=>12[4]345[10][68[10]][2[4][68[10]]]
Enter fullscreen modeExit fullscreen mode

The final result is returned at the last, after all the children node are processed.
I can use the same approach for recursively running the expression handlers, but how to identify the children ?
The above example is a nested vector, and each element is valid node. But in the AST map, only specific keys are recursive in nature.

The solution to this was specifying children nodes in the AST itself.
Then use this to write a custom postwalk function.
This solution was suggested byhiredman on Clojurians Slack.
The same approach was first used intools.analyzer.
Its described in these talksTimothy Baldridge - Data All The ASTs
andClojure eXChange 2015 Immutable code analysis with tools analyzer.

The custom postwalk function is below. It know's what are the children nodes from the ast itself, and then recursively processes them first.

(defnpostwalk[astf](f(reduce(fn[acckey](let[value(getacckey)](if(vector?value)(assocacckey(doall(map(fn[node](postwalknodef))value)))(assocacckey(postwalkvaluef)))))ast(:childrenast))))
Enter fullscreen modeExit fullscreen mode

After this, the code for handling expressions is now simpler.

(defnbinary-exp-handler[expsymbols](let[{v1:valinsts1:instructions}(:leftexp); not running the expression-handler again{v2:valinsts2:instructions}(:rightexp)op(binary-operator(:binary-operatorexp))dst(variable(str"binary_result_"op))_(add-var-to-symboldst(tc/get-typeexp)symbols)binary-inst(binary-instructionopv1v2dst)]{:valdst:instructions(flatten[insts1insts2binary-inst])}))
Enter fullscreen modeExit fullscreen mode

Assembly

This pass has the most amount of code.
It converts the Tacky AST to Assembly instructions.
There's a lot of logic and edge cases to be handled, so that the assembly produced is compliant.

  • Functions have to followSystem V calling.First 6 parameters are passed in registers, other on the stack.
  • Stack pointer needs to be a multiple of 16. As different types take up different space, this needs to maintained manually by rounding to nearest offsets.
  • Several assembly instructions cannot take both operands as memory addresses, so they are rewritten using intermediate registers.

There are other smaller details such as above, but apart from that this phase codewise is not doing anything interesting.
It's again dispatch on type of the instruction from the previous Tacky phase, and generating assembly instructions.

;; generates ret instruction in the assembly output(defmethodtacky-instruction->assembly-instructions:return[{return-value:val}m](let[src(tacky-val->assembly-operandreturn-value); return valuereg(reg-operand:ax); which operand to in mov instructionsrc-type(tacky-val->assembly-typereturn-valuem)][(mov-instructionsrc-typesrcreg)(ret-instruction)])); moves value into eax, returns
Enter fullscreen modeExit fullscreen mode

Assembly instructions take operands, but they can only take a specified combination of operands.
Some cannot take in immediate values in the source, some cannot take both operands to be from the stack.
In some instructions if the specified constant value is too large ( out of 32 bit integer range ), then temporary registers needs to be used.
I initially started with usingcond blocks to handle all these if conditions, but it was getting out of hand. I ended up usingcore.match.

These talks were a great introduction to the librarySean Johnson - Pattern Matching in Clojure andUnderstanding pattern matching in Clojure with core.match
by Misophistful
. Example of using match, with maps and having guard conditions.

(defn-fix-cmp-instruction[instruction](let[src(:srcinstruction)dst(:dstinstruction)assembly-type(:assembly-typeinstruction)imm-outside-range?(fn[o](and(=:imm(:operando))(not(util/in-int-range?(:valueo)))))](match[instruction][{:src{:operand(:or:data:stack)}; operand can be either on stack or data section:dst{:operand(:or:data:stack)}}][(mov-instructionassembly-typesrc(reg-operand:r10))(cmp-instructionassembly-type(reg-operand:r10)dst)][({:assembly-type:quadword:src{:operand:imm}:dst{:operand:imm}}; guard on operand being in integer range:guard[(compimm-outside-range?:src)])][(mov-instruction:quadwordsrc(reg-operand:r10))(mov-instruction:quadworddst(reg-operand:r11))(cmp-instruction:quadword(reg-operand:r10)(reg-operand:r11))][({:assembly-type:quadword:src{:operand:imm}}:guard[(compimm-outside-range?:src)])][(mov-instruction:quadwordsrc(reg-operand:r10))(cmp-instruction:quadword(reg-operand:r10)dst)][{:dst{:operand:imm}}][(mov-instructionassembly-typedst(reg-operand:r11))(cmp-instructionassembly-typesrc(reg-operand:r11))]:elseinstruction)))
Enter fullscreen modeExit fullscreen mode

Below is an output from the assembly phase.

(definput"int foo(int x) {  if (x == 42) {    return x;  }  return foo(x + 1);}int main(void) { return foo(0); }")(asembly-from-srcinput)=>{:program[{:op:function,:identifier"foo",:global?true,:instructions[{:op:binary,:binary-operator:sub,:assembly-type:quadword,:src{:operand:imm,:value16},:dst{:operand:reg,:register:sp}}{:op:mov,:assembly-type:longword,:src{:operand:reg,:register:di},:dst{:operand:stack,:value-4}}{:op:cmp,:assembly-type:longword,:src{:operand:imm,:value42},:dst{:operand:stack,:value-4}}{:op:mov,:assembly-type:longword,:src{:operand:imm,:value0},:dst{:operand:stack,:value-8}}{:op:setcc,:operand{:operand:stack,:value-8},:cond-code:e}{:op:cmp,:assembly-type:longword,:src{:operand:imm,:value0},:dst{:operand:stack,:value-8}}{:op:jmpcc,:identifier"if_end.234",:cond-code:e}{:op:mov,:assembly-type:longword,:src{:operand:stack,:value-4},:dst{:operand:reg,:register:ax}}{:op:ret}{:op:label,:identifier"if_end.234"}{:op:mov,:assembly-type:longword,:src{:operand:stack,:value-4},:dst{:operand:reg,:register:r10}}{:op:mov,:assembly-type:longword,:src{:operand:reg,:register:r10},:dst{:operand:stack,:value-12}}{:op:binary,:binary-operator:add,:assembly-type:longword,:src{:operand:imm,:value1},:dst{:operand:stack,:value-12}}{:op:mov,:assembly-type:longword,:src{:operand:stack,:value-12},:dst{:operand:reg,:register:di}}{:op:call,:identifier"foo"}{:op:mov,:assembly-type:longword,:src{:operand:reg,:register:ax},:dst{:operand:stack,:value-16}}{:op:mov,:assembly-type:longword,:src{:operand:stack,:value-16},:dst{:operand:reg,:register:ax}}{:op:ret}{:op:mov,:assembly-type:longword,:src{:operand:imm,:value0},:dst{:operand:reg,:register:ax}}{:op:ret}]}{:op:function,:identifier"main",:global?true,:instructions[{:op:binary,:binary-operator:sub,:assembly-type:quadword,:src{:operand:imm,:value16},:dst{:operand:reg,:register:sp}}{:op:mov,:assembly-type:longword,:src{:operand:imm,:value0},:dst{:operand:reg,:register:di}}{:op:call,:identifier"foo"}{:op:mov,:assembly-type:longword,:src{:operand:reg,:register:ax},:dst{:operand:stack,:value-4}}{:op:mov,:assembly-type:longword,:src{:operand:stack,:value-4},:dst{:operand:reg,:register:ax}}{:op:ret}{:op:mov,:assembly-type:longword,:src{:operand:imm,:value0},:dst{:operand:reg,:register:ax}}{:op:ret}]}],:backend-symbol-table{"foo"{:type:fun-entry,:defined?true},"x.232"{:type:obj-entry,:static?false,:assembly-type:longword},"main"{:type:fun-entry,:defined?true},"binary_result_equal.233"{:type:obj-entry,:static?false,:assembly-type:longword},"binary_result_add.236"{:type:obj-entry,:static?false,:assembly-type:longword},"function_call_result_foo.237"{:type:obj-entry,:static?false,:assembly-type:longword},"function_call_result_foo.238"{:type:obj-entry,:static?false,:assembly-type:longword}}}
Enter fullscreen modeExit fullscreen mode

Emission

This is simplest phase of the compiler.
It just takes in the assembly in ast form, and converts it to assembly instructions.
E.g.

(defn-mov-instruction-emit[instruction](let[atype(:assembly-typeinstruction)opts{:register-width(assembly-type->operand-sizeatype)}src(operand-emit(:srcinstruction)opts)dst(operand-emit(:dstinstruction)opts)suffix(assembly-type->instruction-suffixatype)][(format"    %s%s        %s, %s""mov"suffixsrcdst)]))
Enter fullscreen modeExit fullscreen mode

The generated assembly is slightly different, whether the system is Linux or Mac, but otherwise there isn't much going on in these functions, just string formatting.

Below is an example of final assembly generated.

(definput"int main(void) {  static int x = 1;  if (x == 42)    return x;  x = x + 1;  return main();}")(emitinput)
Enter fullscreen modeExit fullscreen mode
    .data    .balign 4x.251:    .long 1    .globl main    .textmain:    pushq       %rbp    movq        %rsp, %rbp    subq        $16, %rsp    cmpl        $42, x.251(%rip)    movl        $0, -4(%rbp)    sete        -4(%rbp)    cmpl        $0, -4(%rbp)    je         .Lif_end.253    movl        x.251(%rip), %eax    movq        %rbp, %rsp    popq        %rbp    ret    .Lif_end.253:    movl        x.251(%rip), %r10d    movl        %r10d, -8(%rbp)    addl        $1, -8(%rbp)    movl        -8(%rbp), %r10d    movl        %r10d, x.251(%rip)    call        main    movl        %eax, -12(%rbp)    movl        -12(%rbp), %eax    movq        %rbp, %rsp    popq        %rbp    ret    movl        $0, %eax    movq        %rbp, %rsp    popq        %rbp    ret.section .note.GNU-stack,"",@progbits
Enter fullscreen modeExit fullscreen mode

This was my first largeish (for me) project in Clojure, or Lisp in general.
I got familiar with a lot of libraries, talks and patterns.
I tried different editor setups throughout (Vim + Conjure, Vscode + Calva).
Currently using Doomemacs.

Using the REPL for immediate feedback was probably my favorite part in all this.
Trying out my solution without the overhead of writing it in a test, or generating executable and then running it helped me iterate on solutions quicker and led me to experiment more than I probably would have.
Usingflowstorm for debugging is amazing.
It has saved me a lot of time, which would have been spent debugging by printing the values etc.

I am yet to complete the book. Many of the upcoming features are core to C, such as pointers, arrays, structs.
As of now, I feel the base structure of the compiler is there, and each feature can be neatly added on top.
But as those features are more complex than the ones presented above, there is a still a gradual difficulty curve,
and I notice more opportunities to make my code simpler.
Looking forward to implementing the rest of the book !

Top comments(0)

Subscribe
pic
Create template

Templates let you quickly answer FAQs or store snippets for re-use.

Dismiss

Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment'spermalink.

For further actions, you may consider blocking this person and/orreporting abuse

  • Location
    Bangalore, India
  • Joined

Trending onDEV CommunityHot

DEV Community

We're a place where coders share, stay up-to-date and grow their careers.

Log in Create account

[8]ページ先頭

©2009-2025 Movatter.jp