- Notifications
You must be signed in to change notification settings - Fork28
Demo for high-performance type theory elaboration
License
AndrasKovacs/smalltt
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Demo project for several techniques for high-performance elaboration withdependent types.
It is a complete rewrite of theold smallttversion which I wrotemostly in 2018-2019.
This project serves as a demo for several techniques for high-performanceelaboration with dependent types. I also include some benchmarks, which areintended to be as similar as possible across Agda, Lean, Coq, Idris 2 andsmalltt.
You can skip tobenchmarks if that's what you're after.
Smallttis fast, however:
- Smalltt is not as nearly as fast as it could possibly be. A lot of tuningbased on real-world benchmarking data is missing here, because there isn't anyreal-world code for smalltt besides the benchmarks that I wrote. A bunch ofplausible optimizations are also missing.
- The primary motivation is to demonstrate designs that can plausibly scale upto feature-complete languages, still yield great performance there, andnaturally support a lot more optimization and tuning than what's includedhere.
This project is not yet finalized. Code and documentation alike may be subjectto change, cleanup, extension, or perhaps complete rewrite.
First, clone or download this repository.
Usingstack
:
- Installstack.
- Run
stack install
in the smalltt directory. If you have LLVM installed, usestack install --flag smalltt:llvm
instead, that gives some performanceboost.
Usingcabal
:
- Installcabal
- Run
cabal v2-update
. - Run
cabal v2-install
in the smalltt directory. If you have LLVM, usecabal v2-install -fllvm
instead.
Also make sure that the executable is on the PATH. On Linux-es, thestack
install directory is$HOME/.local/bin
, and thecabal
one is$HOME/.cabal/bin
.
Installation gets you thesmalltt
executable. You can find usage informationafter startingsmalltt
.
Smalltt is a minimal dependent type theory implementation. Features:
- Type-in-type.
- Dependent functions.
- Agda-style implicit functions with named implicit arguments.
- Basic pattern unification.
- A distinguished top-level scope and local let-definitions.
- An executable which can load a single file at once, and lets usquery top-level definitions in several ways.
SeeBasics.stt for introductory examples.
The core design is based onCoquand'salgorithm. Thisis sometimes called "elaboration with normalization-by-evaluation", or "semanticelaboration". This is becoming the de facto standard design of dependently typedelaboration nowadays, but it is in fact the preferred design for elaborationwith any type system which includes binders and substitution in types. It isexplained in detail in
I give a short review. Dependent type checking involves execution of arbitraryfunctional programs at compile time. For executing functional programs, thestandard practice (used in virtually all functional languages) is to have
- Immutable program code, which may be machine code or interpreted code.
- Runtime objects, consisting of constructors and closures.
The basic idea is to use the above setup during elaboration, with someextensions.
The elaborator takes as inputraw syntax, and outputscore syntax, whichcorresponds to immutable program code. When necessary, we evaluate core syntaxintosemantic values (or values in short), which correspond to runtime objects,and indeed they represent function values as closures.
Evaluation and runtime values have to support a somewhat richer feature set thanwhat is typical in functional languages. They must supportopen evaluation,that is, evaluation of programs which may contain free variables. This makes itpossible to evaluate code under binders. In such code, free variables causeevaluation toget stuck. There are special values corresponding to stuckcomputations, which are calledneutral values.
Neutral values make it possible to convert runtime values back to core syntax.This is calledquoting. Quoting is used in elaboration whenever we need toserialize or store values. For example, since elaboration outputs core syntax,whenever we need to fill a hole in raw syntax, we plug the hole by converting avalue to a core term by quoting.
Normalization by evaluation (NbE) means normalizing terms by firstevaluating then quoting them. The kind of normal forms that we get can varydepending on the details of evaluation and quoting. In particular, it is notmandatory that NbE yields beta-normal terms.
Moreover, values supportconversion checking. Type equality checking isrequired in pretty much any type checker. In dependently typed languages, typesmay include arbitrary programs, and equality checking becomesbeta-eta conversion checking of values. At its simplest, this is implemented byrecursively walking values. The "open" evaluation makes it possible to getinside closures during conversion checking, so we can check if two functionshave beta-eta-convertible bodies.
Elaboration with NbE can be contrasted to elaboration with "naive"evaluation. In this style, compile-time evaluation is performed by termsubstitution, which is far less efficient than NbE. In some implementations,naive substitution is still used because of its perceived simplicity. However,my experience is that NbE is significantly simpler to implement, and also easiertocorrectly implement, than capture-avoiding substitution. Furthermore, anyattempt to use naive substitution in type checking necessitates additionaloptimizations, which add more complexity.
For example, Lean uses naive substitution in its kernel, but to recoveracceptable performance it has to add extra machinery (memoization, free variableannotations on terms for skipping traversals during substitutions). This ends upbeing slower and more complicated than a straightforward NbE implementation.
In summary, term substitution should be avoided whenever possible in elaborationimplementations. Sometimes it's necessary, but AFAIK only for more nichepurposes in more feature-rich systems, and not in performance-critical tasks.Smalltt uses no substitution operation whatsoever, and we can go pretty farwithout one.
(Remark:cubical type theories are notorious for requiring substitution fromthe get go. It's an open research problem how to get rid of naive substitutionthere).
Smalltt usescontextual metavariables. This means that every metavariableis a function which abstracts over the bound variables in its scope. Takethe following surface code.
id : (A : U) → A → A = λ A x. xid2 : (A : U) → A → A = λ A x. id _ x
When the elaborator hits the hole inid2
, it fills it with a fresh metavariablewhich abstracts overA
andx
. The elaboration output is:
id : (A : U) → A → A = λ A x. x?0 = λ A x. Aid2 : (A : U) → A → A = λ A x. id (?0 A x) x
Note that?0
is a fresh top-level definition and the hole gets plugged with it.Smalltt's particular flavor of contextual metavariables puts metas in mutual top-levelblocks in the elaboration output. Other setups are possible, including elaborating solvedmetas to local let-definitions, but those are significantly more complicated to implement.
Also, smalltt uses basicpattern unification for producing meta solutions.Seethisfor a tutorial on the basics of contextual metavariables and patternunification.
Smalltt does not try very hard to optimize the representation of contextualmetas, it just reuses plain lambdas to abstract over scopes. For potentialoptimizations, see this Coq discussion:coq/coq#12526. As a result, basic operations involvingmetas are usually linear in the size of the local scope. My benchmarking showedthat this is not a significant bottleneck in realistic user-written code, and wedon't really have machine-generated code (e.g. by tactics) that could introducepathologically sized local scopes.
The most basic NbE setup is not adequate for performance. The problem is that weneed different features in conversion checking and in quoting:
- In basic conversion checking, we want to evaluate as efficiently as possible.
- In quoting, we want to output terms which areas small as possible. Thereason is that, through metavariable solutions, the output of quoting isincluded in the overall elaboration output. So, if quoting returns fullbeta-normal terms, that reliably destroys performance, as normal formstend to be extremely large.
The solution is to add control overdefinition unfolding to evaluation andquotation. We call the implementationglued evaluation, where the evaluatorlazily computes two different values on each unfolding choice. In smalltt wehave unfolding control only for top-level definitions. This simplifiesimplementation, and usually top-level scopes are vastly larger than localscopes, so we already capture the vast majority of size compression by onlyfocusing on top-level unfolding.
Seethisfilefor a minimal demo of glued evaluation. In short, top-level variables areevaluated to values which represent lazy ("non-deterministic") choice betweenunfolding the definition, and not unfolding it. This has a noticeable constantoverhead during evaluation but overall the trade-off is well worth it. Later,the quotation function has the choice of visiting either evaluation branches, orboth, in which case as much as possible computation is shared between thebranches.
When we need high-performance evaluation during conversion checking, we have it,and when we solve a metavariable, we are able to quote values to terms which areminimal with respect to top-level unfolding. This is also useful in errormessage reporting and interaction, where we want to be able to display smallterms.
Only being able to control top-level unfolding is not quite sufficient forsophisticated interactive proving, but the technique here could be extendedto richer unfolding control with modest additional overheads.
Importantly, we can make do with asingle evaluator for all purposes, withfairly good performance. In contrast, Agda, Coq and Lean all have multipleevaluators, and in all cases only theslowest evaluators can be used withoutrestrictions during elaboration. As we'll see in benchmarks, smalltt is robustlyfaster than all "slow" evaluators, and can be faster or slower than the Coqbytecode VM depending on workloads.
Usually by "hash consing" we mean a pervasive form of memoization, where certainobjects are stored at most once in memory, and any new object construction goesthrough a table lookup to check if the object already exists. It is frequentlymentioned as an optimization technique in typechecking. However, specifically inthe context of dependent elaboration, it's not obviously desirable.
Hash consing alone is inadequate for eliminating size explosions. Hash consingmerges duplicate objects to a single copy. But it does not handlebeta-reduction at all, which is a major source of size explosion! For a simpleexample, using Peano naturals, it is easy to give a compact definition foroneMillion
, involving arithmetic operations. But if I normalizeoneMillion
,I get a numeral which is incompressible by hash consing.
If I have something like a first-order term language, hash consing can be veryeffective. But in dependent type theory, we have higher-order programs withpotentially explosive behavior, and it isn't hard to produce size explosionseven in the presence of full hash-consing. Considering this, and the performanceand complexity overhead of hash consing, I decide to skip it in smalltt.
Hash consing is better suited to more static data, like literals, or types insystems without type-level beta rules, such as simple type theory,Hindley-Milner or System F. In those cases, hash consing fully captures thecompression which is possible by rewriting along conversion rules.
In dependently typed elaboration, at least some laziness is essential, becausesome parts of the program may need to be evaluated, but we don't know anythingaboutwhich parts, until we actually do the elaboration.
At the same time, laziness has significant overhead, so we should limit it tothe necessary amount.
Smalltt has the following approach:
- Top-level and local definitions are lazy.
- We instantiate Pi types during elaboration with lazy values.
- Applications headed by top-level variables are lazy.
- Any other function application is call-by-value during evaluation.
The reasoning is the following. First, it does not make sense to have strictevaluation during infer/check, because that would cause theentire program tobe evaluated during elaboration. Hence the laziness of definitions and Piinstantiation.
On the other hand, evaluation is really only forced by conversion checking. Thebulk of the program is never forced by conversion checking, so we might as wellmake evaluation a bit stricter when it is actually forced, to make it faster.
However, glued evaluation mandates that top-level spines are lazily evaluated.So we keep that lazy, and otherwise have call-by-value function applications.
This seems to work well in practice. While there are some cases where it doessuperfluous work, in realistic code we still get plenty of laziness throughlet-definitions and top-level variables.
Approximate conversion checking means deciding conversion without computing allbeta-redexes. It's an important feature in pretty much every major TTimplementation. For example, if I again haveoneMillion
as a definition,checking thatoneMillion
is convertible to itself should immediately returnwith success, without unfolding the numeral.
- An important property here is whether a system permitsapproximate metasolutions. For example, if I unify
f ?0 = f ?1
wheref
is a definedfunction, I might skip computing thef
application, and pretend thatf
isinjective, solving?0
with?1
. But iff
is actually a constant function,this causes?0
and?1
to be unnecessarily equated. AFAIK Coq and Lean bothpermit approximate solutions, and Agda does not. - Another property is howoptimistic the approximation algorithm is. A veryoptimistic algorithm might do the following: if we have identical defined headsymbols on sides, first we try to unify spines, and if that fails we retrywith unfolding. This algorithm expects that unifiable values are nearby,i.e. reachable after few reductions. The downside of unbounded optimism isthat recursive backtracking can cause massive slowdown when unifiable valuesare not in fact near.
Smalltt
- Does not allow approximate meta solutions.
- Has bounded approximation: it only performs limited speculation, andswitches to full reductions on failure.
Concretely, smalltt has three states in unification: "rigid", "flex" and "full".Seeunify
insrc/Unification.hs for details.
- "Rigid": this is the starting state. In this state we can solve metas, and caninitiate speculation. Whenever we have the same top-level head symbol on bothsides, we try to unify the spines in "flex" mode, if that fails, we unfold andevaluate the sides, and unify them in "full" mode. We stay in "rigid" modewhen we recurse under canonical type and term formers.
- "Flex": in this state we cannot solve metas, every situation which requiresa meta solution fails. Moreover, we cannot unfold any top-level definition;if we have identical defined head symbols, we can recurse into spines, butany situation which requires unfolding also causes failure.
- "Full": in this state we can solve metas, and we always immediately unfoldany defined symbol.
Example. We unifycons oneMillion (cons oneMillion nil)
withitself. Assume thatcons
andnil
are rigid term formers for lists. We startin rigid mode, which recurses under thecons
-es, and tries to unifyoneMillion
with itself twice. Both cases succeed speculatively, because headsymbols match andoneMillion
is applied to zero arguments.
Example. We unifyconst true true
withconst true false
, whereconst
isa top-level definition. We start in rigid mode, and since we haveconst
headon both sides, we try to unify spines in flex mode. This fails, sincetrue /= false
. So we unfold theconst
-s, and unify sides in full mode.
In short, smalltt unification backtracks at most once on any path leading to asubterm ("sub-value" actually, since we recurse on values).
We could have a large number of different speculative algorithms. A naturalgeneralization to smalltt is to parametrize the "rigid" state with the number ofshots we get at speculation (smalltt has just one shot). We start in "rigid N"state, and when a speculative (flex) spine unification fails, we continue in"rigid (N-1)", and "rigid 0" corresponds to the "full" state. I had this brieflybut did not find much difference in the benchmarks compared to the one-shotspeculation. Alternatively, we could parameterize the "flex" mode with a numberof allowed unfoldings (currently unfolding is not allowed).
I haven't yet done benchmarking on larger, more realistic codebases. The pointis that the current system is compatible with a large number of approximateconversion checking algorithms, so we could adapt it based on more real-worldperformance data. The main limitation is that we can only suspend top-levelunfoldings, and local let-s and immediate local beta-redexes are alwayscomputed.
Ininfer/check and inunification,instead of using plain values, we use pairs of values, nameddata G = G {g1 :: Val, g2 :: Val}
in the source. Hence,unify
takes twoG
-s, andinfer
returns aG
for inferred type.
InG
, the two values are always convertible, but the first value is always theleast reduced available version, and the second one is potentially morereduced.
For example, if we docheck
-ing, the checking type can be headed by atop-level definition, so we have to compute it until we hit a rigid head symbol,to see whether it's a Pi type. This computation yields a new value which is morereduced than what we started with. But we don't want to throw away either ofthese values! The original version is usually smaller, hence better for printingand meta solutions, the forced version is more efficient to compute with, sincewe don't want to redo the same forcing later.
We prefer to get meta solutions which are as eta-short aspossible. Eta-expansion increases code size and makes evaluation of code slower.
In the standard implementation of syntax-directed function eta-conversionchecking, we do the following:
- If we have lambdas on both sides, we recurse under binders.
- If we have a lambda only on one side, we recurse under that lambda, andapply the other side to a fresh variable.
- We only attempt solving metas after we've checked case 2. For example,if we have a lambda on one side, and a meta-headed value on the other side,first we perform eta-expansion according to step 2.
In smaltt, this is slightly modified to allow eta-short meta solutions. If wehave a meta on one side, and a non-meta on the other side, we immediatelyattempt a solution. However, this can fail if the sides are eta-convertible.For example, trying to solve?0
withλ x. ?0 x
fails because?0
occursrigidly in the solution. So in case of solution failure, we just retry with fulleta expansion. Such failure seems to be very rare in practice, so we almostalways get the eta-short solutions.This is the placewhere we retry after a failed eta-short solution.
Furthermore, we do additional eta-contraction in pattern unification. We try tocontract meta spines, for example?0 x y z = ?1 x y z
is contracted to?0 = ?1
. This is also used in Coq. We have to be careful though not to changepattern conditions by contraction, e.g. not remove non-linear bound vars bycontraction.
Eta-short solutions are also important for preserving top-level unfoldings. Forexample, assume a functionf : Nat → Nat
defined as a lambdaλ x. t
, wheret
can be a large definition. If I unify?0 = f
, the eta-long unificationwould solve?0 := λ x. t x
, while the eta-short version can preserve thef
unfolding, and solve simply as?0 := f
.
Let's look now at the actual process of generating meta solutions. In basicpattern unification, we have problems like?m x₁ x₂ ... xₙ = rhs
, where?m
is a meta,xᵢ
are distinct bound variables, andrhs
is a value. We aim toquoterhs
to a solution term, and at the same time check occurs & scopingconditions on it.
- Scoping: the only bound vars
rhs
can depend on are thexᵢ
vars in the spine. - Occurs:
?0
cannot occur inrhs
(we assume that rhs is not headed by?0
).
If both conditions hold, then it is possible to quoterhs
to somet
termwhich depends only onxᵢ
bound variables, so thatλ x₁ x₂ ... xₙ. t
is awell-formed term. In the actual implementation we use a variablerenamingstructure to map De Bruijn levels inrhs
to thecorrect De Bruijn indices in the output.
The naive implementation beta-normalizesrhs
while quoting, which we want toavoid. In smalltt therhs
is quoted without unfolding any top-leveldefinitions or any previously solved meta. However, this is not entirelystraightforward, because therhs
conditions should be still checked modulofull beta-reductions.
We have three different quotation modes, somewhat similarly to what we have seenin unification. SeeflexQuote
,rigidQuote
andfullCheck
insrc/Unification.hs.
- "rigid": the starting mode. We stay in rigid mode when going undercanonical type/term formers. Illegal var occurrences cause an error to bethrown. When we hit an unfolding, we recurse into the spine in flex mode, ifthat returns a possibly invalid term, we check the unfolding in full mode. Ifthat succeeds, we learn that the term is actually valid, and return it.
- "flex": this mode returns a boolean flag alongside a term. A true flag meansthat the term is definitely valid, a false means that it is possibly invalid.Illegal var occurrences cause a special
Irrelevant
term to be returned alongwith a false flag. - "full": this mode does not return any term, it just fully traverses the valueand throws an error on any illegal var occurrence.
The overall result of this algorithm is that top definitions areneverunfolded in any meta solution, but we check validity up to full beta-reduction.Recall the?0 = const {Bool}{Bool} true y
example. This yields the?0
solutionconst {Bool}{Bool} true Irrelevant
. Note that theIrrelevant
partdisappears during evaluation.
In unification,Irrelevant
immediately unifies with anything, since it signalsthat we are in an irrelevant evaluation context.
It would be better in the previous example to solve?0
withtrue
. Smallttdoes not bother with performing unfolding for code optimization, but itcertainly could; the primary goal is to demonstrate the infrastructure where wehave the freedom to unfold in different ways. Additional optimization passes cantake advantage of the preserved top-level unfoldings.
Freezing metas means that at certain points during elaboration we markunsolved metas as unsolvable. This may be used as a performance optimizationand/or a way to enforce meta scoping. All major systems use at least some metafreezing. The absence of freezing would mean that metas are solvable across thewhole program, across module hierarchies.
Smalltt freezes metas like Agda does: a top-level definition together with itsoptional type annotation constitutes the elaboration unit where fresh metas aresolvable (but in Agda such blocks can be greatly enlarged through mutuallyrecursive definitions).
This enforces a scoping invariant: metas can be grouped to mutual blocks beforeeach top-level definition. Within a mutual block metas can refer to each otherfreely, but outside of the block they can only refer to in-scope top defs andmetas in previous blocks.
Anactive meta is in the current meta block. It can be solved or unsolved,and an unsolved active meta may become solved.
Afrozen meta is in a previous meta block. A frozen unsolved meta cannot besolved.
This yields a major optimization opportunity in meta occurs checking: an activeunsolved meta can only occur in the solution of an active meta, but no othertop-level definition! We exploit this in rigid and flex solution quoting. There,we only look inside solutions of active metas, to do approximate occurschecking.
For example, assume we're checking for?m
occurrences, and we hit?n spine
,where?n
is a solved active meta. It is not enough to checkspine
, we alsoneed to look into the?n
solution. We do this by simply recursively walkingtheterm solution of?n
, which may lead to looking into solutions of otheractive metas. Here we employ a very simple caching mechanism: we only visit eachactive solved meta at most once. So the amount of work done in approximateoccurs checking is limited by the total size of all active meta solutions.
As a result, smalltt is able to very quickly check the classic nested pairexample:
dup : {A} → A → Pair A A = ...pairTest = let x0 = dup U ; let x1 = dup x0; let x2 = dup x1; ... x20
At eachdup
, the normal form of the inferredA
type doubles. In smalltt thisbenchmark is technically quadratic, since at eachdup
we search all previousactive solved metas. But these meta solutions are all tiny, they are of the form?n := Pair ?(n-1) ?(n-1)
. This takes exponential time in Agda, Coq and Lean,although in Lean the profiling shows that "elaboration" is not exponential, only"compilation" is. Seeelaboration asymptotics.
More sophisticated caching mechanisms are plausible and probably desirable. Forbetter UX, it could make sense to combine smarter caching with more relaxed metafreezing behavior, like allowing metas to be active within a single module.
Smalltt includes some low-level GHC-specific optimizations. These almostcertainly make the code less readable. I included them because
- A very important goal was to have an implementation inany language whichrobustly beats all existing systems in elaboration speed. Hence I did not wantto leave too much performance on the table. Having concise code isalso agoal, but I care more about the complexity of the underlying algorithms, andless about the size of supporting code. The optimization boilerplate insmalltt can be easily ignored when we want to look at the core algorithms.
- I wanted to at least try some GHC-specific optimizations, to get an idea abouttheir impact in the first place. Some of these optimizations turned out tohave modest impact, but all of them help to some degree.
Setting RTS options is important and often overlooked. The performance gainsfrom the right settings can be easily 30-50%. The default arena size in GHC (1MBor 4MB starting from GHC 9.2) is very tiny compared to typical RAM sizes. Insmalltt I set the default RTS options to be-A64M -N8
. This means thateffective arena size is 8 * 64MB = 512MB, so smalltt allocates in 512MBchunks. Is this wasteful? RAM sizes below 8GB are getting increasingly rare;512MB is 1/16th of that, and 1/32nd of 16GB. If we can trade RAM forperformance, while still keeping the risk of running out of RAM very low, thenwe should do it. RAM exists to be used, not to just sit there.
One of the main reasons why smalltt is implemented in GHC Haskell is the RTSperformance, which is overall great (the other reason is my prior experience inGHC optimization). I plan to update my oldnormalizationbenchmarks at some point;even there GHC performs well, but my newer unstructured benchmarking with newerGHC versions indicates yet more GHC advantage.
This is the dirtiest trick here. I use custom catching and throwing functions,to avoid the overhead of the standard fingerprint mechanism which ensures typesafety. SeeException#
insrc/Exceptions.hs
. The ideais that my own primitiveException#
type includes the standardSomeException
constructor, but has another one for my own custom exceptions. As a result, Ican catch and throw standard exceptions, but also custom exceptions which havezero fingerprint overhead. This relies on the ability to silently and unsafelycast theSomeException
constructor between the standardException
type andmyException#
type. It works because the memory representation is the same.
I had an old benchmark where custom exceptions had roughly 1/5 the overhead ofstandard exceptions. I haven't yet benchmarked both versions in smalltt, Ireally should.
src/SymTable.hs is a custom mutable hash tableimplementation, keyed by source position spans. The reason for writing this isthat I have had performance problems withhashtables
, the primary mutablehashtable package, where it was even outperformed by the immutableData.HashMap
. However, this was a few years ago, so I should benchmark myversion against alternatives.
I'm also using a custom hash function on bytestrings, which is mostly based onthe non-AES "fallback" hasher inahash.
I intend to release in the future a generic variant of my hashtable (which isnot specialized to source span keys) along with my hash functions.
I use the following:
primdata
: a low-level,minimum-overhead array & mutable reference library. It can be viewed as areplacement forprimitive
with a significantly different API.dynamic-array
: a smalldynamic array library, built on top ofprimdata
.flatparse
: a high-performanceparser combinator library. It is ridiculously faster than all of the parseclibraries. The old smalltt versions before the rewrite usedmegaparsec
, whichwas about 40 times slower. Parsing speed is now typically 2-3 million LOC/son my system.
All files used in benchmarking are available inbench. The followingprograms were used.
agda
2.6.2 with options-vprofile:7 +RTS -M10G
.coq
8.13.2, used astime coqtop -l FILE -batch -type-in-type -time
, ordropping the last-time
options when benchmarking elaboration of largefiles.Lean
4.0.0 nightly 2021-11-20, commit babcd3563d28. Used astime lean FILE --profile
.smalltt
: in elaboration benchmarks I use timings for:r
instead of:l
;reloading tends to be 20-30% faster, presumably because the RTS is "warmed up"by previous heap allocations. I find this to be more representative of typicalusage where we mostly reload files we're actively working on.idris
: I use the version inthis pullrequest which fixes aquadratic parsing bug. Command istime idris2 -c FILE
.
System: Intel 1165G7 CPU, 16GB 3200 MT/s RAM, CPU set to run at 28 W power draw.
Abbreviations:
- SO: stack overflow.
- TL: "too long", I did not have enough patience.
- OOM: out of memory.
All time figures are inseconds.
Benchmark results are currently in rather crappy hand-pasted markdown tables, Iplan to have nicer graphs here.
There are some benchmark entries marked as N/A. In these cases I haven't yetbeen able to reproduce the exact benchmarks in a given system. There are alsosome stack overflows that could be avoidable, but I could not appropriately setstack settings. Pull requests are welcome to fix these!
- stlc: Church-coded simply-typed lambda calculus plus some internaldefinitions. Fairly heavy in terms of sizes of types and the number ofmetavariables. Comes in 5k and 10k LOC versions, where everything is renamedand copied many times.
- stlcLessImpl: a lighter version, with no implicit arguments inside stlc algebras.This works in Coq, while the heavier version does not, because Coq does not supporthigher-order implicit arguments (implicit function types are not first-class).
- stlcSmall: an even lighter version which only includes variables, lambdas andapplications in the object language. For fun, I also test files with one millionlines of code. I don't include these files in the repo, because they're around50MB each. They can be generated by running
stlcSmall1M
functions fromsrc/GenTestFiles.hs. - The different systems do somewhat different kinds of work. Smalltt and coqtoponly elaborate input, while Agda and Idris do module serialization, and Leanapparently does some compilation according to its profiling output. However,the numbers should be still indicative of how much we have to wait to have theentire input available for interactive use.
- Note: Agda 2.6.2 has a parsing blowup issue on large files:agda/agda#5670. So only
stlc
,stlcLessImpl
, andstlcSmall
are really indicative of elaboration performance.
smalltt | Agda | Coq | Lean | Idris 2 | |
---|---|---|---|---|---|
stlc | 0.014 | 0.573 | N/A | 0.194 | 1.18 |
stlc5k | 0.179 | 4.127 | N/A | 6.049 | 43.698 |
stlc10k | 0.306 | 16.160 | N/A | 12.982 | 129.635 |
stlcLessImpl | 0.008 | 0.358 | 0.145 | 0.166 | 0.907 |
stlcLessImpl5k | 0.140 | 4.169 | 0.905 | 4.508 | 20.386 |
stlcLessImpl10k | 0.275 | 17.426 | 1.685 | 8.861 | 52.026 |
stlcSmall | 0.003 | 0.106 | 0.128 | 0.073 | 0.542 |
stlcSmall5k | 0.037 | 4.445 | 0.762 | 2.649 | 6.397 |
stlcSmall10k | 0.072 | 22.8 | 1.388 | 5.244 | 13.496 |
stlcSmall1M | 8.725 | TL | 149 | 615 | OOM |
See theasymptotics
files.
- idTest:
id id ... id
, 40 times iterated. - pairTest: 30 times nested pairs using local
let
. - vecTest: length-indexed vector with 960 elements.
We separately consider elaboration time and total command time, becauseserialization often takes a quadratic hit on vecTest.
smalltt | Agda elab | Agda total | Coq elab | Coq total | Lean elab | Lean total | Idris elab | Idris total | |
---|---|---|---|---|---|---|---|---|---|
idTest | 0.000 | TL | TL | TL | TL | TL | TL | TL | TL |
pairTest | 0.000 | TL | TL | TL | TL | TL | TL | TL | TL |
vecTest | 0.078 | 1.128 | 4.098 | 0.769 | 0.935 | 0.244 | 3.65 | 4.465 | 6.277 |
- smalltt is also quadratic on vecTest! But it's fast enough to be anon-issue. The quadratics comes from the occurs checking, which visits alinear number of active metas for each cons cell. Making it overall linear ispossible, but it would require smarter meta occurs caching.
- Lean elaboration is actually linear on pairTest, but the task itself does notfinish, because the "compilation" phase is exponential.
See theconv_eval
files.
- NatConv: conversion checking Church Peano numerals of given size
- TreeConv: conversion checking complete Church binary trees of given depth
- TreeConvM: same but sides contain unsolved metas.
smalltt | Agda | Coq | Lean | Idris 2 | |
---|---|---|---|---|---|
NatConv1M | 0.045 | 1.8 | SO | 16.4 | 3.22 |
NatConv5M | 0.188 | 9.6 | SO | 34.6 | 29.65 |
NatConv10M | 0.712 | 19.7 | SO | 61.1 | 173.88 |
TreeConv15 | 0.055 | 0.016 | 0.005 | 0.001 | 0.105 |
TreeConv18 | 0.088 | 0.02 | 0.007 | 0.001 | 3.75 |
TreeConv19 | 0.161 | 0.03 | 0.009 | 0.001 | 4 |
TreeConv20 | 0.408 | 1.7 | 0.618 | 0.001 | 16.65 |
TreeConv21 | 0.834 | 3.4 | 1.161 | 0.001 | 5.8 |
TreeConv22 | 1.722 | 6.4 | 2.315 | 0.001 | 12.1 |
TreeConv23 | 3.325 | 13.7 | 4.699 | 0.001 | 25.38 |
TreeConvM15 | 0.010 | 0.770 | 0.003 | 0.001 | 0.1 |
TreeConvM18 | 0.092 | 6.35 | 0.003 | 0.001 | 2 |
TreeConvM19 | 0.169 | 12.8 | 0.004 | 0.001 | 2.67 |
TreeConvM20 | 0.361 | 26.6 | 0.605 | 0.001 | 8.9 |
TreeConvM21 | 0.835 | 50.8 | 1.273 | 0.001 | 10.83 |
TreeConvM22 | 1.694 | TL | 2.703 | 0.001 | 22.23 |
TreeConvM23 | 3.453 | TL | 5.472 | 0.001 | 45.9 |
- I Coq I haven't yet been able to find a way to increase stack sizes enough, forthe Nat benchmark.
- Agda, Coq and Lean all have more aggressive approximate conversion checkingthan smalltt. Lean can shortcutall tree conversion tasks. For the TreeConvMtasks, this requiresapproximate meta solutions. It's apparent that Agdadoes not do such solutions.
- Agda performance degrades sharply when we throw metas in the mix.
See theconv_eval
files again.
- ForceTree : fold over a binary tree with Boolean conjunction
- NfTree : normalize a tree
smalltt | Agda | Coq vm_compute | Coq compute | Coq lazy | Lean reduce | Lean eval | Idris 2 | |
---|---|---|---|---|---|---|---|---|
ForceTree15 | 0.011 | 0.070 | 0.002 | 0.022 | 0.053 | 0.213 | 0.022 | 0.3 |
ForceTree18 | 0.100 | 0.47 | 0.019 | 0.169 | 0.299 | 1.80 | 0.170 | 3.3 |
ForceTree19 | 0.240 | 0.92 | 0.041 | 0.299 | 0.725 | 3.5 | 0.345 | 7.5 |
ForceTree20 | 0.487 | 1.8 | 0.076 | 0.805 | 1.164 | 6.8 | 0.695 | 17.8 |
ForceTree21 | 1.070 | 3.58 | 0.151 | 1.23 | 2.2662 | 14.6 | 1.38 | 49.4 |
ForceTree22 | 2.122 | 7.37 | 0.299 | 2.492 | 4.55 | 29.4 | 2.75 | OOM |
ForceTree23 | 4.372 | 15.93 | 0.731 | 5.407 | 9.664 | 62.7 | 5.52 | OOM |
NfTree15 | 0.005 | N/A | 0.018 | 0.013 | 0.01 | N/A | N/A | N/A |
NfTree18 | 0.064 | N/A | 0.192 | 0.127 | 0.213 | N/A | N/A | N/A |
NfTree19 | 0.111 | N/A | 0.523 | 0.289 | 0.402 | N/A | N/A | N/A |
NfTree20 | 0.259 | N/A | 0.716 | 0.632 | 0.88 | N/A | N/A | N/A |
NfTree21 | 0.552 | N/A | 1.559 | 1.195 | 1.572 | N/A | N/A | N/A |
NfTree22 | 1.286 | N/A | 2.971 | 2.94 | 3.143 | N/A | N/A | N/A |
NfTree23 | 3.023 | N/A | 5.996 | 4.99 | 7.187 | N/A | N/A | N/A |
- Agda, Lean and Idris 2 NfTree is N/A because there is no way to only forcethe normal forms, without doing printing or conversion checking.
- Coq vm_compute is extremely strong in ForceTree, which is a fairly lightlyallocating workload. I note that smalltt with glued evaluation disabledwould be 2x faster here, but that would be still just the third of Coq VMperformance.
- On the other hand, smalltt is faster in normalization, a moreallocation-heavy task. I attribute this to superior RTS performance.