Appendices¶
Format of .leo files¶
Here are the XML elements that may appear in Leo files:
<?xml>Leo files start with the following line:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet>This element is optional. For example:
<?xml-stylesheet ekr_stylesheet?>
<leo_file>This element opens an element that contains the entire file.
</leo_file>ends the file.<leo_header>This element specifies version information and other informationthat affects how Leo parses the file. For example:
<leo_headerfile_format="2"/>
The
file_formatattribute gives the ‘major’ format number.It is"2"for all newer versions of Leo.<globals>,<preferences>,<find_panel_settings>and``<clone_windows>``These elements are vestigial.
Modern versions of Leo ignore them when reading outlines and writeempty elements when writing outlines.<vnodes>A single
<vnodes>element contains nested<v>elements.<v>elements correspond to vnodes.The nesting of<v>elements indicates outline structure in the obvious way.<v>This element represents a single vnode and has the following form:
<v...><vh>sss</vh>(zeroormorenestedvelements)</v>
The
<vh>element specifies the headline text.sss is the headline text encoded with the usual XML escapes.As shown above, a<v>element may contain nested<v>elements.Zero or more of the following attributes may appear in <v> elements:t=name.timestamp.na="xxx"
The
t="Tnnn"attribute specifies the <t> element associated with a <v> element.Thea="xxx"attribute specifies vnode attributes.xxxdenotes one or more upper-case letters whose meanings are as follows:CThevnodeisaclone.(Notusedin4.x)EThevnodeisexpandedsoitschildrenarevisible.MThevnodeismarked.TThevnodeisthetopvisiblenode.VThevnodeisthecurrentvnode.
For example,
a="EM"specifies that the vnode is expanded and is marked.
<tnodes>A single
<tnodes>element contains a non-nested list of<t>elements.<t>This element represents the body text of the corresponding <v> element.It has this form:
<ttx="<gnx>">sss</t>
The
txattribute is required.Thetattribute of<v>elements refer to thistxattribute.sssis the body text encoded with the usual XML escapes.New in 4.0: Plugins and scripts may add attributes to
<v>and<t>elements. SeeWriting plugins for details.
Format of external files¶
This section describes Leo’ssentinel lines, comment used in external files.
External files created by@file use gnx’s in@+node sentinels. Such gnx’s permanently and uniquely identify nodes. Gnx’s have the form:
id.yyyymmddhhmmssid.yyyymmddhhmmss.n
The second form is used if two gnx’s would otherwise be identical.
idis a string unique to a developer, e.g., a git id.yyyymmddhhmmssis the node’s creation date.nis an integer.
Closing sentinels are required for section references and the@all and@others directives, collectively known asembedding constructs. Proof: These constructs do not terminate the node in which they appear. Without a closing sentinel there would be no way to know where the construct ended and the following lines of the enclosing node began.
Here are the sentinels used by modern versions of Leo, in alphabetical order.
@<<Sentinels of the form
@<<section_name>>represent section references.If the reference does not end the line, the sentinel line ending theexpansion is followed by the remainder of the reference line. Thisallows the Read code to recreate the reference line exactly.
@@This sentinel represents any line starting with
@in body textexcept@<whitespace>,@docand@others.Examples:@@nocolor@@pagewidth80@@tabwidth-4@@c
@+alland-allThese sentinels mark the range of``@all`` directives.
@atand@docThese sentinels mark the range of
@docparts.
@@delimsThis sentinel marks the range of an
@delimsdirective.
The range continues until the next@delimsdirective.Adding, deleting or changing
@@delimsentinels will destroy Leo’sability to read the external file.Mistakes in using the
@delimsdirectives have no effect on Leo,though such mistakes will thoroughly mess up a external file as far ascompilers, HTML renderers, etc. are concerned.@@section-delimsThis sentinel represents the
@section-delimsdirective.
Such must appear in the root@<file>node.@+leoThis sentinel marks the start of any external file.
This sentinel has the form:<opening_delim>@leo<closing_delim>
The read code uses single-line comments if
<closing_delim>is empty.
The write code generates single-line comments if possible.The
@+leosentinel contains other information. For example:<opening_delim>@leo-ver=4-thin<closing_delim>
@-leoThis sentinel marks the end of the Leo file.
Nothing but whitespace should follow this directive.
@+nodeThis sentinel marks the start of a node:
@+node:gnx:<headline>
Nodes continue until the next
@+node,at-allorat-otherssentinel.
@+othersand@-othersThese sentinels mark the range of an
@othersdirective.@verbatimThis sentinel indicates that the next line of the external file is nota sentinel. This escape convention allows body text to contain linesthat would otherwise be considered sentinel lines.
The Leonine way to refactor code¶
This paper explains how to use cff (clone-find-flattened) whilerefactoring code. I could not have completed the refactoring of Leo’satFile write code without using continuous, extensive use of cff.
There are two key ideas:
The clones produced by cff are short-term or medium-term data,easily created and easily dispensed with.
Such clones are valuable, but not precious. They will eventually be discarded.
Unlike tags (or any other kind of non-Leonine data), the clonesproduced by cff can be reorganized.
This is the priceless, unique advantage of clones. You don’t understand clones if you don’t get this.
Example
While refactoring, it is essential to see all actual uses of asymbol (method, or ivar, whatever).
The starting point is to use cff to find all potential uses of thesymbol. If multiple files or classes use the symbol, you can use thesuboutline-only option to limit the matches created by cff.
After finding all potential uses of the symbol, you can reorganizethe resulting clones as follows:
Delete nodes that are completely irrelevant.
Squirrel away likely-irrelevant nodes in a new organizer node.
Highlight the defining node, say by making it the preceding sibling of the cff node.
Leave all actual uses of the symbol where they are.
You have now focused your attention on the nodes that will likelychange.
You can now rerun the search only on those cloned nodes to see allinstances of the symbol that might be changed. This is a crucialdouble check on proposed changes.
Summary
I highly recommend that all Leonine programmers use the approach justdescribed when refactoring code.
Neither tags, nor filters, nor refactoring packages can emulate theLeonine way of refactoring.
Theory of operation: c.deletePositionsInList¶
The positions passed to p.deletePositionsInList onlyspecify the desiredchanges; the only way tomake those changes is to operate on vnodes!
Consider this outline, containing no clones:
+ROOT-A-B
The fundamental problem is this. If we delete node A, the index ofnode B in ROOT.children will change. This problem has (almost) nothingto do with clones or positions.
Proof that c.deletePositionsInList is correct
Clearly, deleting a positions ultimately means changing vnodes,specifically, v.parents and v.children for v in some set of vnodes. We mustshow that the particular set of changed vnodes in the new code is thecorrect set of vnodes, and that the changes made to that set are correct.This is far from obvious at first glance.
The new code
Here is Vitalije’s version of c.deletePositionsInList, without the docstring:
defdeletePositionsInList(self,aList):c=self# Ensure all positions are valid.aList=[pforpinaListifc.positionExists(p)]ifnotaList:returndefp2link(p):parent_v=p.stack[-1][0]ifp.stackelsec.hiddenRootNodereturnp._childIndex,parent_vlinks_to_be_cut=sorted(set(map(p2link,aList)),key=lambdax:-x[0])# The main loop.fori,vinlinks_to_be_cut:child_v=v.children.pop(i)child_v.parents.remove(v)
To begin the proof, note that the main loop has the expected “form”. Thatis, when we delete a node, we will:
Delete child_v = v.children[i] for some v and i.
Delete child_v from its parents array.
v.parents is an unordered array, so child_v.parents.remove(v) is all thatis needed. This line might crash if one of the positions is invalid. Itwill also crash if v is not present in child_v.parents. I’ll discuss thislater.
To complete the proof, we must show that the main loop deletes all and onlythose vnodes implied by the positions in aList. We can tentatively assumethat the main loop will work properly if that data in links_to_be_cut iscorrect, but there are some subtleties lurking here which I’ll discuss later.
The following code creates links_to_be_cut:
defp2link(p):parent_v=p.stack[-1][0]ifp.stackelsec.hiddenRootNodereturnp._childIndex,parent_vlinks_to_be_cut=sorted(set(map(p2link,aList)),key=lambdax:-x[0])
This is elegant (compressed) code. Let’s examine it carefully…
p2link produces tuples (childIndex, parent_v). These become bound to i, vin the main loop:
# The main loop.fori,vinlinks_to_be_cut:child_v=v.children.pop(i)child_v.parents.remove(v)
To make this work, parent_v must be the vnode whose i’th child should bedeleted:
parent_v=p.stack[-1][0]ifp.stackelsec.hiddenRootNode
p.stack entries have the form (v, childIndex), so p.stack[-1] is theposition p’s immediate parent vnode, if p has a parent. Otherwise,c.hiddenRootNode is the proper parent vnode.
p._childIndex is also the correct value for i.
We have just proved the following:
Deletingpositionpmeansexecutingthemainloopfor(i,v)returnedbyp2link(p)
Sorting and filtering
The following line filters and sorts the results produced by p2link:
links_to_be_cut=sorted(set(map(p2link,aList)),key=lambdax:-x[0]
Let’s break this into parts, from “inside” to “outside”
map(p2link,aList)appliesp2linkto every position inaList.The result is a list of tuples(i,v). This list may containduplicates.set(map(p2link,aList)removes those duplicates.LettheSet beset(map(p2link,aList).Finally,
sorted(theSet,key=lambdax:-x[0])creates a generatorequivalent to a sorted list.
The generator will deliver the tuples in descending order ofi, whenseveral tuples have the same vnodev. The generator does not ordertuples onv, and there is no need to do so.
It’s easy to understand the intent of this sort order. The main loopcontains this line:
child_v=v.children.pop(i)
The sort order means thatpop(i) happens beforepop(j) ifi>j. This condition ensures that the precomputed values ofi won’tchange in the main loop.
Completing the proof
The algorithm appears sound. The tuples(i,v) computed fromp2link(p) correctly describe the work to be done. Filtering ensuresthat the work is done once. Sorting ensures that child indicesi willnot change during the main loop.
The tuples are sorted oni, but notv.aList can contain positionsin any order, so we know only that for any particular vnodev the main loopwill handle tuples(i1,v),(i2,v),... in the correct order.
There are two final questions to consider:
First, does the order in which the main loop handles vnodes matter?
No. The vnode data are independent. The main loop works regardless ofwhether a vnode has already been unlinked.
Second, could the following lines ever fail?
child_v=v.children.pop(i)child_v.parents.remove(v)
No. Sorting and filtering ensure that tuples are unique.v is guaranteed tobe inchild_v.parents becausev.children[i] must exist.
This concludes the proof.
Discussion
The algorithm is sound because the tuples(i,v) contain no duplicates.For any particularv, the main loop will process the tuples in descendingorder ofi. This resolves the question of whether deleting already deletedpositions is valid.
Leo’s Colorizer: Theory of Operation¶
Are you sure you want to read this appendix? It is intendedonly for Leo’s core developers.
This appendix contains the theory of operation for theJEditColorizer (jedit) class inleoColorizer.py. Unless qualified,all methods are members of the jedit class.
Executive summary
Use the--trace=coloring to see the colorizer in action.
The jedit class collaborates with theqsh, a singleton instance of theQSyntaxHighlighter class.
This collaboration allows the qsh tominimize the calls tojedit.redraw. The qsh callsredraw(s)only if lines could possibly need recoloring.
Leo’s coremust never callredraw! These calls must happen automatically. Callingredraw from Leo’s core could cause hard-to-find bugs!
States allow the jedit class to handle constructs like strings or comments that continue from one line to another.
Mode files, python files in theleo/modes directory, tell the jedit class how to colorize specific languages.
Pattern matchers do the actual syntax coloring.
Components of the jedit class¶
TheJEditColorizer class consists ofdrivers,pattern matchers and varioushelper utilities.
Drivers
Theredraw andmainLoop methods are the interface between the qsh and the methods of the jedit class.
qsh.rehighlight callsredraw(s) to colorize asingle lines. After initing theinitial state,redraw callsmainLoop.
Leo’s coremust never callredraw! Doing so could cause hard-to-find bugs.
redraw(s) callsmainLoop(s) to do the actual syntax coloring.
mainLoop(s) callsrule functions (defined in mode files).
Rule functions delegate their work to exactly onepattern matcher (described next). In other words,pattern matchers are callbacks.
Pattern matchers
Pattern matchers form the bulk of the jedit class. Each pattern matcher does the following:
Examines the string
s[i:], where s is the argument tomainLoop(s).Updates theending state for successful matches.
Returns an integerreturn code:
-1:completefailure:mainLoopreturns0:partialfailure:mainLooptriestomatchotherrules>0:success:mainLoopincrementsibynandtriestomatchotherrules
The big challenge¶
The qsh can (and will) callrecolor(s) where s isout-of-sequence with the preceding call. For example, the following snippet is typical in@languagejupytext nodes:
# %% (like @language python)defspam():# %% [markdown] (like @language md)# Section 1Itwasthebestoftimes;itwastheworstoftimes.
The user can click (or cut or paste) anywhere in this text. The qsh will then callrecolor(s) starting with the first changed line.
There is no necessary relationship betweensand the line thatrecolorlast saw!
About states and restarters¶
States allow the colorizer to handle out-of-order calls torecolor.
The colorizer collaborates the qsh inonly one way, namely by callingself.highlighter.setCurrentBlockState(n), wheren is aninteger state number.
Internally, the colorizer usesstring states. The colorizer’sstate utilities handle the conversion between integer and string representations.
States represent, for aparticular line:
Which
@languagedirective is in effect.Which coloring directive (
@killcolor,@color,@nocolor, etc.) is in effect.Whether the coloring for the linecontinues to the next line.For example, Python docstrings may span multiple lines.
A fine point: The state utilities ensure that all integer state numbers are unique and refer to the proper language in effect. For example, if a node contains multiple@language directives, the default states of the two languages must have different state numbers!
The colorizer keeps track of states as follows:
recolor first computes the line’sinitial state from theending state of theprevious line. By default, the initial state becomes the ending state ofcurrent state.
Pattern matchers usually leave the ending state unchanged.
If a pattern matcher matches an@language or coloring directive, the matcher creates (indirectly, via the state utilities) a new state representing the directive. This is the easy case.
If a pattern matcher matches syntax that continues to the next line, the pattern matcher creates arestarter matcher by callingsetRestart. The restarter matchers are essentially closures, bindings of the pattern matcherand all of its kwargs. The details are hairy, but the coding pattern is the same for all cases. See the source code for details.
Mode files¶
ThejEdit editor drives its syntax colorer using.xml files. Leo uses.py files instead.
Long ago, Leo’sjEdit2Py script created the pythonmode files in theleo/modes directory.
Roughly speaking, there is one mode file for every language that Leo can colorize. Over the years, Leo’s devs have modified several of these mode files by hand.
init setsjedit.rulesDict from the mode file for the language in effect. Keys arelead-in characters; values are lists ofrule functions (rules for short). Each mode file defines its own rulesDict and rules.
mainLoop(s) uses the rulesDict to retrieve a list of zero or more rules to be used when scanning s. It’s that simple.
Legacy mode files could collaborate using thedelegate keyword, but that scheme was (and is) clumsy.
Leo 6.8.3 addedleo/modes/jupytext.py, a new mode file for@languagejupytext. This mode file introduced a new style of collaboration. The mode files for the jupytext, python and markdown languages collaboratedirectly using modifications ofexisting rules inpython.py andmd.py.
leoAst.py¶
The classes inleoAst.pyunify python’s token-based and ast-based worlds by creating two-way linksbetween tokens in the token list and ast nodes in the parse tree.
Are you sure you want to use leoAst?
`asttokens <https://asttokens.readthedocs.io/en/latest/user-guide.html>`_will usually be a better choice:
Support for the leoAst module ends with Python 3.14.
leoAst.py is part ofLeo, and can be usedcompletely independently of Leo.
You must use Leo to see the intended outline structure of the code. WithoutLeo, you will see specialsentinel comments that create Leo’s outlinestructure. These comments have the form:
#@<comment-kind>:<user-id>.<timestamp>.<number>: <outline-level> <headline>If you have any trouble installing or using this code, please help for helponLeo’s forum
Running leoAst.py¶
Running leoAst.py from the command line¶
leoAst.py is designed to be run from the command line:
usage:leoAst.py--helpleoAst.py[--fstringify|--fstringify-diff|--orange|--orange-diff]PATHSleoAst.py--py-cov[ARGS]leoAst.py--pytest[ARGS]leoAst.py--unittest[ARGS]examples:--py-cov"-f TestOrange"--pytest"-f TestOrange"--unittestTestOrangepositionalarguments:PATHSdirectoryorlistoffilesoptionalarguments:-h,--helpshowthishelpmessageandexit--fstringifyleoninefstringify--fstringify-diffshowfstringifydiff--orangeleonineBlack--orange-diffshoworangediff--py-covrunpytest--covonleoAst.py--pytestrunpytestonleoAst.py--unittestrununittestonleoAst.py
Running the code python programmatically¶
To access the code, do one of the following:
importleoAstimportleo.core.leoAstasleoAst
You can then run the fstringify commands as follows:
changed=leoAst.Fstringify().fstringify_file(filename)changed=leoAst.Fstringify().fstringify_diff_files(filename)
Running unit tests and coverage tests programmatically¶
The following runs all unit tests for leoAst.py:
python-mleo.core.leoAst
The following runs coverage tests:
pytest-x--cov-reporthtml--cov-reportterm-missing--cov=leo.core.leoAstleo/core/leoAst.py
TokenOrder classes: Theory of operation¶
This is the Theory of Operation for the TokenOrderGenerator (TOG) class andrelated classes.
Token-order classes¶
TheTokenOrderGenerator (TOG) class injects two-way links between alltokens and the corresponding ast nodes. The TOG class also injectsparent/child links into all ast nodes.
The TOG class defines visitors that visit ast nodes intoken order, thetraversal order that corresponds to the order of the tokens produced bypython’s tokenizer module. The only way to ensure this correspondence is touse separate visitors for all ast nodes. All visitors are straightforwardgenerators.
TOG visitors eventually callTOG.sync_token, which checks that that tokensare, in fact, visited in the correct order.TOG.sync_token is anever-present unit test.
TheTokenOrderTraversal (TOT) class uses the parent/child links createdby the TOG class. TOT.traverse contains a single for-loop that calls allnodes of the parse tree in token order. This loop is extremely fast. Usingthe TOT class, client code can easily modify the token list or parse treeas desired.
Other classes¶
TheToken class represents one token, created by tokenize.tokenize.
TheFstringify class is an re-implementation of the external fstringifyproject using the TOG class.
TheOrange class is a re-implementation of the black project.
The name “Orange” is a play on words: “Orange is the new black”.
TheAstDumper class provides an extensive set of tools for examiningtoken lists, parse trees, and the links between them.
TheBaseTest class provides common infrastructure for all other test classes.Important: BaseTest.make_data is, all by itself, a very strong unit test.
Significant vs insignificant tokens¶
The distinction betweensignificant andinsignificant tokens iscrucial. Visitors callTOG.gen_token,TOG.gen_op, etc.only forsignificant tokens. Theis_significant andis_significant_tokenfunctions define which tokens are significant.
Visitors can’t know, just by looking at the parse tree, whether the inputcontainsinsignificant tokens. For example, the source tokens mightcontain non-essential parentheses, or optional trailing commas, or whethertwo statements are separated by a semicolon.
Helping TOG visitors¶
TheTOG.do_If visitor callsTOG.find_next_significant_token todetermine whetherTOG.do_If should generate an “if” or an “elif” token.This help is essential, because the following two source files generateidentical parse trees!
if1:if1:passpasselse:elif2:if2:passpass
Similarly, theTOG.do_Str andTOG.do_JoinedStr visitors callTOG.get_concatenated_string_tokens to handle one ore more concatenatedstring tokens.
Finally,TOG.do_slice callsTOG.find_next_significant_token to determinewhether a slice without a step contains an optional colon.
TOG.find_next_significant_token andTOG.get_concatenated_string_tokens arecrucial inventions. TOG class would not be possible without them.
Syncing tokens¶
TOG.px is an index into the token list. It is either -1, or it pointsat the previous significant token.Note: TOG.find_next_significant_tokenand TOG.get_concatenated_string_tokens use TOG.px, but never change TOG.px.
TOG.sync_token(self, kind, val) associates tokens with ast nodes asfollows:
If (kind, val) denote aninsignificant token, TOG.sync_token doesnothing.
Otherwise, (kind, val) denotes asignificant token. TOG.sync_tokenassociates that token,plus all previousinsignificant tokens withself.node, the ast node presently being visited.
In addition, if (kind, val) denotes asignificant token, TOG.sync_tokenchecks that the nextsignificant token in the token list has the expectedkind and value. This is done as follows:
TOG.sync_token advances TOG.px to point at the next significant token,call it T.
TOG.raises AssignLinksError if (T.kind, T.value) != (kind, val)
To summarize token syncing:
The TOG.px index tracks the last-seensignificant token.
TOG.px advances monotonically through the token list.
TOG.find_next_significant_token and TOG.get_concatenated_string_tokensscan forward through the token list using a private copy of TOG.px. Thesemethods never change TOG.px itself.
This token-syncing machinery is thesimplest thing that could possiblywork. It is also thefastest thing that could possibly work.
Figures of merit¶
Simplicity:
The distinction between significant and insignificant tokens makestoken-order traversals possible. This distinction drastically simplifiesTOG visitors. They never have to generate insignificant tokens!
TOG.find_next_significant_token and TOG.get_concatenated_string_tokens()use TOG.px to look ahead in the token list.
It took a long time to realize that the parse tree needs help from thetoken list, not the other way around!
Speed: The TOG creates links between tokens and ast nodes in roughly the timetaken by python’s tokenize.tokenize and ast.parse library methods. The TOTclass traverses trees annotated with parent/child links even more quickly.
TOG class avoids bothast.fix_missing_locations andast.get_source_segment,which are too slow to be useful.
Memory: The TOG class makes no significant demand on python’s resources:
Generators add nothing to python’s call stack.
Theonly variable-length data created by the TOG is TOG.node_stack.This stack resides in python’s heap, so its length is unimportant. In theworst case, it might contain a few thousand entries.
The TOT uses no variable-length data whatever.
Maintaining leoAst.py¶
New ast nodes are sometimes required to support new language features,especially language features that require new syntax or keywords. Pythonhas added new nodes fairly often in the past. New nodes may be added infuture. When that happens, the following changes will be needed toleoAst.py:
Add a visitor for the new node.
Add one or more unit tests that fully cover the new visitor.
Thetest_visitors_exist unit test checks that visitors exist for allast nodes defined by to a particular version of python.
SeeLeo issue #1440 for notes relating to the code.
Unicode reference¶
Leo uses unicode internally for all strings.
Leo converts headline and body text to unicode when reading .leo files and external files. Both .leo files and external files may specify their encoding. The default is utf-8. If the encoding used in a external file is not “utf-8” it is represented in the @+leo sentinel line. For example:
#@+leo-encoding=iso-8859-1.Theutf-8encodingisa"lossless"encoding(itcanrepresentallunicodecodepoints),soconvertingtoandfromutf-8plainstringswillnevercauseaproblem.Whenreadingorwritingacharacternotina"lossy"encoding,Leoconvertssuchcharactersto'?'andissuesawarning.
When writing .leo files and external files Leo uses the same encoding used to read the file, again with utf-8 used as a default.
leoSettings.leo contains the following Unicode settings, with the defaults as shown:
default_derived_file_encoding=UTF-8new_leo_file_encoding=UTF-8Thesecontrolthedefaultencodingsusedwhenwritingexternalfilesand.leofiles.Changingthenew_leo_file_encodingsettingisnotrecommended.SeethecommentsinleoSettings.leo.Youmaysetdefault_derived_file_encodingtoanythingthatmakessenseforyou.
The @encoding directive specifies the encoding used in a external file. You can’t mix encodings in a single external file.
Valid URL’s¶
Leo checks that the URL is valid before attempting to open it. A valid URL is:
3 or more lowercase alphas
followed by one :
followed by one or more of:
$%&'()*+,-./0-9:=?@A-Z_a-z{}~followed by one of:
$%&'()*+/0-9:=?@A-Z_a-z}~
That is, a comma, hyphen and open curly brace may not be the last character.
URL’s in Leo should contain no spaces: use %20 to indicate spaces.
You may use any type of URL that your browser supports: http, mailto, ftp, file, etc.
The Mulder/Ream update algorithm¶
This appendix documents the Mulder/Ream update algorithm in detail, with an informal proof of its correctness.
Prior to Leo 5.1, Leo used Bernhard Mulder’s original algorithm to read @shadow files. Starting with Leo 5.1, Leo uses this algorithm to read both @clean and @shadow files. Conceptually, both algorithms work as described in the next section.
In February 2015 EKR realized that the @shadow algorithm could be used to update @clean (@nosent) files. Simplifying the algorithm instantly became a top priority. The new code emerged several days later, made possible by the x.sentinels array. It is an important milestone in Leo’s history.
What the algorithm does¶
For simplicity, this discussion will assume that we are updating anexternal file, x, created with @clean x. The update algorithm worksexactly the same way with @shadow trees.
The algorithm works withany kind of text file. The algorithm uses onlydifflib. It knows nothing about the text or its meaning. No parsing is everdone.
Suppose file x has been changed outside of Leo. When Leo reads x it doesthe following:
Recreates theold version of xwithout sentinels by writing the@clean xoutline into a string, as if it were writing the @clean xoutline again.
Recreates all the lines of xwith sentinels by writing the @clean xoutline into a string, as if it was writing an @file node! Let’s callthese lines theold sentinels lines.
Uses difflib.SequenceMatcher to create a set of diffs between theold and new versions of xwithout sentinels.
Terminology: the diffs tell how to change file a into file b. Theactual code uses this terminology:a is set of lines in the oldversion of x,b is the set of lines in the new version of x.
Creates a set of lines, thenew sentinels lines using the oldsentinels lines, the a and b lines and the diffs.
This is the magic. Bernhard Mulder’s genius was conceiving that athree-way merge of lines could produce the new outline,withsentinels. The code is in x.propagate_changed_lines and its helpers.
Replaces the @clean tree with the new tree created by reading the newsentinels lines with the @file read logic.
Important: The update algorithm never changes sentinels. It neverinserts or deletes nodes. The user is responsible for creating nodes tohold new lines, or for deleting nodes that become empty as the result ofdeleting lines.
Guesses don’t matter¶
There are several boundary cases that the update algorithm can not resolve.For example, if a line is inserted between nodes, the algorithm can notdetermine whether the line should be inserted at the end of one node or thestart of the next node. Let us call such linesambiguous lines.
The algorithmguesses that ambiguous lines belongs at the end of a noderather than at the start of the next node. This is usually what iswanted–we usually insert lines at the end of a node.
Happily,guesses don’t matter, for the following reasons:
The external file that results from writing the @clean x tree will bethe same as the updated external fileno matter where ambiguous linesare placed. In other words, the update algorithm issound.
Leo reports nodes that were changed when reading any external file. Theuser can review changes to @clean and @file trees in the same way.
The user can permanently correct any mistaken guess. Guesses only happenfornewly inserted or changed lines. Moving an ambiguous line to thefollowing node will not change the external file. As a result, thenext time Leo reads the file the line will be placed in the correct node!
This proves that @shadow and @clean are easy and safe to use. Theremaining sections of this document discuss code-level details.
Background of the code¶
The algorithm depends on three simple, guaranteed, properties ofSequenceMatcher.opcodes. Seehttps://docs.python.org/2/library/difflib.html#sequencematcher-examples
Fact 1: The opcodes tell how to turn x.a (a list of lines) into x.b(another list of lines).
The code uses the a and b terminology. It’s concise and easy to remember.
Fact 2: The opcode indices ai, aj, bi, bjnever change becauseneither x.a nor x.b changes.
Plain lines of the result can be built up by copying lines from x.b to x.results:
'replace'x.results.extend(x.b[b1:b2])'delete'donothing(b1==b2)'insert'x.results.extend(x.b[b1:b2])'equal'x.results.extend(x.b[b1:b2])
Fact 3: The opcodescover both x.a and x.b, in order, without any gaps.
This is an explicit requirement of sm.get_opcode:
The first tuple has ai==aj==bi==bj==0.
Remaining tuples have ai == (aj from the preceding tuple) and bi == (bjfrom the previous tuple).
Keep in mind this crucial picture:
The slices x.a[ai:aj] cover the x.a array, in order without gaps.
The slices x.b[bi:bj] cover the x.b array, in order without gaps.
Aha: the x.sentinels array¶
Mulder’s original algorithm was hard to understand or to change. Theculprit was the x.mapping array, which mapped indices into arrays of lineswith sentinels to indices into arrays of lineswithout sentinels.
The new algorithm replaces the x.mapping array with the x.sentinels array.As a result, diff indices never need to be adjusted and handling diffopcodes is easy.
For any index i, x.sentinels[i] is the (possibly empty) list of sentinellines that precede line a[i]. Computing x.sentinels from old_private_linesis easy. Crucially, x.a and x.sentinels areparallel arrays. That is,len(x.a) == len(x.sentinels), so indices into x.a arealso indices intox.sentinels.
Strategy & proof of correctness¶
Given the x.sentinels array, the strategy for creating the results issimple. Given indices ai, aj, bi, bj from an opcode, the algorithm:
Writes sentinels from x.sentinels[i], for i in range(ai,aj).
Writes plain lines from b[i], for i in range(bi,bj).
This “just works” because the indices cover both a and b.
The algorithm writes sentinels exactly once (in order) because eachsentinel appears in x.sentinels[i] for some i in range(len(x.a)).
The algorithm writes plain lines exactly once (in order) becauseeach plain line appears in x.b[i] for some i in range(len(x.b)).
This completes an informal proof of the correctness of the algorithm.
The leading and trailing sentinels lines are easy special cases. Thiscode, appearing before the main loop, ensures that leading lines arewritten first, and only once:
x.put_sentinels(0)x.sentinels[0]=[]
Similarly, this line, at the end of the main loop, writes trailingsentinels:
x.results.extend(x.trailing_sentinels)
Summary¶
The algorithm creates an updated set of lineswith sentinels using the@clean outline and the updated external file. These new lines then replacethe original @clean with a new @clean tree. The algorithm uses onlydifflib. It will work withany kind of text file. No knowledge of anylanguage is needed.
The algorithm depends on simple, guaranteed, properties of indices inSequenceMatcher opcodes.
The algorithm steps through x.sentinels and x.b, extending x.resultsas it goes.
The algorithm gets all needed data directly from opcode indices intox.sentinels and x.b. Using opcode indices requires neither readerclasses nor auxiliary indices.
The algorithm is simple enough to be understood at first reading. I’llremember its details for the rest of my life.
Why I like Python¶
I wrote this soon after discovering Python in 2001. The conclusions are still valid today.
I’ve known for a while that Python was interesting; I attended a Python conference last year and added Python support to Leo. But last week I got that Python is something truly remarkable. I wanted to convert Leo from wxWindows to wxPython, so I began work on c2py, a Python script that would help convert from C++ syntax to Python. While doing so, I had an Aha experience. Python is more than an incremental improvement over Smalltalk or C++ or objective-C; it is “something completely different”. The rest of this post tries to explain this difference.
Clarity¶
What struck me first as I converted C++ code to Python is how much less blah, blah, blah there is in Python. No braces, no stupid semicolons and most importantly,no declarations. No more pointless distinctions between const, char *, char const *, char * and wxString. No more wondering whether a variable should be signed, unsigned, short or long.
Declarations add clutter, declarations are never obviously right and declarations don’t prevent memory allocation tragedies. Declarations also hinder prototyping. In C++, if I change the type of something I must change all related declarations; this can be a huge and dangerous task. With Python, I can change the type of an object without changing the code at all! It’s no accident that Leo’s new log pane was created first in Python.
Functions returning tuples are a “minor” feature with a huge impact on code clarity. No more passing pointers to data, no more defining (and allocating and deallocating) temporary structs to hold multiple values.
Python can’t check declarations because there aren’t any. However, there is a really nifty tool calledpylint that does many of the checks typically done by compilers.
Power¶
Python is much more powerful than C++, not because Python has more features, but because Python needsless features. Some examples:
Python does everything that the C++ Standard Template Library (STL) does, without any of the blah, blah, blah needed by STL. No fuss, no muss, no code bloat.
Python’s slicing mechanism is very powerful and applies to any sequence (string, list or tuple). Python’s string library does more with far less functions because slices replace many functions typically found in other string libraries.
Writing dict = {} creates a dictionary (hash table). Hash tables can contain anything, including lists and other hash tables.
Python’s special functions, __init__, __del__, __repr__, __cmp__, etc. are an elegant way to handle any special need that might arise.
Safety¶
Before using Python I never fully realized how difficult and dangerous memory allocation is in C++. Try doing:
aList[i:j]=list(aString)
in C. You will write about 20 lines of C code. Any error in this code will create a memory allocation crash or leak.
Python is fundamentally safe. C++ is fundamentally unsafe. When I am using Python I am free from worry and anxiety. When I am using C++ I must be constantly “on guard.” A momentary lapse can create a hard-to-find pointer bug. With Python, almost nothing serious can ever go wrong, so I can work late at night, or after a beer. The Python debugger is always available. If an exception occurs, the debugger/interpreter tells me just what went wrong. I don’t have to plan a debugging strategy! Finally, Python recovers from exceptions, so Leo can keep right on going even after a crash!
Speed¶
Python has almost all the speed of C. Other interpretive environments such as icon and Smalltalk have clarity, power and safety similar to Python. What makes Python unique is its seamless way of making C code look like Python code. Python executes at essentially the speed of C code because most Python modules are written in C. The overhead in calling such modules is negligible. Moreover, if code is too slow, one can always create a C module to do the job.
In fact, Python encourages optimization by moving to higher levels of expression. For example, Leo’s Open command reads an XML file. If this command is too slow I can use Python’s XML parser module. This will speed up Leo while at the same time raising the level of the code.
Conclusions¶
Little of Python is completely new. What stands out is the superb engineering judgment evident in Python’s design. Python is extremely powerful, yet small, simple and elegant. Python allows me to express my intentions clearly and at the highest possible level.
The only hope of making Leo all it can be is to use the best possible tools. I believe Python will allow me to add, at long last, the new features that Leo should have.
Edward K. Ream, October 25, 2001. P.S., September, 2005:
Four years of experience have only added to my admiration for Python. Leo couldnot possibly be what it is today without Python.