A bit over a year ago I posted this little gem on Hacker News, where I define a simple programming language in 450 lines of JavaScript or so (that includes comments). I made a mental note at that moment to make a more detailed post explaining how it worked, but I never got around to it. So I'm going to go ahead and do it now.
That language was a nameless throwaway experiment, but for the purposes of this post I should probably give it a name. I'm going to go with "Teacup", an unassuming, adorable name, fit for the small and unassuming language we're going to make.
If you don't like it and would rather give it a mightier name like, I don't know, TeaRex or MegaLang3000, you can edit it below:
So without further ado, dear readers, let us write the Teacup programming language.
log
function to pretty-print objects in
the area to the right of the text editor. However, this is only
going to be practical on a desktop.Overview
We will construct Teacup as a series of transformations, going from the initial source code (a string) to the result of the computation. We will use JavaScript for the implementation, because it is widely known and runs in browsers, but it should be straightforward to translate to your favorite language.
The steps will be as follows:
- We start with the source code, a string.
- The lexer transforms a string into a list of tokens.
- The parser transforms a list of tokens into an abstract syntax tree.
- The interpreter transforms an abstract syntax tree into a result.
Here we define a Pipeline
class that will make our work easier by
composing a series of transformations, passed as a parameter:
Lexer
The job of a lexer is to split the input text into a list of discrete tokens (atomic parts like numbers, strings, an opening parenthesis, and so on). Not all parsing methodologies have this step, but for the method we are going to use, it simplifies things.
First we will define a regular expression for each of our token types:
we'll have numbers, words (variables), operators, strings (with double
quotes) and comments (starting with #
and up to the end of the
line):
The order is important: the regular expression for word
is a superset
of the expression for number
, for example, which is why number
is
before word
. Whatever is not covered by these regular expressions,
whitespace for instance, will be ignored by the parser.
Also, the reason why I define them as strings rather than regular expressions is that I'm going to concatenate them and I don't know how to do this if they are not strings.
Here's the code for transforming a string into a joyous list of tokens:
You can change testCode
to something else you want to test (hit
Ctrl+Enter to run).
Essentially, we build a large regular expression with one
matching group for each token type, then we use String.split
to
get our list of tokens, which is a mix of matching strings and the
value undefined
(when there is no match). The type of each token can
be figured out from its position in the list. Then we filter out the
spurious tokens, those that are comments, have a null
type, or have
undefined
in the text field.
We also store start and end positions for each token. It is very important to keep track of these numbers so that when things go wrong, we know where they went wrong.
Parser
Alright, so we have a list of tokens. Now we need to organize them into a structure that resembles some kind of program. For Teacup we are going to use a very simple, but elegant system: an operator precedence grammar.
OPGs are more limited than the LL or LR grammars that are used for most languages, but I think these are overkill anyway. OPGs have some interesting properties that in my view make up for the limitations. For instance, the parsing rules are highly local, which means they are, in principle, amenable to parallel parsing.
Also, nobody ever talks about them, so I can differentiate my offering in the unending flood of terrible tutorials that plague the Internet.
Now, the common understanding of operator precedence is that each operator has a precedence number and is either left-associative or right-associative. But forget about this system. It's rubbish. Let me show you a better one:
Operator precedence grammars
Instead, imagine that you have a precedence matrix, with a cell
for each (ordered) pair of operators. So if you have the expression
(a + b * c)
, instead of going: "Okay, so +
has priority 10, and
*
has priority 20, so *
has higher priority and we must
parenthesize it like this: (a + (b * c))"
, you look up the matrix
entry M(+, *)
. There are three possibilities:
M(+, *) = -1
: the operator on the left (+
) has precedence, so the expression should be parenthesized as((a + b) * c)
.M(+, *) = 1
: the operator on the right (*
) has precedence, so the expression should be parenthesized as(a + (b * c))
.M(+, *) = 0
: both operators are equal, sob
"belongs" to both of them. If you want a better intuition as to what this means, it means you are making the ternary operator+*
andb
is its middle operand.
If the expression, instead, was (a * b + c)
, then we would consult
M(*, +)
, which may or may not be the opposite of M(+, *)
. In fact:
M(+, +) = -1
means that+
is left-associative.M(^, ^) = 1
means that^
is right-associative.- For a ternary operator, for example
(a ? b : c)
, thenM(?, :) = 0
. - The same idea works for brackets:
M([, ]) = 0
.
So you can see this is pretty powerful. You get associativity, you get
ternary operators, brackets, and so on, for free. You could also
define a complex operator like if a then b elseif c ... else d end
if you wanted to (and we will want to do this).
Precedence definitions
Granted, defining a precedence matrix sounds like work. If you have ten operators, the matrix is five by ten, which means a hundred entries. If you have twenty, well… yeah. We don't want to fill that manually.
There are many solutions to this problem, but a tutorial can only
afford so much sophistication, so what we're going to do is this: for
each operator op
, we will have two numeric priorities: a
left-priority L(op)
for when the operator is on the left and a
right-priority R(op)
for when it is on the right. And then we will
define M(op1, op2)
using L
and R
as follows:
M(op1, op2) = sign(R(op2) - L(op1))
Under this scheme, an operator is left-associative if L(op) > R(op)
and right-associative if R(op) > L(op)
. Unlike the one-priority
scheme, this preserves our ability to define ternary operators. What
follows are the priorities for Teacup. Some of them will seem
strange at first glance, but I will explain everything.
Some clarifications:
lassoc
defines a left-associative operator, which means its
left-priority is slightly greater than its right-priority. rassoc
defines a right-associative operator (higher right-priority). xassoc
defines equal left and right-priorities.
A prefix
operator is just an operator with very high
right-priority. Imagine for example that the expression (a + * b)
is
in fact (a + null * b)
, with an implicit "null" token between the two
operators. A prefix operator will have a null left operand, whereas a
suffix operator will have a null right operand.
Since we compare L(+)
to R(*)
, if R(*)
is very high, it's always
going to "win" the comparison. It will always take the null
, so it
will always be prefix. Likewise, a suffix operator would have very
high left-priority.
So:
- Opening brackets and keywords like
if
(which begin special forms) are defined asprefix
operators with very low priority. - Closing brackets the
end
keyword are defined assuffix
operators with a very low priority that matches the opening bracket/keyword. - Keywords that are part of a special form, like
else
, have a very low priority that matches the opening and closing forms.
What about the entries for word
, number
and string?
These are not
operators, right? Well, to simplify the code, we will assume that they
are operators, but these operators are nullary (they are both
prefix and suffix).
Note that despite being prefix, (
can take an operand on its left,
for instance if one writes f(x)
, because R(()
is not quite as high
as L(f)
.
What follows is the code to get a token's priority and the order
method, which calculates the matrix entry corresponding to two tokens:
Algorithm
Now that we have a priority system for our operators here is the meat:
the operator precedence parser. It is not a long algorithm, nor is
it very complex. The most difficult is to properly visualize its
operation, but you can uncomment the log
line in the code to get a
useful trace of what the algorithm is doing!
In the right pane, you can see the output of the raw algorithm, which is a deeply nested list. It may seem oddly organized at first glance, but it is actually very neat: notice that in each list (0-based), which we will also call a "handle", the odd indexes are the operator tokens, and the even indexes are the operands. For instance:
- The atom
a
(which as I explained before is parsed as a nullary operator) corresponds to the list[null, a, null]
. The operator isa
(in position 1), and it has twonull
operands (in position 0 and 2). a + b
corresponds to the list[a, +, b]
: operator+
in position 1, operandsa
andb
in position 0 and 2.- If you try the code
(a)
it will produce the list[null, (, a, ), null]
. Operators in positions 1 and 3, operands in 0, 2 and 4.
Another neat property of this output is that it is order preserving:
if you flatten it, you end up with the original list of tokens, with a
null
between each token pair.
You have certainly noticed that parse
takes an additional argument, a
function called finalize
(we use the identity function in the example
above). That is the subject of the next section:
Finalizing the parse
Although neat, the raw output of parse
is not necessarily the one we
want. The finalize
function is given a handle after it is completed
and must return the "value" associated to that handle. If you are
writing a simple calculator, that can be an evaluation function; if
you are writing a compiler or a more elaborate interpreter, it can be
a node of your abstract syntax tree (or AST).
Here we will define a finalize
function that transforms our infix AST
nodes into prefix ones and computes location information. We will do
this somewhat cleverly, for reasons that will become apparent later
(note that in what follows, "a"
is shorthand for the token object for
a
):
Our transformation rules:
[null, "a", null] ==> "a"
[a, "+", b] ==> ["E + E", a, b]
[null, "+", b] ==> ["_ + E", b]
[null, "(", a, ")", null] ==> ["_ ( E ) _", a]
In other words, we will eliminate nullary operators, and aggregate the operators that we find in odd positions into an expression that describes which of the operands are not null.
(Also we are not really going to return an expression in prefix
notation, instead we will put the function's "signature" in the type
field and the arguments in the args
field. That simplifies things a
little: our AST will have our old token nodes as leaves, with types
word
, number
, string
. And now we're going to add inner nodes with
type E + E
and the like, so that later on we can simply dispatch on
type
.)
Interpreter
Now that we have our AST, there are a few things we could do. We could write a compiler for Teacup, either to JavaScript, or to a different language, or machine code. Or we could write a simple interpreter. We are going to do the latter, because it is a bit more interesting than compiling to JavaScript (and not a pain in the ass like compiling to machine code).
Environment
In our interpreter we will implement variables and lexical
scoping. This means we need some kind of structure to map variable
names to values: an environment. For this purpose we will use a
simple JavaScript object with a null
prototype (which means it has no
built-in fields like toString
that could interfere with symbol
resolution). Then, to implement nested scopes, all we need to do is to
call Object.create
on the parent environment. JS's prototype
inheritance will do the rest for us.
So we define the makeEnvironment
function that takes an optional
bindings
argument (for initial bindings). It will also populate the
environment with basic functionality that we want to have in
Teacup, like truth, falsehood, arithmetic… that kind of
nonsense.
Notice that we populate the environment with some symbolic names like
"+"
. We will make it so that operator applications like (a + b)
will use these bindings.
The and
and or
functions are lazy: instead of receiving their
arguments directly, these functions will receive functions that
evaluate to the arguments, so they may choose to evaluate only a
subset. The lazy
wrapper we define here just sets the lazy
property
to true, we will deal with it explicitly later (see the runCall
function).
Utilities
Before going any further we will write some utilities to manipulate the AST a bit more easily (their functionality and purpose should be clear enough from the comments):
Algorithm
Now comes the time to write our interpreter for Teacup, which will evaluate a source string into a value. It will have two parameters:
- A
parser
to produce the AST. - A mapping from node "type" (word, number, E + E, etc.) to a function
that handles that kind of node (
handlers
).
One caveat for the mapping is that we will also map regular
expressions to handlers. For instance, we may have a node of "type"
_ if X elseif X elseif X else X end _
. We will therefore want to be
able to match an arbitrary number of elseif
. A regular expression
will do just fine.
We will define an eval
method on the interpreter which will take two
arguments: a source code string or an AST node, and the environment
we defined previously.
That's it, really. All we are missing are the handlers, which are admittedly the most important part, because at the moment our interpret literally can't interpret anything.
As we define them, we will push the handlers into the handlers
variable.
Basic handlers
The basic handlers we need are as follows:
Type | Example | Description |
---|---|---|
word | xyz | Variable lookup |
number | 123 | Numbers |
string | "hohoho" | Strings |
_ ( E ) _ | (x) | Grouping |
E ( E ) _ | f(x) | Function calls |
_ [ E ] _ | [x, y, z] | List building |
E [ E ] _ | x[y] | Indexing |
E . E | x.y | Dot notation for indexing |
E + E | x + y | Operator application |
E * E | x * y | " |
etc. | etc. | " |
_ if E then E else E end _ | if x then y else z end | Conditional statement |
Variables and functions
Now that the basics are there, we will add the capability to define variables and functions, as well as simple list comprehension syntax:
Type | Example | Description |
---|---|---|
E -> E | x -> x + 1 | Anonymous function |
_ let E in E end _ | let x = 1 in x * x end | Variable binding |
_ for E in E do E end _ | for x in list do x * x end | List comprehension |
Well, look at that! Teacup is almost complete!
Adjustments
The one small detail we have to fix is unary -
which,
unfortunately, doesn't work. With our current setup, (10 + -2)
will
parse as ((10 +) - 2)
.
There is an easy fix: we will write a pass that looks for sequences of infix tokens and make every operator except for the first one prefix. We will sandwich that pass between the lexer and parser.
There. We're done :)
Play pen
You can try the final result here. I have added syntax highlighting and automatic indent for your convenience. The highlighting depends only on the regular expressions defined in the lexer section, so it will adapt to whatever changes you make. Fun!
Note: if the play pen is empty, that's because you evaluated code in some other editor on your way here. If you refresh the page, you will see an example.
Hit Ctrl+Enter to evaluate.
Conclusion
We made Teacup! Yay! That wasn't too hard.
There's a bunch of stuff we could do to tweak or improve it, and which may be the subject of future posts, if there is interest:
- Static type annotations and inference.
- Interesting special forms, like pattern matching.
- Simple macro system.
- Incremental parser.
A git repository with the code is available here.
I also made a JSFiddle for your convenience.
Related links
- The Earl Grey language which I designed using some of the principles in this tutorial.
- "Screw it, I'll make my own!", an article about my experience designing a programming language.
Unrelated links