That language was a nameless throwaway experiment, but for the purposes of this post I should probably give it a name. I'm going to go with "Teacup", an unassuming, adorable name, fit for the small and unassuming language we're going to make.
If you don't like it and would rather give it a mightier name like, I don't know, TeaRex or MegaLang3000, you can edit it below:
So without further ado, dear readers, let us write the Teacup programming language.
logfunction to pretty-print objects in the area to the right of the text editor. However, this is only going to be practical on a desktop.
The steps will be as follows:
- We start with the source code, a string.
- The lexer transforms a string into a list of tokens.
- The parser transforms a list of tokens into an abstract syntax tree.
- The interpreter transforms an abstract syntax tree into a result.
Here we define a
Pipeline class that will make our work easier by
composing a series of transformations, passed as a parameter:
The job of a lexer is to split the input text into a list of discrete tokens (atomic parts like numbers, strings, an opening parenthesis, and so on). Not all parsing methodologies have this step, but for the method we are going to use, it simplifies things.
First we will define a regular expression for each of our token types:
we'll have numbers, words (variables), operators, strings (with double
quotes) and comments (starting with
# and up to the end of the
The order is important: the regular expression for
word is a superset
of the expression for
number, for example, which is why
word. Whatever is not covered by these regular expressions,
whitespace for instance, will be ignored by the parser.
Also, the reason why I define them as strings rather than regular expressions is that I'm going to concatenate them and I don't know how to do this if they are not strings.
Here's the code for transforming a string into a joyous list of tokens:
You can change
testCode to something else you want to test (hit
Ctrl+Enter to run).
Essentially, we build a large regular expression with one
matching group for each token type, then we use
get our list of tokens, which is a mix of matching strings and the
undefined (when there is no match). The type of each token can
be figured out from its position in the list. Then we filter out the
spurious tokens, those that are comments, have a
null type, or have
undefined in the text field.
We also store start and end positions for each token. It is very important to keep track of these numbers so that when things go wrong, we know where they went wrong.
Alright, so we have a list of tokens. Now we need to organize them into a structure that resembles some kind of program. For Teacup we are going to use a very simple, but elegant system: an operator precedence grammar.
OPGs are more limited than the LL or LR grammars that are used for most languages, but I think these are overkill anyway. OPGs have some interesting properties that in my view make up for the limitations. For instance, the parsing rules are highly local, which means they are, in principle, amenable to parallel parsing.
Also, nobody ever talks about them, so I can differentiate my offering in the unending flood of terrible tutorials that plague the Internet.
Now, the common understanding of operator precedence is that each operator has a precedence number and is either left-associative or right-associative. But forget about this system. It's rubbish. Let me show you a better one:
Operator precedence grammars
Instead, imagine that you have a precedence matrix, with a cell
for each (ordered) pair of operators. So if you have the expression
(a + b * c), instead of going: "Okay, so
+ has priority 10, and
* has priority 20, so
* has higher priority and we must
parenthesize it like this:
(a + (b * c))", you look up the matrix
M(+, *). There are three possibilities:
M(+, *) = -1: the operator on the left (
+) has precedence, so the expression should be parenthesized as
((a + b) * c).
M(+, *) = 1: the operator on the right (
*) has precedence, so the expression should be parenthesized as
(a + (b * c)).
M(+, *) = 0: both operators are equal, so
b"belongs" to both of them. If you want a better intuition as to what this means, it means you are making the ternary operator
bis its middle operand.
If the expression, instead, was
(a * b + c), then we would consult
M(*, +), which may or may not be the opposite of
M(+, *). In fact:
M(+, +) = -1means that
M(^, ^) = 1means that
- For a ternary operator, for example
(a ? b : c), then
M(?, :) = 0.
- The same idea works for brackets:
M([, ]) = 0.
So you can see this is pretty powerful. You get associativity, you get
ternary operators, brackets, and so on, for free. You could also
define a complex operator like
if a then b elseif c ... else d end
if you wanted to (and we will want to do this).
Granted, defining a precedence matrix sounds like work. If you have ten operators, the matrix is five by ten, which means a hundred entries. If you have twenty, well… yeah. We don't want to fill that manually.
There are many solutions to this problem, but a tutorial can only
afford so much sophistication, so what we're going to do is this: for
op, we will have two numeric priorities: a
L(op) for when the operator is on the left and a
R(op) for when it is on the right. And then we will
M(op1, op2) using
R as follows:
M(op1, op2) = sign(R(op2) - L(op1))
Under this scheme, an operator is left-associative if
L(op) > R(op)
and right-associative if
R(op) > L(op). Unlike the one-priority
scheme, this preserves our ability to define ternary operators. What
follows are the priorities for Teacup. Some of them will seem
strange at first glance, but I will explain everything.
lassoc defines a left-associative operator, which means its
left-priority is slightly greater than its right-priority.
defines a right-associative operator (higher right-priority).
defines equal left and right-priorities.
prefix operator is just an operator with very high
right-priority. Imagine for example that the expression
(a + * b) is
(a + null * b), with an implicit "null" token between the two
operators. A prefix operator will have a null left operand, whereas a
suffix operator will have a null right operand.
Since we compare
R(*) is very high, it's always
going to "win" the comparison. It will always take the
null, so it
will always be prefix. Likewise, a suffix operator would have very
- Opening brackets and keywords like
if(which begin special forms) are defined as
prefixoperators with very low priority.
- Closing brackets the
endkeyword are defined as
suffixoperators with a very low priority that matches the opening bracket/keyword.
- Keywords that are part of a special form, like
else, have a very low priority that matches the opening and closing forms.
What about the entries for
string? These are not
operators, right? Well, to simplify the code, we will assume that they
are operators, but these operators are nullary (they are both
prefix and suffix).
Note that despite being prefix,
( can take an operand on its left,
for instance if one writes
R(() is not quite as high
What follows is the code to get a token's priority and the
method, which calculates the matrix entry corresponding to two tokens:
Now that we have a priority system for our operators here is the meat:
the operator precedence parser. It is not a long algorithm, nor is
it very complex. The most difficult is to properly visualize its
operation, but you can uncomment the
log line in the code to get a
useful trace of what the algorithm is doing!
In the right pane, you can see the output of the raw algorithm, which is a deeply nested list. It may seem oddly organized at first glance, but it is actually very neat: notice that in each list (0-based), which we will also call a "handle", the odd indexes are the operator tokens, and the even indexes are the operands. For instance:
- The atom
a(which as I explained before is parsed as a nullary operator) corresponds to the list
[null, a, null]. The operator is
a(in position 1), and it has two
nulloperands (in position 0 and 2).
a + bcorresponds to the list
[a, +, b]: operator
+in position 1, operands
bin position 0 and 2.
- If you try the code
(a)it will produce the list
[null, (, a, ), null]. Operators in positions 1 and 3, operands in 0, 2 and 4.
Another neat property of this output is that it is order preserving:
if you flatten it, you end up with the original list of tokens, with a
null between each token pair.
You have certainly noticed that
parse takes an additional argument, a
finalize (we use the identity function in the example
above). That is the subject of the next section:
Finalizing the parse
Although neat, the raw output of
parse is not necessarily the one we
finalize function is given a handle after it is completed
and must return the "value" associated to that handle. If you are
writing a simple calculator, that can be an evaluation function; if
you are writing a compiler or a more elaborate interpreter, it can be
a node of your abstract syntax tree (or AST).
Here we will define a
finalize function that transforms our infix AST
nodes into prefix ones and computes location information. We will do
this somewhat cleverly, for reasons that will become apparent later
(note that in what follows,
"a" is shorthand for the token object for
Our transformation rules:
[null, "a", null] ==> "a"
[a, "+", b] ==> ["E + E", a, b]
[null, "+", b] ==> ["_ + E", b]
[null, "(", a, ")", null] ==> ["_ ( E ) _", a]
In other words, we will eliminate nullary operators, and aggregate the operators that we find in odd positions into an expression that describes which of the operands are not null.
(Also we are not really going to return an expression in prefix
notation, instead we will put the function's "signature" in the
field and the arguments in the
args field. That simplifies things a
little: our AST will have our old token nodes as leaves, with types
string. And now we're going to add inner nodes with
E + E and the like, so that later on we can simply dispatch on
In our interpreter we will implement variables and lexical
scoping. This means we need some kind of structure to map variable
names to values: an environment. For this purpose we will use a
null prototype (which means it has no
built-in fields like
toString that could interfere with symbol
resolution). Then, to implement nested scopes, all we need to do is to
Object.create on the parent environment. JS's prototype
inheritance will do the rest for us.
So we define the
makeEnvironment function that takes an optional
bindings argument (for initial bindings). It will also populate the
environment with basic functionality that we want to have in
Teacup, like truth, falsehood, arithmetic… that kind of
Notice that we populate the environment with some symbolic names like
"+". We will make it so that operator applications like
(a + b)
will use these bindings.
or functions are lazy: instead of receiving their
arguments directly, these functions will receive functions that
evaluate to the arguments, so they may choose to evaluate only a
lazy wrapper we define here just sets the
to true, we will deal with it explicitly later (see the
Before going any further we will write some utilities to manipulate the AST a bit more easily (their functionality and purpose should be clear enough from the comments):
Now comes the time to write our interpreter for Teacup, which will evaluate a source string into a value. It will have two parameters:
parserto produce the AST.
- A mapping from node "type" (word, number, E + E, etc.) to a function
that handles that kind of node (
One caveat for the mapping is that we will also map regular
expressions to handlers. For instance, we may have a node of "type"
_ if X elseif X elseif X else X end _. We will therefore want to be
able to match an arbitrary number of
elseif. A regular expression
will do just fine.
We will define an
eval method on the interpreter which will take two
arguments: a source code string or an AST node, and the environment
we defined previously.
That's it, really. All we are missing are the handlers, which are admittedly the most important part, because at the moment our interpret literally can't interpret anything.
As we define them, we will push the handlers into the
The basic handlers we need are as follows:
|Dot notation for indexing|
Variables and functions
Now that the basics are there, we will add the capability to define variables and functions, as well as simple list comprehension syntax:
Well, look at that! Teacup is almost complete!
The one small detail we have to fix is unary
unfortunately, doesn't work. With our current setup,
(10 + -2) will
((10 +) - 2).
There is an easy fix: we will write a pass that looks for sequences of infix tokens and make every operator except for the first one prefix. We will sandwich that pass between the lexer and parser.
There. We're done :)
You can try the final result here. I have added syntax highlighting and automatic indent for your convenience. The highlighting depends only on the regular expressions defined in the lexer section, so it will adapt to whatever changes you make. Fun!
Note: if the play pen is empty, that's because you evaluated code in some other editor on your way here. If you refresh the page, you will see an example.
Hit Ctrl+Enter to evaluate.
We made Teacup! Yay! That wasn't too hard.
There's a bunch of stuff we could do to tweak or improve it, and which may be the subject of future posts, if there is interest:
- Static type annotations and inference.
- Interesting special forms, like pattern matching.
- Simple macro system.
- Incremental parser.
A git repository with the code is available here.
I also made a JSFiddle for your convenience.
- The Earl Grey language which I designed using some of the principles in this tutorial.
- "Screw it, I'll make my own!", an article about my experience designing a programming language.