Quaint Development Blog How quaint

20Nov/112

LR – Icon

Here I will review a language which I find conceptually very interesting: the Icon language. Icon is a language with "goal-directed execution" where function calls may either fail or return one or more values, and failure within an expression causes backtracking to see if there may be other values to try.

Goal-directed execution

It is not entirely trivial to describe this execution modus without examples, so here goes:

procedure main()
    write(fact(10))
    # prints 3628800

    write(fact(-10 to 10))
    # tries fact(-10), fact(-9) ... fact(10) until one of them succeeds
    # the first to succeed will be fact(1) = 1, so it prints 1

    every write(fact(-10 to 10))
    # calls all of fact(-10), fact(-9) ... fact(10)
    # prints n! for 1 <= n <= 10 (all n in the range that succeed)

    write(10000 < fact(1 to 100))
    # the < operator fails if the condition is not respected
    # if the < operator succeeds, it returns the right hand side
    # therefore, this writes 40320, which is the first factorial number
    # to exceed 10000

    if fact(n := 1 to 100) = 3628800 then
        write("n! = 3628800 for n = ", n)
    # tests that fact(n) = 3628800 for n ranging from 1 to 100
    # it keeps trying until the comparison succeeds or all possibilities
    # are exhausted

    L := 20
    every (i := 1 to L)^2 + (j := i to L)^2 = (k := 1 to L)^2 do
        write(i, "^2 + ", j, "^2 = ", k, "^2")
    # prints all (i, j, k) triplets such that i*i + j*j = k*k
    # and i, j, k <= L

end

Essentially, Icon is based on the concept of generator ("1 to 100" is a generator that produces numbers from 1 to 100). A function can either fail, or it can generate one or more values. An expression such as "f(g(x))" is evaluated as follows: first, g(x) is evaluated. If it fails, then the whole expression fails. If it returns a value, then f is called on that value. If f succeeds, its return value is the value of the expression. Else, f will ask for the next value of g(x). If there are no more values, the expression fails. Else, f is called on that new value, and so on, until it succeeds or exhausts all of its possible arguments. If a function takes several arguments, it will consume all possible combinations of values: for each possible value of the first argument's generator, the second argument's generator is called, and so forth (so the call f(g(x), h(y)) might actually call h more than once).

All control structures are predicated on this idea. The if statement, for instance, rather than working on boolean values, executes the then branch if the conditional *succeeds*, and the else branch if it fails, using the evaluation method described above. The same goes for the while statement. The every statement, as can be seen in the example, exhausts all possibilities, and is more or less equivalent to a function that always fails (because failing causes the generators to keep churning values). It is important to note that the backtracking is done on a per-statement basis. For instance, the following code prints 10:

i := 10 to 100
every i do write(i)

Indeed, the first statement succeeds right away and assigns 10 to i. Then, the every statement operates on i, which is now a plain value.

Icon has many functions that are generators. The example given in all tutorials is the find function: find(substring, string) generates the indexes of all occurrences of substring in string. If one wanted to find the first occurrence of substring in string of position at least n, in most languages, this would require passing an extra argument to find. In Icon, however, one can simply write "n < find(substring, string)". For all occurrences with an index smaller than n, the comparison fails, prompting find to yield the next occurrence, and so on, until the comparison succeeds or no more occurrences are found. If the comparison succeeds, it returns its right hand argument, so it is important to put the call to find on the right. Note that this policy allows one to chain comparisons: "a < b < c" is "(a < b) < c", and "a < b" returns b if it succeeds, so that we can evaluate "b < c" afterwards. If it fails, of course, the whole expression fails.

Icon provides ways to create and compose generators. A function may use the statement "suspend x" to produce the value x. If that value is deemed unsuitable, the execution will be resumed at that point. (a | b) creates a generator that generates a, then b. (|a) generates a over and over again. It is also possible to control backtracking to some extent: for instance, (1 to 100 \ 20) will only produce 20 values.

Goal-directed evaluation is a rather unique feature that allows for code that can be extremely concise. Many problems which would require elaborate nested loops in most programming languages can be written on a single line, because the looping is made implicit by the evaluation policy. Simply consider the example above that computes combinations of integers x, y and z such as x^2 + y^2 = z^2. In most languages, this would require three loops, so there is a big gain here.

This model can however lead to some unexpected behavior. Consider the following code:

s := "hello"
s := f(x)
write(s)

If the call to f(x) fails, the assignment fails, and the code prints "hello". That seems like a bug-prone behavior: if one assumes that the call to f(x) will succeed, but that it does not, the execution will keep going and erroneous behavior might ensue later on. It might be difficult to track the source of the error in that case. This might not occur all that often in practice, especially since seasoned users will watch out for such circumstances, but it still seems problematic. I would prefer it if all statements had to succeed, so that "a := f(x)" would raise an error if f(x) fails. In order to have the current behavior, one might write "maybe a := f(x)", meaning that the statement might fail and do nothing. I'd be interested in knowing if people fluent in Icon think it is a good idea or not.

String scanning

Icon sports "string scanning" functionality used as follows:

procedure replace_all_e_by_a(s)
    s ? {
        # s is the "subject" of the body
        # it is akin to an implicit argument
        every pos := find("e") do {
            writes(tab(pos)) # write up to the e, no newline
            move(1)          # skip the e
            writes("a")
        }
        write(tab(0)) # write rest of string
    }
end

procedure main()
    replace_all_e_by_a("The vegetable soup is the best")
    # -> "Tha vagatabla soup is tha bast"
end

Basically, "text ? expression" creates an environment comprised of a subject, which is the string we are processing, and a position, which is an index into the subject string. Functions called within the environment have knowledge of the subject and position, meaning that these parameters can be omitted. The move(i) function increments or decrements the position by i and returns whatever is between the old and new positions. The tab(i) function moves to position i within the subject, again returning what's between the old and new positions. Strings are indexed from 1 to length, but can also be indexed in reverse, 0 being the last position, -1 the second to last, etc. Thus tab(0) moves to the end of the subject. The find function can be called with a single argument and will produce the indexes of the occurrences. So tab(find("world")) will move to the position of the first occurrence of "world" in the subject string. Braces {} can be used to group several statements to execute on the subject.

String scanning is a pretty nifty feature, and I think it would be interesting to generalize it to other types of sequences such as arrays. I will note that it is fairly similar to the "with" statement found in some other languages, though these are usually object oriented languages. Food for thought.

16Nov/114

Syntax

This post will detail two things. First, it will describe the syntax. Second, it will describe the canonical representation (AST) that the syntax maps to.

Tokens

All Unicode characters are split into four groups: symbol characters (abcdef...), operator characters (+-*/...), special characters (()[]{}"«»'⏏) and illegal characters (everything else). The following tokens are defined:

  • Identifier: any sequence of symbol characters where the first character is not a digit. Examples: abc, Zaza, λ, Ωmega, mélimélo, a†
  • Numeral: any sequence of digits and underscores (first character must be a digit), potentially followed by a dot and/or the character e or E followed by more digits. Alternatively, digits followed by the character r or R followed by a number in that base. Examples: 0, 1, 100000, 100_000, 3.14159, 16rDEADBEEF, 2r101011, 013 (this is NOT base 8), 8r13 (this IS base 8).
  • Character: the character ' followed by any character encodes a string of that character. Examples: 'a, 'b, '*, '←, '', '", '(
  • String: any sequence of characters delimited by "", or delimited by «». The latter can nest, and ⏏ may be used as an escape character. Examples: "hello", «world», «he was «fine»», «wfωγφεhe↑←"≥X!≠», "⏏""
  • String interpolation: $identifier, $(expression), $identifier.identifier[expression], etc. represent expressions to be evaluated and inserted in the string that contains them. Examples: "your name is $name", "a + b = $(a + b)", "your name is $user.name", "your name is $database[user].name". Note that the $(...) syntax stops at the closing parenthesis, so anything after that is interpreted as plain text.
  • Operator: any sequence of operator characters. Examples: +, *, /, //, ←, #, &+&, \, %←\^^^
  • Brackets: (, ), [, ], {, }
  • Sequence separator: the comma (,) serves as a sequence separator inside any type of bracket. Line breaks are almost always interpreted as commas, except when a line continuator (\) or line-continuating operator (: and ~) ends the line or starts the next one. Note that line breaks are interpreted as commas even inside parentheses.

Syntax

Quaint's syntax is basically an operator syntax. For each pair of operators (op1, op2), a priority is defined. Either op1 has higher priority, op2 has higher priority, or neither operator has priority over each other. Concretely, given the expression "a + b * c", we look up the (+, *) entry. If the entry is -1, b is subsumed by +, giving "(a + b) * c"; if it is 1, b is subsumed by *, giving "a + (b * c)"; if the entry is 0, a compilation error is raised. This is much more fine grained than simple numerical priority: indeed, we can define a partial order on operators with respect to either left-priority or right-priority. A left-associative operator is simply one such that the (op, op) entry in the database yields -1 (1 gives right-associativity and 0 non-associativity).

The following figure gives an idea of how priority is defined in the language in practice (this is only a partial diagram omitting several operators, and forgive the poorly aligned arrows):

Priority decreases from top to bottom, so an operator op1 has priority over another operator op2 (left or right) if there is a directed path from op2 to op1 in the graph (if there is no path, no priority is defined). Within the same box, operators have either left-associativity (blue box), right-associativity (orange box), or non-associativity (grey box). Operators in black are binary, operators in red are prefix, and operators in blue are postfix.

I believe that a partial order is the right way to define operator priority, since it forcefully enhances readability without limiting the range of possible operators. For instance, the range operator (..) has no priority with arithmetic operators, so "1 .. a + b" is a syntax error. Supposing that we define operations on lists as follows: "[1,2] + [3,4] == [4, 6]" and "[1, 2] ++ [3, 4] == [1, 2, 3, 4]", an expression such as "[1, 2] + [3, 4] ++ [5, 6]" is kind of ambiguous. The syntax I have defined, along with the priority graph above, simply says that such an expression is a syntax error. Similarly, priority between union (∪) and intersection (∩) is not necessarily clear (I think it is usual to give intersection higher priority, but would most people feel confident that's how it works?), so none is defined. Custom operators are incomparable with almost all other operators (as well as with each other), with the exception of assignment and juxtaposition. This is practical and easy to remember.

Example

Here is a tentative code sample for Quaint:

def fib[x]: (
    if (x =< 2):
        1
    ~else:
        fib[x-1] + fib[x-2]
)

def main[n]: (
    fibn = fib[int[n]]
    <> "The $(n)th fibonacci number is: $(fibn)"
)

Tokenizing is relatively straightforward. To parse the code, first add a comma at the end of every line except for lines ending with : and lines where the next line starts with ~. Commas that are right after ( or right before ) can be removed, and consecutive commas reduce to just one comma. Once that is done, use the priority graph to parse the expression. I will only give a few tidbits in order to give you an idea:

  • The program is parsed something like ((def _ (fib _ [x]) : ...), (def _ (main _ [n]) : ...)), where _ is the binary juxtaposition or whitespace operator; there are implicit parentheses around the whole expression.
  • The if statement is parsed something like ((if ... : ...) ~ (else : ...)).
  • The parentheses around the definition of fib are not optional. Indeed, the precedence entry for (:, : ) is 0, and that means an expression of the form (a : b : c) is illegal. Here this means the parser cannot determine if you mean (def thing : (if cond : ...)), or ((def thing : if cond) : ...). It would be possible to define a precedence (right-associative would work nicely), but this makes priority for the ~ operator hairy to define (I consider this operator to be pivotal, since it is what allows us to attach elseif/else to if, except/finally to try, etc. without hard-coding them).
  • Again, parentheses are needed around (x ≤ 2), or else Quaint complains that it cannot tell what the priority is between ≤ and :. That is, the entry in the priority table for (≤, : ) is 0. It would not complain if ≤ was after :, because the entry for (:, ≤) is 1. Priority in Quaint is not necessarily symmetric, though I try to make it as intuitive as possible.
    • I also made an effort to make undefined priority errors top-notch: Quaint highlights both operators in different colors and explains that it can't tell priority between them and that the user should put parentheses. In the short term, I plan to also highlight the code between the operators and help the user resolve the ambiguity. Then, I plan on analyzing the code to figure out what the user meant and suggest something sensible, with a short justification. For instance, if the user enters "a < b < c", we can say that "there is no priority between < and < and we think you meant to write 'a < b ∧ b < c': indeed, a < b < c means (a < b) < c, and (a < b) is a boolean value which can't be compared to anything. You probably meant to compare a with b, and then compare b with c." I am very confident that I can put up extremely useful error reports about syntax errors (the current ones are already 100x better than that of most other languages).

In the end, this pass produces a binary tree, except for sequences that can have an arbitrary number of children. Prefix operators can be interpreted as binary operators where the second argument is some special "null" object, ditto for postfix operators except with the first argument.

Canonical representation

Although the AST produced previously is in fact rather simple, the canonical representation of code will be different and will contain some isomorphisms. While this removes some information, I believe that it simplifies processing. Macros will receive the canonical representation rather than the raw AST.

The general idea is that "a OP b" is always equivalent to "(OP){a, b}", except of course for the juxtaposition operator (else I would be defining an infinite recursion). "OP a" is "(OP){∅, a}" and "a OP" is "(OP){a, ∅}", where ∅ is a special object. This reduces all operators to plain application on a code object. The AST is then mapped to sexps as follows:

  • a b → (apply a b)
  • (a) → a
  • (a, b, ...) → (begin a b ...)
  • [a, b, ...] → (table a b ...)
  • {a, b, ...} → (quote a b ...)
  • x → (symbol x)
  • 12.3 → (value {numeral base:10, mantissa:123, exponent: 2})
  • "abcdef" → (value "abcdef")

For instance, "f[x, y]" becomes "(apply (symbol f) (table (symbol x) (symbol y)))", and "a + b" becomes "(apply (symbol +) (quote (symbol a) (symbol b)))". Properly speaking, it is not exactly sexps that are produced, because useful metadata must also be attached to the code. Therefore, for each sexp there is precise location information and a unique "nesting identifier" representing the scope it is supposed to be evaluated in. For example, if the file "test.q" contains "a ++ [b, c]", this is mapped to (app (sym ++) (table (sym a) (sym b))) and the nesting identifier of b is ("test.q" 2 2), meaning "go in test.q, parse, go in the second argument of app, then in the second argument of table". For convenience, there may also be backpointers, but the nesting identifier can be used to determine what the enclosing expression is, so it is redundant.

As you can see, all source code maps to nested sexps where each expression is one of the core instructions applybegintable,quotesymbol and value. Note that source code cannot directly produce all core instructions. For instance, direct production ofdeclare and instantiate is not possible, nor is variable declaration. However, macros may insert them. Therefore, let's see howeval is implemented on the canonical representation.

Evaluation loop

Evaluation works in two phases. First hyper-unquoting. Second, an interleave of unquoting and evaluation step. The idea is that hyper-unquoting slams through quote nodes and "pulls" expressions outside of quote. Technically, this makes all quote statements into "quasiquote" statements, but this is more pervasive, because unquoting in Quaint is legal outside of quasiquote. Essentially, if you unquote in the body of Quaint source outside of a code block, you specify that some statement should be evaluated at the compilation phase and inserted in place. You can even unquote twice, to add a meta-compilation phase before compilation. Needless to say, it's an easy way to implement macros. Hyper-unquoting is the $ operator. That phase is only done once. The second phase, plain unquoting, on the other hand, does *not* go through quote, but it pulls up expressions one level anyhow. Execution is self-explanatory. Here is what this all means:

(hyper-unquote canonical-representation hyper-unquoters)

So, first, we perform hyper-unquoting: $(...) in code (and in strings) is substituted by a ref node and a fill block is created outside of the quote block. Let me put this clearly (the quotes do NOT mean these are strings!): "{a, $(b), c}" is parsed as "(quote a (apply $ b) c)", and this is then "pulled" into "(fill (quote a (ref 1) c) b)". Unquoting can be done many times. For instance, "{a, $$(b), c}" becomes "(fill (fill (quote (quote a (ref 1) c) (ref 1))) b)". The semantics of fill are to replace the references by the actual value computed as the arguments. I might add $! and $* to respectively wrap the value in a (value X) node, and flatten the value into the enclosing sexp. The hyper-unquoters here would be functions that match nodes, and if a match is found, they produce a substitution containing ref and a corresponding argument to fill.

(unquote canonical-representation unquoters)

This works more or less like hyper-unquote. The only differences are that unquote does not match through quote and that different behavior is associated to each unquoter. Typically, the unquoters are actually hard-defined, leading to a call that's roughly the following: (unquote canonical-representation ((: macro-call) (~ macro-agglutinate) (= declare))). An example should clarify things:

"if a: b" maps to (apply : (quote (apply if a) b)), and then the (:) unquoter produces the expression: (fill (quote (ref 1)) (apply if (quote a b))).

(eval-step canonical-representation environment)

Now, eval-step evaluates the result. In the previous example, (apply if (quote a b)) will be evaluated in order to produce something like (if a b) or (cond a (#t b)), a core primitive that the evaluator can work with. The environment is a core environment that defines such primitives as "if". Then, "fill" (another primitive!) is called, puts (if a b) instead of ref, and since that reference is all there is, it returns (if a b).

Now, unquote will be called again on the result. If there is something else to unquote (which may happen if, for instance, b is an application of : ), a new fill statement will be produced, and eval-step will evaluate that again. If, however, unquote finds nothing, then it is a no-op. eval-step will then perform the evaluation. Note that there is a small catch: the unquote and evaluation steps will loop right until the code returned is of the form (value X), because neither unquote nor eval-step may go inside value nodes. It is therefore important to make sure that the evaluation terminates by using correct expansions.

Note that unquote/eval-step is a sort of macro system. However, I find it more elegant than the macro system found in Scheme. Even if we standardize unquoters, there isn't much power loss. For instance, one could easily define "unless cond: body" by doing $(def unless{cond, body}: ...). This ensures that the unless function is defined in the compilation scope, and when the (:) unquoter is triggered, it will try to evaluate unless in that scope. Similarly, one can do $(import: unless) to import that macro in another file for use in the compilation phase (note: there is no global scope in any phase). I believe that "macro args: body ~modifier args: body ..." is a great general-purpose macro syntax (okay, it looks crappy as a one-liner, but what doesn't), and that requiring that macros be defined as normal functions inside $(...) clarifies their syntactic meaning and helps segregate compilation from execution.

Evaluation proper

If you will forgive me, I will talk about this in the next post. This one is getting pretty long, and this is a huge topic in itself.

 

Filed under: Uncategorized 4 Comments
8Nov/110

LR – C

Yeah! C!

What is C?

C was developed at Bell Labs at the beginning of the 70s by Dennis Ritchie (rest in peace) while working on Unix. It is now one of the most widely used languages and has a compiler targeting almost every single hardware architecture. Partly for this reason, a lot of languages compile to C as an intermediary language.

C is a low level programming language, or a high level assembly language, depending on how you look at it. It is imperative, strongly typed, and lacks many facilities found in other programming languages, such as closures and automatic garbage collection. This makes it particularly suited to systems that require peak performance and responsiveness (such as operating systems or device drivers).

Many languages were derived or were inspired from C: C++ and Objective C were made in order to extend C to the object oriented paradigm, albeit using different approaches. Otherwise, languages such as Java or C# took heavy inspiration from C's syntax in order to look more familiar to the demographic they were aiming to conquer. Overall, C had a very deep impact on the current programming language landscape and to some extent, it molded expectations for new languages.

Pointers

One of the basic concepts found in the C language is the pointer. A pointer of type T is essentially an integer pointing to the offset of some object of type T in virtual memory. A pointer variable can be declared by prefixing the variable name with *, the address of a variable can be obtained by prefixing it with &, and the value located at a pointer's address can be obtained by prefixing it with * where it is used (pointer indirection).

These tools essentially allow one to look at and modify the program's data without restriction and without security, for the most part. Indeed, C does not formally verify that a pointer "points" to a valid area of memory, nor that it is pointing to an object of the proper type. Trying to get data from a pointer to invalid memory (chiefly at address 0) leads to the infamous "segmentation fault", whereas trying to get data from a pointer to the wrong object likely leads to some undefined, wacky behavior.

In a sense, the concept of pointer undoes a lot of strong typing's advantages: while most type systems try to guarantee that type errors will never occur, C's typing system merely checks that the programmer's declarations are consistent and that the variables of the right types are used at the right places. But since a pointer of type T is essentially an arbitrary integer with the semantics that whatever is in memory at that byte offset should be interpreted as some value of type T (even though the memory space contains, well, everything), the correctness of the program hinges on not making mistakes in the manipulation of these integers. That is not verified by the compiler (and is not really possible to verify either), and the consequences of mistakes are unforgiving. The worst consequences are the so-called buffer overflows, where an attacker abuses knowledge of the system in order to make it write at memory positions it should not ever write to.

While they provide an incredible amount of flexibility (or rope to hang oneself with), pointers are not much more useful than simple references, that is, pointers without pointer arithmetic. The main uses of pointer arithmetic, manipulating strings and other sequences, are better addressed by more structured and more sanely bounded means of sequence building and indexing. C's conceptualization of pointers also prevent possibly useful optimizations, if they involve the modification of the layout of structures or moving them around in memory. Except possibly in very particular cases, the concept of pointer arithmetic seems obsolete to general application development.

Explicit memory management

In C, memory has to be managed by the programmer. In a sense, that might seem obvious, but in reality a lot of languages help with this. In C, a data structure may either be allocated on the stack, in which case it will be reclaimed at the end of the block in which it is visible; or it may be allocated on the heap, usually with a call to the standard library function malloc (which returns a pointer to void corresponding to the address of the beginning of the allocated block), in which case it will stay there until the programmer deallocates it with a call to free.

Obviously, allocation without deallocation, over the course of a program, can lead to memory usage inflating until it bursts (i.e. a memory leak). Since allocation on the heap is ubiquitous, especially in the context of complex structures, memory leaks are a major problem. Most high level languages offer facilities to reclaim memory when it can be proven that there is no way to access it by following pointers or references from the data the program has immediate access to, a process called garbage collection. The lack of such a facility in C means that on one hand, the overhead of garbage collection is avoided, but on the other hand, a greater number of errors happen.

Static typing

A language offers static typing if for every variable the programmer may provide a "type annotation" forcing it to be of a certain precise type, may that be integer, float, or a structure with fields a and b. Such annotations may also be provided for the return type of a function. If we define a "type error" as being the action of putting a value of a type A in a variable of type B, where A is not equal to B, the compiler then tries to guarantee that such errors never happen.

C does not provide guarantees that are as strong as that given by a lot of other languages. For instance, C allows one to explicitly convert any type of pointer into any other type of pointer. Doing that entails that any and all type safety is thrown out of the window, but on the other hand it provides some helpful flexibility (e.g. having heterogeneous arrays).

Braces and semicolons

C's syntax requires semicolons to be appended to all statements in order to terminate them, and uses braces {} to delimit code blocks for function definitions, if statements, for statements, and so on. Thanks for C's influence, many languages have adopted similar conventions: Java, C#, JavaScript, and many others use very similar syntax. Honestly, I dislike that syntax. Braces fill a role that's semantically very similar to that of parentheses (grouping), and since most statements do not span multiple lines, it is silly to require punctuation to terminate a statement, rather than requiring one to continuate it on the next line.

Macros

The C preprocessor macro system is something extremely basic of moderate power. It is essentially text substitution: #define A B replaces all further occurrences of A by B, and #define A(x) B/x replaces all occurrences of A(...) by B where x is substituted for whatever the ... represents. The preprocessor is somewhat tricky to use, since for instance "#define A x + y", when used in the expression "A * z", will compute "x + y * z" = "x + (y * z)" which is probably not what we intended. It is therefore better to wrap the definition itself in parentheses, i.e. "#define A (x + y)". Other preprocessor directives are "#include file", which basically pastes the contents of a file verbatim in its place, and #ifdef/#endif can be used for conditional compilation, checking if some symbols are defined or not and swapping code accordingly.

Overall, the system has its uses, but it is vastly inferior to a "true" macro system such as Lisp's or Scheme's, which allows one to manipulate an AST in a Turing-complete way and in a safer way than C allows for.

Bottom line

C is a rather ubiquitous language due to how close it is to the machine and how widely it was spread thanks to Unix. Objectively speaking, it is somewhat ill-designed: pointer arithmetic is a bug factory, syntax is mediocre, and structuring code in header, code files and #include statements is not a stellar module system. Essentially, I would say that C was born from pragmatic concerns in order to get something done (Unix), and was good at what it did, but that it aged badly, and that its extensions/successors (C++, Objective-C, etc.) did not really solve any of its core issues.

 

15Oct/116

Unicode syntax

The nice thing about designing a language is that you can start from scratch, and do things that were either never done properly before, and/or could not be tacked onto existing languages without requiring deep changes, and/or were too marginal to gain the slightest pull. One such aspect is source code encoding and the range of characters that one allows to be found in it.

And thus, I am considering (read: my mind is almost made up on this) incorporating Unicode characters in Quaint, for use in identifiers and operators.

First, I will tell you why.

Second, I will tell you how.

Why

ASCII contains few usable characters, and even fewer adequate ones. If one wishes to give standard meanings to operators that have them, wanted to reserve some characters as prefixes (e.g. "@x" for dynamically scoped variables, "$x" for pattern matching, etc.), wanted to have set operators (membership, union, intersection, difference...), maybe a length or size operator (prefix "#" would do nicely), and so forth, one would eventually ran out of ideas. For instance, what to use for the collection membership operator? Nothing in ASCII really makes any sense for that.

Some languages use keywords as operators, chiefly for the "and", "or", "not" and "in" operators. I find that somewhat bothersome, since there have been situations where I wished I could name my variables that way ("in", meant as the opposite of "out", being the worst offender).

Here are some of the issues I have with most existing languages and for which Unicode is a nice solution (I do not say it is the only solution):

  • Keywords being operators. It would be nice to be able to use them as variables.
  • The "=" operator. There is something fundamentally unclean (not to mention dangerous) about assignment being "=", and equality being "==". Using ":=" or "<-" alleviates the problem somewhat, but it would be nice to have an actual left arrow.
  • Too much overloading. Let's say I have two lists of integers. I could want to a) concatenate them, b) add them like vectors, c) get the intersection of their contents or d) do a binary and of their elements. In ASCII, at least one or two of these is going to be completely arbitrary.
  • Limited scalability. You can't add new operators without either making new keywords, which is disruptive, or new combinations of operator characters, which is arcane.

Making use of Unicode allows one to sidestep a lot of these issues. There are characters for "not" (¬), "and" (∧), "or" (∨), "in" (∈) and "union" (∪), which frees up a lot of ASCII characters for language design. That is why I decided to give in and figure out a way to make use of Unicode.

How

There are two problems with Unicode in source code. The first problem is to write Unicode characters: only a few are made available on common keyboard layouts. The second problem is displaying Unicode characters. While that problem has mostly subsided, it does occur sometimes, and anyone who has had the misfortune will concur that it is supremely annoying.

This means that in order to support Unicode gracefully, one must make sure that source files are readable and editable by any user that might be visiting from the 90s. That is, one must steer clear of UTF-8. That does not necessarily mean we can't use Unicode - that simply means we have to encode it differently. While Unicode is practically synonymous with UTF-8 nowadays, there is no reason we can't make up a different encoding for it.

So here goes: Quaint will use a special encoding for Unicode characters that is optimized to be human readable and writable. Most of the problems with Unicode are related to the fact that most encodings are unintelligible to humans when read as ASCII (the encoding is meant to be swept under the rug by the editor). For instance, take the following code: "a ≥ 0 ∧ a ≤ 2". If you were to encode this in UTF-8, but read it as ISO 8859-1, it would show up as "a ⥠0 â§ a ⤠2". That is impossible to read. In contrast, we could encode the same Unicode string as: "a >= 0 `and` a =< 2", which anybody would agree is both readable and editable.

I have designed an encoding according to the following criteria:

  • Must be ASCII.
  • Must be easy to read and easy to write.
  • Must be exhaustive and guessable - the encoding should be based on the names of the characters.
  • Must be simple. Each character should have a specific encoding, most of ASCII being encoded by itself, and the encoding should be as non-contextual as possible.
    • My encoding actually fails this requirement, as "`alpha`in`beta`" might mean "alpha∈beta" or "αinβ" depending on the surroundings. This seems relatively minor, though, and should not be a problem in practice. It could be solved with an encoding that uses different delimiters to begin and end a character.
  • Must be systematic. The encoding does not disappear inside strings. Therefore, the string literal "a `in` b" would represent the Unicode string "a ∈ b". This does exclude the keyword approach, since most of the keywords we would use are common English words (we would not want the string literal "you are in trouble" to evaluate to the Unicode string "you are ∈ trouble", however funny that might be).

A good hint was given to me by Haskell, which allows any binary function to be used as an operator: "f a b" is equivalent to "a `f` b". I figured that this was usable, but I did it on an encoding level. That is, "`lambda`" would encode the character lambda (λ).

This being said, I intended to use the left arrow as an assignment operator. "`larr`" was not acceptable, so I decided to add digraphs to the mix. A digraph is a sequence of two characters that encode another (I decided not to have trigraphs, though <-> would have been nice). Thus, "<-" encodes the left arrow "←", "<<" encodes the opening guillemet "«", ">=" encodes greater than "≥" (note that this is exactly how most programming languages denote greater than, but the similarity stops there since Quaint encodes less than ("≤") with "=<"), "<>" encodes the diamond "♦".

Limitations

While all Unicode characters are acceptable in strings, they are not all acceptable in source code. First, all source code characters should be carefully hand picked so that they are all unambiguous (notwithstanding bad fonts). So while we may allow the use of Greek letters, capital Alpha looks just like capital A, meaning that it should not be allowed in identifiers, or confusion might result. All source code characters should be displayable in a fixed-width font, and should be standard enough to be available in all major Unicode fonts. The set of available characters should be large enough to be useful, but small enough to be learnable in full.

The bottom line

In the end, Quaint will support Unicode characters through the use of a special encoding (the Quaint encoding? Q-ENC?), which is both readable and writable without special editor support. This is paramount to this feature being unobtrusive, and I believe that source code encoded that way doesn't look too bad either.

 

12Oct/110

LR – Python

I have to admit I don't really know how exactly I am supposed to do these language reviews. So I will just... try.

The first language I will look at is Python, mostly because I've been programming in that language almost exclusively for the past few years. I will not repeat the information that's in Wikipedia or online tutorials, we'll see how well that works.

What is Python?

Python is one of the many "dynamic languages" that are occupying an ever growing portion of the development scene. It was developed in the late 80s by Guido van Rossum at CWI, Netherlands, as a successor to the ABC programming language in use at that institution at the time. It is in wide use as an application or web language, and in the scientific community.

Python is for the most part an imperative object-oriented language, promoting a coding style that's a mix of plain functions and classes, though functional tools are available, especially via its list comprehension feature: for instance, [x**2 for x in xs] returns a list of the squares of the elements of the xs list. Python's generators implement a limited version of coroutines, allowing for the generation of a sequence of elements to be mixed with its processing. It is a late binding language, meaning that functions and methods are defined and looked up at runtime. In general, arbitrary attributes can be set on any object. Python is dynamically typed, meaning that all variables can contain values of any type, but it is also strongly typed, meaning that values are not automatically converted from one type to another depending on usage.

Expressed in the Zen of Python, there is a certain philosophy underpinning use of the language by its aficionados: code should be simple rather than complex, readable rather than convoluted, explicit rather than implicit. For any task, "there should be one - and preferably only one - obvious way to do it."

Object system

Python's object system is class-based: one can define a class along with a set of methods, and then instantiate it by calling it as if it was a function (rather than using a "new" keyword, as most object oriented languages do). The language recognizes many special methods, all of which follow the pattern: __method__. The __init__ method is called in order to initialize the object, which usually means setting a few properties and perhaps computing some things. The __add__, __mul__, etc. method define how the +, *, etc. operators behave when objects of that class appear on the left hand side (if this fails, Python tries __radd__, __rmul__, etc. on the right hand side object), whereas the __str__ method controls conversion to a string.

Python supports multiple inheritance. When a class C "inherits" from a class B, all methods on B are also defined on C, except for those that C redefines. This allows for code reuse and conceptual extension. For instance, Dog can inherit from Animal. Multiple inheritance means that several such parent classes can be defined. A resolution order is defined for methods following some static topological ordering of the inheritance tree, and it is recommended to use the super() method in order to call a method on the parent class.

Explicit self

Unlike most object oriented languages, in Python, the "self" or "this" object is not implicit in its methods. Instead, it is passed explicitly as their first argument. This requires attributes to be set as "self.attribute = value". In a way, it makes sense to do this in a dynamic language: indeed, since methods may be removed or added to objects or classes at runtime, if full qualification of a method name was not required inside the definitions of a method, there would be an ambiguity about any free variable in use in any method. Indeed, consider:

def g():
    print "g"
class C:
    def f():
        return g()
c = C()
c.f()
C.g = lambda: print "C.g"
c.f()

The call to g in f obviously refers to the global function g the first time it is called, but once we add a method g to C, what would it refer to? Ruby has this issue: the Ruby equivalent of this code prints "g" and then "C.g". This is immensely confusing and it could be done accidentally, in which case it would be outright dangerous. This problem does not occur in static languages since the whole set of methods is trivially known by inspecting the source, so it is not a problem to first look at the method's class for name resolution and then in the global space.

This being said, the "self" argument itself could be implicit, and/or a shortcut could be provided to refer to method attributes. For instance, Ruby makes a distinction between methods and object attributes, the latter of which must be prefixed with "@". Using lexical scope would also be an acceptable way to do this, though to me that seems more appropriate for a prototype based object system. The way Python works seems more like a leftover of a previous design or of previous implementation limitations than something that was purposefully designed that way.

Decorators

In order to streamline a common pattern in Python, decorators were added to the language. The syntax is "@decorator" before a function or a method, and the semantic is essentially the same thing as adding "function = decorator(function)" after the function's definition. Given what it allows to do, it is a nifty feature, though it feels like an afterthought and is not extremely well integrated to the language. This, by the way, is how methods can be declared as class methods or static methods.

Late binding

Late binding leaves arbitrary the association between a name and the function or method it points to. In general, statically typed languages such as C, Java, Ocaml, Haskell, etc. bind names at the time of compilation, so that it is known for sure before executing the program that the name "factorial" corresponds to the function "factorial" we defined in the source, and that the method "bark" of the class "Dog" will execute this particular piece of code. In order to support virtual methods in children classes, the compiler will still translate the name into an offset where to look up the method in a virtual method table rather than use the name itself. This knowledge significantly enhances performance, since function calls can be replaced by a jump to a known address, and method calls with a small additional indirection.

By contrast, in Python (and most other dynamically typed languages such as Ruby or Scheme), all global variables, including those defining functions, can be redefined at runtime. If we define a function called "factorial", we can, later on, in the body of some other function, run something like "global factorial; factorial = lambda x:x", which will make factorial implement the identity function. This can be used to do some instrumentation, like adding preconditions, postconditions or logging to a function, while preventing any foreknowledge of what exactly is being called without expensive code analysis (which might not be conclusive).

The same goes for objects: in general, any field of any object can be set to anything. There is no language-provided way to control access to object attributes that doesn't add significant overhead (for instance, it is possible to intercept getting or setting attributes, but that adds the overhead of executing these methods each time an access is made). This makes it very easy to define and extend objects, but at the same time makes them particularly difficult to analyze, since their makeup is not explicitly documented and can change at any time.

All in all, I do not find late binding to provide advantages that outweigh the performance losses it entails. In order to obtain great performance in such languages, it is necessary to use many clever tricks and optimizations, and since it is often possible for any part of the code to modify any other part, these optimizations have to be global, which proves to be a scalability issue. In Python, it is possible to import a module and change the binding of functions defined within it - this makes it unsafe to heavily optimize it without knowing anything about the programs that use it.

A better system would be to bind early by default, but enable late binding for particular symbols. Similarly, promote a more structured way to define objects, while also allowing the construction of unrestricted objects. These compromises do not lower the language's power, but I believe they make it more scalable.

Duck typing

Duck typing, according to Wikipedia, refers to the idea according to which the type of an object is effectively the set of methods and attributes that it possesses. Python naturally makes use of this principle, and in general objects are not type checked - if they have the expected methods, and that these methods behave as expected, then there is no problem.

While this system allows for very easy prototyping, it has the disadvantage that type errors can be somewhat uninformative: for instance, you might get an error due to a missing "x" field, or an obtuse error due to some attribute of an attribute violating some part of the semantics. In order to avoid this, one can check type using "isinstance(object, type)", but that is clunky and awkward to use. More recently, abstract base classes that only verify the presence of a certain set of attributes, or some arbitrary properties, were introduced in order to formalize the concept of duck typing a bit better (but that does not make the use of "isinstance" any less awkward).

Python 3 has introduced a very interesting feature, where the arguments and return value of a function can be annotated, for instance:

def f(x: int, y: int) -> int:
    return x + y

This code snippet does not actually do anything, however, because the semantics of the annotation are up to the developer. They can be used to add individual comments to arguments, or to give them types, which would have to be verified by a decorator or automatically by the class, if it is given the proper metaclass. Unfortunately, making good use of this feature requires a small dose of black magic...

Module system

One of my favorite things about Python is its nice and clean module system, which is inspired from Modula-3's. In order to use modules, Python mostly uses statements of the form "import module" or "from module import function". In the former case, all uses of a function in the module must be fully qualified, i.e. "module.function". In the latter case, the function is imported in the current namespace. This is in contrast with the somewhat kitchen-sink approach of some other languages, or namespace systems that I usually find clunky. It is also possible to import all the symbols from a module, i.e. "from module import *", but this is not recommended except perhaps in the interactive interpreter or throwaway scripts, because of name shadowing issues.

Making a module in Python is as simple as putting a .py file in one of the directories listed in the PYTHONPATH environment variable, or a directory with an __init__.py file in it. The associated module will be named after the file or of the directory.

Bottom line

I feel like I might have been pretty critical of Python here, even though I really like the language overall. In making Theano, however, I found myself fairly limited, for there was no real possibility to smoothly integrate Theano's mini-language into Python without resorting to parsing it (and unfortunately, it appears that the AST is fairly complex, which makes the whole exercise painful). Automatically converting code using numpy to code using Theano was not really feasible either. In the end, we settled for manually constructing computation graphs.

All in all, I would describe Python as being a great language which is harder to optimize than it should be and whose expressiveness is limited in comparison to languages such as Ruby or Scheme which allow for a slightly laxer syntax and/or more powerful abstractions and/or the possibility to define language features with macros. It is true that it is in line with Python's philosophy of syntactic restriction in order to streamline, but on the other hand it seems to me that a good language should not force such choices upon the programmer, and that it should be possible to restrict it otherwise. For instance, a superset of Python could support macros, and you would have the choice to use the subset rather than the whole language.

 

11Oct/110

Objectives

Whenever one purports to create a new language, the most obvious and most immediate question that pops up is: "why an new language?", or more precisely: "what does your language do that other languages don't?".

That is a very good question, though the answer might not always be very satisfactory. The existence of a language is usually easier to justify a posteriori.

In a sense, with respect to its creator, even though he or she might not fully realize it, a programming language is mind share. Since it is the most basic level at which programmers reason, anyone using a language is implicitly endorsing the way of thinking that it promotes. If we assume that creating a language means to craft a means of combining concepts and ideas that most closely matches our own thought processes, that means that a programming language is essentially a vehicle to bring programming orthodoxy closer to what we believe it should be - and such beliefs fall in a wide spectrum between objectivity and subjectivity.

Many view the design of programming languages as a race to figure out the most expressive, or the "ideal" building blocks with which problems may be solved, one that hits a mythical sweet spot between all sorts of factors ranging from raw expressiveness to ease of debugging or optimization. I believe that view is a red herring: the "ideal language" does not exist, and the subtleties of programming languages lie at a place that is out of view, much less within reach of the large majority of programmers (language users). So while languages such as Scheme or Smalltalk might have been conceived as an attempt to attain a sort of programming Nirvana, I doubt that it matters much in practice or that this is an objective Nirvana, rather than a subjective one.

To create a new language is to offer an alternative to yourself and others that might increase productivity, and this very much depends on whose productivity you wish to increase. For bad programmers, this is impossible: they will be unproductive in all languages. For decent programmers, this is very difficult: as long as they know how to do something, they're content. For good programmers, languages such as Lisp, Smalltalk and al. place us at the point of diminishing returns: they are well-rounded enough that no spectacular improvements can be achieved. And finally, for excellent programmers, this is trivial, because they are abstraction sponges.

Much like ethics, design is not a place where objective truths can be obtained, but rather a battlefield where memes duke it out. Depending on the target audience, different strategies are to prescribe. In the field of mass programming, where bad and decent programmers thrive, language quality is largely irrelevant, and only pragmatic matters count: library size, killer apps, the inane bullet points corporate neckties eat for breakfast, community support and as many zealots as you can get. That's because mass programming is driven by community - I mean, do you think anybody would use PHP otherwise?

Academia is already more interesting, because unlike the code monkeys that populate enterprises (I don't mean to be disparaging! Gotta earn that bread), it is formed of many people who actually like messing around with programming languages. Thanks to this, they are not as likely to reject ideas on the grounds that they are not what they are familiar with, and that certainly makes for a less depressing experience. Now, the question is: how do you actually convince them that what you are doing is worth it? There are three categories of goals that are worth pursuing:

  1. Efficiency: certain design choices can make languages inherently more difficult to execute efficiently, while certain others, on the contrary, can make optimization easier. The more benign a design choice is and the greater the efficiency gain, the more convincing the case is. This is probably the better horse to bet on, because of how objective the measurements are, and because this is pretty much the only property that a language can have that can strong-arm people into using them. For instance, why does anybody use C? Because it's efficient. Besides force of habit, what other reason could you possibly have to use C?
  2. Expressiveness: imagine that you have a large set of random tasks that you must program solvers for in some language you are proficient in. For each language, you can compute the mean and variance of the lengths of the solvers you come up with and the mean and variance of the cognitive load associated to programming each solution. Expressiveness, informally speaking, aims to minimize all of these values: you not only want short programs, but you also want as many problems as possible to have short solutions. You don't want each solution to take too much time (cognitive load) to come up with, and you don't want to get stumped on problems that are just inherently difficult to solve in that language. Expressiveness is ultimately somewhat of a subjective affair, but again, if a convincing case can be made that several programs are made shorter or easier to come up with in a certain language, that's a good show.
  3. Conceptual frugality: all scientists are suckers for conceptual simplicity (and so am I). That is, if it can be shown that feature A, B, C, D and E of most languages can be reduced to the single feature X of your language, it makes a good case for its superiority. First, this usually makes the language core easier to implement. Second, such a feature X might be general enough to implement features F, G and H that nobody thought of. This makes it strictly more expressive than an enumeration of hard-coded features, assuming that these new features are ever found out to be useful.

Now, nobody designs a language just to design a language. Somebody designs a language because they are unsatisfied with existing offerings and have one or more ideas that they want to try out. What the previous paragraphs outlined, however, are a rough categorization of useful ideas, that is, ideas that can lead to something new and/or interesting. Ideas that do not clearly fall in that spectrum may be "cool", but they are unlikely to be seen as useful: rather, they will be categorized as idiosyncrasies, and criticized on grounds that have little to do with their merits. Most (but not all) syntactic considerations fall into that category, and as such they are ultimately an insufficient basis to build a language on (other than as a hobby, that is).

For Quaint, I have a list of many ideas that I wish to try out, and the language will be a sort of synthesis of these ideas. I can roughly categorize these ideas into three major groups:

  1. Semantics: the meat of the expected contribution of the language. That is, these are the core semantic concepts that the language will be built around.
  2. Pragmatics: these are ideas to help spread use of the language. The unfortunate but incontrovertible truth is that the quality of a language does not really matter in the usual settings where it is used, but I have some ideas that might translate to wider use.
  3. Cosmetics: I also have some ideas about what the language should look like. I will sneak in these ideas into the language. Why? Because I can.

Some ideas might be dropped and new ones might be had, but so far there is still a distinct set I can describe right now.

Paradigm and imports

I will start by describing what one can expect from Quaint and the various ideas that are taken from other languages. I will not give a lot of detail here, but I will later on:

  • At a surface level, it will resemble a dynamic language à la Python or Ruby.
  • Multi-paradigm:
    • Optional lazy evaluation of expressions.
    • Primitives for data and task parallelism.
    • Primitives for backtracking.
    • A form of data flow programming.
  • A kind of macro system (Lisp-like):
    • It will be possible to "unquote" an expression within source code in order to evaluate it at compile time, which will insert the result as a special AST node at that location.
    • There will be syntax for "code" objects.
    • It will be possible to create new control structures (working on the AST - no custom parse rules).
    • Macro use will be more explicit and easier to identify than in Lisp-like languages.
  • Optional type annotations (using a general principle I will describe in the next section).
  • The module system will work like Python/Modula-3's.
  • Several data models will be supported.
    • Class-based and prototype-based objects, algebraic data types.
    • Rather than having a base object class with standard functionality, as many object-oriented languages do, standard operations such as fetching type or converting to string will be function primitives.
  • There will be whitespace-significant rules.

Semantic ideas

These ideas are probably the meatiest of the bunch and those that will require the most attention and formalism. These all aim to give the programmer greater expressiveness and greater control over the language.

  1. Reified variables: the idea is to give the programmer access to objects representing variables that control getting and setting the variable (as well as other things, possibly). Each variable could be declared with custom constructors allowing to attach arbitrary behavior to operations involving it. That feature could be used to implement the following:
    1. Types: a variable can be typed with a setter that verifies the type of the new value.
    2. Automatic conversion: a variable could be made so that anything put in it is converted to a string, for instance.
    3. Parameter-less functions: a variable's getter can compute a nullary function (for instance, the variable "time" could return the current time on every use).
    4. Enforce constraints: for example, a variable could restrict its value to be between fixed bounds by clipping it to these bounds if needed.
    5. Enhance debugging: a variable could print or log all values it has taken over the course of the execution of a function, or print/log all of its uses.
    6. Implement access restrictions: a variable's setter can only accept a value once, which implements a read-only variable. It might be able to verify what code is trying to access it, thereby implementing the private/protected keywords found in other languages.
    7. Dynamic scoping: a variable's getter/setter could use a global store to fetch and store values, which amounts to dynamic scoping.
  2. Lazy typing: I call "lazy typing" the process of typing a variable without formally verifying that it has that type, instead verifying dynamically that all of its uses conform to it. Strict typing enforces a contract upon the values a variable can take, and attempts to prove that the variable will never take a value that violates the contract. However, it is not always possible to obtain such a proof: while it is easy to verify that something is an integer, it is more difficult to verify that a function has the signature int -> int, or that some mutable object won't change later in a way that will violate the contract. Or sometimes, we might have objects that technically violate the contract, but only in use cases that we know don't happen. In these cases, we would like to use a lazily typed variable which states that actual uses of the variable will never violate the contract. Therefore, a variable lazily typed int->int will take any function and will not raise an error if it is always given an integer and always returns an integer. If either of these conditions is violated, an error will (informatively) point to the lazy type declaration. There are of course some complications to such a system, which I will explain in a standalone article about this feature.
  3. Symbolic tools: I believe that it is a desirable feature of a language to be able to reason on a rich representation of a program or function and automatically transform it into a derived program or function. Most examples I have in mind involve symbolic manipulation on mathematical functions, a bit like a CAS, but not necessarily to solve complex equations - just to perform useful mechanical transformations. One of these is obtaining a function calculating the partial derivative of another function. Another would be to calculate the size of the vector output of a function given the size of its vector input (that transformation would give identity on map and addition on concat), or to simplify a vector function to only compute its ith output. In Quaint, it will be possible to do these things, though it might not be fully general: it might only work on pure functions, but we'll see about that. About this, I have to mention my previous work in Theano, which implements some of this capability in Python (but not directly at a language level).
  4. Rich semantic annotations: it should be possible for the programmer to give the compiler/interpreter as much information as possible on as many functions or objects as possible, so that it can be exploited to increase performance or memory efficiency. For instance, annotating a function as being the inverse of another function can be useful if during execution we often apply one to the result of the other (so that they cancel out). Annotating functions as doing the same thing as each other lets the compiler pick which one to execute. One might also want to describe the big-O complexity of a function in best/average/worst cases, and let the compiler approximate the multiplicative  constants. It might be useful to know if a function is commutative or associative. Many properties could be figured out automatically, but certainly not all of them, and if the programmer knows what they are, we could use that knowledge. Of course, we need to be cautious about erroneous annotations. Again, Theano supports several such annotations.

I believe all of these ideas have interesting implications in practice. Reified variables and lazy typing enhance expressiveness, symbolic tools automates work that the programmer would otherwise have to do himself (like computing the derivative by hand or using Mathematica), and semantic annotations can allow the compiler to reason at a high level and enhance performance in some situations.

Pragmatic ideas

These ideas, in my mind, are what could make Quaint successful. Of course, that does not mean other ideas do not matter - what it means is that these are the features that would promote its use the most efficaciously, if we look pragmatically at what is likely to make a language take off.

  1. Online versioned package system: the idea is to have a global online repository whence "import" statements would automatically pull code, including arbitrarily old snapshots of modules. Packages would not have to be installed: the language would automatically look for them if they are imported by the code, and cache them on your system (of course, it would be possible to explicitly ask the system to fetch and cache a module and its dependencies, for offline use or development).
    1. Versioning allows programs to explicitly import versions of some external code or module from a certain time, for instance to ensure stability of behavior, while allowing other programs to use the most recent version. Versioning would likely work with simple time tags. For instance, the statement "import oct2011: url" would allow you to get the url package as it was in October of the year 2011.
    2. There would be a user space, to facilitate code sharing: for instance, the statement "import users.breuleux: coolfunc" would allow you to use the cool function I made without any further effort, provided that I committed it to my user space.
  2. Cerfificates: a "certificate" would apply to source files (or individual functions) in order to statically guarantee certain properties that depend on the certificate chosen, and guide the user into making the appropriate changes to satisfy it. The idea is that restrictions can be put on Quaint by using certificates, producing new languages that are more static, more regular and/or easier to read and maintain. For large projects, certificates could be used in order to guarantee some performance and coding standards among team members. Here are a few examples of certificates that would probably be standardized and actively promoted, although not enforced on the core language:
    1. Type safety: a certificate could require that type can be inferred throughout a source file.
    2. Macro/custom operator limitations: this would serve to limit the macros and operators used in a source file to a relatively small whitelist. Indeed, macros and custom operators can lead to very idiosyncratic code, which is sometimes bad for readability and maintainability.
    3. Style guidelines: this would verify that style matches certain guidelines.

I believe that the package system that I propose would promote code reuse very well: it is akin to having a global code repository for everyone to use, with personal space to publish small or large code snippets. The way I envision it working is that you would write a function called f in a file called mymodule.q, type "quaint push mymodule.q" at the terminal, sending your code to a central server that would publish it in the global package path "users.yourname.mymodule.f". Then you and everybody else could use it by typing "import users.yourname.mymodule: f" - on your machine it would use your local version, and for others it would download from the central repository or a mirror.

Certificates, I believe, are necessary in order to enhance the quality of large scale open source or enterprise projects where many collaborators have to contribute and read/maintain each other's code. In such a situation, it makes sense to have strict conventions that ensure everything fits together cogently. With standard tools that restrict the language's power to a manageable subset, such projects can simply decide to apply certificates XYZ to all project code, effectively using a subset of Quaint (I expect only a few certificate combinations to prevail in practice). The nice thing is that if something cannot be done well under these restrictions, exceptions can be made on a per-case basis.

Cosmetic ideas

Finally, the following ideas are things I want to try out about the look and feel of the language:

  1. Use of Unicode: I am tired of ASCII and I find it limited. On the other hand, I am aware of all the potential hurdles and stigma associated to using Unicode, so I have made what I believe to be a good plan in order to introduce Unicode symbols and operators in Quaint.
  2. Whitespace significance: I don't really care about whitespace insignificance, because I don't think the ability to organize whitespace in whatever ways you want is very valuable, compared to the gains some whitespace significant rules can offer. I plan on having whitespace significance in the following ways:
    1. Line breaks end statements: that should be clear enough to anyone who managed to read up until here. You can break a statement on several lines by using a backslash at the end of a line that's being continued, or at the beginning of a line that continues the previous one. HOWEVER, unlike several languages where line breaks end statements, in Quaint line breaks will also separate elements of lists, so that you don't need to put commas at the end of a line.
    2. To disambiguate fixity: examples should suffice to show how this works: a + b is parsed as infix+(a, b); a+ b is parsed as postfix+(a)(b); a +b is parsed as a(prefix+(b)). The parser makes no difference between one space and more than one space.
    3. As two juxtaposition priorities: note that in the following, _ represents a space. The idea is that (a)(b)(c)(d) is parsed as itself, that is, (a)(b)(c)(d); (a)(b)_(c)(d), on the other hand, is parsed as ((a)(b))((c)(d)). Basically, not putting spaces between operands binds them tighter. Furthermore, I am considering that whereas (a)(b) + (c)(d) would parse as infix+((a)(b), (c)(d)), (a)_(b) + (c)_(d) would parse like (a)(b + c)(d) (regardless of the space pattern around +).
    4. Significant indent: a bit more general than what Python does: in Quaint, an indented block would be equivalent to putting an opening parens at the end of the line preceding the block, and another at the end of the last line of the block. This being said, this might not actually survive the design process, because auto-indent is a useful feature it would prevent.
  3. Operator-based syntax: besides delimiter grouping (parentheses, square/curly brackets, guillemets), the syntax will be almost purely an operator syntax, based on an operator precedence DAG (note: whitespace is considered the function application operator here, and commas/semicolons/linebreaks are list constructors). What's pretty cool is that in conjunction with the whitespace rules previously listed, it can be made to look almost exactly like Python, just with a lot less restrictions on what is legal syntax. I will post about it soon.
  4. General function application: this might be more about semantics than cosmetics, and the feature's name is probably misleading. The idea is as follows: consider two usual features of programming languages, that is, fetching the attribute of an object, and calling a function. There is typically a syntactic distinction between these two features, but I plan to do away with it, and instead see them as two cases of function application, with for only difference the type of the argument. With "f(x)" representing general function application, "f.x" is parsed as "f(.x)": the application of a function to a symbol, filling the semantic role of fetching an attribute. "f[x]", on the other hand, is parsed as "f([x])": the application of a function to a list of arguments, filling the semantic role of the (standard) function call. What's nifty with this approach is that when the argument to f is neither a symbol nor a list of arguments, the semantics are pretty open. I have some use cases in mind that I will describe another time.

I am not certain any of these ideas are interesting per se, but I will push them through anyway.

And that pretty much wraps it up for the time being :)

Wow! Did you really read the whole thing? Here, take this, you deserve it:

 

28Sep/110

Language review

In order to gain inspiration for Quaint features, to make sure that good ideas are not lost, and to avoid repeating some mistakes, I am going to spend some time learning and reviewing many existing languages, at the rate of one or two per week (once I get the hang of it, anyway). I will spend more time on some languages than others, simply because many of them are similar and that there is little need to cover the same things more than once.

Here is a preliminary list of the main languages I will look at, along with the reasons I am considering the language:

C - is an imperative language that's very close to the machine, and lacks many features enjoyed by most other languages (such as garbage collection). Since C is essentially the language of choice of anyone who wishes to maximize performance, I feel that it is important to cover it.

Python - is a mature, widely used object-oriented dynamic language. It aims for frugality ("there should be one obvious way to do it") and its package system is solid (in the same vein: Ruby).

JavaScript - is the language of choice for web development, and pretty much the only one. It implements a prototype-based object system (in the same vein: SelfLua, CoffeeScript).

Scheme - is syntactically minimalist and made extremely flexible by its macro system, allowing for the easy definition of new control structures and/or domain-specific languages. It is somewhat multi-paradigm, but is usually used in a functional way (in the same vein: Common Lisp).

Haskell - is purely functional and lazy. Its approach to evaluation order is not mainstream but has many advantages with respect to code elegance and optimization. The type system of Haskell is rigorous, well fleshed out, and types can be inferred automatically in most cases. Doing imperative/side-effect-inducing code in Haskell is not as intuitive as in most other languages, however.

OCaml - is a functional object-oriented language. It has an elaborate type system supporting OOP features such as inheritance, and can do multi-paradigm rather well.

Smalltalk - is an object-oriented dynamic language based on message passing (where all arguments are named). Its syntax is rather regular and minimalist. It makes the definition of anonymous functions ("blocks") easy, and most control structures are implemented as methods on appropriate types that take blocks (for instance, [count > 0] whileTrue: [count := count - 1]).

Icon - has a unique evaluation method: the evaluation of a function is conditioned on the success of the evaluation of its arguments, and functions may generate more than one result, which allows to backtrack and retry until the evaluation succeeds. This allows some things to be written very concisely. Icon incidentally provides an elegant solution to chaining comparison operators.

Prolog - is a logic programming language, based on first order logic and backtracking. Predicates are defined on one or more arguments, e.g. add(x, y, z), and the interpreter may be asked to generate all values that satisfy some predicate given some constraints (e.g. add(4, 5, ?) -> z = 9).

Forth - is a stack-based programming language. "Words" may be defined which, when used, manipulate the topmost elements of the stack and push the result. Parsing is trivial, as tokens are read one after the other, looked up in a code dictionary, and executed on the stack.

Erlang - is a concurrent functional language. It has facilities to spawn local or remote processes, to hot swap some code for other code on a running system. Processes do not share state, so they have to communicate with each other.

Rebol - is based on dialecting, which are basically small domain-specific languages. The "do" dialect allows program definition, whereas the "parse" dialect allows the definition of a grammar, the "layout" dialect serves to describe GUIs, and so on. Each dialect, while derived from a common base, has its own particular syntax rules.

APL - is an array-based programming language making use of a wide array of symbols. Somewhat unreadable, but very concise.

Coq - is a dependently typed functional programming language, which allows it to fully tap the Curry-Howard isomorphism (types are proofs). The language allows the specification of correctness proofs about most functions.

Eiffel - pioneered design by contract, where functions provide preconditions that must be true prior to execution, postconditions that must be true posterior to execution, as well as an account of the errors and side effects that might occur, and so forth.

Chapel - is a programming language developed by Cray, geared towards supercomputing. As such, it offers several abstractions for data or task parallelism, and "locales" where variables live that may be located on different computers.

VHDL - is a hardware description language, but insofar that compiling code in a high level language as a circuit is an interesting objective, it follows that it is useful to know how it is presently done, and to take some inspiration if possible.

Other interesting languages not mentioned above include: Modula-3, Nemerle, Objective-C, REXX

 

23Sep/110

Welcome!

In here I'm going to publicly design a new programming language named Quaint. I shall post about all the various ideas I have and their possible implementation, and I will try to justify the inclusion of all the features I choose and the omission of all the features I reject.

It will be fun. It will be grand. It will be revolutionary.

So, welcome!

Filed under: Uncategorized No Comments