Designing a new programming language is something I have always wanted to do, and seeing the ridiculous number of new programming languages popping up (and disappearing) every few months, I am clearly not the only person who feels inspired by the idea. There is something exciting in designing the very tools that we design libraries and applications with and adapting them to one’s own quirks and expectations.
Although I consider that a lot of languages already exist that are perfectly adequate, none combine all the features I desire: a simple and elegant syntax, infix notation, macros, a solid package system, a design that lends itself well to various optimizations, and being designed by me (I am that vain). Now, that might not be completely true: there are many languages that nobody uses which I find much better than any language that people use; but I figure I might as well make my own, if I see any room for improvement (subjectively speaking, anyway).
There are many aspects of my language that are already designed, but this post will not delve into details. It is more of an overview of what I believe to be sound design principles.
An overarching concern
It is obvious that some languages are inherently harder to optimize than some others. A program written in Python or in Ruby will run slower than a program written in C, and in general, there is a chasm between dynamic languages (mostly interpreted) and static languages (mostly compiled). It is not difficult to understand why where the difference comes from: static languages require more annotations (to identify types, for instance) and can produce much tighter guarantees about what variable references will resolve to. The much greater laxness of dynamic languages makes these guarantees much harder, if not impossible to obtain, but at the same time it makes them much more flexible. Productivity is enhanced, though often at the cost of correctness and eager bug diagnosis.
I believe that a reasonable balance can be struck between static and dynamic, a balance that neither type of language really tries to achieve, tending instead to the extremes of rigidity or flexibility. I believe that dynamic languages, by default, allow more possibilities than they should, and fail to provide proper tools to enforce restrictions to the benefit of safety and compiler optimization. Static languages, on the other hand, give no escapes to their rigid semantics.
My goal is essentially to make a highly optimizable dynamic language, with restricted behavior in the average case but expressive enough to ask for more. That way, I hope to get the best of both worlds. A few points made in this article (further down) are directly related to this concern for efficiency.
Expressive power
There is something to be said about expressive power, which is that more expressive power benefits program writers, whereas less expressive power benefits program readers and maintainers. Power users of a programming language and creative programmers will usually thrive for more features and more ways to do the same things, so that they can optimize their own coding experience. On the other hand, team-oriented programmers, maintainers and managers likely see flexibility as being more things to learn and more rope to hang oneself with. The Java programming language is a prime example of a language with abysmal expressive power which is nonetheless staunchly defended by those who are put at ease by its soothing familiarity and monotonicity, and then proceed to go work for boring companies.
In any case, I would say that the success of a language is mostly circumstantial and that once it catches on, people learn to love it in spite of its faults. JavaScript is pure dreck, but essentially all we have for client-side web programming – and so we use it, and through some inexplicable acclimatization process, we even like it. C++ is a clunky and bloated abomination and its preprocessor is a disgrace. Still in wide use. Okay, so the last few sentences are a bit tongue-in-cheek, but nonetheless it seems clear that languages have success somewhat independently of their merits, and that people will rationalize and defend any feature that they are used to.
This is both good and bad. It is bad to the extent that the factors that a language designer thinks are the main contribution of his or her work are ultimately secondary. On the other hand, it relaxes the need to design a language that will please to a maximum of people, so that one can focus on finding an elegant way to express what they want to express. This being said, violating too many expectations (e.g. the expectation that the mathematical precedence of operators is respected) still seems like a bad idea to me, since it gets in the way of appreciating the rest of the language, and re-acclimatization might then be viewed as an unjustified hardship.
Macros rule
A macro – and by macro, I mean the kind of macros one finds in Lisp or Scheme, not awful C preprocessor macros – is a way to abstract out common patterns in such a way that they can be reused with a minimal amount of code. In that way, they are similar to functions, with the difference that they are essentially code generators: given a few pieces of code, they can produce complex expressions to evaluate. Therefore, they can abstract over patterns that functions cannot. Macros are the ultimate form of expressiveness.
A proper toolkit of pre-made abstractions, such as lazy evaluation, closures, variable references, etc. can alleviate the need for a macro facility, if they are designed in such a way that they can be combined effectively. It is also true that abusing macros can lead to unintelligible code, since each new macro is akin to tacking on a new syntax rule. Reminiscing my days coding in Scheme, I am quite the offender in that regard.
I personally believe that macro use should be made obvious, in order to make the syntax more predictable and less surprising. In Lisp-like languages, macros are usually impossible to distinguish syntactically from normal function calls: “(function arg)” and “(macro arg)” follow the same syntax. This is rarely a problem, since macro calls are usually strange or meaningless if they are interpreted as normal calls, but I would still rather use something like “(macro: arg)”. In any case, I would like to provide the option of defining macros. Staging is also a neat idea, where code can be written to be executed at compile time (stage 0) or at execution time (stage 1) (of course, nothing precludes having more stages).
Everything is a function
Many programming languages gravitate around the “everything is an object” paradigm. It is my belief that “everything is a function” is a paradigm which is superior in every way, because it can be made to look and behave almost identically… except that the syntax is simpler.
Consider the following syntax:
- “f x” is the application of the function f to the argument x. Functions always apply to a single argument, and “f x y” means “((f x) y)”.
- “[a, b, c]” is a list of a, b and c (many languages already use this syntax).
- “.m” represents the symbol “m” (many languages have syntax for symbols, though not necessarily that one).
Then, consider how one would translate Python to such a language:
- “f(x)” becomes “f[x]” – what in Python is simply referred to as a function call becomes a function call on a list (of arguments).
- “person.age” stays the same, but it is parsed as “person .age”, or “person(.age)”. What in Python is an attribute lookup becomes a function call on a symbol.
- “f(*args)” becomes “f args”. There is indeed no difference between “f[x]” and “args = [x]; f args”, and no need for any syntactic sugar.
- “getattr(thing, attribute)” becomes “thing attribute”. There is indeed no difference between “foo.bar” and “attr = .bar; foo attr”.
While this does not include a mapping for other languages’ “collection[index]” syntax, the fact is that objects very rarely implement both “()” and “[]“. It seems futile to differentiate them. Furthermore, using square brackets for function calls avoids overloading the meaning of parentheses: their sole use is now to override operator priority. For what it’s worth, square brackets also do not require using the shift key.
The point here is essentially to point out that there is no need to make a syntactic distinction between function calls and attribute lookups. Besides replacing parentheses by square brackets, which I believe is a good idea regardless, the three syntax rules I enumerated previously lead to code that looks exactly like the equivalent expressions in Java, Python, Ruby and others, except for when it is simpler.
Using “f(a, b, c)” as the function call syntax compels an apply function (or of some syntactic sugar for it). Encoding syntax for attribute lookup “x.y” compels dynamic introspection methods. Instead, I believe that there should only be one syntactic concept, which is calling a function on a single argument, and that this call may be made on an object of type “arguments” or on an object of type “symbol”. This takes care of everything in a very elegant way, and even leaves an interesting gap to fill with other argument types (numbers, hash maps, special objects, etc.)
Definitions ought to be read-only by default
There is one aspect of dynamic languages that kind of ticks me off: when defining a function in the global scope, it is usually possible to overwrite it with another function. In many languages, modifications to the definitions in file X can even be done in file Y, affecting calls in file Z. I would wager that this is extremely rarely done, which in itself is not an argument against the feature. However, this can get rather bad when it is done accidentally (quite easy in JavaScript, since it thinks making variable assignment global by default is a brilliant idea). Conflicts can easily happen if one is not careful. Furthermore, the ability to replace functions can make code intractable, difficult to follow and difficult to optimize (even if the “feature” is not being used). In Python at least, it is also possible to import a package and shuffle around the functions it exports – sure, that is flexible, but so many things could go wrong.
I believe it can be sometimes very useful to change functions dynamically. But why is it possible to do this by default? I would much rather have a strict default behavior, such that definitions are immutable, but functions may be defined as mutable (opt-in). Knowledge of immutability gives readers of the code reassuring guarantees and makes it a lot easier to perform complex optimizations. Explicit mutability also helps readers understand the code (“Oh, this could change! I will look for places that might change it!”).
Overall, the default behavior I propose would lead to better code, which would be more predictable and easier to understand. Since it is a mere default, it would not preclude opting out of it, which could be as simple as writing “def mutable” instead of “def”. I do not preclude, either, the possibility of importing files or modules in “debug” mode, where the default would be changed to “mutable” (it could also be possible to import a module twice, once as “debug” and once normally, they would have to be loaded twice, as independent versions, rather than once).
Eval is ridiculous
Many dynamic programming languages include a handy function called “eval”. That function takes a string (or sometimes an AST) and – by default – parses and interprets it in the current scope, a bit as if the string’s contents were inserted verbatim in place of the call to eval. So for instance, “eval(‘x = 333′); print x” would print 333.
I have two issues with eval. The first issue is that eval is fundamentally dangerous – calling eval on a string coming from an untrusted source is a sure recipe for disaster, and for that reason it should not be done. The second issue is that eval is an optimization nightmare – any function call that might be a call to eval (and keep in mind that most languages allow setting fields and variables to eval) throws off any assumptions you might want to make about the values of the variables in the current scope. Now, eval being dangerous means that it should not be used, but that does not change the fact that it could be, which makes optimization harder regardless.
Don’t get me wrong – being able to evaluate a string of source code is something quite useful, and it is a feature I definitely wish to have in any programming language. However, does that string have to be evaluated in the current scope? It seems to me that the eval function should require an argument containing its evaluation environment. This would make it much safer, and much less of an issue for optimization, while preserving most of its legitimate uses. Certainly, by default, it should not have access to the current environment.
Now, should one wish to evaluate something in the current environment using eval, they should have to pass some object that represents it. I do not think there should be a generic function that returns the current environment, because this would lead to the same kind of aliasing problems that occur with eval. I would rather use a keyword or a special syntactic form, so that the compiler (and for that matter, any human reading the source code!) may tractably determine whether it escapes or not. The previous example would then have to be written “eval(‘x = 333′, scope); print x”.
The bottom line
Dynamic languages are great, but they make optimization needlessly difficult by allowing the overriding of functions in the global/package space by default, and making “eval” evaluate its argument in the environment in which it is called.
I would say that a good dynamic language, besides addressing these issues, should allow the definitions of macros and have syntax for staging. The “everything is an object” paradigm, while interesting in its own right, is dominated by the “everything is a function” paradigm, which can do the exact same things with fewer syntax rules and greater expressive power.
All in all, I think this sets a base for an interesting new programming language. Any feedback, criticism and ideas are welcome!