Originally written for Cohost, but uh... It aint here anymore and I didn't finish this in time to post it.
It's been a hot minute/year since I wrote one of these, so please accept this humble smattering of a bunch of disparate shit smörgåsbord of bespoke programming language features served up hot and fresh.
In no particular order:
Name
The current name I'm going with is Lace. Since I'mAesthetics
A common theme of this project is stealing things from Ada and syntax is no exception.- Arrays now use parenthesis instead of square brackets. No real justification for this one. I just think it's neat. :)
-
Function calls with zero arguments don't need parentheses.
This one at least has some vague justification. Consider the case of
methods. Say some record has some field
Size
(or some package has an entrySize
) and later on, as an optimization or something,Size
doesn't need to be stored and may be cheaply calculated. So you would probably turn it into a method in something like C++ or Rust. The transformation is largely mechanical, replacingfoo.size
withfoo.size()
. When doing this style of change it reminds me of having to change something in C to a pointer and having to replace a bunch of.
with->
. Most modern languages agree that this sucks to do and I am extending this idea to methods.
let
, mut
, var
, const
, let's call the whole thing off
Mutability is needed[citation needed] in a systems-level language, but is not a reasonable default. So symbols need some way to denote that they refer to mutable data.
Since I already steal so much from Ada's syntax, I've just inverted it. (Ada has mutability by default)
Some_Mutable_Data : Some_Type := 123 Some_Immutable_Data : constant Some_Type := 123
Some_Mutable_Data : mutable Some_Type := 123 Some_Immutable_Data : Some_Type := 123
The general shape of a declaration is name ':' modes expression? ':=' expression
and mutable
lies under this modes
part and, spoiler alert, there are
more to come later.
Later is now
Modes are essentially a way to add extra properties to a declaration. The current modes are:mutable
as described above, makes the data referred to by a given symbol mutableunused
states that the symbol will not be used. If a symbol is not used but doesn't have theunused
annotation, an error is emittedpublic
makes the symbol a public entry of the current packagealiased
is complicated
Aliasing
This section is half the current implementation, and half hopeful speculation.
In concrete, code-writing terms, the aliased
mode lets you use the
'access
and 'address
attributes on a symbol or any of
its components. Essentially, allowing you to take a pointer to it (i.e. access
it through a different name, or an alias).
In abstract, specification-esque terms, aliased
mode makes the
memory layout of the data referred to by the aliased symbol observable.
Operations that observe memory layouts are things like taking pointers/access
to it, reinterpret casting, and storing to or loading from any memory address.
On non-aliased data, the compiler is/should be allowed to rearrange data as it
sees fit. Record components need not be stored in order or even next to each
other. If an array isn't dynamically indexed, just treat each component as a
local variable and do normal register allocation based on their usage. Etc.
etc.
I'm sure other languages have similar definitions about data that isn't aliased, but it is nice to have a marker in the language for this kind of thing (at least in my humble opinion).
A note on strict aliasing
While I try not to compare things to C since it is the lowest bar of
both features and UX/DX, I think this is worth mentioning. Strict aliasing is a
rule in the C (and C++) standard that essentially states that memory is typed (or
effectively typed
) by either its declaration or in the case of allocated
objects (i.e. malloc
ed pointers), its first write.
This means that a lot of C code is non-conforming (to almost no one's surprise).
For example, the relatively innocent and common pattern of reading into a local
char
buffer to then reinterpret into some protocol's data is
technically incorrect.
char buffer[1024]; ssize_t count = read(socket_fd, buffer, sizeof buffer); // handle count < 0 struct some_data const *data = ((struct some_data *)buffer); return data.thingy;
Since buffer
's effective type is char[1024]
, reading
it as a struct some_data
is undefined behavior. (Note: even though
buffer
is of a character type, writing to it as a
non-character type is still UB)
You may also notice that this rule makes it so that custom allocators in
userland are almost always non-conforming unless they call out to
malloc
to actually acquire the memory. This is because the
implementation of the standard library is blessed by the standard to do
implementation-defined things that would otherwise be non-conformant.
I am striving to avoid having any sort of blessed library that is required to implement basic features in userland (such as dynamic allocators) and allowing data to alias is a big first step.
A note about reinterpreting data
The standard-blessed way of reinterpreting data is via memcpy
or through a union (only in C, not C++). This is fine for small data and
compilers usually have optimizations for when they realize that a
memcpy
is just being used as a reinterpret. But what about the
extremely common case where the data has a flexible array member? (Bonus points
if that array is not a character array. Bonus bonus points if that data isn't
aligned in the protocol) While I'm more up on variable-length-arrays than most
people, it is something that the standard allows the implementation to omit. So
we are back to using memcpy
to copy out chunks of this array?
On a practical note, since reinterpreting character buffers is so common, no actual compiler is going to do the wrong thing here... probably. Especially across linkage boundaries. So in actual code writing terms this is fine, but in terms of actually defining a language? Goofy.
If you want my take summed up: Attempting to write standard-compliant C is a complete farce. Just say you only support a handful of compilers and rely on their implementation defined behavior and extensions.
Error handling (no, not that kind)
For the ease of implementation, the compiler used to completely bail on the first syntax or semantic error it came across. In terms of writing a compiler, this is great! Something goes wrong? Hit da bricks!
As a user of the compiler? This shit sucks. You're tellin' me since I added an extra comma you gotta start the whole compilation over?
Well now the compiler can sort of handle multiple errors. There are still many cases where it just bails when it should just abandon the current function/block. But hey, it's progress.
Error handling (no, still not that kind)
Don't you love it when all error messages are clumped together in your terminal? And they print out a whole bunch of source code unrelated to the actual error?
bingus.c:100:23: error: Some shit happend thang x = { blah } ^^^^^ bingus.c:100:23: note: This shit sucks dude, try better thang x = { blah } ^^^^^ bingus.c:420:69: note: Oh brother, this gUY STINKS! ... ...
Yeah, me neither
This is definitely getting better with newer languages discovering the technology of box drawing characters. And Lace is following suit.
Take this code for example:Main : public := function do A : Int := 1 B ::= A -- Unused symbol B F (A, B => A) -- Missing argument C F (C => 2) -- Missing arguments A, B end F ::= function (A : Int, B : Int, C : Int) do pragma Discard (A) pragma Discard (B) pragma Discard (C) end Int ::= range 1 .. 10
As the comments say, this will produce a few errors. But note how I try to visually group all each error into its own distinct bundle via line-drawing characters and whitespace.
3 errors in Scratch.Lace, 1 is automatically fixable: ┌ Scratch.Lace:3:2: │ Symbol ‘B’ should be annotated as unused │ │ 2│ A : Int := 1 │ 3│> B ::= A -- Unused symbol B │ 4│ │ │ • This error can be automatically fixed └────────────────── ┌ Scratch.Lace:5:2: │ Attempt to call function of type ‘function (A : Int, B : Int, C : Int)’ with 1 fewer argument than needed │ │ 4│ │ 5│> F (A, B => A) -- Missing argument C │ 6│ F (C => 2) -- Missing arguments A, B │ │ Note: │ • Missing argument ‘C’ of type ‘Int’ └────────────────── ┌ Scratch.Lace:6:2: │ Attempt to call function of type ‘function (A : Int, B : Int, C : Int)’ with 2 fewer arguments than needed │ │ 5│ F (A, B => A) -- Missing argument C │ 6│> F (C => 2) -- Missing arguments A, B │ 7│ end │ │ Notes: │ • Missing argument ‘A’ of type ‘Int’ │ • Missing argument ‘B’ of type ‘Int’ └──────────────────
Handling Errors (still not that kind) Automagically
You may have noticed in that first error
• This error can be automatically fixed
This is something that Zig is doing
that I really like. Unambiguous errors like having an extra comma in a list,
using an unused
variable, etc. It's that joke that every beginner
programmer independently discovers: If the compiler knows that a semicolon is
missing, why can't it just put one there?
Well in these simple cases, the compiler will (when told to via the
--autofix
flag) delete the erroneous characters or add the needed
ones to make the code correct.
unsafe
? Nah. unchecked
? Now we're talking
I do want this language to be safe. But there are things that the compiler
can't verify. Rust has used the unsafe
keyword to denote places
where it is up to the programmer to uphold invariants that the language expects
to be upheld. While I do like this decision I think the word
unchecked
is a little better for this. (Note: This strong opinion
is weakly held and may change at a later date.)
String : public := array (Index range 1 .. <>) of Byte Last_Is_Question_Mark : public := function (Data : String) return boolean do return Data ('last) = '?' end
pub fn last_is_question_mark(data: &[u8]) -> bool { data[data.len() - 1] == b'?' }
To me, this code is unsafe. Why? What happens when a programmer calls
Last_Is_Question_Mark
with an empty string? In Rust we would get a runtime
panic, which is by Rust's definition safe (but not 'panic-safe', a separate
category of safety). But I want a stricter definition of
safety that includes things like array bounds.
┌ Oob.Lace:4:15: │ This operation requires an ‘unchecked’ annotation │ │ 3│ Last_Is_Question_Mark : public := function (Data : String) return boolean do │ 4│> return Data ('last) = '?' │ 5│ end │ │ Notes: │ • This array's length is unknown, therefore attribute ‘'last’ may produce an invalid value when it is empty │ • Check that the array's 'length attribute is not 0 to not need ‘unchecked’ └───────────────
Since Data
is an array of runtime-only known length, it may be of
length zero. The compiler knows this and knows to check any indexes to try to
prove that the array is known to be of a certain length before trying to use
any of its components.
Now we could use an unchecked
annotation like the error says
Last_Is_Question_Mark : public := function (Data : String) return boolean do return (unchecked Data ('last)) = '?' end
Last_Is_Question_Mark : public := function (Data : String) return boolean do if Data'length = 0 do return false end return Data ('last) = '?' end
Data'length = 0
condition and knows that by the second
return
, it can't be true. Therefore the array index is safe.
Wow, what a cool feature that definitely works all the time and produces correct results!
Now I won't lie and say I've implemented a full SMT solver or anything. In fact the current implementation is quite janky due to the ir not being a proper ssa or sea-of-nodes etc. So it is actually quite fragile. But the idea is there and works in the simple cases.
This is actually a common theme while I'm still designing the language and is something I often do with any program. The first implementation is essentially built to be thrown away. It will be a janky hack that lets me get the general shape of the feature that I can write tests around and figure out how the rest of the implementation needs to accomodate the feature. Then one fateful day it gets ripped out and reimplemented with a more proper implementation that is informed by the mistakes and questions posed by the first.
This approach has been very educational for me in general as it gives motivations for why other implementations work the way they do. By doing this you often run into the same mistakes that others do, but are glossed over in the literature. Like how trying to do semantic analysis on a hacked together three address code ir instead of ssa was what finally made phi-nodes in ssa click for me.
This ‘fact tracking’ as I've been calling it (if there is a better/established name for this do let me know) will eventually form the basis of function contracts (that can be checked at compile time).
Function contracts are one of my favorite features of Ada (or any language really, but Ada is the only one I've really used with them).
- It limits the observable behavior of an api. If a behavior is observable it will be relied upon. And if programmers are not willing to read the documentation (which they usually are only willing to skim), they will run some code, see that the effect they wanted to happen happened, and move on, not knowing that they are invoking undefined behavior.
- It is documentation that is verified by the compiler and/or runtime. If you've never had the pleasure of reading Ada's standard library headers (or specifications), they are beautifully documented. Not with comments, but with descriptive parameter names and contracts. Contracts like ‘If you push to this vector, its length will increase by one, or an exception will be thrown’ In terms of Ada, contracts can be asserted at runtime, or with SPARK, verified at compile time.
That's all for now
There is definitely more to talk about, but it's all to do with code generation and backend stuff and deserves its own post.