Another Year of Compiler Development

A look at a year of compiler development

Posted Dec 03 2024


Originally written for Cohost, but uh... It aint here anymore and I didn't finish this in time to post it.

It's been a hot minute/year since I wrote one of these, so please accept this humble smattering of a bunch of disparate shit smörgåsbord of bespoke programming language features served up hot and fresh.

In no particular order:

Name

The current name I'm going with is Lace. Since I'm stealing borrowing so much from Ada, I figure I'd also pay homage with the name. So it was either Love or Lace, and Love is taken by a wonderful Lua 2D game engine, so Lace it is.

Aesthetics

A common theme of this project is stealing things from Ada and syntax is no exception.

let, mut, var, const, let's call the whole thing off

Mutability is needed[citation needed] in a systems-level language, but is not a reasonable default. So symbols need some way to denote that they refer to mutable data.

Since I already steal so much from Ada's syntax, I've just inverted it. (Ada has mutability by default)

Ada code
Some_Mutable_Data   : Some_Type := 123
Some_Immutable_Data : constant Some_Type := 123
Lace code
Some_Mutable_Data   : mutable Some_Type := 123
Some_Immutable_Data : Some_Type := 123

The general shape of a declaration is name ':' modes expression? ':=' expression and mutable lies under this modes part and, spoiler alert, there are more to come later.

Later is now

Modes are essentially a way to add extra properties to a declaration. The current modes are:

Aliasing

This section is half the current implementation, and half hopeful speculation.

In concrete, code-writing terms, the aliased mode lets you use the 'access and 'address attributes on a symbol or any of its components. Essentially, allowing you to take a pointer to it (i.e. access it through a different name, or an alias).

In abstract, specification-esque terms, aliased mode makes the memory layout of the data referred to by the aliased symbol observable. Operations that observe memory layouts are things like taking pointers/access to it, reinterpret casting, and storing to or loading from any memory address. On non-aliased data, the compiler is/should be allowed to rearrange data as it sees fit. Record components need not be stored in order or even next to each other. If an array isn't dynamically indexed, just treat each component as a local variable and do normal register allocation based on their usage. Etc. etc.

I'm sure other languages have similar definitions about data that isn't aliased, but it is nice to have a marker in the language for this kind of thing (at least in my humble opinion).

A note on strict aliasing

While I try not to compare things to C since it is the lowest bar of both features and UX/DX, I think this is worth mentioning. Strict aliasing is a rule in the C (and C++) standard that essentially states that memory is typed (or effectively typed) by either its declaration or in the case of allocated objects (i.e. malloced pointers), its first write.

This means that a lot of C code is non-conforming (to almost no one's surprise).

For example, the relatively innocent and common pattern of reading into a local char buffer to then reinterpret into some protocol's data is technically incorrect.

C code
char buffer[1024];
ssize_t count = read(socket_fd, buffer, sizeof buffer);
// handle count < 0
struct some_data const *data = ((struct some_data *)buffer);
return data.thingy;

Since buffer's effective type is char[1024], reading it as a struct some_data is undefined behavior. (Note: even though buffer is of a character type, writing to it as a non-character type is still UB)

You may also notice that this rule makes it so that custom allocators in userland are almost always non-conforming unless they call out to malloc to actually acquire the memory. This is because the implementation of the standard library is blessed by the standard to do implementation-defined things that would otherwise be non-conformant.

I am striving to avoid having any sort of blessed library that is required to implement basic features in userland (such as dynamic allocators) and allowing data to alias is a big first step.

A note about reinterpreting data

The standard-blessed way of reinterpreting data is via memcpy or through a union (only in C, not C++). This is fine for small data and compilers usually have optimizations for when they realize that a memcpy is just being used as a reinterpret. But what about the extremely common case where the data has a flexible array member? (Bonus points if that array is not a character array. Bonus bonus points if that data isn't aligned in the protocol) While I'm more up on variable-length-arrays than most people, it is something that the standard allows the implementation to omit. So we are back to using memcpy to copy out chunks of this array?

On a practical note, since reinterpreting character buffers is so common, no actual compiler is going to do the wrong thing here... probably. Especially across linkage boundaries. So in actual code writing terms this is fine, but in terms of actually defining a language? Goofy.

If you want my take summed up: Attempting to write standard-compliant C is a complete farce. Just say you only support a handful of compilers and rely on their implementation defined behavior and extensions.

Error handling (no, not that kind)

For the ease of implementation, the compiler used to completely bail on the first syntax or semantic error it came across. In terms of writing a compiler, this is great! Something goes wrong? Hit da bricks!

As a user of the compiler? This shit sucks. You're tellin' me since I added an extra comma you gotta start the whole compilation over?

Well now the compiler can sort of handle multiple errors. There are still many cases where it just bails when it should just abandon the current function/block. But hey, it's progress.

Error handling (no, still not that kind)

Don't you love it when all error messages are clumped together in your terminal? And they print out a whole bunch of source code unrelated to the actual error?

Terminal/Shell output
bingus.c:100:23: error: Some shit happend
              thang x = { blah }
              ^^^^^
bingus.c:100:23: note: This shit sucks dude, try better
              thang x = { blah }
              ^^^^^
bingus.c:420:69: note: Oh brother, this gUY STINKS!
...
...

Yeah, me neither

This is definitely getting better with newer languages discovering the technology of box drawing characters. And Lace is following suit.

Take this code for example:
Lace code
Main : public := function do
	A : Int := 1
	B ::= A       -- Unused symbol B

	F (A, B => A) -- Missing argument C
	F (C => 2)    -- Missing arguments A, B
end

F ::= function (A : Int, B : Int, C : Int) do
	pragma Discard (A)
	pragma Discard (B)
	pragma Discard (C)
end

Int ::= range 1 .. 10

As the comments say, this will produce a few errors. But note how I try to visually group all each error into its own distinct bundle via line-drawing characters and whitespace.

Terminal/Shell output
3 errors in Scratch.Lace, 1 is automatically fixable:
   ┌ Scratch.Lace:3:2:
   │ Symbol ‘B’ should be annotated as unused
   │
   │ 2│      A : Int := 1
   │ 3│>     B ::= A -- Unused symbol B
   │ 4│
   │
   │ • This error can be automatically fixed
   └──────────────────

   ┌ Scratch.Lace:5:2:
   │ Attempt to call function of type ‘function (A : Int, B : Int, C : Int)’ with 1 fewer argument than needed
   │
   │ 4│
   │ 5│>     F (A, B => A) -- Missing argument C
   │ 6│      F (C => 2)    -- Missing arguments A, B
   │
   │ Note:
   │    • Missing argument ‘C’ of type ‘Int’
   └──────────────────

   ┌ Scratch.Lace:6:2:
   │ Attempt to call function of type ‘function (A : Int, B : Int, C : Int)’ with 2 fewer arguments than needed
   │
   │ 5│      F (A, B => A) -- Missing argument C
   │ 6│>     F (C => 2)    -- Missing arguments A, B
   │ 7│   end
   │
   │ Notes:
   │    • Missing argument ‘A’ of type ‘Int’
   │    • Missing argument ‘B’ of type ‘Int’
   └──────────────────
This is a lot nicer for me, personally.

Handling Errors (still not that kind) Automagically

You may have noticed in that first error

Terminal/Shell output
• This error can be automatically fixed

This is something that Zig is doing that I really like. Unambiguous errors like having an extra comma in a list, using an unused variable, etc. It's that joke that every beginner programmer independently discovers: If the compiler knows that a semicolon is missing, why can't it just put one there?

Well in these simple cases, the compiler will (when told to via the --autofix flag) delete the erroneous characters or add the needed ones to make the code correct.

unsafe? Nah. unchecked? Now we're talking

I do want this language to be safe. But there are things that the compiler can't verify. Rust has used the unsafe keyword to denote places where it is up to the programmer to uphold invariants that the language expects to be upheld. While I do like this decision I think the word unchecked is a little better for this. (Note: This strong opinion is weakly held and may change at a later date.)

Take the following code:
Lace code
String : public := array (Index range 1 .. <>) of Byte

Last_Is_Question_Mark : public := function (Data : String) return boolean do
	return Data ('last) = '?'
end
And the equivalent Rust:
Rust code
pub fn last_is_question_mark(data: &[u8]) -> bool {
	data[data.len() - 1] == b'?'
}

To me, this code is unsafe. Why? What happens when a programmer calls Last_Is_Question_Mark with an empty string? In Rust we would get a runtime panic, which is by Rust's definition safe (but not 'panic-safe', a separate category of safety). But I want a stricter definition of safety that includes things like array bounds.

Terminal/Shell output
    ┌ Oob.Lace:4:15:
    │ This operation requires an ‘unchecked’ annotation
    │
    │   3│   Last_Is_Question_Mark : public := function (Data : String) return boolean do
    │   4│>     return Data ('last) = '?'
    │   5│   end
    │
    │ Notes:
    │    • This array's length is unknown, therefore attribute ‘'last’ may produce an invalid value when it is empty
    │    • Check that the array's 'length attribute is not 0 to not need ‘unchecked’
    └───────────────

Since Data is an array of runtime-only known length, it may be of length zero. The compiler knows this and knows to check any indexes to try to prove that the array is known to be of a certain length before trying to use any of its components.

Now we could use an unchecked annotation like the error says

Lace code
Last_Is_Question_Mark : public := function (Data : String) return boolean do
	return (unchecked Data ('last)) = '?'
end
or we could do as the second note says and check that the array's length is non-zero:
Lace code
Last_Is_Question_Mark : public := function (Data : String) return boolean do
	if Data'length = 0 do
		return false
	end
	return Data ('last) = '?'
end
And this code will compile. The compiler is able to track the control flow of the Data'length = 0 condition and knows that by the second return, it can't be true. Therefore the array index is safe.

Wow, what a cool feature that definitely works all the time and produces correct results!

Now I won't lie and say I've implemented a full SMT solver or anything. In fact the current implementation is quite janky due to the ir not being a proper ssa or sea-of-nodes etc. So it is actually quite fragile. But the idea is there and works in the simple cases.

This is actually a common theme while I'm still designing the language and is something I often do with any program. The first implementation is essentially built to be thrown away. It will be a janky hack that lets me get the general shape of the feature that I can write tests around and figure out how the rest of the implementation needs to accomodate the feature. Then one fateful day it gets ripped out and reimplemented with a more proper implementation that is informed by the mistakes and questions posed by the first.

This approach has been very educational for me in general as it gives motivations for why other implementations work the way they do. By doing this you often run into the same mistakes that others do, but are glossed over in the literature. Like how trying to do semantic analysis on a hacked together three address code ir instead of ssa was what finally made phi-nodes in ssa click for me.

This ‘fact tracking’ as I've been calling it (if there is a better/established name for this do let me know) will eventually form the basis of function contracts (that can be checked at compile time).

Function contracts are one of my favorite features of Ada (or any language really, but Ada is the only one I've really used with them).

  1. It limits the observable behavior of an api. If a behavior is observable it will be relied upon. And if programmers are not willing to read the documentation (which they usually are only willing to skim), they will run some code, see that the effect they wanted to happen happened, and move on, not knowing that they are invoking undefined behavior.
  2. It is documentation that is verified by the compiler and/or runtime. If you've never had the pleasure of reading Ada's standard library headers (or specifications), they are beautifully documented. Not with comments, but with descriptive parameter names and contracts. Contracts like ‘If you push to this vector, its length will increase by one, or an exception will be thrown’ In terms of Ada, contracts can be asserted at runtime, or with SPARK, verified at compile time.

That's all for now

There is definitely more to talk about, but it's all to do with code generation and backend stuff and deserves its own post.