More Compiler Progress!

A dive into array types in my language

Posted Dec 03 2024

Written Mar 21 2023


Originally posted on Cohost.

Hey! Long time no compiler-chosting. A short time after writing my first short post about named arguments, I got super frustrated with the C programming language and decided to port my compiler to Zig. The short of it is that I had many memory errors (be it use after free/move (which moves were the worst since lots of things happened to be stored in an arena), use of uninitialized memory) etc. and I got sick of debugging them. Now Zig isn't exactly memory safe either (it does have safety features though), but I'm not proficient enough in Rust to want to rewrite my compiler in it.

Anywho, I am on the tail end of re-implementing arrays, and if you're familiar with Pascal or Ada, you know these aren't your ordinary, run-of-the-mill arrays. These are advanced arrays.

Let's say you want an array of some amount of bytes. In your plain ol' C you may define like so unsigned char my_byte_array[352], or Rust let mut my_byte_array: [u8; 352] = [0; 352];, or Zig var my_byte_array: [352]u8 = .{0} ** 352;. Disgusting. We can do better:

Lace code
My_Byte_Array : Byte_String := { others => 0 }

Byte_String : type := array [Index] of Byte
Index       : type := range 69 .. 420
Byte        : type := mod 2 ^ 8

Beautiful. Electrifying. Effervescent.

As you can clearly see here, 69 based indexing is the future and I don't think these so-called "modern" languages are equipped to handle it.

On a vaguely more serious note, I will die on the hill of 1-indexing, but still recognize the utility of a good 0-index every once and a while. So I'm taking the Ada route that arrays are indexable via any discrete type.

Arrays of Unknown/Runtime-Known size

Usually these aren't allowed to be automatically allocated, and must be dynamically allocated. The usual representation is a pointer to the data with the length next to that pointer. Rust and Zig call this a "slice", C++ calls this a std::span, and C calls this a fuck you an exercise left to the reader.

But while we're on the stealing-from-Ada train, we may as well pick up the concept of Unconstrained Arrays. Let's define our Index type to be a little more reasonable:

Lace code
Index : type := range 1 .. 2^64 - 1

And an array type to go along with it:

Lace code
Byte_String : type := array [Index range <>] of Byte

This allows us to have arrays of varying lengths that all belong to the Byte_String type (Recall that this language is Nominally typed, i.e. types have names and are incompatible if they have different names). We can use by specifying the array bounds for the values we want to store:

Lace code
Greeting : Byte_String [1 .. 13] := "Hello, World!"
Offset_Greeting : Byte_String [35 .. 47] := "Hello, World!"

Or we can let the compiler infer the bounds from the initial value

Lace code
-- Infers [1 .. 13]
Greeting : Byte_String := "Hello, World!"

-- <> means to infer the given bound when we don't
-- want to start from the first `Index`

-- Infers [35 .. 47]
Offset_Greeting   : Byte_String [35 .. <>] := "Hello, World!"

-- Infers [78 .. 90]
Offset_Greeting_2 : Byte_String [<> .. 90] := "Hello, World!"

One issue this creates is we now have to know the range that the array values were created with to get elements out of them. We have two solutions to this currently:

  1. Use the 'first attribute. This attribute get the first value of a discrete type, or of an array's index type.
Lace code
Offset_Greeting_2 [Offset_Greeting_2'first] := 'h'

-- The compiler infers that 'first by default should resolve
-- to Offset_Greeting_2'first so you don't have to type it twice
Offset_Greeting_2 ['first] := 'h'
Offset_Greeting_2 ['first + 1] := 'E'

(which admittedly, feels like 0-indexing with extra steps)

  1. Partially constrain the array type. Instead of having both the lower and upper bounds of the Byte_String type left to the user, we can constrain the lower bound to 1 (or 0, or 69). Or constrain the upper bound to something, if you're into that.
Lace code
Byte_String : type := array [Index range 1 .. <>] of Byte

-- Same as before
Greeting : Byte_String := "Hello, World!"

-- This one is now a compile error since 35 is not 1
Offset_Greeting   : Byte_String [35 .. <>] := "Hello, World!"

-- This one is now a compile error, because the array
-- assigned to it would need to be 90 elements long
Offset_Greeting_2 : Byte_String [<> .. 90] := "Hello, World!"

Zero Indexing

One argument for zero indexing is for when your array needs to act like a ring buffer and just doing a simple modulo operation is sufficient. This brings us to another type class of the language. Recall earlier I said that arrays could be indexed by any "discrete" type. Here "discrete" means having a (conceptually) finite number of (ordered) elements.

A vaguely more formal definition of "discrete"

For those of you more mathematically/theory inclined, a discrete type is a scalar type that represents a finite set of ordered elements, i.e. isomorphic to any finite subset of the integers.

In my handwavey definition, I said "conceptually". In my mind, this excludes floating point types even though in reality they can only represent a finite set of (ordered) values, conceptually they represent a continuum and have a not-so-intuitive successor/predecessor operation, a plethora of NaN representations, equality not being intuitive (the usual 0.1+0.2 /= 0.3), and other quirks that make them a bad fit for being discrete.

I do want to include fixed point types as a language feature, but am currently unsure if they should be considered discrete. Since they can be thought of as representing a continuum maybe they shouldn't, but conceptually they are also useful for representing a discrete set of values that are just not always integral. So I'm leaning more towards yes than no.

But for this wrapping case, we have to define an integer range as such

Lace code
Zero_Indexed_Range ::= range 0 .. Some_Length - 1

Which (imho) can be improved by baking the wrapping behavior into the type. Introducing modular types

Lace code
Wrapping_Index ::= mod Some_Length

As you can see above, the unary mod operator creates a modular type. This is similar to the Zero_Indexed_Range except when doing arithmetic it will automagically wrap around when its bounds are reached.

I've found that most of the time when I want zero indexing, I also want this wrapping behavior.

Lace code
My_Array : array [Wrapping_Index] of Byte := { others => 0 }

I : Wrapping_Index := 'last

My_Array [I] := 1
I := I + 1 -- wraps to zero
My_Array [I] := 0

That's it for now!

I've been in the porting mines for a couple months but now the rubber is starting to hit the road an I'll have more interesting things to post about soon