stz strings

stz strings
Photo by Anomaly / Unsplash

Strings are a powerful beast. They can come in many different forms and encodings. Some have character escaping and others have interpolation. In a previous blog post we realised we needed more than just plain characters for our strings.

The first rule of thumb to make is all compilation strings will be an array of unicode character points. That definition means right now each character is a 32-bit value. It might become 64-bit in the future if unicode loses its mind further.

The compiler will then 'reduce' the unicode string to a more optimised version. If all the codepoints are ≤127 then they can become unicode-character-8's in an array. Likewise for ≤32767 down to unicode-character-16.

The compiler will do type casting though. If you state that the type of the string is a utf8-string then it will convert it at compilation time. This is especially useful when you want utf16-string for interfacing with, say, win32 or macOS.

The in-memory representation should be as black-box as possible with the caveat that this is a systems language and if you want access to the bytes you can have access to the bytes.

Not all string encodings are made equal. Accessing characters by index in utf8 and utf16 will produce surrogate characters that you have to be aware of. The 'pure' unicode strings are much easier to work with but if they are mixed content (ie: latin characters + the poop emoji) then you suddenly bloom back up to x2 or x4 size for every character.

Difficulty using the string type versus memory usage. It's the developers choice really.

The first string literal we will define then is a 'raw' string. No escaping, no interpolation. It can contain everything but its escape boundary:

`a raw string`
eos`a raw string that can contain back-quotes: ` eos`

To solve the problem of needing to escape the character we quoted the string with we can prefix the string with a string-terminator substring. We're trading compilation time for ease of programming. This allows us to have a back-quoted string and include the back-quote in the string.

Next up we have escape characters. These are things like \n and \r and \t or a hexcode for a unicode character \xB000. A raw string cannot contain these because it doesn't interpret anything inside of it:

'a string\n\twith escapes'
my-eos'it's nice to be able to use single quote ' where-ever\nand escape characters!\nmy-eos'

And finally, one more kind of string - one with interpolation. This is the one the compiler will split up in to many substrings and write some code for you. Another way of putting it is it's a template string:

name = 'Jane'
"Good morning ~[name],\nHow are you today?"
⛷"Good morning ~[name],\nHow are you "today"?⛷"

"Good morning ~{name},\nHow are you today?"
⛷"Good morning ~{name},\nHow are you "today"?⛷"

"Good morning ~(name),\nHow are you today?"
⛷"Good morning ~(name),\nHow are you "today"?⛷"

"Good morning ~“name”,\nHow are you today?"

"Good morning ~‘name’,\nHow are you today?"

"Good morning ~⛷name⛷,\nHow are you today\~?"

You also gain the extra escape codes of \~ in case you need it. The way this works is after the ~ character that isn't escaped you can place any character you want - if it's a bracket or 'smart' quote character it will end with the matching closing bracket or smart quote, otherwise it will end with the character you started with.

We want to avoid trapping ourselves in to 'escape code hell' by making it as easy as possible to pick alternate quoting both for the string and for the substitutions.

The code between the brackets will run in the same scope as the string. It can be any code you want, however for your own sanity I recommend pre-computing the parameters you're going to substitute in to the string. It's easier to read - especially if you need to do if statements.

I chose the ~ character because it has a meaning in closures that the variable in question is 'external' to the block closure's context. I thought it's a nice mirror here that the substitution is 'external' to the string itself.

Substitutions cannot end a string. The end-string prefix comes from the hard coded string only.

It is high time a version of Smalltalk had 'nice to code with' strings templating/interpolation.