stz parsing

stz parsing
Photo by Say Cheeze Studios / Unsplash

Before we can write an interpreter or compiler we need to be able to parse the syntax of stz.

The rules are fairly simple so far. There's a few characters that are special:

start-bracket (
stop-bracket  )
start-block   [
stop-block    ]
start-make    {
stop-make     }
divider       |
return        ^
string-quote  ' " `
reference     &
keyword-break :
comment-start /*
comment-stop  */
comment-line  //
comma         ,
assignment    =

Then there's the numbers:

integer      -0123456789 123456789
binary        0b01010101
octal        -0o07070707 0o07070707
hexadecimal  -0x0F0F0F0F 0x0F0F0F0F
float        -123456.789 123456.789

There's likely more to be done with strings to include interpolation and escape codes but that can wait and likely has nothing to do with the actual syntax of the language but instead meta-programming in the string library.

If we can parse identifiers and method signatures then we have all the building blocks of the language.

identifier    [a-zA-Z~][a-zA-Z0-9-_!?¿]*

It's a very permissive language so far. I'm not sure that's the wisest of ideas. I guess we'll see soon enough.

keyword        [a-zA-Z][a-zA-Z0-9-!@#$%*_+/\<>~]*
prefix         [!@#$%~]
unary          identifier
binary         [*_+÷/\<>≤≥] prefix*

That might be close enough to get something parsing. Let's see if we can define some of our structures:

expression
  method-call
  assignment-expression

method-call
  selector:prefix receiver:identifier?
  receiver:identifier? selector:unary
  receiver:identifier? selector:binary argument:sub-expressio
  receiver:identifier? (selector:unary keyword-break argument:sub-expression)+

sub-expresion
  literal
  object-make
  open-bracket expression close-bracket

literal
  literal-string
  literal-number
  literal-array
  literal-map
  literal-structure
  literal-enumeration
  
assignment-expression
  variable:identifier assignment one-expression (comma assignment-expression)*

method-declaration
  signature:method-call separator types:literal-array-body separator statements

literal-array
  open-bracket (separator type-expression)? literal-array-body close-bracket

literal-array-body
  nothing
  (type-expression comma)+

This is enough of an exploration to see that parsing the language isn't ambiguous. No fancy back tracking is required. There's some guess work with ( ... ) and [ ... ] but the moment we see what's between the |'s we know what we're dealing with.

Identifiers tend to be described by what they can't have rather than what they do have.

I have half a mind to spend some of the weekend trying to implement the parser. The interpreter would come next. The interpreter would be enough to make a compiler and then the compiler could compile itself. That's always a fun test when making a programming language.