hisham hm

🔗 Parsing and preserving Teal comments

The Teal compiler currently discards comments when parsing, and it constructs an abstract syntax tree (AST) which it then traverses to generate output Lua code. This is fine for running the code, but it would be useful to have the comments around for other purposes. Two things come to mind:

Today I spent some time playing with the lexer and parser and AST, looking into preserving comments from the input, with these two goals in mind.

The compiler does separate lexer and parser steps, both handwritten, and the comments were discarded at the lexer stage. That bit was easy. As I consumed each comment, I stored it as a metadata item to the following token (what if there’s no following token? I’m currently dropping the final comment if it’s the last thing in the file, but that would be trivial to fix).

The hard part of dealing with comments is what to do with them when parsing: the parser stage reads tokens and builds the AST. The question then becomes, to which nodes attach each comment as metadata. This is way trickier than it looks, because comments can appear everywhere:

--[[ comment! ]] if --[[ comment! ]] x --[[ comment! ]] < --[[ comment! ]] 2 --[[ comment! ]]  then --[[ comment! ]]
   --[[ comment! ]]
   --[[ comment! ]] print --[[ comment! ]] ( --[[ comment! ]] x --[[ comment! ]]  ) --[[ comment! ]]
   --[[ comment! ]]
--[[ comment! ]]  end --[[ comment! ]]

After playing a bit with a bunch of heuristics to determine to which AST nodes I should be attaching comments in attempts to preserving them all, I came to the conclusion that this is not the way to go.

The point of the abstract syntax tree is to abstract away the syntax, meaning that all those purely syntactical items such as then and end are discarded. So storing information such as “there’s a –[[ comment! ]] right before “then” and then another one right after, and then another in the following line—no, in fact two of them!—before the next statement” are pretty much antithetical to the concept of an AST.

However, there are comments for which there is an easy match as to which AST node they should attach. And fortunately, those are sufficient for the “TealDoc” documentation generation: it’s easy to catch the comment that appears immediately before a statement node and attach to it. This way, we get pretty much the whole relevant ground covered: type definitions, function declarations, variable declarations. Then, to build a “TealDoc” tool, it would be a matter of traversing the AST and then writing a small separate parser to read the contents of the documentation comments (e.g. reading `@param` tags and the like). I chose to attach comments to statements and fields in data structures, and to stop there. Especially for a handwritten parser, adding comment handling code can add to the noise quickly.

As for the second use-case, a formatter/beautifier, I don’t think that going through the route of the AST is the right way to go. A formatter tool could share the compiler’s lexer, which does read and collect all tokens and comments, but then it should operate purely in the syntactic domain, without abstracting anything. I wrote a basic “token dumper” in the early days of Teal that did some of this work, but it hasn’t been maintained — as the language got more complex, it now needs some more context in order to do its work properly, so I’m guessing it needs a state machine and/or more lookahead. But with the comments preserved by the lexer, it should be possible to extend it into a proper formatter.

The code is in the preserve-comments branch. So, I think with these two additions (comments fully stored by the lexer, and comments partially attached to the AST by the parser), I think we have everything we need to serve as a good base for future documentation and formatter tooling. Any takers on these projects?


🐦 Twitter🐘 MastodonRSS - posts in English, posts em Português, todos / all

Latest posts


Admin area