🔗 Compiler versus Transpiler: what is a compiler, anyway?

Teal was featured on HN today, and one of the comments was questioning the fact that the documentation states that it “compiles Teal into Lua”:

We need better and more rigorous terms in computing science. This use of the compiler word blurs the meaning of interpreted vs compiled languages.

I was under the assumption that it would generate executable machine code, not Lua source code.

I thought that was worth replying to because it allowed to dispel two misconceptions at once.

First, if we want to be rigorous about computer science terms, calling it “interpreted vs compiled languages” is a misnomer, because being interpreted or compiled is not a property of the language, but of the implementation. There have been things such as a C interpreter and an ahead-of-time compiler for PHP which generates machine code.

But then, we get to the main course, the use of “compiler”.

The definition of compiler has never assumed generating executable machine code. Already in the 1970s, Pascal compilers have generated P-code (a form of “bytecode” in Java parlance), which was then interpreted. In the 1980s, Turbo Pascal produced machine code directly.

I’ve seen the neologism “transpiler” being very frowned upon by the academic programming language community precisely because a compiler is a compiler, no matter the output language — my use of “compiler” there was precisely because of my academic background.

I remember people joking around on Academic PL Twitter jokingly calling it the “t-word” even. I just did a quick Twitter search to see if I could find it, and I found a bunch of references dating from 2014 (though I won’t go linking people’s tweets here). But that shows how out-of-date this blog post is! Academia has pretty much settled on not using the “compiler” vs. “transpiler” definition at all by now.

I don’t mind the term “transpiler” myself if it helps non-academics understand it’s a source-to-source compiler, but then, you don’t see people calling the Nim compiler, which generates C code then compiles it into machine code, a “transpiler”, even though it is a source-to-source compiler.

In the end, “compiler” is the all-encompassing term for a program that takes code in one language and produces code in another, be it high-level or machine language — and yes, that means that pedentically an assembler is a compiler as well (but we don’t want to be pedantic, right? RIGHT?). And since we’re talking assembler, most C compilers do not generate executable machine code either: gcc produces assembly, which is then turned into machine code by gas. So gcc is a source-to-source compiler? Is Turbo Pascal more of a compiler than gcc? I could just as well add an output step in the Teal compiler to produce an executable in the output using the same techniques of the Pascal compilers of the 70s. I don’t think that would make it more or less of a compiler.

As you can see, the distinction of “what is a transpiler” reduces to “what is source code” or “what is a high-level language”, the latter especially having a very fuzzy definition, so in the end my sociological observation on the uses of “transpiler” vs. “compiler” tends to boil down to people’s prejudices of “what makes it a Real, Hardcore Compiler”. But being a “transpiler” or not doesn’t say anything about the project’s “hardcoreness” either — I’m sure the TypeScript compiler which generates JavaScript is a lot more complex than a lot of compilers out there which generate machine code.

Posted by hisham on Thursday, February 25, 2021 16:27:01 in en_US, Coding, Computing, Language

🔗 Again on 0-based vs. 1-based indexing

André Garzia made a nice blog post called “Lua, a misunderstood language” recently, and unfortunately (but perhaps unsurprisingly) a bulk of HN comments on it was about the age-old 0-based vs. 1-based indexing debate. You see, Lua uses 1-based indexing, and lots of programmers claimed this is unnatural because “every other language out there” uses 0-based indexing.

I’ll brush aside quickly the fact that this is not true — 1-based indexing has a long history, all the way from Fortran, COBOL, Pascal, Ada, Smalltalk, etc. — and I’ll grant that the vast majority of popular languages in the industry nowadays are 0-based. So, let’s avoid the popularity contest and address the claim that 0-based indexing is “inherently better”, or worse, “more natural”.

It really shows how conditioned an entire community can be when they find the statement “given a list x, the first item in x is x[1], the second item in x is x[2]” to be unnatural. :) And in fact this is a somewhat scary thought about groupthink outside of programming even!

I guess I shouldn’t be surprised by groupthink coming from HN, but it was also alarming how a bunch of the HN comments gave nearly identical responses, all linking to the same writing by Dijkstra defending 0-based indexing as inherently better, as an implicit Appeal to Authority. (Well, I did read Dijkstra’s note years ago and wasn’t particularly convinced by it — not the first time I disagree with Dijkstra, by the way — but if we’re giving it extra weight for coming from one of our field’s legends, then the list of 1-based languages above gives me a much longer list of legends who disagree — not to mention standard mathematical notation which is rooted on a much greater history.)

I think that a better thought, instead of trying to defend 1-based indexing, is to try to answer the question “why is 0-based indexing even a thing in programming languages?” — of course, nowadays the number one reason is tradition and familiarity given other popular languages, and I think even proponents of 0-based indexing would agree, in spite of the fact that most of them wouldn’t even notice that they don’t call it a number zero reason. But if the main reason for something is tradition, then it’s important to know how did the tradition start. It wasn’t with Dijkstra.

C is pointed as the popularizer of this style. C’s well-known history points to BCPL by Martin Richards as its predecessor, a language designed to be simple to write a compiler for. One of the simplifications carried over to C: array indexing and pointer offsets were mashed together.

It’s telling how, whenever people go into non-Appeal-to-Authority arguments to defend 0-based indexes (including Dijkstra himself), people start talking about offsets. That’s because offsets are naturally 0-based, being a relative measurement: here + 0 = here; here + 1 meter = 1 meter away from here, and so on. Just like numeric indexes are identifiers for elements of an ordered object, and thus use the 1-based ordinal numbers: the first card in the deck, the second in the deck, etc.

BCPL, back in 1967, made a shortcut and made it so that p[i] (an index) was equal to p + i an offset. C inherited that. And nowadays, all arguments that say that indexes should be 0-based are actually arguments that offsets are 0-based, indexes are offsets, therefore indexes should be 0-based. That’s a circular argument. Even Dijkstra’s argument also starts with the calculation of differences, i.e., doing “pointer arithmetic” (offsets), not indexing.

Nowadays, people just repeat these arguments over and over, because “C won”, and now that tiny compiler-writing shortcut from the 1960s appears in Java, C#, Python, Perl, PHP, JavaScript and so on, even though none of these languages even have pointer arithmetic.

What’s funny to think about is that if instead C had not done that and used 1-based indexing, people today would certainly be claiming how C is superior for providing both 1-based indexing with p[i] and 0-based pointer offsets with p + i. I can easily visualize how people would argue that was the best design because there are always scenarios where one leads to more natural expressions than the other (similar to having both x++ and ++x), and how newcomers getting them mixed up were clearly not suited for the subtleties of low-level programming in C, and should be instead using simpler languages with garbage collection and without 0-based pointer arithmetic.

Posted by hisham on Monday, January 18, 2021 15:28:58 in en_US, Coding, Philosophy, Computing, Culture, Language

🔗 What’s faster? Lexing Teal with Lua 5.4 or LuaJIT, by hand or with lpeg

I ran a small experiment where I ported my handwritten lexer written in Teal (which translates to Lua) to lpeg, then ran the compiler on the largest Teal source I have (itself).

lexer time / total time
LuaJIT+hw   -  36 ms / 291 ms
LuaJIT+lpeg -  40 ms / 325 ms
Lua5.4+hw   - 105 ms / 338 ms
Lua5.4+lpeg -  66 ms / 285 ms

These are average times from multiple runs of tl check tl.tl, done with hyperfine.

The “lexer time” was done by adding an os.exit(0) call right after the lexer pass. The “total time” is the full run, which includes additional lexer passes on some tiny code snippets that are generated on the fly, so the change between the lpeg lexer and the handwritten lexer affects it a bit as well.

Looks like on LuaJIT my handwritten parser beats lpeg, but the rest of the compiler (lots of AST manipulation and recursive walks) seems to run faster on Lua 5.4.2 than it does on LuaJIT 2.1-git.

Then I decided to move the os.exit(0) further, to after the parser step but before the type checking and code generation steps. These are the results:

lexer+parser time
LuaJIT+hw   -  107 ms ± 20 ms
LuaJIT+lpeg -  107 ms ± 21 ms
Lua5.4+hw    - 163 ms ± 3 ms
Lua5.4+lpeg -  120 ms ± 2 ms

The Lua 5.4 numbers I got seem consistent with the lexer-only tests: the same 40 ms gain was observed when switching to lpeg, which tells me the parsing step is taking about 55 ms with the handwritten parser. With LuaJIT, the handwritten parser seems to take about 65 ms — what was interesting though was the variance reported by hyperfine: Lua 5.4 tests gave be a variation in the order of 2-3 ms, and LuaJIT was around 20 ms.

I re-ran the “total time” tests running the full tl check tl.tl to observe the variance, and it was again consistent, with LuaJIT’s performance jitter (no pun intended!) being 6-8x that of Lua 5.4:

total time
LuaJIT+hw   -  299 ms ± 25 ms
LuaJIT+lpeg -  333 ms ± 31 ms
Lua5.4+hw    - 336 ms ± 4 ms
Lua5.4+lpeg -  285 ms ± 4 ms

My conclusion from this experiment is that converting the Teal parser from the handwritten one into lpeg would probably increase performance in Lua 5.4 and decrease the performance variance in LuaJIT (with the bulk of that processing happening in the more performance-reliable lpeg VM — in fact, the variance numbers of the “lexer tests” in lpeg mode are consistent between LuaJIT and Lua 5.4). It’s unclear to me whether an lpeg-based parser will actually run faster or slower than the handwritten one under LuaJIT — possibly the gain or loss will be below the 8-9% margin of variance we observe in LuaJIT benchmarks, so it’s hard to pay much attention to any variability below that range.

Right now, performance of the Teal compiler is not an immediate concern, but I wanted to get a feel of what’s within close reach. The Lua 5.4 + lpeg combo looks promising and the LuaJIT pure-Lua performance is great as usual. For now I’ll keep using the handwritten lexer and parser, if only to avoid a C-based dependency in the core compiler — it’s nice to be able to take the generated tl.lua from the Github repo, drop it into any Lua project, call tl.loader() to register the package loader and get instant Teal support in your require() calls!

Update: Turns out a much faster alternative is to use LuaJIT with the JIT disabled! Here are some rough numbers:

JIT + lpeg: 327ms
5.4: 310ms
JIT: 277ms
5.4 + lpeg: 274ms
JIT w/ http://jit.off: 173ms
JIT w/ http://jit.off + lpeg: 157ms

I have now implemented a function which disables the JIT on LuaJIT and temporarily disables garbage collection (which provides further speed ups) and merged it into Teal. The function was appropriately named:

turbo(true)

Posted by hisham on Friday, January 8, 2021 19:24:53 in en_US, Coding, Computing, Language

🔗 Parsing and preserving Teal comments

The Teal compiler currently discards comments when parsing, and it constructs an abstract syntax tree (AST) which it then traverses to generate output Lua code. This is fine for running the code, but it would be useful to have the comments around for other purposes. Two things come to mind:

JavaDoc/Doxygen/LDoc/(TealDoc?) style documentation generated from comments
a code formatter that preserves comments

Today I spent some time playing with the lexer and parser and AST, looking into preserving comments from the input, with these two goals in mind.

The compiler does separate lexer and parser steps, both handwritten, and the comments were discarded at the lexer stage. That bit was easy. As I consumed each comment, I stored it as a metadata item to the following token (what if there’s no following token? I’m currently dropping the final comment if it’s the last thing in the file, but that would be trivial to fix).

The hard part of dealing with comments is what to do with them when parsing: the parser stage reads tokens and builds the AST. The question then becomes, to which nodes attach each comment as metadata. This is way trickier than it looks, because comments can appear everywhere:

--[[ comment! ]] if --[[ comment! ]] x --[[ comment! ]] < --[[ comment! ]] 2 --[[ comment! ]]  then --[[ comment! ]]
   --[[ comment! ]]
   --[[ comment! ]] print --[[ comment! ]] ( --[[ comment! ]] x --[[ comment! ]]  ) --[[ comment! ]]
   --[[ comment! ]]
--[[ comment! ]]  end --[[ comment! ]]

After playing a bit with a bunch of heuristics to determine to which AST nodes I should be attaching comments in attempts to preserving them all, I came to the conclusion that this is not the way to go.

The point of the abstract syntax tree is to abstract away the syntax, meaning that all those purely syntactical items such as then and end are discarded. So storing information such as “there’s a –[[ comment! ]] right before “then” and then another one right after, and then another in the following line—no, in fact two of them!—before the next statement” are pretty much antithetical to the concept of an AST.

However, there are comments for which there is an easy match as to which AST node they should attach. And fortunately, those are sufficient for the “TealDoc” documentation generation: it’s easy to catch the comment that appears immediately before a statement node and attach to it. This way, we get pretty much the whole relevant ground covered: type definitions, function declarations, variable declarations. Then, to build a “TealDoc” tool, it would be a matter of traversing the AST and then writing a small separate parser to read the contents of the documentation comments (e.g. reading `@param` tags and the like). I chose to attach comments to statements and fields in data structures, and to stop there. Especially for a handwritten parser, adding comment handling code can add to the noise quickly.

As for the second use-case, a formatter/beautifier, I don’t think that going through the route of the AST is the right way to go. A formatter tool could share the compiler’s lexer, which does read and collect all tokens and comments, but then it should operate purely in the syntactic domain, without abstracting anything. I wrote a basic “token dumper” in the early days of Teal that did some of this work, but it hasn’t been maintained — as the language got more complex, it now needs some more context in order to do its work properly, so I’m guessing it needs a state machine and/or more lookahead. But with the comments preserved by the lexer, it should be possible to extend it into a proper formatter.

The code is in the preserve-comments branch. So, I think with these two additions (comments fully stored by the lexer, and comments partially attached to the AST by the parser), I think we have everything we need to serve as a good base for future documentation and formatter tooling. Any takers on these projects?

Posted by hisham on Monday, December 28, 2020 20:05:50 in en_US, Coding, Language

🔗 Dynamic type systems aren’t even simpler

Alexis King just published a great blog post titled “No, dynamic type systems are not inherently more open”.

That reminded me of the talk I gave last year at FOSDEM, titled “Minimalism versus types”, where I advocated for static types from a slightly different angle. I tried to convince people that types in dynamically-typed programs are often more complicated than people realize. And often more complicated than in typical statically-typed languages.

People often say the opposite, that static type systems are more complicated, and dynamically-typed languages are simpler. At the surface level, this seems true: in a dynamic world you just go merrily declaring variables, assigning values and doing things with them, without ever having to write down any types, no matter how trivial or complex they are. Things can’t get any simpler in the typing department than “doing nothing”, right?

Well, types are nothing more than the shapes and allowed behaviors of your data. So it’s not like you don’t have shapes and behaviors in any program, regardless of the language… so, you have types, whether you write them or not. They are there, even in assembly language, even if at a conceptual level, as the sets of “valid values” your program can manipulate. And you have to think about them, and they have to make sense, and they have to do the right thing. So, in short, in a correct dynamically-typed program the types need to be just as correct as they are in a statically-typed one (or else you’ll get runtime errors).

In other words, the types are there, but you have to run the type checker in your head. And you know what’s funny? When people don’t write down the types, they often end up with types that are often more complicated than the types from people who do write them. The complexity just goes under the radar, so it piles up.

One day you open that module which you haven’t touched in six months, and you see a function call where the third argument is null. You need to remember what kinds of variables you can pass to that third argument, or read the whole source code to figure it out. You follow through the code to see all places that third argument is used and realize the accepted type of the third argument depends on what you give to the second argument. Congratulations, you’re dealing with a dependent type, which means you’ve just surpassed Haskell in terms of type system complexity. Compilers that deal with this kind of type system are so complex they are effectively proof assistants (and are at the forefront of programming language research), and here you are dealing with those data types with your brain (and your faith in your ability to come up with sufficient tests) alone.

Given that there is no mechanical type checker to prescribe what is expressible, and that the dynamic runtime will accept anything as long as the program doesn’t crash, when doing typechecking in your head you essentially have the world’s most powerful and complicated type checker at your disposal. And once you start making use of this power, you end up dealing with the world’s most complicated type system.

And when you give people expressive power, they use it. In my experience, people go wild constructing complicated structures in dynamic languages that they normally wouldn’t in static languages. It’s not that static languages are less powerful (Turing equivalence, blah blah), but they make the things you’re doing more obvious to you (Alexis’s post has some great examples). In a dynamically-typed program people are all to keen to make variables and data structures perform double or triple duty (“this is a X but also a Y under circumstances Z”), but when they have to write down what they’re doing as types, it’s like a little conscience check, and they think twice before creating a more complex type for something that could be described in a simpler way (simple example: they’ll probably make two plain functions instead of making one function that takes a string argument that changes the behavior of other arguments). Static types nudge you towards simpler, less “clever” solutions (and we all know what kind of solution is more maintainable in the long run).

But okay, let’s assume we avoid “clever” and pick the same solutions in either. Writing the same program in a static or a dynamic language to process the same data in the same way, you will end up with values of roughly the same types in both. The fact that the variables have static types or not does not change that.

“But in a dynamic language I don’t have to write the types! It’s less work!”

“Not having to” write types but having to think about them anyway is like doing math “not having to” write anything down and doing all calculations in your head. How is it an advantage to not use pen and paper to track down your train of thought as you do a complex calculation, and instead be restricted to do it only in your head? And how is an advantage to not have a mechanical tool — like a calculator, which can actually compute the things you wrote down — to check whether what you wrote with pen and paper makes sense?

I’m lazy, so I hate doing any math in my head. I’ll take writing things down and have a machine check it for me any day of the week. Why wouldn’t I want the same when programming? That’s what computers are for, right? To save us from computing things in our head. So I’ll write my types, and have the compiler check whether they make sense, thank you very much. It’s less work.

Posted by hisham on Monday, January 20, 2020 03:11:12 in en_US, Coding, Language

🐘 Mastodon ▪ RSS (English), RSS (português), RSS (todos / all)

🔗 Compiler versus Transpiler: what is a compiler, anyway?

🔗 Again on 0-based vs. 1-based indexing

🔗 What’s faster? Lexing Teal with Lua 5.4 or LuaJIT, by hand or with lpeg

🔗 Parsing and preserving Teal comments

🔗 Dynamic type systems aren’t even simpler

Follow

Last 10 entries

Search

Admin