Link to The Pragmatic Bookshelf.
After attending standard so-called “compiler” courses at the university (“formal languages and grammars” and “construction of compilers”), I saw that they were weakly related to the real world.
Real compilers are never written exactly as it is described in standard theoretical books. I was feeling like I lack some knowledge - for example, no grammar will immediately understand what the entry T(i)
means in C++, since in order to do this you need to know what T
and i
mean, and grammars are just not able to do that. 😟
The “Language Implementation Patterns” book is super informative and contains real-life theory with code for implementing language tools. It considers all sorts of examples like: byte code interpreter, static code analyzer, C/C++ compiler (its limited version, more precisely) and much more.
There are 31 “patterns” in the book from more simpler to more complex, each pattern describes either a data structure, or an algorithm, or a typical architecture for language tools.
The patterns are divided into 4 parts: input reading (I
), input analysis (II
), input interpretation (III
), output generation (IV
).
The simplest applications use only I
, complex ones use I+II+III
or I+II+IV
.
An example from my older article: part I
explains “lexical analysis”, part II
- “syntactic analysis”/”parsing”, part IV
- “code generation”.
The book does not invent useless wheels, but rather immediately gives the correct architecture or describes an auxiliary tool 🙂
For example, the entire part I
can be covered by the ANTLR parser generator (by the way the author of the book is the ANTLR developer), and you don’t have to write your own parser.
This saves a lot of time, because a human-readable grammar (for example, in the Backus-Naur form) may have ultimately non-obvious parsing rules, and you’re gonna need to calculate the FIRST and FOLLOW sets yourself, and it’s not so simple, there is a university course which is dedicated to this… But the parser generator does everything for the programmer according to a human-readable grammar description.
It would seem that not so many people need to make a compiler for something? But this book will still be useful to anyone who makes their own DSL, as well as tools for code refactoring, formatting, static analysis, or metrics.
My comments on the book:
- 🤔 Remember that the sample “compiler for C++” developed in the book is made for a limited subset of language. In the real world, the analyzer for C++ is not a LL(1)- or LL(*)-analyzer, but rather LL(k), where k is the file size…
- 🤔 All the code in the book is written in Java (ANTLR is also implemented in it), maybe someone doesn’t like this language.
- 🤔 In one of the patterns, there is an overoptimization for the “token buffer” - tokens read from the file are immediately thrown out of memory. But even in C++, where after the preprocessor stage the file could inflate to millions of LOC (and N times more tokens), everything is stored in memory.
- 🤔 The book advertises the ANTLR parser generator, but using it for a number of complex tasks looks creepy and is more incomprehensible than if it were done by writing the code. An example is when we “optimize” the intermediate representation in the language designed for encoding operations with vectors:
1
4 * [0, 5*0, 3] -> [4*0, 4*5*0, 4*3] -> [0, 0, 4*3]
it is proposed to do this directly via the ANTLR config:
1 2 3
scalarVectorMult : ^('*' INT ^(VEC (e+=.)+)) -> ^(VEC ^('*' INT $e)+) ; zeroX : ^('*' a=INT b=INT {$a.int==0}?) -> $a ; // 0*x -> 0 xZero : ^('*' a=INT b=INT {$b.int==0}?) -> $b ; // x*0 -> 0
- 🤔 The book does not describe machine code generation and complex optimizations at all. This is a separate topic and most people will never need this knowledge. But we must keep in mind that, relatively speaking, the book roughly shows how the C++ code is translated into the LLVM IR, that is, the first half of the compilation pipeline.
Literally on the last page, the LLVM is mentioned and is recommended to use for compilable languages, since it already has all optimization passes and everything else out of the box.