clang-format: an architectural failure? 🩼

I wrote something about clang-tidy in this blog previously. It is used to check the code quality. It has a modular architecture - everyone can write their own check and use many checks independently from each other. It works at the [AST level] (https://clang.llvm.org/docs/IntroductionToTheClangAST.html), that is, the code passes lexical and syntactic analysis before using the tool.

And clang-format is another tool used for formatting the source code - so that there are the right number of spaces, sorted #include and so on. It works at the token level, that is, the code passes only lexical analysis before using the tool.

That is, the clang-format understands only approximately what the current token doest and what is needed to be done. For example, the code

  
class A: B {

is seen as this sequence of tokens:

(kw_class) (identifier) (colon) (identifier) (l_brace)

And clang-format applies a series of hardcoded rules on these tokens, with using various ad-hoc stuff like nesting stack for brackets. There is no modularity at all, that is, all the rules are written right somewhere deep in the source code of tool.

For example, at some point in the middle of its work, the WhitespaceManager::generateReplacements method is called, which corrects spaces, and in it inside the WhitespaceManager::alignArrayInitializers method which corrects gaps in arrays.

It is difficult to format tokens without semantics at all, so clang-format “annotates” tokens with additional data before formatting: it maps each Token object to some FormatToken object.

There is a number of ad-hoc fields like bool IsArrayInitializer (indicates whether this token is the beginning of an array initialization), or FormatToken *MatchingParen (pointer to the closing parenthese).

Everything works notoriously bad with this approach😣. The common error is putting a lot of extra spaces and disfiguring lambdas.

There is a bunch of issues related to clang-format, and fixing them is much more difficult than fixing issues related to clang-tidy.

If you’re fixing a bug in clean-tidy, the “area of potential edits” is the code of a separate check (a maximum of several hundred lines), then in clang-format it is the entire code of clang-format.

For example, it is very difficult to fix such a stupid bug as broken formatting in the array initializer nested inside brackets. The fact is that formatting relies on the “annotation” of tokens, and it is exactly what it is for nested brackets. So it is necessary to fix the “annotator”, but it is way too difficult and there is a risk of breaking something else.

And so for many issues - you begin to understand a minor problem, like “why an extra gap is being written”, you reveal the whole chain of reasons, and you get a mega-problem that is impossible to fix.

Therefore, try to make modular programs to reduce the “area of potential edits” when fixing a bug 😎

clang-format: an architectural failure? 🩼

Further Reading

Design and evolution of constexpr in C++

Branch tables for switch operators

How compilers implement NRVO?