Anant Jain

Large-Scale Automated Refactoring Using ClangMR

Paper Review

Abstract

Maintaining large codebases can be a challenging endeavour. As new libraries, APIs and standards are introduced, old code is migrated to use them. To provide as clean and succinct an interface as possible for developers, old APIs are ideally removed as new ones are introduced. In practice, this becomes difficult as automatically finding and transforming code in a semantically correct way can be challenging, particularly as the size of a codebase increases. In this paper, we present a real-world implementation of a system to refactor large C++ codebases efficiently. A combination of the Clang compiler framework and the MapReduce parallel processor, ClangMR enables code maintainers to easily and correctly transform large collections of code. We describe the motivation behind such a tool, its implementation and then present our experiences using it in a recent API update with Google’s C++ codebase.

Highlights

  • A combination of the Clang compiler framework and the MapReduce parallel processor, ClangMR enables code maintainers to easily and correctly transform large collections of code.

  • As software systems evolve, existing code often must be updated to allow for future feature development, removal of old or incompatible interfaces, and further maintenance tasks.

  • Google addresses this challenge by using ClangMR, a tool that uses semantic knowledge from the C++ abstract syntax tree (AST) to make editing decisions.

  • This combination of knowledge, flexibility and speed allows code maintainers to perform complex transformations across millions of lines of C++ code in a matter of minutes.

  • ClangMR is not tool designed to do complex refactoring over C++, but it is unique in its flexibility, speed, and use in industrial applications.

  • Traditional regular-expression based matching tools lack the semantic knowledge that complex transformations often require.

  • Many integrated development such as Eclipse, provide a limited set of refactoring tools. While these tools take advantage of compile-time knowledge, they are generally limited to a single file or package, and only support the operations built into the tool. Since these tools run on a developer’s local workstation, processing large collections of source code is often intractable.

  • Google maintains a large mixed-language codebase in a single repository, with a significant portion written in C++.

  • To help maintain as small an API surface as reasonable, the developers and teams responsible for introducing new core APIs are also tasked with removing old ones and migrating existing callers to the new abstractions.

  • ClangMR helps reduce the accumulated technical debt of a diverse codebase built over the course of more than a decade.

  • The ClangMR implementation consists of three parts:

    1. an indexer which describes how to compile the C++ codebase into a collection of ASTs.
    2. a transformation-specific node matching tool which builds ASTs from the index, matches applicable AST nodes, and outputs editing instructions.
    3. a source code refactorer which consumes editing instructions and modifies the files in a local version control client to effect the desire transformations.
  • This architecture means a single code maintainer needs only implement an appropriate matcher for the desired transform, rather than the entire pipeline.

  • Storing the entire precomputed ASTs would be space-prohibitive, but this index serves to provide a way to quickly construct those ASTs from a snapshot in source code history.

  • Experience shows that it is roughly as fast to compile C++ code into a memory-held AST as it is to read a completely annotated AST out of distant storage.

  • Because each translation unit has a separate root node in the AST, and each source file is generally an independent translation unit, the node matcher can operate on separate translation units in parallel.

  • A developer provides an appropriate node matching expression, and a callback to be invoked when that expression is matched. In practice, these tend to be relatively small: a few hundred lines of C++ code. This technique allows for much more powerful transformations than pure textual substitution.

  • Similar to a text-based patch, these instructions describe edits to the target source file as a series of byte-level offsets and additions or deletions. These instructions are then serialized to disk, and used as input to the source code refactorer.

  • The source refactorer reads the list of edit commands generated by the node matcher callbacks and filters out any duplicate, overlapping or conflicting edit instructions before editing the source code in the local version control client. Each edit is processed serially in the version control client on the local workstation of the developer, which is synchronized to the version of code stored in the compilation index.

  • A final pass with ClangFormat, a Clang-based formatting tool, ensures the resulting code meets formatting and style guide recommendations.

  • ClangMR can only refactor changes which are self-contained within translation units, requires tedious manual review, and requires learning some nuances of the Clang AST, which requires developer investment.

  • The initial ClangMR program transformed about 35,000 callers of SplitStringUsing to strings::Split, and these changes were mailed for review in 3,100 separate chunks, though not all simultaneously. These were often reviewed quite quickly, with an 80th percentile review time of just over two minutes. The bulk of reviews were completed over two months, with a small number requiring another month to complete.

Conclusion

ClangMR allows for fast and versatile refactoring of large C++ codebases, and has been applied to many problems within Google, enabling maintainers to keep millions of lines of C++ code nimble and accessible to thousands of engineers.

PDF


Over the next few Saturdays, I'll be going through some of the foundational papers in Computer Science, and publishing my notes here. This is #30 in this series.