Understanding Compiler's internal architecture

Understanding Compiler's internal architecture

How does GCC work internally?

Becoming a techy genius is not rocket science. You just have to understand what goes on, in what you use and do daily.

Umm... still confused?! Here I am with a new series of blogs for you to become a tech GENIUS.

Let's start!

Prerequisites

  • You should have written at least a few lines of code before.

Introduction

As you already know the boring line- Computers can't understand what we speak or write(or scream). We need to tell it, every small detail about what to do.

So, we write some human-readable codes, which are translated to binary codes(0 and 1) that the computer understands.

As I expect, you are a coder and have written some pieces of code in an IDE. Do you know what happens behind the scenes when you compile/Run the program?

In this article, we'll explore how compilers work internally by examining the architecture of GCC, the GNU Compiler Collection, which has become one of the most popular compilers in the UNIX world!

Architecture of GCC

GCC is a set of compilers and development tools having multiple compilers for different programming languages, like C, C++, Java, etc.

Refer to the diagram above to understand the next section.

How do compilers work?

As you write code, a compiler takes a program written in one language, that is source code, and converts it into a language, the machine can understand, that is object code.

A compiler transforms the code into multiple intermediate forms and gradually converts it into object code.

There are mainly 3 phases in a compiler:

  • front end - receives and validates the syntax of the input program.

  • middle end - analyzes and transforms the program into a more efficient form

  • back end - does the final mapping into object code

Now, let's go one step deeper into the compiler phases.

Front End

Computers are not good with the text we write, so we need to remove all the spaces, comments, and formatting and convert the program into a form of code called intermediate Representation (IR).

These representations are nothing but internal data structures, that can be changed easily as compared to texts.

So, when a normal program(code) comes, Front End parser converts it into a data structure called Abstract Syntax Trees(AST). It is a tree, as the name suggests.

For example, if I write a code-

x = y - z * 3

It will look something like this-

Then, AST is converted into a unified form called generic. A form that is independent of any specific programming language.

So, Frontend verifies the correctness of the text program's syntactical structure, generates internal data structures for variables and data types declared in programs, and constructs AST.

Middle End

As intermediate representation(IR) is sent by the front end, the middle-end proceeds to analyze and transform the program. This is done for 2 reasons:

  • make the object code run as fast as possible (reduce time complexity)

  • make the object code take as little space as possible (reduce space complexity)

First, the tree is converted into another representation called GIMPLE. In this form, each expression contains no more than three operands, which means somewhat like, a = b + c or d = e * f .

Then, it is converted from GIMPLE into Static Single Assignment (SSA) representation.

In SSA, each variable is assigned a value only once but it can be used many times. When a variable is reassigned, the compiler creates a new version of it.

Let's see an example-

In GIMPLE form:

x = 5
if (condition) {
    x = 10
} else {
    x = 20
}

In SSA form:

x1 = 5
if (condition) {
    x2 = 10
} else {
    x3 = 20
}
x4 = PHI(x2, x3)

Frontend doesn't know what the program actually does, like most of us ;) little joke, lol

Optimization can only be possible when the compiler understands the flow of control in the program (control-flow analysis) and how the data is transformed as the program executes (data-flow analysis).

All control flow constructs are represented as combinations of conditional statements and goto operators, arguments of a function call can only be variables

💡
This analysis helps the compiler to improve the run-time performance of the code.

There are more such optimizations like-

  • Algebraic simplifications: for eg- converting a + 1 - a to 1.

  • Constant folding: evaluating expressions, for eg-converting a = 1 + 1 to a = 2

  • Redundancy elimination: This helps with removing redundant codes, and has further types:

    • Loop-invariant code motion- when in a loop, a line of code produces the same result, ex: c = 2+3 in every iteration, it is removed from the loop and written outside.

    • Common sub-expression elimination- removing the old piece of code, for eg: if we define x = 10 and later do x = 2 we can remove the previously defined line.

After the SSA optimization pass, the tree is converted back to the GIMPLE form which is then used to generate a register-transfer language (RTL) form of a tree.

RTL is a hardware-based representation that corresponds to an abstract target architecture with an infinite number of registers. An RTL optimization pass optimizes the tree in the RTL form, on which the backend works thereafter.

Back End

This phase produces machine code for the target architecture.

At this stage, the compiler needs to have very detailed knowledge about the hardware where the program will run.

This means that the intermediate representation of the program is usually converted into a form resembling machine language.

Even after final code generation, the resulting code is typically not ready to run as-is. Some compilers (GCC among them) generate assembly code which is then fed into the assembler for object code generation.

After the object code generation, the linker collects all the different object files and libraries needed to build the final executable.

Conclusion

You are now, one step closer to becoming a tech genius!

You now know how your code works behind the scenes and why it takes time to compile and run.

There is a lot more in-depth about how each phase is connected to other parts of the machine, like OS, CPU, etc.

Continue exploring the tech.

Thanks for reading🫡

Do like and comment on what you learned in this blog. Don't forget to check out my other blogs for in-depth tech content.

Source of learning: Source 1 & source 2

Did you find this article valuable?

Support Agrim Sharma by becoming a sponsor. Any amount is appreciated!