Introduction
C's efficiency and closeness to hardware make it pivotal in systems and application development. This article guides you through the C compilation process, crucial for transforming code into executable programs. Designed for both beginners and experienced developers, it enhances understanding of C programming.
You can follow along with the examples or read for insight, with active engagement recommended for better learning.
Prerequisites include a Linux machine, and examples utilize files from this GitHub repository.
Overview of the C Compilation Process
The C compilation process transforms human-readable code into an executable program through a series of stages. Here's a brief overview of each phase:
- Preprocessing: Resolves directives like
#include
and#define
, preparing the source code for compilation. - Compilation: Converts preprocessed code into assembly language, translating high-level constructs into a lower-level format.
- Assembly: Transforms assembly code into machine code, producing object files with binary code that the processor can understand.
- Linking: Combines object files and libraries into a single executable file, resolving references to create a standalone program.
To demonstrate these steps, we'll use this main.c file that includes various C elements specifically chosen to showcase how the compiler handles different aspects of the language.
Preprocessing
The preprocessing phase is the first step in the C compilation process. It involves the preprocessor analyzing the source code and executing directives before the actual compilation begins. These directives, identified by a #
symbol.
Key Steps in the Preprocessing Process
-
Macro Expansion: Replaces macros defined using
#define
with their corresponding values or code snippets throughout the code. -
File Inclusion: Incorporates the contents of included files using
#include
into the source code. This is used for providing declarations and definitions used in the program as well as improve the code reusability. -
Conditional Compilation: Allows compiling different sections of code depending on certain conditions, enabling more versatile and adaptable code using
if
like constructuions. (#if
,#ifdef
,#ifndef
,#else
,#elif
,#endif
)
Preprocessing paves the way for the next phases of compilation by streamlining the source code, ensuring it's in an ideal format for conversion into machine code.
To preprocess the main.c
file, use the command gcc -E main.c -o main.i
. This generates the preprocessed file, main.i
.
Inspecting the output reveals several key changes:
Expansion of Header Files and Introduction of Linemarkers:
The header files are expanded, and linemarkers are introduced to track file names and line numbers. Linemarkers are a feature of the GCC preprocessor that helps in identifying the origins of each line of code. For example:
# 1 "/usr/include/stdio.h" 1 3 4 // Enters stdio.h header
...
# 1 "/usr/lib/gcc/x86_64-linux-gnu/11/include/stddef.h" 1 3 4 // Enters stddef.h as stdio.h includes stddef.h
# 209 "/usr/lib/gcc/x86_64-linux-gnu/11/include/stddef.h" 3 4 // Sets source and line number
typedef long unsigned int size_t; // The type defined in stddef.h at line 209
# 34 "/usr/include/stdio.h" 2 3 4 // Returns from stddef.h header
Macro Expansion
Macros are expanded to their corresponding definitions. This can be observed in how ERROR(msg)
is processed:
// In main.c
#define ERROR(msg) {printf("Error: %s", msg); exit(-1);}
...
ERROR("No name provided");
// In main.i after preprocessing
{printf("Error: %s", "No name provided"); exit(-1);};
The ERROR
macro in main.c
is replaced in main.i
with its defined content. The argument "No name provided"
replaces msg
in the macro's body. This process simplifies the code and aids in debugging by providing a more readable output.
Understanding and using C preprocessing can provide several advantages:
-
Debugging Imports: By examining the preprocessed file (
main.i
), you can verify which headers are included and in what order. This is particularly useful for resolving issues related to conflicting or missing declarations. -
Debugging Macro Expansions: Seeing how macros expand in the preprocessed code can clarify their impact on the program. This is crucial for debugging and ensuring that macros behave as expected.
-
Debugging Conditional Blocks: Preprocessing reveals how conditional compilation directives (
#if
,#ifdef
, etc.) are resolved. This can help in understanding the flow of compilation and ensuring that the correct code blocks are compiled.
Compilation to Assembly
After preprocessing, the next step in the C compilation process is the actual compilation. This phase involves translating the preprocessed C code into assembly language. A few key aspects of this phase are:
Key Steps in the Compilation Process
-
Translating Instructions: The compiler converts C language constructs and instructions into their corresponding assembly counterparts.
-
Memory Allocation for Variables and Constants: During this phase, the compiler allocates memory for both global and local variables, as well as constants. Global variables are usually placed in a data segment, while local variables are allocated on the stack.
-
Program Section Construction: The compiler organizes the code into various sections, such as
.text
for the executable code,.data
for initialized global and static variables, and.rodata
for read-only data like string literals. -
Generating Debug and Other Sections: Additional sections for debugging information and other metadata are also generated, which can be used for debugging and optimizing the code.
To compile the preprocessed file main.i
into assembly, we use the -S
option with GCC gcc -S -masm=intel main.i -o main.s
. This generates the assembly code using Intel syntax. You can omit the -masm=intel
parameter to output the code in AT&T
flavor.
Let's examine how specific C constructs are represented in the assembly code (main.s
):
Constants Definition
Assembly places constants, such as string literals, in the .rodata
section, ensuring they are read-only.
.section .rodata ; Specify target section for next declarations
.LC0: ; Label for "Anonimus" string
.string "Anonimus" ; String definition, .string instructs the assembler to add an ending 0 to the string constant
.LC1: ; Different labels for different string constants
.string "No name provided"
.LC2:
.string "Error: %s"
.LC3:
.string "Hello %s"
Global Variable Allocation
Global variables are allocated in the .data
section and initialized appropriately. The following snippet shows the "name" variable allocation.
.section .data.rel.local,"aw" ; Set the target section (.data)
.align 8 ; The alignment is 8 bytes (due to 64 bit architecture)
.type name, @object ; Specify the type of the "name" variable
.size name, 8 ; Specify the size of the variable (pointer types has 8 bytes)
name: ; "name" label
.quad .LC0 ; Allocate 8 bytes (.quad) and initialize with .LC0 address
Function Call Translation (Example: strlen
)
The assembly translation of a strlen
function call demonstrates stack space allocation, parameter passing, and function calling convention.
; Original C code
; len = strlen(name);
sub rsp, 32 ; Allocate space on stack for local variable and other intermediary values
...
mov rax, QWORD PTR name[rip] ; Resolve the address or name that is relative to RIP
mov rdi, rax ; Move the address in rdi register (that is the first param for functions)
call strlen@PLT ; Call strlen function
mov DWORD PTR -4[rbp], eax ; Store the function result (eax) in allocated stack variable
Conditional Statement Translation (Example: if
Statement)
The translation of an if
statement into assembly involves condition checking and branching based on the comparison result.
; Original C code
; if(argc != 2)
; {...}
mov DWORD PTR -20[rbp], edi ; Store argc (first param of main function) on stack
...
cmp DWORD PTR -20[rbp], 2 ; Compare the argc with 2
je .L2 ; if equal, jump to label .L2, otherwise continue the execution
... ; positive branch
.L2:
... ; rest of the code after if statement
These examples from main.s
illustrate the assembly translation of constants, global variables, function calls, and conditional statements, highlighting the compiler's role in transforming high-level C code into low-level assembly instructions. Additionally, debug information and compiler-specific data are included in the assembly file, aiding in development and debugging processes.
Assembly to binary object
This phase converts the human-readable assembly code (.s
files) into machine code, encapsulated in binary object files (.o
files). Typically, these object files are in the Executable and Linkable Format (ELF), which provides a versatile standard for storing the compiled code along with necessary metadata for linking.
Key Steps in the Assembly Process:
-
Section Composition: The assembler consolidates data from similar sections across the assembly source into the ELF object file. This ensures that all data, code, and other section types are appropriately merged, leveraging the ELF format's capabilities to organize and manage section information.
-
Instruction Translation: Assembly instructions are translated from human-readable form into binary code (bytecode) that the CPU can directly execute. This process is specific to the target CPU's instruction set, with the ELF object file storing the translated machine instructions.
-
Label Resolution: In the ELF object file, labels within the assembly code, marking function entry points or data locations, are replaced with actual memory addresses. This step is crucial for ensuring accurate execution flow and data access when the program runs.
-
Relocation Table Creation: A relocation table is generated within the ELF object file. Because the final memory addresses of sections or symbols are not determined until the linking stage, the relocation table specifies which addresses in the code need to be updated based on the sections' eventual base addresses.
-
Symbol Table Generation: The ELF object file includes a symbol table, which lists all symbols (function and variable names) used in the assembly code along with their addresses or relocation information. This table is crucial for linking, as it allows the linker to resolve external references between different object files.
-
Debug Information: If included, debug information is stored in specific sections within the ELF object file. This information maps the machine code back to the source code, enabling developers to debug the compiled program at the source level.
The ELF format's design to accommodate these steps makes .o
files particularly suited for the subsequent linking process, moving the assembly code closer to becoming a complete executable program.
To transform assembly code into an object file, the GNU assembler (as
) can be utilized directly:
as main.s -o main.o
While gcc
can also perform this compilation, using as
directly provides a clearer view into the assembly-to-binary object file conversion process.
Since main.o
is a binary file (ELF format), viewing it in a text editor is not informative. Instead, the readelf
and objdump
utility offers a way to inspect ELF files thoroughly.
Inspecting the ELF sections
To explore the sections within main.o
and understand their organization, the following readelf
command can be used:
readelf -W -S main.o
This reveals the sections in main.o
, including their names, offsets, sizes, and attributes, providing insights into the compiler's data and code organization and the preparations for linking.
[Nr] Name Type Address Off Size ES Flg Lk Inf Al
[ 0] NULL 0000000000000000 000000 000000 00 0 0 0
[ 1] .text PROGBITS 0000000000000000 000040 00008a 00 AX 0 0 1
[ 2] .rela.text RELA 0000000000000000 000288 000108 18 I 13 1 8
[ 3] .data PROGBITS 0000000000000000 0000ca 000000 00 WA 0 0 1
[ 4] .bss NOBITS 0000000000000000 0000ca 000000 00 WA 0 0 1
[ 5] .rodata PROGBITS 0000000000000000 0000ca 00002d 00 A 0 0 1
[ 6] .data.rel.local PROGBITS 0000000000000000 0000f8 000008 00 WA 0 0 8
[ 7] .rela.data.rel.local RELA 0000000000000000 000390 000018 18 I 13 6 8
[ 8] .comment PROGBITS 0000000000000000 000100 00002c 01 MS 0 0 1
[ 9] .note.GNU-stack PROGBITS 0000000000000000 00012c 000000 00 0 0 1
[10] .note.gnu.property NOTE 0000000000000000 000130 000020 00 A 0 0 8
[11] .eh_frame PROGBITS 0000000000000000 000150 000038 00 A 0 0 8
[12] .rela.eh_frame RELA 0000000000000000 0003a8 000018 18 I 13 11 8
[13] .symtab SYMTAB 0000000000000000 000188 0000d8 18 14 4 8
[14] .strtab STRTAB 0000000000000000 000260 000025 00 0 0 1
[15] .shstrtab STRTAB 0000000000000000 0003c0 000089 00 0 0 1
Inspecting the .rodata
Section
To view the contents of the .rodata
section, where string constants are stored, use the readelf
command:
readelf -x .rodata main.o
This command dumps the content of the .rodata
section, revealing the raw bytes and the strings added to this section in the assembly file:
Hex dump of section '.rodata':
0x00000000 416e6f6e 696d7573 004e6f20 6e616d65 Anonimus.No name
0x00000010 2070726f 76696465 64004572 726f723a provided.Error:
0x00000020 20257300 48656c6c 6f202573 00 %s.Hello %s.
Inspecting the .text
Section
Similarly, to inspect the .text
section, which contains the executable code, run:
readelf -x .text main.o
Hex dump of section '.text':
0x00000000 f30f1efa 554889e5 4883ec20 897dec48 ....UH..H.. .}.H
0x00000010 8975e083 7dec0274 28488d05 00000000 .u..}..t(H......
0x00000020 4889c648 8d050000 00004889 c7b80000 H..H......H.....
0x00000030 0000e800 000000bf ffffffff e8000000 ................
0x00000040 00488b45 e0488b40 08488905 00000000 .H.E.H.@.H......
0x00000050 488b0500 00000048 89c7e800 00000089 H......H........
0x00000060 45fc488b 05000000 004889c6 488d0500 E.H......H..H...
0x00000070 00000048 89c7b800 000000e8 00000000 ...H............
0x00000080 bf000000 00e80000 0000 ..........
To translate the bytecode into a more understandable format, use objdump
to decompile it:
objdump -M intel -d -j .text main.o
This provides a disassembled view of the .text
section, translating bytecode back into assembly instructions:
0000000000000000 <main>:
0: f3 0f 1e fa endbr64
4: 55 push rbp
5: 48 89 e5 mov rbp,rsp
8: 48 83 ec 20 sub rsp,0x20
c: 89 7d ec mov DWORD PTR [rbp-0x14],edi
f: 48 89 75 e0 mov QWORD PTR [rbp-0x20],rsi
13: 83 7d ec 02 cmp DWORD PTR [rbp-0x14],0x2
17: 74 28 je 41 <main+0x41>
19: 48 8d 05 00 00 00 00 lea rax,[rip+0x0] # 20 <main+0x20>
20: 48 89 c6 mov rsi,rax
23: 48 8d 05 00 00 00 00 lea rax,[rip+0x0] # 2a <main+0x2a>
...
Inspecting Symbols
To display the symbols in main.o
, including defined and imported functions, use the readelf
command with the -s
option:
readelf -s main.o
This command prints the symbol table, showing symbols like main
, name
, and imported functions such as printf
or exit
:
Symbol table '.symtab' contains 9 entries:
Num: Value Size Type Bind Vis Ndx Name
0: 0000000000000000 0 NOTYPE LOCAL DEFAULT UND
1: 0000000000000000 0 FILE LOCAL DEFAULT ABS main.c
2: 0000000000000000 0 SECTION LOCAL DEFAULT 1 .text
3: 0000000000000000 0 SECTION LOCAL DEFAULT 5 .rodata
4: 0000000000000000 8 OBJECT GLOBAL DEFAULT 6 name
5: 0000000000000000 138 FUNC GLOBAL DEFAULT 1 main
6: 0000000000000000 0 NOTYPE GLOBAL DEFAULT UND printf
7: 0000000000000000 0 NOTYPE GLOBAL DEFAULT UND exit
8: 0000000000000000 0 NOTYPE GLOBAL DEFAULT UND strlen
Inspecting the Relocations
To view the relocation entries in main.o
, which are crucial for linking, use the readelf
command with the -r
option:
readelf -r main.o
This command displays the relocation sections, showing how symbols in the code are adjusted during the linking process:
Relocation section '.rela.text' at offset 0x288 contains 11 entries:
Offset Info Type Sym. Value Sym. Name + Addend
00000000001c 000300000002 R_X86_64_PC32 0000000000000000 .rodata + 5
000000000026 000300000002 R_X86_64_PC32 0000000000000000 .rodata + 16
000000000033 000600000004 R_X86_64_PLT32 0000000000000000 printf - 4
00000000003d 000700000004 R_X86_64_PLT32 0000000000000000 exit - 4
00000000004c 000400000002 R_X86_64_PC32 0000000000000000 name - 4
000000000053 000400000002 R_X86_64_PC32 0000000000000000 name - 4
...
Each entry specifies how and where to adjust symbol references. For example, the relocation at offset 0x1c
in the .text
section (lea rax,[rip+0x0]
) instructs the loader to update the address for the string "No name provided", based on the actual location of .rodata
at runtime. This process ensures that references to variables, functions, and constants are correctly resolved, regardless of where sections are loaded in memory.
More detailed information on relocation types and their purposes can be found in the ELF specification for the x86-64 architecture: ELF x86_64 ABI.
Linking
The linking process is the final stage in the compilation process, where multiple binary object files are combined into a single executable.
Key Aspects of Linking:
-
Combining Sections: The linker merges similar sections from all the object files into unified sections in the final executable. For instance, all
.text
sections from different object files are combined into a single.text
section. -
Resolving Imported Symbols: The linker resolves symbols that are imported from other object files or libraries. This includes linking function calls to their definitions, whether those are included in static libraries included at compile time or dynamic libraries loaded at runtime.
-
Creating Segments: The linker also organizes the combined sections into segments. Segments are portions of the executable that are loaded into memory during execution. This organization is essential for the runtime execution of the program, distinguishing between code, data, and other necessary information.
Due to the complexity of directly invoking the linker (ld
), including the need to specify a linker script, include C libraries, and ensure the C runtime is correctly initialized, it is common practice to use gcc
for this task. gcc
abstracts away these complexities and provides a more straightforward interface for linking:
gcc main.o -o main
Inspecting the segments
In the linking process, one important aspect is understanding the segments (also referred to as program headers) that make up the final executable. These segments define how the executable is organized in memory when it's loaded and executed. You can inspect these segments using the readelf
command:
readelf -l main
The output provides detailed information about the segments, including their loading address in memory, size, size in memory, and alignment.
Program Headers:
Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align
PHDR 0x000040 0x0000000000000040 0x0000000000000040 0x0002d8 0x0002d8 R 0x8
INTERP 0x000318 0x0000000000000318 0x0000000000000318 0x00001c 0x00001c R 0x1
[Requesting program interpreter: /lib64/ld-linux-x86-64.so.2]
LOAD 0x000000 0x0000000000000000 0x0000000000000000 0x0006b0 0x0006b0 R 0x1000
LOAD 0x001000 0x0000000000001000 0x0000000000001000 0x000221 0x000221 R E 0x1000
LOAD 0x002000 0x0000000000002000 0x0000000000002000 0x000110 0x000110 R 0x1000
LOAD 0x002da8 0x0000000000003da8 0x0000000000003da8 0x000270 0x000278 RW 0x1000
DYNAMIC 0x002db8 0x0000000000003db8 0x0000000000003db8 0x0001f0 0x0001f0 RW 0x8
NOTE 0x000338 0x0000000000000338 0x0000000000000338 0x000030 0x000030 R 0x8
NOTE 0x000368 0x0000000000000368 0x0000000000000368 0x000044 0x000044 R 0x4
GNU_PROPERTY 0x000338 0x0000000000000338 0x0000000000000338 0x000030 0x000030 R 0x8
GNU_EH_FRAME 0x002034 0x0000000000002034 0x0000000000002034 0x000034 0x000034 R 0x4
GNU_STACK 0x000000 0x0000000000000000 0x0000000000000000 0x000000 0x000000 RW 0x10
GNU_RELRO 0x002da8 0x0000000000003da8 0x0000000000003da8 0x000258 0x000258 R 0x1
Section to Segment mapping:
Segment Sections...
00
01 .interp
02 .interp .note.gnu.property .note.gnu.build-id .note.ABI-tag .gnu.hash .dynsym .dynstr .gnu.version .gnu.version_r .rela.dyn .rela.plt
03 .init .plt .plt.got .plt.sec .text .fini
04 .rodata .eh_frame_hdr .eh_frame
05 .init_array .fini_array .dynamic .got .data .bss
06 .dynamic
07 .note.gnu.property
08 .note.gnu.build-id .note.ABI-tag
09 .note.gnu.property
10 .eh_frame_hdr
11
12 .init_array .fini_array .dynamic .got
For instance, the .data section, which contains the "name" variable, is part of the 5th segment. Note that the MemSiz of this segment is greater than FileSiz by 8 bytes, indicating that space for the variable is allocated at runtime but is not present in the ELF file.
Conclusion
In wrapping up our exploration of the C compilation process, it's vital to recognize how each stage - preprocessing, compiling, assembling, and linking - serves as a cornerstone in transforming C code into a functioning program. This journey from code to executable is not just a technical necessity but a foundation for better programming practices. By understanding the intricacies of each phase, developers can optimize their code, troubleshoot more efficiently, and ultimately enhance the performance and reliability of their software. Embrace this knowledge to unlock new levels of programming mastery.
Comments
Post a Comment