The C++ source file is processed by the compiler as if the following phases take place, in this exact order:
[edit] Phase 1
1) The individual bytes of the source code file are mapped (in implementation defined manner) to the characters of the
basic source character set. In particular, OS-dependent end-of-line indicators are replaced by newline characters.
- The basic source character set consists of 96 characters:
a) 5 whitespace characters (space, horizontal tab, vertical tab, form feed, new-line)
b) 10 digit characters from '0' to '9'
c) 52 letters from 'a' to 'z' and from 'A' to 'Z'
d) 29 punctuation characters: _ { } [ ] # ( ) < > % : ; . ? * + - / ^ & | ~ ! = , \ " ’
3) Any source file character that cannot be mapped to a character in basic source character set, is replaced by its universal character name (\uXXX) or by some internal form that is handled equivalently.
[edit] Phase 2
1) Whenever backslash appears at the end of a line (immediately followed by the newline character), both backslash and newline are deleted, combining two physical source lines into one logical source line. This is a single-pass operation, a line ending in two backslashes followed by an empty line) does not combine three lines into one). If a universal character name (\uXXX) is formed on this phase, the behavior is undefined.
2) If a non-empty source file does not end with a newline character after this step (whether it had no newline originally, or it ended with a backslash)
- the behavior is undefined (until C++11)
- a terminating newline character is added (since C++11)
[edit] Phase 3
1) The source file is decomposed into
comments, sequences of whitespace characters (space, horizontal tab, new-line, vertical tab, and form-feed), and
preprocessing tokens, which are the following
a) header names: <iostream> or "myfile.h"
c) numbers
e) operators and punctuators (including
alternative tokens), such as
+,
<<=,
new,
<%,
##, or
and.
f) individual non-whitespace characters that do not fit in any other category
2) Each comment is replaced by one space character
3) Newlines are kept, and it's implementation-defined whether non-newline whitespace sequences may be collapsed into single space characters.
[edit] Phase 4
2) Each file introduced with the
#include directive goes through phases 1 through 4, recursively.
3) At the end of this phase, all preprocessor directives are removed from the source.
[edit] Phase 5
2) Escape sequences and universal character names in character literals and non-raw string literals are expanded and converted to execution character set.
If the character specified by universal character name isn't a member of the execution character set, the result is implementation-defined, but is guaranteed to not be a null (wide) character.
[edit] Phase 6
Adjacent string literals are concatenated.
[edit] Phase 7
Compilation takes place: the tokens are syntactically and semantically analyzed and translated as a translation unit.
[edit] Phase 8
Each translation unit is examined to produce a list of required template instantiations, including the ones requested by explicit instantiations). The definitions of the templates are located, the required instantiations are performed to produce instantiation units.
[edit] Phase 9
Translation units, instantiation units, and library components needed to satisfy external references are collected into a program image which contains information needed for execution in its execution environment.