2 Chapter 1. Overview
1.1. Character sets
Source code character set processing in C and related languages is rather complicated. The C standard
discusses two character sets, but there are really at least four.
The files input to CPP might be in any character set at all. CPP’s very first action, before it even looks
for line boundaries, is to convert the file into the character set it uses for internal processing. That
set is what the C standard calls the source character set. It must be isomorphic with ISO 10646, also
known as Unicode. CPP uses the UTF-8 encoding of Unicode.
At present, GNU CPP does not implement conversion from arbitrary file encodings to the source
character set. Use of any encoding other than plain ASCII or UTF-8, except in comments, will cause
errors. Use of encodings that are not strict supersets of ASCII, such as Shift JIS, may cause errors
even if non-ASCII characters appear only in comments. We plan to fix this in the near future.
All preprocessing work (the subject of the rest of this manual) is carried out in the source character
set. If you request textual output from the preprocessor with the -E option, it will be in UTF-8.
After preprocessing is complete, string and character constants are converted again, into the execution
character set. This character set is under control of the user; the default is UTF-8, matching the source
character set. Wide string and character constants have their own character set, which is not called out
specifically in the standard. Again, it is under control of the user. The default is UTF-16 or UTF-32,
whichever fits in the target’s wchar_t type, in the target machine’s byte order.1Octal and hexadecimal
escape sequences do not undergo conversion; ’\x12’ has the value 0x12 regardless of the currently
selected execution character set. All other escapes are replaced by the character in the source character
set that they represent, then converted to the execution character set, just like unescaped characters.
GCC does not permit the use of characters outside the ASCII range, nor \u and \U escapes, in identi-
fiers. We hope this will change eventually, but there are problems with the standard semantics of such
"extended identifiers" which must be resolved through the ISO C and C++ committees first.
1.2. Initial processing
The preprocessor performs a series of textual transformations on its input. These happen before all
other processing. Conceptually, they happen in a rigid order, and the entire file is run through each
transformation before the next one begins. CPP actually does them all at once, for performance rea-
sons. These transformations correspond roughly to the first three "phases of translation" described in
the C standard.
1. The input file is read into memory and broken into lines.
Different systems use different conventions to indicate the end of a line. GCC accepts the ASCII
control sequences LF,CR LF and CR as end-of-line markers. These are the canonical sequences
used by Unix, DOS and VMS, and the classic Mac OS (before OSX) respectively. You may
therefore safely copy source code written on any of those systems to a different one and use
it without conversion. (GCC may lose track of the current line number if a file doesn’t consis-
tently use one convention, as sometimes happens when it is edited on computers with different
conventions that share a network file system.)
If the last line of any input file lacks an end-of-line marker, the end of the file is considered to
implicitly supply one. The C standard says that this condition provokes undefined behavior, so
GCC will emit a warning message.
2. If trigraphs are enabled, they are replaced by their corresponding single characters. By default
GCC ignores trigraphs, but if you request a strictly conforming mode with the -std option, or
you specify the -trigraphs option, then it converts them.
1. UTF-16 does not meet the requirements of the C standard for a wide character set, but the choice of 16-bit
wchar_t is enshrined in some system ABIs so we cannot fix this.