C, like any language, uses a standard grammar and character set. The specific elements that comprise this grammar and character set are described in the following sections:
C compilers interpret source code as a stream of characters from the source file. These characters are grouped into tokens, which can be punctuators, operators, identifiers, keywords, string literals, or constants. Tokens are the smallest lexical element of the language. The compiler forms the longest token possible from a given string of characters; the token ends when white space is encountered, or when the next character could not possibly be part of the token.
White space can be a space character, new-line character, tab character, form-feed character, or vertical tab character. Comments are also considered white space. Section 1.1 lists all the white space characters. White space is used as a token separator (except within quoted strings), but is otherwise ignored in the character stream, and is used mainly for human readability. White space may also be significant in preprocessor directives (see Chapter 8).
Consider the following source code line:
static int x=0; /* Could also be written "static int x = 0;" */
The compiler breaks the previous line into the following tokens (shown one per line):
static int x = 0 ;
As the compiler processes the input character stream, it identifies tokens and locates error conditions. The compiler can identify three types of errors:
char x[3] = (1,2,3);
int *x = 5.7;
Logical errors are not identified by the compiler.
An important concept throughout C is the idea of a compilation unit, which is one or more files compiled by the compiler.
The smallest acceptable compilation unit is one external definition. The ANSI C standard defines several key concepts in terms of compilation units. Section 2.2 discusses compilation units in detail.
A compilation unit with no declarations is accepted with a compiler warning in all modes except for the strict ANSI standard mode.