1 Lexicon

C, like any language, uses a standard grammar and character set. The specific elements that comprise this grammar and character set are described in the following sections:

Character set (Section 1.1)
Rules for identifiers in C (Section 1.2)
Use of comments in a program (Section 1.3)
Keywords (Section 1.4)
Use of C operators (Section 1.5)
Use of punctuation characters (Section 1.6)
Use of character strings in a program (Section 1.7)
Interpretation of constant values (Section 1.8)
Inclusion of function declarations and other definitions, common to multiple source files, in a separate header file or module (Section 1.9)
The limits imposed on a conforming program by the ANSI C standard (Section 1.10)

C compilers interpret source code as a stream of characters from the source file. These characters are grouped into tokens, which can be punctuators, operators, identifiers, keywords, string literals, or constants. Tokens are the smallest lexical element of the language. The compiler forms the longest token possible from a given string of characters; the token ends when white space is encountered, or when the next character could not possibly be part of the token.

White space can be a space character, new-line character, tab character, form-feed character, or vertical tab character. Comments are also considered white space. Section 1.1 lists all the white space characters. White space is used as a token separator (except within quoted strings), but is otherwise ignored in the character stream, and is used mainly for human readability. White space may also be significant in preprocessor directives (see Chapter 8).

Consider the following source code line:

static int x=0;  /* Could also be written "static int x = 0;"   */

The compiler breaks the previous line into the following tokens (shown one per line):

static
int
x
=
0
;

As the compiler processes the input character stream, it identifies tokens and locates error conditions. The compiler can identify three types of errors:

Lexical errors, which occur when the compiler cannot form a legal token from the character stream (such as when an illegal character is used).
Parsing (syntax) errors, which occur when a legal token can be formed, but the compiler cannot make a legal statement from the tokens. For example, the following line contains incorrect punctuation surrounding an initializer list:
```
char x[3] = (1,2,3);
```
Semantic errors, which are grammatically correct but break another C language rule. For example, the following line shows an attempt to assign a floating-point value to a pointer type:
```
int *x = 5.7;
```

Logical errors are not identified by the compiler.

An important concept throughout C is the idea of a compilation unit, which is one or more files compiled by the compiler.

Note: The ANSI C standard refers to compilation units as translation units. This text treats these terms as equivalent.

The smallest acceptable compilation unit is one external definition. The ANSI C standard defines several key concepts in terms of compilation units. Section 2.2 discusses compilation units in detail.

A compilation unit with no declarations is accepted with a compiler warning in all modes except for the strict ANSI standard mode.

Previous Page | Next Page | Table of Contents | Index