lex
lex - Generates programs for lexical tasks
SYNOPSIS
lex [-ct] [-n | -v] [file ...]
[Digital] The following syntax applies when the CMD_ENV
environment variable is set to svr4:
lex [-crt] [-n | -v] [-V] [-Qy | -Qn] [file ...]
STANDARDS
Interfaces documented on this reference page conform to
industry standards as follows:
lex: XPG4, XPG4-UNIX
Refer to the standards(5) reference page for more informa-
tion about industry standards and associated tags.
FLAGS
Writes C code to the file lex.yy.c. This is the default.
Suppresses the statistics summary. When you set your own
table sizes for the finite state machine, lex automati-
cally produces this summary if you do not select this
flag. [Digital] Writes RATFOR code to the file lex.yy.r.
(There is no RATFOR compiler for DIGITAL UNIX.) Writes to
standard output instead of writing to a file. Provides a
summary of the generated finite state machine statistics.
[Digital] Outputs lex version number to standard error.
Requires the environment variable CMD_ENV to be set to
svr4. [Digital] Determines whether the lex version num-
ber is written to the output file. -Qn does not do so and
is the default. Requires the environment variable CMD_ENV
to be set to svr4.
DESCRIPTION
The lex command uses the rules and actions contained in
file to generate a program, lex.yy.c, which can be com-
piled with the cc command. That program can then receive
input, break the input into the logical pieces defined by
the rules in file, and run program fragments contained in
the actions in file.
The generated program is a C Language function called
yylex(). The lex command stores yylex() in a file named
lex.yy.c. You can use yylex() alone to recognize simple,
1-word input, or you can use it with other C Language pro-
grams to perform more difficult input analysis functions.
For example, you can use lex to generate a program that
tokenizes an input stream before sending it to a parser
program generated by the yacc command.
structure allows the program to exist in only one state
(or condition) at a time. A finite number of states are
allowed. The rules in file determine how the program
moves from one state to another in response to the input
that the program receives.
The lex command reads its skeleton finite state machine
from the file /usr/ccs/lib/ncpform or /usr/ccs/lib/ncform.
Use the environment variable LEXER to specify another
location for lex to read from.
If you do not specify a file, lex reads standard input.
It treats multiple files as a single file.
Input File Format
The input file can contain three sections: definitions,
rules, and user subroutines. Each section must be sepa-
rated from the others by a line containing only the delim-
iter, %%. The format is as follows: definitions %% rules
%% user_subroutines The purpose and format of each of
these sections are described under the headings that fol-
low.
Definitions Section
If you want to use variables in rules, you must define
them in the definitions section. The variables make up
the left column, and their definitions make up the right
column. For example, to define D as a numerical digit,
enter: D [0-9] You can use a defined variable in the
rules section by enclosing the variable name in braces,
{D}.
In the definitions section, you can set either of the fol-
lowing two mutually exclusive declarations: Declare the
type of yytext to be a null-terminated character array.
Declare the type of yytext to be a pointer to a null-ter-
minated character string. Use of the %pointer definition
selects the /usr/ccs/lib/ncpform skeleton.
In the definitions section, you can also set table sizes
for the resulting finite state machine. The default sizes
are large enough for small programs. You may want to set
larger sizes for more complex programs: Number of posi-
tions is number (default 5000) Number of states is number
(default 2500) Number of parse tree nodes is number
(default 2000) Number of transitions is number (default
5000) Number of packed character classes is number
(default 2000) Number of output slots is number (default
5000)
If extended characters appear in regular expression
strings, you may need to reset the output array size with
number of extended characters relative to the number of
ASCII characters.
Rules Section
The rules section is required, and it must be preceded by
the %% delimiter, even if you do not have a definitions
section. The lex command does not recognize rules without
the delimiter.
In this section, the left column contains the pattern to
be recognized in an input file to yylex(). The right col-
umn contains the C program fragment executed when that
pattern is recognized.
Patterns can include extended characters with one excep-
tion: extended characters may not appear in range specifi-
cations within character class expressions surrounded by
brackets.
The columns are separated by a tab. For example, to
search files for the word LEAD and replace it with GOLD,
perform the following steps: Create a file called trans-
mute.l containing the lines: %% (LEAD) printf("GOLD");
Then issue the following commands to the shell: lex trans-
mute.l cc -o transmute lex.yy.c -ll You can test the
resulting program with the command: transmute <transmute.l
This command echoes the contents of transmute.l, with the
occurrences of LEAD changed to GOLD.
Each pattern may have a corresponding action, that is, a
fragment of C source code to execute when the pattern is
matched. Each statement must end with a ; (semicolon).
If you use more than one statement in an action, you must
enclose all of them in {} (braces). A second delimiter,
%%, must follow the rules section if you have a user sub-
routine section.
When yylex() matches a string in the input stream, it
copies the matched text to an external character array,
yytext, before it executes any actions in the rules sec-
tion.
You can use the following operators to form patterns that
you want to match: Matches the characters written.
Matches any one character in the enclosed range ([.-.]) or
the enclosed list ([...]). [abcx-z] matches a,b,c,x,y, or
z. Matches the enclosed character or string even if it is
an operator. "$" prevents lex from interpreting the $
character as an operator. Acts the same as double quotes.
\$ prevents lex from interpreting the $ character as an
matches zero or more repeated literal characters x.
Matches one or more occurrences of the single-character
regular expression immediately preceding it. Matches
either zero or one occurrence of the single-character reg-
ular expression immediately preceding it. Matches the
character only at the beginning of a line. ^x matches an
x at the beginning of a line. Matches any character
except for the characters following the ^. [^xyz] matches
any character but x, y, or z. Matches any character
except the newline character. Matches the end of a line.
Matches either of two characters. x|y matches either x or
y. Matches one extended regular expression (ERE) only
when followed by a second ERE. It reads only the first
token into yytext. Given the regular expression a*b/cc
and the input aaabcc, yytext would contain the string aaab
on this match. Matches the pattern in the ( ) (parenthe-
ses). This is used for grouping. It reads the whole pat-
tern into yytext. A group in parentheses can be used in
place of any single character in any other pattern.
(xyz123) matches the pattern xyz123 and reads the whole
string into yytext. Matches the character as defined in
the definitions section. If D is defined as numeric dig-
its, {D} matches all numeric digits. Matches m-to-n
occurrences of the specified character. x{2,4} matches 2,
3, or 4 occurrences of x.
If a line begins with only a space, lex copies it to the
lex.yy.c output file. If the line is in the definitions
section of file, lex copies it to the declarations section
of lex.yy.c. If the line is in the rules section, lex
copies it to the program code section of lex.yy.c.
User Subroutines Section
The lex library has three subroutines defined as macros
that you can use in the rules. Reads a character from
yyin. Replaces a character after it is read. Writes a
character to yyout.
You can override these three macros by writing your own
code for these routines in the user subroutines section.
But if you write your own routines, you must undefine
these macros in the definitions section as follows: %{
#undef input #undef unput #undef output }% When you are
using lex as a simple transformer/recognizer for stdin to
stdout piping, you can avoid writing the framework by
using libl.a (the lex library). It has a main routine
that calls yylex() for you.
External names generated by lex all begin with the prefix
yy, as in yyin, yyout, yylex, and yytext.
Putting Spaces in an Expression
the spaces or tab characters in "" (double quotes) to
include them in the expression. Use quotes around all
spaces in expressions that are not already within sets of
[ ] (brackets).
Other Special Characters
The lex program recognizes many of the normal C language
special characters. These character sequences are as fol-
lows: Sequence Meaning
\n Newline \t Tab
\b Backspace \\ Backslash
\digits The character whose encoding is
represented by the three-digit octal number
\xdigits The character whose encoding is
represented by the hexadecimal integer Do
not use the actual newline character in an expression.
When using these special characters in an expression, you
do not need to enclose them in quotes. Every character,
except these special characters and the previously
described operator symbols, is always a text character.
Matching Rules
When more than one expression can match the current input,
lex chooses the longest match first. Among rules that
match the same number of characters, the rule that occurs
first is chosen. For example: integer keyword action...;
[a-z]+ identifier action...;
If the preceding rules are given in that order and inte-
gers is the input word, lex matches the input as an iden-
tifier because [a-z]+ matches eight characters, while
integer matches only seven. However, if the input is
integer, both rules match seven characters. The keyword
rule is selected because it occurs first. A shorter
input, such as int, does not match the expression rule
integer and causes lex to select the rule identifier.
Matching a String with Wildcard Characters
Because lex chooses the longest match first, do not use
rules containing expressions like .* (for example: '.*'
The preceding rule might seem like a good way to recognize
a string in single quotes. However, the lexical analyzer
reads far ahead, looking for a distant single quote to
complete the long match. If a lexical analyzer with such
a rule gets the following input, it matches the whole
string: 'first' quoted string here, 'second' here To find
the smaller strings, first and second, use the following
rule: '[^'\n]*' This rule stops after matching 'first'.
fore, expressions like .* stop on the current line. Do
not try to defeat this with expressions like [.\n] +. The
lexical analyzer tries to read the entire input file, and
an internal buffer overflow occurs.
Finding Strings within Strings
The lex program partitions the input stream and does not
search for all possible matches of each expression. Each
character is accounted for once and only once. For exam-
ple, to count occurrences of both she and he in an input
text, try the following rules: she s++; he h++; \n |
. ; The last two rules ignore everything besides he and
she. However, because she includes he, lex does not rec-
ognize the instances of he that are included in she.
To override this choice, use the REJECT action. This
directive tells lex to go to the next rule. lex then
adjusts the position of the input pointer to where it was
before the first rule was executed, and executes the sec-
ond choice rule. For example, to count the included
instances of he, use the following rules: she {s++;
REJECT;} he {h++; REJECT;} \n | . ; After counting
the occurrences of she, lex rejects the input stream and
then counts the occurrences of he. In this case, you can
omit the REJECT action on he because she includes he but
not vice versa. In other cases, it may be difficult to
determine which input characters are in both classes.
In general, REJECT is useful whenever the purpose of lex
is not to partition the input stream but to detect all
examples of some items in the input, and the instances of
these items may overlap or include each other.
Environment Variables
The following environment variables affect the behavior of
lex(): Provides a default value for the locale category
variables that are not set or null. If set, overrides the
values of all other locale variables. Determines the
order in which output is sorted for the -x option. Deter-
mines the locale for the interpretation of byte sequences
as characters (single-byte or multi-byte) in input parame-
ters and files. Determines the locale used to affect the
format and contents of diagnostic messages displayed by
the command. Determines the location of message catalogs
for the processing of LC_MESSAGES.
NOTES
Because lex uses fixed names for intermediate and output
files, you can have only one lex-generated program in a
given directory. If the -t option is not specified,
informational, error, and warning messages are written to
stdout. If the -t option is specified, informational,
200, controlled by the constant YYLMAX. If the programmer
needs to allow a larger array, the YYLMAX constant may be
redefined as follows from within the lex command file: {
#undef YYLMAX #define YYLMAX 8192 } Two other arrays use
YYLMAX, yysubf, and yylstate.
EXAMPLES
The following command draws lex instructions from the file
lexcommands and places the output in lex.yy.c: lex lexcom-
mands The file lexcommands contains an example of a lex
program that would be put into a lex command file. The
following program converts uppercase to lowercase, removes
spaces at the end of a line, and replaces multiple spaces
with single spaces: %% [A-Z] putchar(tolower(yytext[0]));
[ ]+$ ; [ ]+ putchar(' ');
FILES
Run-time library. Default C language skeleton finite
state machine for lex. Default C language skeleton finite
state machine for lex, implemented with the pointer defi-
nition of yytext. Default RATFOR language skeleton finite
state machine for lex.
RELATED INFORMATION
Commands: yacc(1)
Guides: Programming Support Tools
Standards: standards(5) delim off