Due to the technical work on the site downloading books (as well as file conversion and sending books to email/kindle) may be unstable from May, 27 to May, 28
Also, for users who have an active donation now, we will extend the donation period.
Introduction to Compiler Design
Torben Ægidius Mogensen (auth.)
The second edition of this textbook has been fully revised and adds material about loop optimisation, function call optimisation and dataflow analysis. It presents techniques for making realistic compilers for simple programming languages, using techniques that are close to those used in "real" compilers, albeit in places slightly simplified for presentation purposes. All phases required for translating a highlevel language to symbolic machine language are covered, including lexing, parsing, type checking, intermediatecode generation, machinecode generation, register allocation and optimisation, interpretation is covered briefly.
Aiming to be neutral with respect to implementation languages, algorithms are presented in pseudocode rather than in any specific programming language, but suggestions are in many cases given for how these can be realised in different language flavours. Introduction to Compiler Design is intended for an introductory course in compiler design, suitable for both undergraduate and graduate courses depending on which chapters are used.
Editorial:
Springer International Publishing
ISBN 13:
9783319669663
Series:
Undergraduate Topics in Computer Science
The file will be sent to your email address. It may take up to 15 minutes before you receive it.
The file will be sent to your Kindle account. It may takes up to 15 minutes before you received it.
Please note you need to add our email km@bookmail.org to approved email addresses.
Read more.
Most frequently terms
Undergraduate Topics in Computer Science
Torben Ægidius Mogensen
Introduction
to Compiler
Design
Second Edition
Undergraduate Topics in Computer Science
Series editor
Ian Mackie
Advisory board
Samson Abramsky, University of Oxford, Oxford, UK
Karin Breitman, Pontiﬁcal Catholic University of Rio de Janeiro, Rio de Janeiro, Brazil
Chris Hankin, Imperial College London, London, UK
Dexter C. Kozen, Cornell University, Ithaca, USA
Andrew Pitts, University of Cambridge, Cambridge, UK
Hanne Riis Nielson, Technical University of Denmark, Kongens Lyngby, Denmark
Steven S Skiena, Stony Brook University, Stony Brook, USA
Iain Stewart, University of Durham, Durham, UK
Undergraduate Topics in Computer Science (UTiCS) delivers highquality
instructional content for undergraduates studying in all areas of computing and
information science. From core foundational and theoretical material to ﬁnalyear
topics and applications, UTiCS books take a fresh, concise, and modern approach
and are ideal for selfstudy or for a one or twosemester course. The texts are all
authored by established experts in their ﬁelds, reviewed by an international advisory
board, and contain numerous examples and problems. Many include fully worked
solutions.
More information about this series at http://www.springer.com/series/7592
Torben Ægidius Mogensen
Introduction to Compiler
Design
Second Edition
123
Torben Ægidius Mogensen
Datalogisk Institut
Københavns Universitet
Copenhagen
Denmark
ISSN 18637310
ISSN 21971781 (electronic)
Undergraduate Topics in Computer Science
ISBN 9783319669656
ISBN 9783319669663 (eBook)
https://doi.org/10.1007/9783319669663
Library of Congress Control Number: 2017954288
1st edition: © SpringerVerlag London Limited 2011
2nd edition: © Springer International Publishing AG 2017
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part
of the material is concerned, speciﬁcally the rights of translation, reprinting, reuse of illustrations,
recitation, b; roadcasting, reproduction on microﬁlms or in any other physical way, and transmission
or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar
methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publication does not imply, even in the absence of a speciﬁc statement, that such names are exempt from
the relevant protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this
book are believed to be true and accurate at the date of publication. Neither the publisher nor the
authors or the editors give a warranty, express or implied, with respect to the material contained herein or
for any errors or omissions that may have been made. The publisher remains neutral with regard to
jurisdictional claims in published maps and institutional afﬁliations.
Printed on acidfree paper
This Springer imprint is published by Springer Nature
The registered company is Springer International Publishing AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
Language is a process of free creation; its laws and principles are ﬁxed, but the manner in which the principles of
generation are used is free and inﬁnitely varied. Even the
interpretation and use of words involves a process of free
creation.
Noam Chomsky (1928–)
In order to reduce the complexity of designing and building computers, nearly all
of these are made to execute relatively simple commands (but do so very quickly).
A program for a computer must be built by combining these very simple commands
into a program in what is called machine language. Since this is a tedious and
errorprone process most programming is, instead, done using a highlevel programming language. This language can be very different from the machine language that the computer can execute, so some means of bridging the gap is
required. This is where the compiler comes in.
A compiler translates (or compiles) a program written in a highlevel programming language, that is suitable for human programmers, into the lowlevel
machine language that is required by computers. During this process, the compiler
will also attempt to detect and report obvious programmer mistakes.
Using a highlevel language for programming has a large impact on how fast
programs can be developed. The main reasons for this are
• Compared to machine language, the notation used by programming languages is
closer to the way humans think about problems.
• The compiler can detect some types of programming mistakes.
• Programs written in a highlevel language tend to be shorter than equivalent
programs written in machine language.
Another advantage of using a highlevel language is that the same program can be
compiled to many different machine languages and, hence, be brought to run on
many different machines.
v
vi
Preface
On the other hand, programs that are written in a highlevel language and
automatically translated to machine language may run somewhat slower than
programs that are handcoded in machine language. Hence, some timecritical
programs are still written partly in machine language. A good compiler will,
however, be able to get very close to the speed of handwritten machine code when
translating wellstructured programs.
The Phases of a Compiler
Since writing a compiler is a nontrivial task, it is a good idea to structure the work.
A typical way of doing this is to split the compilation into several phases with
welldeﬁned interfaces between them. Conceptually, these phases operate in
sequence (though in practice, they are often interleaved), each phase (except the
ﬁrst) taking the output from the previous phase as its input. It is common to let each
phase be handled by a separate program module. Some of these modules are written
by hand, while others may be generated from speciﬁcations. Often, some of the
modules can be shared between several compilers.
A common division into phases is described below. In some compilers, the
ordering of phases may differ slightly, some phases may be combined or split into
several phases or some extra phases may be inserted between those mentioned
below.
Lexical
analysis
Syntax
analysis
Type
checking
Intermediate
code
generation
Register
allocation
This is the initial part of reading and analyzing the program text:
The text is read and divided into tokens, each of which
corresponds to a symbol in the programming language, e.g., a
variable name, keyword, or number. Lexical analysis is often
abbreviated to lexing.
This phase takes the list of tokens produced by the lexical
analysis and arranges these in a tree structure (called the syntax
tree) that reflects the structure of the program. This phase is often
called parsing.
This phase analyses the syntax tree to determine if the program
violates certain consistency requirements, e.g., if a variable is
used but not declared, or if it is used in a context that does not
make sense given the type of the variable, such as trying to use a
Boolean value as a function pointer.
The program is translated to a simple machineindependent
intermediate language.
The symbolic variable names used in the intermediate code are
translated to numbers, each of which corresponds to a register in
the target machine code.
Preface
Machine code
generation
Assembly and
linking
vii
The intermediate language is translated to assembly language
(a textual representation of machine code) for a speciﬁc machine
architecture.
The assembly language code is translated into binary representation and addresses of variables, functions, etc., are determined
The ﬁrst three phases are collectively called the frontend of the compiler and the
last three phases are collectively called the backend. The middle part of the
compiler is in this context only the intermediate code generation, but this often
includes various optimisations and transformations on the intermediate code.
Each phase, through checking and transformation, establishes invariants on the
data it passes on to the next phase. For example, the type checker can assume the
absence of syntax errors, and the code generation can assume the absence of type
errors. These invariants can reduce the burden of writing the later phases.
Assembly and linking are typically done by programs supplied by the machine
or operating system vendor, and are hence not part of the compiler itself. We will
not further discuss these phases in this book, but assume that a compiler produces
its result as symbolic assembly code.
Interpreters
An interpreter is another way of implementing a programming language.
Interpretation shares many aspects with compiling. Lexing, parsing, and type
checking are in an interpreter done just as in a compiler. But instead of generating
code from the syntax tree, the syntax tree is processed directly to evaluate
expressions, execute statements, and so on. An interpreter may need to process the
same piece of the syntax tree (for example, the body of a loop) many times and,
hence, interpretation is typically slower than executing a compiled program. But
writing an interpreter is often simpler than writing a compiler, and an interpreter is
easier to move to a different machine, so for applications where speed is not of
essence, or where each part of the program is executed only once, interpreters are
often used.
Compilation and interpretation may be combined to implement a programming
language. For example, the compiler may produce intermediatelevel code which is
then interpreted rather than compiled to machine code. In some systems, there may
even be parts of a program that are compiled to machine code, some parts that are
compiled to intermediate code that is interpreted at runtime, while other parts may
be interpreted directly from the syntax tree. Each choice is a compromise between
speed and space: Compiled code tends to be bigger than intermediate code, which
tend to be bigger than syntax, but each step of translation improves running speed.
Using an interpreter is also useful during program development, where it is more
important to be able to test a program modiﬁcation quickly rather than run the
viii
Preface
program efﬁciently. And since interpreters do less work on the program before execution starts, they are able to start running the program more quickly. Furthermore,
since an interpreter works on a program representation that is closer to the source code
than is compiled code, error messages can be more precise and informative.
We will discuss interpreters briefly in Chap. 4, but they are not the main focus of
this book.
Why Learn About Compilers?
Few people will ever be required to write a compiler for a generalpurpose language
like C, Java, or SML. So why do most computer science institutions offer compiler
courses and often make these mandatory?
Some typical reasons are
(a) It is considered a topic that you should know in order to be “wellcultured” in
computer science.
(b) A good craftsman should know his tools, and compilers are important tools for
programmers and computer scientists.
(c) The techniques used for constructing a compiler are useful for other purposes as
well.
(d) There is a good chance that a programmer or computer scientist will need to
write a compiler or interpreter for a domainspeciﬁc language.
The ﬁrst of these reasons is somewhat dubious, though something can be said for
“knowing your roots”, even in such a hastily changing ﬁeld as computer science.
Reason “b” is more convincing: Understanding how a compiler is built will
allow programmers to get an intuition about what their highlevel programs will
look like when compiled, and use this intuition to tune programs for better efﬁciency. Furthermore, the error reports that compilers provide are often easier to
understand when one knows about and understands the different phases of compilation, such as knowing the difference between lexical errors, syntax errors, type
errors, and so on.
The third reason is also quite valid. In particular, the techniques used for reading
(lexing and parsing) the text of a program and converting this into a form (abstract
syntax) that is easily manipulated by a computer, can be used to read and manipulate any kind of structured text such as XML documents, address lists, etc.
Reason “d” is becoming more and more important as domainspeciﬁc languages
(DSLs) are gaining in popularity. A DSL is a (typically small) language designed for
a narrow class of problems. Examples are database query languages, textformatting
languages, scene description languages for raytracers, and languages for setting up
economic simulations. The target language for a compiler for a DSL may be traditional machine code, but it can also be another highlevel language for which
compilers already exist, a sequence of control signals for a machine, or formatted
Preface
ix
text and graphics in some printercontrol language (e.g., PostScript), and DSLs are
often interpreted instead of compiled. Even so, all DSL compilers and interpreters
will have frontends for reading and analyzing the program text that are similar to
those used in compilers and interpreters for generalpurpose languages.
In brief, the methods needed to make a compiler frontend are more widely
applicable than the methods needed to make a compiler backend, but the latter is
more important for understanding how a program is executed on a machine.
About the Second Edition of the Book
The second edition has been extended with material about optimisations for
function calls and loops, and about dataflow analysis, which can be used for various
optimisations. This extra material is aimed at advanced BSclevel courses or
MSclevel courses.
To the Lecturer
This book was written for use in the introductory compiler course at DIKU, the
Department of Computer Science at the University of Copenhagen, Denmark.
At times, standard techniques from compiler construction have been simpliﬁed
for presentation in this book. In such cases, references are made to books or articles
where the full version of the techniques can be found.
The book aims at being “language neutral”. This means two things
• Little detail is given about how the methods in the book can be implemented in
any speciﬁc language. Rather, the description of the methods is given in the
form of algorithm sketches and textual suggestions of how these can be
implemented in various types of languages, in particular imperative and functional languages.
• There is no single throughgoing example of a language to be compiled. Instead,
different small (sub)languages are used in various places to cover exactly the
points that the text needs. This is done to avoid drowning in detail, hopefully
allowing the readers to “see the wood for the trees”.
Each chapter has a section on further reading, which suggests additional reading
material for interested students. Each chapter has a set of exercises. Few of these
require access to a computer, but can be solved on paper or blackboard. After some
of the sections in the book, a few easy exercises are listed as suggested exercises. It
x
Preface
is recommended that the student attempts to solve these exercises before continuing
reading, as the exercises support understanding of the previous sections.
Teaching with this book can be supplemented with project work, where students
write simple compilers. Since the book is language neutral, no speciﬁc project is
given. Instead, the teacher must choose relevant tools and select a project that ﬁts
the level of the students and the time available. Depending on the amount of project
work and on how much of the advanced material added in the second edition is
used, the book can support course sizes ranging from 5–10 ECTS points.
The following link contains extra material for the book, including solutions to
selected exercises—http://www.diku.dk/*torbenm/ICD/.
Copenhagen, Denmark
Torben Ægidius Mogensen
Acknowledgements
“Most people return small favors, acknowledge medium ones
and repay greater ones—with ingratitude.”
Benjamin Franklin (1705–1790)
The author wishes to thank all people who have been helpful in making this book a
reality. This includes the students who have been exposed to earlier versions of the
book at the compiler courses “Dat1E”, “Oversættere”, “Implementering af programmeringssprog” and “Advanced Language Processing” at DIKU, and who have
found numerous typos and other errors in the earlier versions. I would also like to
thank coteachers and instructors at these courses, who have pointed out places
where things were not as clear as they could be.
Copenhagen, Denmark
August 2017
Torben Ægidius Mogensen
xi
Contents
1
Lexical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . .
1.1
Regular Expressions . . . . . . . . . . . . . . . . . .
1.1.1 Shorthands . . . . . . . . . . . . . . . . . . .
1.1.2 Examples . . . . . . . . . . . . . . . . . . . .
1.2
Nondeterministic Finite Automata . . . . . . . .
1.3
Converting a Regular Expression to an NFA
1.3.1 Optimisations . . . . . . . . . . . . . . . . .
1.4
Deterministic Finite Automata . . . . . . . . . . .
1.5
Converting an NFA to a DFA . . . . . . . . . . .
1.5.1 Solving Set Equations . . . . . . . . . . .
1.5.2 The Subset Construction . . . . . . . . .
1.6
Size Versus Speed . . . . . . . . . . . . . . . . . . . .
1.7
Minimisation of DFAs . . . . . . . . . . . . . . . . .
1.7.1 Example . . . . . . . . . . . . . . . . . . . . .
1.7.2 Dead States . . . . . . . . . . . . . . . . . .
1.8
Lexers and Lexer Generators . . . . . . . . . . . .
1.8.1 Lexer Generators . . . . . . . . . . . . . .
1.9
Properties of Regular Languages . . . . . . . . .
1.9.1 Relative Expressive Power . . . . . . .
1.9.2 Limits to Expressive Power . . . . . . .
1.9.3 Closure Properties . . . . . . . . . . . . .
1.10 Further Reading . . . . . . . . . . . . . . . . . . . . .
1.11 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
2
4
5
7
9
10
12
13
14
16
19
20
21
23
25
29
30
30
32
32
33
34
38
2
Syntax Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1
ContextFree Grammars . . . . . . . . . . . . . . . . . .
2.1.1 How to Write Context Free Grammars .
2.2
Derivation . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.1 Syntax Trees and Ambiguity . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
39
40
42
44
45
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
xiii
xiv
Contents
2.3
Operator Precedence . . . . . . . . . . . . . . . . . . . . . . . .
2.3.1 Rewriting Ambiguous Expression Grammars
2.4
Other Sources of Ambiguity . . . . . . . . . . . . . . . . . .
2.5
Syntax Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6
Predictive Parsing . . . . . . . . . . . . . . . . . . . . . . . . . .
2.7
Nullable and FIRST . . . . . . . . . . . . . . . . . . . . . . . . .
2.8
Predictive Parsing Revisited . . . . . . . . . . . . . . . . . . .
2.9
FOLLOW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.10 A Larger Example . . . . . . . . . . . . . . . . . . . . . . . . . .
2.11 LL(1) Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.11.1 Recursive Descent . . . . . . . . . . . . . . . . . . .
2.11.2 TableDriven LL(1) Parsing . . . . . . . . . . . .
2.11.3 Conﬂicts . . . . . . . . . . . . . . . . . . . . . . . . . .
2.12 Rewriting a Grammar for LL(1) Parsing . . . . . . . . . .
2.12.1 Eliminating LeftRecursion . . . . . . . . . . . . .
2.12.2 LeftFactorisation . . . . . . . . . . . . . . . . . . . .
2.12.3 Construction of LL(1) Parsers Summarised .
2.13 SLR Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.14 Constructing SLR Parse Tables . . . . . . . . . . . . . . . .
2.14.1 Conﬂicts in SLR ParseTables . . . . . . . . . . .
2.15 Using Precedence Rules in LR Parse Tables . . . . . . .
2.16 Using LRParser Generators . . . . . . . . . . . . . . . . . . .
2.16.1 Conﬂict Handling in Parser Generators . . . .
2.16.2 Declarations and Actions . . . . . . . . . . . . . . .
2.16.3 Abstract Syntax . . . . . . . . . . . . . . . . . . . . .
2.17 Properties of ContextFree Languages . . . . . . . . . . .
2.18 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.19 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
48
50
53
54
54
55
59
61
63
65
66
68
70
70
70
72
73
74
77
81
82
84
84
86
86
89
90
91
95
3
Scopes and Symbol Tables . . . . . . . . . . . . . . . . . .
3.1
Symbol Tables . . . . . . . . . . . . . . . . . . . . . .
3.1.1 Implementation of Symbol Tables . .
3.1.2 Simple Persistent Symbol Tables . . .
3.1.3 A Simple Imperative Symbol Table .
3.1.4 Efﬁciency Issues . . . . . . . . . . . . . . .
3.1.5 Shared or Separate Name Spaces . . .
3.2
Further Reading . . . . . . . . . . . . . . . . . . . . .
3.3
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . .
Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
97
98
98
99
100
101
101
102
102
102
4
Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
4.1
The Structure of an Interpreter . . . . . . . . . . . . . . . . . . . . . . . . 104
4.2
A Small Example Language . . . . . . . . . . . . . . . . . . . . . . . . . 104
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Contents
xv
4.3
An Interpreter for the Example Language . . . . .
4.3.1 Evaluating Expressions . . . . . . . . . . . .
4.3.2 Interpreting Function Calls . . . . . . . . .
4.3.3 Interpreting a Program . . . . . . . . . . . .
4.4
Advantages and Disadvantages of Interpretation
4.5
Further Reading . . . . . . . . . . . . . . . . . . . . . . .
4.6
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
105
106
108
108
110
111
111
113
5
Type Checking . . . . . . . . . . . . . . . . . . . . . . . .
5.1
The Design Space of Type Systems . . . .
5.2
Attributes . . . . . . . . . . . . . . . . . . . . . . .
5.3
Environments for Type Checking . . . . . .
5.4
Type Checking of Expressions . . . . . . . .
5.5
Type Checking of Function Declarations
5.6
Type Checking a Program . . . . . . . . . . .
5.7
Advanced Type Checking . . . . . . . . . . .
5.8
Further Reading . . . . . . . . . . . . . . . . . .
5.9
Exercises . . . . . . . . . . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
115
115
117
117
118
120
121
122
124
125
126
6
IntermediateCode Generation . . . . . . . . . . . . . . .
6.1
Designing an Intermediate Language . . . . . .
6.2
The Intermediate Language . . . . . . . . . . . . .
6.3
SyntaxDirected Translation . . . . . . . . . . . . .
6.4
Generating Code from Expressions . . . . . . .
6.4.1 Examples of Translation . . . . . . . . .
6.5
Translating Statements . . . . . . . . . . . . . . . . .
6.6
Logical Operators . . . . . . . . . . . . . . . . . . . .
6.6.1 Sequential Logical Operators . . . . . .
6.7
Advanced Control Statements . . . . . . . . . . .
6.8
Translating Structured Data . . . . . . . . . . . . .
6.8.1 FloatingPoint Values . . . . . . . . . . .
6.8.2 Arrays . . . . . . . . . . . . . . . . . . . . . .
6.8.3 Strings . . . . . . . . . . . . . . . . . . . . . .
6.8.4 Records/Structs and Unions . . . . . . .
6.9
Translation of Declarations . . . . . . . . . . . . .
6.9.1 Simple Local Declarations . . . . . . . .
6.9.2 Translation of Function Declarations
6.10 Further Reading . . . . . . . . . . . . . . . . . . . . .
6.11 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
127
128
129
131
131
135
136
138
140
142
143
144
144
149
149
150
151
151
152
152
155
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
xvi
Contents
7
MachineCode Generation . . . . . . . . . . .
7.1
Conditional Jumps . . . . . . . . . . .
7.2
Constants . . . . . . . . . . . . . . . . . .
7.3
Exploiting Complex Instructions .
7.3.1 TwoAddress Instructions
7.4
Optimisations . . . . . . . . . . . . . . .
7.5
Further Reading . . . . . . . . . . . . .
7.6
Exercises . . . . . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
157
158
159
159
163
164
166
166
167
8
Register Allocation . . . . . . . . . . . . . . . . . . . . . .
8.1
Liveness . . . . . . . . . . . . . . . . . . . . . . . . .
8.2
Liveness Analysis . . . . . . . . . . . . . . . . . .
8.3
Interference . . . . . . . . . . . . . . . . . . . . . . .
8.4
Register Allocation by Graph Colouring . .
8.5
Spilling . . . . . . . . . . . . . . . . . . . . . . . . .
8.6
Heuristics . . . . . . . . . . . . . . . . . . . . . . . .
8.6.1 Removing Redundant Moves . . . .
8.6.2 Using Explicit Register Numbers .
8.7
Further Reading . . . . . . . . . . . . . . . . . . .
8.8
Exercises . . . . . . . . . . . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
169
170
171
174
176
177
179
180
181
182
182
184
9
Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.1
The Call Stack . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.2
Activation Records . . . . . . . . . . . . . . . . . . . . . . . . . .
9.3
Prologues, Epilogues and CallSequences . . . . . . . . . .
9.4
Letting the Callee Save Registers . . . . . . . . . . . . . . . .
9.5
CallerSaves Versus CalleeSaves . . . . . . . . . . . . . . . .
9.6
Using Registers to Pass Parameters . . . . . . . . . . . . . .
9.7
Interaction with the Register Allocator . . . . . . . . . . . .
9.8
Local Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.9
Accessing Nonlocal Variables . . . . . . . . . . . . . . . . . .
9.9.1 Global Variables . . . . . . . . . . . . . . . . . . . . . .
9.9.2 CallbyReference Parameters . . . . . . . . . . . .
9.10 Functions as Parameters . . . . . . . . . . . . . . . . . . . . . .
9.11 Variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.11.1 VariableSized Frames . . . . . . . . . . . . . . . . .
9.11.2 Variable Number of Parameters . . . . . . . . . . .
9.11.3 Direction of StackGrowth and Position of FP
9.11.4 Register Stacks . . . . . . . . . . . . . . . . . . . . . . .
9.12 Optimisations for Function Calls . . . . . . . . . . . . . . . .
9.12.1 Inlining . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.12.2 TailCall Optimisation . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
185
185
186
187
190
191
192
194
196
196
197
198
199
199
199
200
200
200
201
201
203
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Contents
xvii
9.13 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
9.14 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
10 DataFlow Analysis and Optimisation . . . . . . . . . . . . . . . . . . . . .
10.1 DataFlow Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.2 How to Design a DataFlow Analysis . . . . . . . . . . . . . . . .
10.3 Liveness Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.3.1 Improving Liveness Analysis . . . . . . . . . . . . . . . . .
10.4 Generalising from Liveness Analysis . . . . . . . . . . . . . . . . .
10.5 Common Subexpression Elimination . . . . . . . . . . . . . . . . .
10.5.1 Available Assignments . . . . . . . . . . . . . . . . . . . . .
10.5.2 Example of AvailableAssignments Analysis . . . . .
10.5.3 Using Available Assignment Analysis for Common
Subexpression Elimination . . . . . . . . . . . . . . . . . .
10.6 IndexCheck Elimination . . . . . . . . . . . . . . . . . . . . . . . . . .
10.7 JumptoJump Elimination . . . . . . . . . . . . . . . . . . . . . . . . .
10.8 Resources Used by DataFlow Analysis . . . . . . . . . . . . . . .
10.9 Pointer Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.10 Limitations of DataFlow Analyses . . . . . . . . . . . . . . . . . .
10.11 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.12 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
211
211
212
212
213
214
215
215
217
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
220
221
224
225
227
231
232
232
233
11 Optimisations for Loops . . . . . . . . . . . . . .
11.1 Loops . . . . . . . . . . . . . . . . . . . . . . .
11.2 Code Hoisting . . . . . . . . . . . . . . . . .
11.3 Memory Prefetching . . . . . . . . . . . .
11.4 Incrementalisation . . . . . . . . . . . . . .
11.4.1 Rules for Incrementalisation
11.5 Further Reading . . . . . . . . . . . . . . .
11.6 Exercises . . . . . . . . . . . . . . . . . . . .
Reference . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
235
235
236
238
240
242
244
244
245
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Appendix A: Set Notation and Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . 247
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
List of Figures
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
1.10
1.11
1.12
1.13
1.14
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
2.9
2.10
2.11
2.12
2.13
Regular expressions and their derivation . . . . . . . . . . . . . . . . . . . . .
Some algebraic properties of regular expressions . . . . . . . . . . . . . .
Example of an NFA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Constructing NFA fragments from regular expressions . . . . . . . . . .
NFA for the regular expression (a jb) ac . . . . . . . . . . . . . . . . . . .
Optimised NFA construction for regular expression shorthands . . .
Optimised NFA for ½09 þ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Example of a DFA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
DFA constructed from the NFA in Fig. 1.5 . . . . . . . . . . . . . . . . . .
Nonminimal DFA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Minimal DFA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Combined NFA for several tokens . . . . . . . . . . . . . . . . . . . . . . . . .
Combined DFA for several tokens . . . . . . . . . . . . . . . . . . . . . . . . .
A 4state NFA that gives 15 DFA states . . . . . . . . . . . . . . . . . . . .
From regular expressions to context free grammars . . . . . . . . . . . .
Simple expression grammar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simple statement grammar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Example grammar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Derivation of the string aabbbcc using Grammar 2.4 . . . . . . . . .
Leftmost derivation of the string aabbbcc using Grammar 2.4 . .
Syntax tree for the string aabbbcc using Grammar 2.4 . . . . . . . .
Alternative syntax tree for the string aabbbcc using
Grammar 2.4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Unambiguous version of Grammar 2.4 . . . . . . . . . . . . . . . . . . . . .
Fully reduced tree for the syntax tree in Fig. 2.7 . . . . . . . . . . . . . .
Preferred syntax tree for 2+3*4 using Grammar 2.2, and the
corresponding fully reduced tree . . . . . . . . . . . . . . . . . . . . . . . . . . .
Unambiguous expression grammar . . . . . . . . . . . . . . . . . . . . . . . . .
Syntax tree for 2+3*4 using Grammar 2.12 , and the corresponding
fully reduced tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3
4
8
10
11
11
12
12
19
21
23
26
27
31
42
43
43
45
45
45
46
..
..
..
47
47
48
..
..
49
52
..
52
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
xix
xx
2.14
2.15
2.16
2.17
2.18
2.19
2.20
2.21
2.22
2.23
2.24
2.25
2.26
2.27
2.28
2.29
2.30
2.31
2.32
2.33
2.34
2.35
4.1
4.2
4.3
4.4
5.1
5.2
5.3
5.4
6.1
6.2
6.3
6.4
6.5
6.6
6.7
6.8
6.9
6.10
6.11
6.12
7.1
List of Figures
Unambiguous grammar for statements . . . . . . . . . . . . . . . . . . . . . .
Fixedpoint iteration for calculation of Nullable . . . . . . . . . . . . . . .
Fixedpoint iteration for calculation of FIRST. . . . . . . . . . . . . . . . .
Recursive descent parser for Grammar 2.9 . . . . . . . . . . . . . . . . . .
Treebuilding recursive descent parser for Grammar 2.9 . . . . . . . . .
LL(1) table for Grammar 2.9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Program for tabledriven LL(1) parsing . . . . . . . . . . . . . . . . . . . . .
Input and stack during tabledriven LL(1) parsing . . . . . . . . . . . . .
Treebuilding program for tabledriven LL(1) parsing . . . . . . . . . .
Removing leftrecursion from Grammar 2.12 . . . . . . . . . . . . . . . . .
Leftfactorised grammar for conditionals . . . . . . . . . . . . . . . . . . . .
Example shiftreduce parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
SLR table for Grammar 2.9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Algorithm for SLR parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Example SLR parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Example grammar for SLRtable construction . . . . . . . . . . . . . . . .
NFAs for the productions in Grammar 2.29 . . . . . . . . . . . . . . . . . .
Combined NFA for Grammar 2.29: Epsilon transitions
are added, and A is the only start state . . . . . . . . . . . . . . . . . . . . . .
DFA constructed from the NFA in Fig. 2.31 . . . . . . . . . . . . . . . . .
DFA table for Grammar 2.9, equivalent to the DFA in Fig. 2.32 . .
Summary of SLR parsetable construction . . . . . . . . . . . . . . . . . . .
Textual representation of NFA states . . . . . . . . . . . . . . . . . . . . . . .
Example language for interpretation . . . . . . . . . . . . . . . . . . . . . . . .
Evaluating expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Evaluating a function call . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Interpreting a program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The design space of type systems . . . . . . . . . . . . . . . . . . . . . . . . . .
Type checking of expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Type checking a function declaration . . . . . . . . . . . . . . . . . . . . . . .
Type checking a program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The intermediate language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A simple expression language . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Translating an expression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Statement language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Translation of statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Translation of simple conditions . . . . . . . . . . . . . . . . . . . . . . . . . . .
Example language with logical operators . . . . . . . . . . . . . . . . . . . .
Translation of sequential logical operators . . . . . . . . . . . . . . . . . . .
Translation for onedimensional arrays . . . . . . . . . . . . . . . . . . . . . .
A twodimensional array. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Translation of multidimensional arrays . . . . . . . . . . . . . . . . . . . . .
Translation of simple declarations. . . . . . . . . . . . . . . . . . . . . . . . . .
Pattern/replacement pairs for a subset of the MIPS instruction set .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
53
57
58
66
67
68
68
69
69
72
73
75
76
77
77
78
78
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
79
79
80
80
85
105
107
109
109
116
119
121
122
130
132
134
136
137
137
140
141
145
146
148
151
162
List of Figures
8.1
8.2
8.3
8.4
8.5
8.6
8.7
8.8
8.9
9.1
9.2
9.3
9.4
9.5
9.6
9.7
9.8
9.9
9.10
9.11
9.12
9.13
9.14
9.15
9.16
10.1
10.2
10.3
10.4
10.5
10.6
10.7
10.8
11.1
11.2
Example program for liveness analysis and register allocation . . . .
Gen and kill sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
succ, gen and kill for the program in Fig. 8.1 . . . . . . . . . . . . . . . . .
Fixedpoint iteration for liveness analysis . . . . . . . . . . . . . . . . . . . .
Interference graph for the program in Fig. 8.1 . . . . . . . . . . . . . . . .
Algorithm 8.3 applied to the graph in Fig. 8.5 . . . . . . . . . . . . . . . .
Program from Fig. 8.1 after spilling variable a . . . . . . . . . . . . . . . .
Interference graph for the program in Fig. 8.7 . . . . . . . . . . . . . . . .
Colouring of the graph in Fig. 8.8 . . . . . . . . . . . . . . . . . . . . . . . . .
Simple activation record layout. . . . . . . . . . . . . . . . . . . . . . . . . . . .
Prologue for the header f ðp1 ; . . .; pm Þ using the frame layout
shown in Fig. 9.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Epilogue for the instruction RETURN result using the frame
layout shown in Fig. 9.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Call sequence for x :¼ CALL gða1 ; . . .; an Þ using the frame
layout shown in Fig. 9.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Activation record layout for calleesaves . . . . . . . . . . . . . . . . . . . .
Prologue for the header f ðp1 ; . . .; pm Þ using calleesaves . . . . . . . .
Epilogue for the instruction RETURN result using calleesaves . . .
Call sequence for x :¼ CALL gða1 ; . . .; an Þ using calleesaves . . . .
Possible division of registers for a 16register architecture . . . . . . .
Activation record layout for the register division shown
in Fig. 9.9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Prologue for the header f ðp1 ; . . .; pm Þ using the register division
shown in Fig. 9.9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Epilogue for the instruction RETURN result using the register
division shown in Fig. 9.9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Call sequence for x :¼ CALL gða1 ; . . .; an Þ using the register
division shown in Fig. 9.9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Variable capture when inlining . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Renaming variables when inlining . . . . . . . . . . . . . . . . . . . . . . . . .
Recursive inlining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Gen and kill sets for available assignments . . . . . . . . . . . . . . . . . .
Example code for availableassignments analysis . . . . . . . . . . . . . .
pred, gen and kill for the program in Fig. 10.2 . . . . . . . . . . . . . . .
Fixedpoint iteration for availableassignment analysis . . . . . . . . . .
The program in Fig. 10.2 after common subexpression
elimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
gen and kill sets for indexcheck elimination . . . . . . . . . . . . . . . . .
Intermediate code for forloop with index check . . . . . . . . . . . . . .
Equations for pointer analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Incrementalisation of nested loop . . . . . . . . . . . . . . . . . . . . . . . . . .
Eliminating weakly dead variables . . . . . . . . . . . . . . . . . . . . . . . . .
xxi
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
171
172
173
174
175
178
179
179
179
187
. . 188
. . 189
.
.
.
.
.
.
.
.
.
.
.
.
188
190
190
190
190
191
. . 192
. . 193
. . 193
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
194
202
202
202
216
218
218
219
.
.
.
.
.
.
.
.
.
.
.
.
220
223
224
229
241
242
Chapter 1
Lexical Analysis
I am not yet so lost in lexicography as to forget that words are
the daughters of earth, and that things are the sons of heaven.
Language is only the instrument of science, and words are but
the signs of ideas.
Samuel Johnson (1709–1784)
The word “lexical” in the traditional sense means “pertaining to words”. In terms of
programming languages, words are entities like variable names, numbers, keywords
etc. Such wordlike entities are traditionally called tokens.
A lexical analyser, also called a lexer or scanner, will as input take a string of
individual letters and divide this string into a sequence of classified tokens. Additionally, it will filter out whatever separates the tokens (the socalled whitespace),
i.e., layout characters (spaces, newlines etc.) and comments.
The main purpose of lexical analysis is to make life easier for the subsequent
syntax analysis phase. In theory, the work that is done during lexical analysis can be
made an integral part of syntax analysis, and in simple systems this is indeed often
done. However, there are reasons for keeping the phases separate:
• Efficiency: A specialised lexer may do the simple parts of the work faster than the
parser, using more general methods, can. Furthermore, the size of a system that
is split in two phases may be smaller than a combined system. This may seem
paradoxical but, as we shall see, there is a nonlinear factor involved which may
make a separated system smaller than a combined system.
• Modularity: The syntactical description of the language need not be cluttered with
small lexical details such as whitespace and comments.
• Tradition: Languages are often designed with separate lexical and syntactical
phases in mind, and the standard documents of such languages typically separate lexical and syntactical elements of the languages.
It is usually not terribly difficult to write a lexer by hand: You first read past initial
whitespace, then you, in sequence, test to see if the next token is akeyword, a
© Springer International Publishing AG 2017
T.Æ. Mogensen, Introduction to Compiler Design, Undergraduate Topics
in Computer Science, https://doi.org/10.1007/9783319669663_1
1
2
1 Lexical Analysis
number, a variable or whatnot. However, this is not a very good way of handling
the problem: You may read the same part of the input repeatedly while testing each
possible token, and in some cases it may not be clear where the next token ends.
Furthermore, a handwritten lexer may be complex and difficult to maintain. Hence,
lexers are normally constructed by lexer generators, that transform humanreadable
specifications of tokens and whitespace into efficient programs.
We will see the same general strategy in the chapter about syntax analysis: Specifications in a welldefined humanreadable notation are transformed into efficient
programs.
For lexical analysis, specifications are traditionally written using regular expressions: An algebraic notation for describing sets of strings. The generated lexers are
in a class of extremely simple programs called finite automata.
This chapter will describe regular expressions and finite automata, their properties
and how regular expressions can be converted to finite automata. Finally, we discuss
some practical aspects of lexer generators.
1.1 Regular Expressions
The set of all integer constants or the set of all variable names are examples of
sets of strings, where the individual digits or letters used to form these constants or
names are taken from a particular alphabet, i.e., a set of characters. A set of strings
is called a language. For integers, the alphabet consists of the digits 0–9 and for
variable names the alphabet contains both letters and digits (and perhaps a few other
characters, such as underscore).
Given an alphabet, we will describe sets of strings by regular expressions, an
algebraic notation that is compact and relatively easy for humans to use and understand. The idea is that regular expressions that describe simple sets of strings can
be combined to form regular expressions that describe more complex sets of strings.
Regular expressions are often called “regexps” for short.
When talking about regular expressions, we will use the letters r, s and t in italics
to denote unspecified regular expressions. When letters stand for themselves (i.e., in
regular expressions that describe strings that use these letters) we will use typewriter
font, e.g., a or b. The letters u, v and w in italics will be used to denote unspecified
single strings, i.e., members of some language. As an example, abw denotes any
string starting with ab. When we say, e.g., “The regular expression s” we mean
the regular expression that describes a single oneletter string “s”, but when we say
“The regular expression s”, we mean a regular expression of any form which we just
happen to call s. We use the notation L(s) to denote the language (i.e., set of strings)
described by the regular expression s. For example, L(a) is the set {“a”}.
To find L(s) for a given regular expression s, we use derivation: Rules that rewrite a
regular expression into a string of letters. These rules allow a single regular expression
to be rewritten into several different strings, so L(s) is the set of strings that s can
be rewritten to using these rules. L(s) is often an infinite set, but each string in the
1.1 Regular Expressions
Regexp
3
Derivation rules
Informal description
The oneletter string a. No derivation rule,
as it is already a string.
a
⇒
The empty string.
st
st ⇒ s
st ⇒ t
Either s or t. Note that this allows multiple
different derivations.
st
st ⇒ s t , if s ⇒ s and t ⇒ t
Something derived from s followed by something derived from t.
s∗
s∗ ⇒
s∗ ⇒ s(s∗ )
A concatenation of any number (including 0)
of strings derived from s. The number depends on how many times the second rule is
used.
Fig. 1.1 Regular expressions and their derivation
set is finite and can be obtained by a finite number of derivation steps. Figure 1.1
shows the different forms of regular expression, the derivation rules for these, and an
informal description of what the regular expression forms mean. Note that we use a
double arrow (⇒) to denote derivation. In addition to the specific derivation rules in
Fig. 1.1, we also use some general rules to make derivation reflexive and transitive:
s⇒s
Derivation is reflexive
r ⇒ t if r ⇒ s and s ⇒ t Derivation is transitive
Note that, while we use the same notation for concrete strings and regular expressions
denoting onestring languages, the context will make it clear which is meant. We will
often show strings and sets of strings without using quotation marks, e.g., write {a,
bb} instead of {“a”, “bb”}. When doing so, we sometimes use ε to denote the empty
string, so the derivation s∗ ⇒ shown in Fig. 1.1 can also be written as s∗ ⇒ ε.
We can use the derivation rules to find the language for a regular expression. As
an example, L(a(bc)) = {ab, ac} because a(bc) ⇒ a(b) = ab and a(bc) ⇒
a(c) = ac. L((ab)∗ ) is infinite and contains any sequence of as and bs, including
the empty sequence. For example, the string ab is in L((ab)∗ ) because (ab)∗ ⇒
(ab)(ab)∗ ⇒ a(ab)∗ ⇒ a(ab)(ab)∗ ⇒ ab(ab)∗ ⇒ ab.
Parentheses and Precedence Rules
When we use the symbols above to construct composite regular expressions such
as aab∗ , it is not a priori clear how the different subexpressions are grouped. We
will sometimes use parentheses to make the grouping of symbols explicit such as in
(a(ab))∗ . Additionally, we use precedence rules, similar to the algebraic convention
that multiplication binds stronger than additions, so 3+4×5 is equivalent to 3+(4×5)
and not (3 + 4) × 5. For regular expressions, we use the following conventions: ∗
binds tighter than concatenation, which binds tighter than alternative (). The example
aab∗ from above is, hence, equivalent to a(a(b∗ )).
4
1 Lexical Analysis
(rs)t = rst = r(st)
st = ts
ss = s
s? = s
(rs)t = rst = r(st)
s = s = s
r(st) = rsrt
(rs)t = rtst
(s∗ )∗ = s∗
s∗ s∗ = s∗
∗
ss = s+ = s∗ s
(s+ )+ = s+
 is associative.
 is commutative.
 is idempotent.
by definition.
concatenation is associative.
is a neutral element for concatenation.
concatenation distributes over .
concatenation distributes over .
∗ is idempotent.
0 or more twice is still 0 or more.
by definition.
+ is idempotent.
Fig. 1.2 Some algebraic properties of regular expressions
The  operator is associative and commutative. Concatenation is associative (but
obviously not commutative) and distributes over . Figure 1.2 shows these and other
algebraic properties of regular expressions, including properties of some of the shorthands introduced below.
Suggested Exercise: 1.1.
1.1.1 Shorthands
While the constructions in Fig. 1.1 suffice to describe e.g., number strings and variable
names, we will often use extra shorthands for convenience. For example, if we want
to describe nonnegative integer constants, we can do so by saying that it is one or
more digits, which is expressed by the regular expression
(0123456789)(0123456789)∗
The large number of different digits makes this expression rather verbose. It gets
even worse when we get to variable names, where we must enumerate all alphabetic
letters (in both upper and lower case).
Hence, we introduce a shorthand for sets of letters. A sequence of letters enclosed
in square brackets represents the set of these letters. For example, we use [ab01]
as a shorthand for ab01. Additionally, we can use interval notation to abbreviate
[0123456789] to [0–9]. We can combine several intervals within one bracket
and for example write [a–zA–Z] to denote all alphabetic letters in both lower
and upper case.
When using intervals, we must be aware of the ordering for the symbols involved.
For the digits and letters used above, there is usually no confusion. However, if we
write, e.g., [0–z] it is not immediately clear what is meant. When using such notation
1.1 Regular Expressions
5
in lexer generators, a character set encoding such as ASCII, ISO 88591, or UTF8 is
usually implied, so the symbols are ordered as defined by these encodings. To avoid
confusion, we will in this book use the interval notation only for intervals of digits
or alphabetic letters.
Getting back to the example of integer constants above, we can now write this
much shorter as [0–9][0–9]∗ .
Since s∗ denotes zero or more occurrences of s, we needed to write the set of
digits twice to describe that one or more digits are allowed. Such nonzero repetition
is quite common, so we introduce another shorthand, s+ , to denote one or more
occurrences of s. With this notation, we can abbreviate our description of integers to
[0–9]+ . On a similar note, it is common that we can have zero or one occurrence of
something (e.g., an optional sign to a number). Hence we introduce the shorthand s?
for sε. The shorthand symbols + and ? bind with the same precedence as ∗ .
We must stress that these shorthands are just that. They do not allow more
languages to be describes, they just make it possible to describe some languages
more compactly. In the case of s+ , it can even make an exponential difference:
If + is nested n deep, recursive expansion of s+ to ss∗ yields 2n −1 occurrences
of ∗ in the expanded regular expression. For example, (((a+ )b)+ c)+ expands to
aa∗ b(aa∗ b)∗ c(aa∗ b(aa∗ b)∗ c)∗ .
1.1.2 Examples
We have already seen how we can describe nonnegative integer constants using
regular expressions. Here are a few examples of other typical programming language
elements:
Keywords. A keyword like if is described by a regular expression that looks
exactly like that keyword, e.g., the regular expression if (which is the concatenation of the two regular expressions i and f).
Variable names. In the programming language C, a variable name consists of letters, digits and the underscore symbol and it must begin with a letter or underscore.
This can be described by the regular expression [a–zA–Z_][a–zA–Z_0–9]∗ .
Integers. An integer constant is an optional sign followed by a nonempty sequence
of digits: [+]?[0−9]+ . In some languages, a signed constant is not a single token,
but a concatenation of two tokens: the sign and an unsigned number constant.
This will usually allow whitespace between the sign and the number, which is not
possible with the above.
Floats. A floatingpoint constant can have an optional sign. After this, the mantissa
part is described as a sequence of digits followed by a decimal point and then
another sequence of digits. Either one (but not both) of the digit sequences can
be empty. Finally, there is an optional exponent part, which is the letter e (in
upper or lower case) followed by an (optionally signed) integer constant. If there
is an exponent part to the constant, the mantissa part can be written as an integer
6
1 Lexical Analysis
constant (i.e., without the decimal point). Some examples:
3.14
3. .23
3e+4
11.22e3.
This rather involved format can be described by the following regular expression:
[+]?((([0−9]+ . [0−9]∗  . [0−9]+ )([eE][+]?[0−9]+ )?)
 [0−9]+ [eE][+]?[0−9]+ )
This regular expression is complicated by the fact that the exponent is optional if
the mantissa contains a decimal point, but not if it does not (as that would make
the number an integer constant). We can make the description simpler if we make
the regular expression for floats also include integers, and instead use other means
of distinguishing integers from floats (see Sect. 1.8 for details). If we do this, the
regular expression can be simplified to
[+]?(([0−9]+ (. [0−9]∗ )?. [0−9]+ )([eE][+]?[0−9]+ )?)
Some languages require digits on both sides of the decimal point (if there is a
decimal point). This simplifies the description considerably, as there are fewer
special cases:
[+]?(([0−9]+ (. [0−9]+ )?([eE][+]?[0−9]+ )?)
String constants. A string constant starts with a quotation mark followed by a
sequence of symbols and finally another quotation mark. There are usually some
restrictions on the symbols allowed between the quotation marks. For example,
linefeed characters or quotes are typically not allowed, though these may be
represented by special “escape” sequences of other characters, such as “\n\n”
for a string containing two linefeeds. As a (much simplified) example, we can
by the following regular expression describe string constants where the allowed
symbols are alphanumeric characters and sequences consisting of the backslash
symbol followed by a letter (where each such pair is intended to represent a
nonalphanumeric symbol):
“([a−zA−Z0−9]\[a−zA−Z])∗ ”
Suggested Exercises: 1.2, 1.11 (a).
1.2 Nondeterministic Finite Automata
7
1.2 Nondeterministic Finite Automata
In our quest to transform regular expressions into efficient programs, we use a stepping stone: Nondeterministic finite automata. By their nondeterministic nature, these
are not quite as close to “real machines” as we would like, so we will later see how
these can be transformed into deterministic finite automata, which are easily and
efficiently executable on normal hardware.
A finite automaton is, in the abstract sense, a machine that has a finite number of
states and a finite number of transitions between pairs of states. A transition between
two states is usually labelled by a character from the input alphabet, but we will also
use transitions marked with ε, the socalled epsilon transitions.
A finite automaton can be used to decide if an input string is a member in some
particular set of strings. To do this, we select one of the states of the automaton as the
starting state. We start in this state, and in each step we can do one of the following:
• Follow an epsilon transition to another state, or
• Read a character from the input and follow a transition labelled by that character.
When all characters from the input are read, we see if the current state is marked as
being accepting. If this is the case, the string we have read from the input is in the
language defined by the automaton. Otherwise, it is not.
At each step, we may have a choice of several actions: We can choose between
either an epsilon transition or a transition on an alphabet character, and if there are
several transitions with the same symbol, we can choose between these. This makes
the automaton nondeterministic, as the choice of action is not determined solely by
looking at the current state and the next input character. It may be that some choices
lead to an accepting state while others do not. This does, however, not mean that the
string is sometimes in the language and sometimes not: We will include a string in
the language if it is possible to make a sequence of choices that makes the string
lead to an accepting state.
You can think of it as solving a maze with symbols written in the corridors. If you
can find the exit while walking over the letters of the string in the correct order, the
string is recognised by the maze.
We can formally define a nondeterministic finite automaton by:
Definition 1.1 A nondeterministic finite automaton consists of a set S of states.
One of these states, s0 ∈ S, is called the starting state of the automaton, and a subset
F ⊆ S of the states are accepting states. Additionally, we have a set T of transitions.
Each transition t connects a pair of states s1 and s2 and is labelled with a symbol,
which is either a character c from the alphabet Σ, or the symbol ε, which indicates
an epsilontransition. A transition from state s to state t on the symbol c is written
as s c t.
Starting states are sometimes called initial states and accepting states can also be
called final states (which is why we use the letter F to denote the set of accepting
8
1 Lexical Analysis
states). We use the abbreviations FA for finite automaton, NFA for nondeterministic
finite automaton and (later in this chapter) DFA for deterministic finite automaton.
We will mostly use a graphical notation to describe finite automata. States are
denoted by circles, optionally containing a number or name that identifies the state.
This name or number has, however, no operational significance, it is solely used for
identification purposes. Accepting states are denoted by using a double circle instead
of a single circle. The initial state is marked by an unlabelled arrow pointing to it
from outside the automaton.
A transition is denoted by an arrow connecting two states. Near its midpoint, the
arrow is labelled by the symbol (possibly ε) that triggers the transition. Note that the
arrow that marks the initial state is not a transition and is, hence, not labelled by a
symbol.
Repeating the maze analogue, the circles (states) are rooms and the arrows (transitions) are oneway corridors. The double circles (accepting states) are exits, while
the unlabelled arrow pointing to the starting state is the entrance to the maze.
Figure 1.3 shows an example of a nondeterministic finite automaton having three
states. State 1 is the starting state, and state 3 is accepting. There is an epsilontransition from state 1 to state 2, transitions on the symbol a from state 2 to states 1
and 3, and a transition on the symbol b from state 1 to state 3. This NFA recognises
the language described by the regular expression a∗ (ab). As an example, the string
aab is recognised by the following sequence of transitions:
from to by
1
2 ε
2
1 a
1
2 ε
2
1 a
1
3 b
At the end of the input we are in state 3, which is accepting. Hence, the string is
accepted by the NFA. You can check this by placing a coin at the starting state and
follow the transitions by moving the coin.
Note that we sometimes have a choice of several transitions. If we are in state
2 and the next symbol is an a, we can, when reading this, either go to state 1 or
to state 3. Likewise, if we are in state 1 and the next symbol is a b, we can either
read this and go to state 3, or we can use the epsilon transition to go directly to state
Fig. 1.3 Example of an NFA
2
a
a

?
b3
1
1.2 Nondeterministic Finite Automata
9
2 without reading anything. If we, in the example above, had chosen to follow the
atransition to state 3 instead of state 1, we would have been stuck: We would have
no legal transition, and yet we would not be at the end of the input. But, as previously
stated, it is enough that there exists a path leading to acceptance, so the string aab
is accepted by the NFA.
A program that decides if a string is accepted by a given NFA will have to check
all possible paths to see if any of these accepts the string. This requires either backtracking until a successful path found, or simultaneously following all possible paths.
Both of these methods are too timeconsuming to make NFAs suitable for efficient
recognisers. We will, hence, use NFAs only as a stepping stone between regular
expressions and the more efficient DFAs. We use this stepping stone because it
makes the construction simpler than direct construction of a DFA from a regular
expression.
1.3 Converting a Regular Expression to an NFA
We will construct an NFA compositionally from a regular expression, i.e., we will
construct the NFA for a composite regular expression from the NFAs constructed
from its subexpressions.
To be precise, we will from each subexpression construct an NFA fragment and
then combine these fragments into bigger fragments. A fragment is not a complete
NFA, so we complete the construction by adding the necessary components to make
a complete NFA.
An NFA fragment consists of a number of states with transitions between these
and additionally two incomplete transitions: One pointing into the fragment and
one pointing out of the fragment. The incoming halftransition is not labelled by a
symbol, but the outgoing halftransition is labelled by either ε or an alphabet symbol.
These halftransitions are the entry and exit to the fragment and are used to connect
it to other fragments or additional “glue” states.
Construction of NFA fragments for regular expressions is shown in Fig. 1.4. The
construction follows the structure of the regular expression by first making NFA
fragments for the subexpressions, and then joining these to form an NFA fragment
for the whole regular expression. The NFA fragments for the subexpressions are
shown as dotted ovals with the incoming halftransition on the left and the outgoing
halftransition on the right. The symbol on the outgoing halftransition is not shown
when an NFA fragment is shown as a dotted oval (it is “hidden” inside the oval).
When an NFA fragment has been constructed for the whole regular expression, the
construction is completed by connecting the outgoing halftransition to an accepting
state. The incoming halftransition serves to identify the starting state of the completed NFA. Note that, even though we allow an NFA to have several accepting states,
an NFA constructed using this method will have only one: the one added at the end
of the construction.
10
1 Lexical Analysis
Regular expression
NFA fragment

a


st
st

a

s
t
s
^

t

s
s∗

Fig. 1.4 Constructing NFA fragments from regular expressions
An NFA constructed this way for the regular expression (ab)∗ ac is shown in
Fig. 1.5. We have numbered the states for future reference.
1.3.1 Optimisations
We can use the construction in Fig. 1.4 for any regular expression by expanding out
all shorthand, e.g. converting s+ to ss∗ , [0−9] to 012 · · · 9, s? to sε, and so on.
1.3 Converting a Regular Expression to an NFA
11
Fig. 1.5 NFA for the regular expression (ab)∗ ac
Fig. 1.6 Optimised NFA
construction for regular
expression shorthands
However, this will result in very large NFAs for some expressions, so we use a few
optimised constructions for the shorthands, as shown in Fig. 1.6. Additionally, we
show an alternative construction for the regular expression ε. This construction does
not quite follow the formula used in Fig. 1.4, as it does not have two halftransitions.
Rather, the linesegment notation is intended to indicate that the NFA fragment for
ε just connects the halftransitions of the NFA fragments that it is combined with.
In the construction for [0−9], the vertical ellipsis is meant to indicate that there is a
transition for each of the digits in [0−9]. This construction generalises in the obvious
12
1 Lexical Analysis
Fig. 1.7 Optimised NFA for
[0−9]+
way to other sets of characters, e.g., [a−zA−Z0−9]. We have not shown a special
construction for s? as sε will do fine when we use the optimised construction for ε.
As an example, an NFA for [0−9]+ is shown in Fig. 1.7. Note that while this is
optimised, it is not optimal. You can (in several different ways) make an NFA for
this language using only two states.
Suggested Exercises: 1.3(a), 1.11(b).
1.4 Deterministic Finite Automata
Nondeterministic automata are, as mentioned earlier, not quite as close to “the
machine” as we would like. Hence, we now introduce a more restricted form of
finite automaton: The deterministic finite automaton, or DFA for short. DFAs are
special cases of NFAs that obey a number of additional restrictions:
• There are no epsilontransitions.
• There may not be two identically labelled transitions out of the same state.
This means that we never have a choice of several nextstates: The state and the next
input symbol uniquely determine the transition (or lack of same). This is why these
automata are called deterministic. Figure 1.8 shows a DFA equivalent to the NFA in
Fig. 1.3. Using the maze analogy, finding an exit is easy, as you are never in doubt
about which corridor to follow.
The transition relation of a DFA is a partial function, and we often write it as a
function: move(s, c) is the state (if any) that is reached from state s by a transition
on the symbol c. If there is no such transition, move(s, c) is undefined.
Fig. 1.8 Example of a DFA
1.4 Deterministic Finite Automata
13
It is very easy to implement a DFA on a computer: A twodimensional table can
be crossindexed by state and symbol to yield the next state (or an indication that
there is no transition), essentially implementing the move function by table lookup.
Another (onedimensional) table can indicate which states are accepting.
DFAs have the same expressive power as NFAs: A DFA is a special case of
NFA, and any NFA can (as we shall shortly see) be converted to an equivalent
DFA. However, the benefit of deterministic transitions comes at a cost: The resulting
DFA can be exponentially larger than the NFA (see Sect. 1.9). In practice (i.e., when
describing tokens for a programming language) the increase in size is usually modest,
which is why most lexical analysers are based on DFAs.
Suggested Exercises: 1.8(a, b), 1.9.
1.5 Converting an NFA to a DFA
As promised, we will show how NFAs can be converted to DFAs such that we, by
combining this with the conversion of regular expressions to NFAs shown in Sect. 1.3,
can convert any regular expression to a DFA.
The conversion is done by simulating all possible transitions in an NFA at the
same time. This means that we operate with sets of NFA states: When we have
several choices of a next state, we take all of the choices simultaneously and form
a set of the possible nextstates. Given a set of NFA states and a symbol, we follow
all transitions on that symbol from all states in the set, which gives us a new set of
NFA states. So we get transitions from sets of NFA states to sets of NFA states. The
transitions are deterministic because we from one set of NFA states and one symbol
have exactly one (possibly empty) set of NFA states that the transition moves to. The
idea is that different sets of NFA states become different single states in the DFA
that we construct.
Epsilontransitions complicate the construction a bit: Whenever we are in an NFA
state with an outgoing epsilontransition, we can always choose to follow the epsilontransition without reading any symbol. Hence, given a symbol, a nextstate can be
found by either following a transition with that symbol, or by first doing any number
of epsilontransitions and then a transition with the symbol. We handle this in the
construction by extending sets of NFA states by adding all NFA states that can be
reached from states in the set using only epsilontransitions. We define the epsilonclosure of a set of NFA states as the set extended with all NFA states that can be
reached from these using any number of epsilontransitions. More formally:
Definition 1.2 Given a set M of NFA states, we define εclosure(M) to be the least
(in terms of the subset relation) set X that is a solution to the set equation
X = M ∪ {t  s ∈ X and s ε t ∈ T }
where T is the set of transitions in the NFA.
14
1 Lexical Analysis
We will later on see several examples of set equations like the one above, so we
use some time to discuss how such equations can be solved.
1.5.1 Solving Set Equations
The following is a very brief description of how to solve set equations like the above.
If you find it confusing, you can read the Appendix and in particular Sect. A.4 first.
In general, a set equation over a single setvalued variable X has the form
X = F(X )
where F is a function from sets to sets. Not all such equations are solvable, so we
will restrict ourselves to special cases, which we will describe below. We will use
calculation of epsilonclosure as the driving example.
In Definition 1.2, we must find a set X that solves the equation
X = M ∪ {t  s ∈ X and s ε t ∈ T }
To cast this equation into the form X = F(X ) for a function f , we define FM to be
FM (X ) = M ∪ {t  s ∈ X and s ε t ∈ T }
There may be several solutions to the equation X = FM (X ). For example, if the
NFA has a pair of states that connect to each other by epsilon transitions, adding this
pair to a solution that does not already include the pair will create a new solution.
The epsilonclosure of M is the least solution to the equation (i.e., the smallest set
X that satisfies the equation).
FM has a property that is essential to our solution method: If X ⊆ Y then FM (X ) ⊆
FM (Y ). We say that FM is monotonic.
When we have an equation of the form X = F(X ) and F is monotonic, we can
find the least solution to the equation in the following way: We first guess that the
solution is the empty set and check to see if we are right: We compare ∅ with F(∅).
If these are equal, we are done and ∅ is the solution. If not, we use the following
properties:
• The least solution S to the equation satisfies S = F(S)
• ∅ ⊆ S implies that F(∅) ⊆ F(S)
to conclude that F(∅) ⊆ S. Hence, F(∅) is either S or a subset of S, so we can use
it as a new guess. We now form the chain
∅ ⊆ F(∅) ⊆ F(F(∅)) ⊆ . . .
1.5 Converting an NFA to a DFA
15
If at any point an element in the sequence is identical to the previous, we have a
fixedpoint, i.e., a set S such that S = F(S). This fixedpoint of the sequence will be
the least (in terms of set inclusion) solution to the equation. This is not difficult to
verify, but we will omit the details. Since we are iterating a function until we reach
a fixedpoint, we call this process fixedpoint iteration.
If we are working with sets over a finite domain (e.g., sets of NFA states), we will
eventually reach a fixedpoint, as there can be no infinite chain of strictly increasing
sets.
We can use this method for calculating the epsilonclosure of the set {1} with
respect to the NFA shown in Fig. 1.5. Since we want to find εclosure({1}), M = {1},
so FM = F{1} . We start by guessing that X is the empty set:
F{1} (∅) = {1} ∪ {t  s ∈ ∅ and s ε t ∈ T }
= {1}
As ∅ = {1}, we continue.
F{1} (F{1} (∅)) = F{1} ({1})
= {1} ∪ {t  s ∈ {1} and s ε t ∈ T }
= {1} ∪ {2, 5} = {1, 2, 5}
F{1} (F{1} (F{1} (∅))) = F{1} ({1, 2, 5})
= {1} ∪ {t  s ∈ {1, 2, 5} and s ε t ∈ T }
= {1} ∪ {2, 5, 6, 7} = {1, 2, 5, 6, 7}
F{1} (F{1} (F{1} (F{1} (∅)))) = F{1} ({1, 2, 5, 6, 7})
= {1} ∪ {t  s ∈ {1, 2, 5, 6, 7} and s ε t ∈ T }
= {1} ∪ {2, 5, 6, 7} = {1, 2, 5, 6, 7}
We have now reached a fixedpoint and found our solution. Hence, we conclude that
εclosure({1}) = {1, 2, 5, 6, 7}.
We have done a good deal of repeated calculation in the iteration above: We have
calculated the epsilontransitions from state 1 three times and those from state 2 and
5 twice each. We can make an optimised fixedpoint iteration by exploiting that the
function is not only monotonic, but also distributive: F(X ∪ Y ) = F(X ) ∪ F(Y ).
This means that, when we during the iteration add elements to our set, we in the next
iteration need only calculate F for the new elements and add the result to the set. In
the example above, we get
F{1} (∅) =
=
F{1} ({1}) =
=
{1} ∪ {t  s ∈ ∅ and s ε t ∈ T }
{1}
{1} ∪ {t  s ∈ {1} and s ε t ∈ T }
{1} ∪ {2, 5} = {1, 2, 5}
16
1 Lexical Analysis
F{1} ({1, 2, 5}) = F{1} ({1}) ∪ F{1} ({2, 5})
= {1, 2, 5} ∪ ({1} ∪ {t  s ∈ {2, 5} and s ε t ∈ T })
= {1, 2, 5} ∪ ({1} ∪ {6, 7}) = {1, 2, 5, 6, 7}
F{1} ({1, 2, 5, 6, 7}) = F{1} ({1, 2, 5}) ∪ F{1} ({6, 7})
= {1, 2, 5, 6, 7} ∪ ({1} ∪ {t  s ∈ {6, 7} and s ε t ∈ T })
= {1, 2, 5, 6, 7} ∪ ({1} ∪ ∅) = {1, 2, 5, 6, 7}
We can use this principle to formulate a worklist algorithm for finding the least
fixedpoint for an equation over a distributive function F. The idea is that we stepbystep build a set that eventually becomes our solution. In the first step, we calculate
F(∅). The elements in this initial set are unmarked. In each subsequent step, we take
an unmarked element x from the set, mark it and add F({x}) (unmarked) to the set.
Note that if an element already occurs in the set (marked or not), it is not added again.
When, eventually, all elements in the set are marked, we are done.
This is perhaps best illustrated by an example (the same as before). We start by
calculating F{1} (∅) = {1}. The element 1 is unmarked, so we pick this, mark it and
calculate F{1} ({1}) and add the new elements 2 and 5 to the set. As we continue, we
get this sequence of sets:
{1}
√
5}
{ 1√ , 2,
√
{ 1 , 2 , 5}
√
√
√
{ 1 , 2 , 5√ , 6, 7}
√
√
√
{ 1 , 2 , 5√ , 6 , 7}
√
√
√
√
{1, 2, 5, 6, 7}
Since all elements in the last set are marked, this is a solution to the equation.
We will later also need to solve simultaneous equations over sets, i.e., several
equations over several sets. These can also be solved by fixedpoint iteration in the
same way as single equations, though the worklist version of the algorithm becomes
a bit more complicated.
1.5.2 The Subset Construction
After this brief detour into the realm of set equations, we are now ready to continue
with our construction of DFAs from NFAs. The construction is called the subset
construction, as each state in the DFA is a subset of the states from the NFA.
1.5 Converting an NFA to a DFA
17
Algorithm 1.3 (The subset construction) Given an NFA N with states S, starting
state s0 ∈ S, accepting states F ⊆ S, transitions T , and alphabet Σ, we construct an
equivalent DFA D with states S , starting state s0 , accepting states F , and a transition
function move by:
s0
move(s , c)
S
F
=
=
=
=
εclosure({s0 })
εclosure({t  s ∈ s and s c t ∈ T })
{s0 } ∪ {move(s , c)  s ∈ S , c ∈ Σ}
{s ∈ S  s ∩ F = ∅}
The DFA uses the same alphabet Σ as the NFA.
A little explanation:
• The starting state s0 of the DFA is the epsilonclosure of the set containing just
the starting state s0 of the NFA, i.e., the states that are reachable from s0 solely by
epsilontransitions.
• A transition in the DFA on a symbol c is done by finding the set s of NFA states
that comprise the DFA state, following all transitions on c in the NFA from all
NFA states s in s , combining the resulting sets of NFA states, and finally closing
this under epsilon transitions.
• The set S of states in the DFA is the set of DFA states that can be reached from
s0 using the move function. S is defined as a set equation which can be solved as
described in Sect. 1.5.1.
• A state s in the DFA is an accepting state if at least one of the NFA states in s is
accepting.
As an example, we will convert the NFA in Fig. 1.5 to a DFA.
The initial state in the DFA is εclosure({1}), which we have already calculated to
be s0 = {1, 2, 5, 6, 7}. This is now entered into the set S of DFA states as unmarked
(following the worklist algorithm from Sect. 1.5.1).
We now pick an unmarked element from the uncompleted S . We have only one
choice: s0 . We now mark this and calculate the transitions for it. We get
move(s0 , a) = εclosure({t  s ∈ {1, 2, 5, 6, 7} and s a t ∈ T })
= εclosure({3, 8})
= {3, 8, 1, 2, 5, 6, 7}
= s1
move(s0 , b) = εclosure({t  s ∈ {1, 2, 5, 6, 7} and s b t ∈ T })
= εclosure({8})
= {8, 1, 2, 5, 6, 7}
= s2
move(s0 , c) = εclosure({t  s ∈ {1, 2, 5, 6, 7} and s c t ∈ T })
= εclosure({})
= {}
18
1 Lexical Analysis
Note that the empty set of NFA states is not an DFA state, so there will be no transition
from s0 on c.
√
We now add s1 and s2 to our incomplete S , which now is {s0 , s1 , s2 }. We now
pick s1 , mark it and calculate its transitions:
move(s1 , a) = εclosure({t  s ∈ {3, 8, 1, 2, 5, 6, 7} and s a t ∈ T })
= εclosure({3, 8})
= {3, 8, 1, 2, 5, 6, 7}
= s1
move(s1 , b) = εclosure({t  s ∈ {3, 8, 1, 2, 5, 6, 7} and s b t ∈ T })
= εclosure({8})
= {8, 1, 2, 5, 6, 7}
= s2
move(s1 , c) = εclosure({t  s ∈ {3, 8, 1, 2, 5, 6, 7} and s c t ∈ T })
= εclosure({4})
= {4}
= s3
√
√
We have seen s1 and s2 before, so only s3 is added: {s0 , s1 , s2 , s3 }. We next pick s2 :
move(s2 , a) = εclosure({t  s ∈ {8, 1, 2, 5, 6, 7} and s a t ∈ T })
= εclosure({3, 8})
= {3, 8, 1, 2, 5, 6, 7}
= s1
move(s2 , b) = εclosure({t  s ∈ {8, 1, 2, 5, 6, 7} and s b t ∈ T })
= εclosure({8})
= {8, 1, 2, 5, 6, 7}
= s2
move(s2 , c) = εclosure({t  s ∈ {8, 1, 2, 5, 6, 7} and s c t ∈ T })
= εclosure({})
= {}
No new elements are added, so we pick the remaining unmarked element s3 :
move(s3 , a) = εclosure({t  s ∈ {4} and s a t ∈ T })
= εclosure({})
= {}
1.5 Converting an NFA to a DFA
19
Fig. 1.9 DFA constructed
from the NFA in Fig. 1.5
move(s3 , b) = εclosure({t  s ∈ {4} and s b t ∈ T })
= εclosure({})
= {}
move(s3 , c) = εclosure({t  s ∈ {4} and s c t ∈ T })
= εclosure({})
= {}
Since all states are now marked, this completes the construction of S = {s0 , s1 , s2 , s3 }.
Only s3 contains the accepting NFA state 4, so this is the only accepting state of our
DFA. Figure 1.9 shows the completed DFA.
Suggested Exercises: 1.3(b), 1.5.
1.6 Size Versus Speed
In the above example, we get a DFA with 4 states from an NFA with 8 states. However, as the states in the constructed DFA are (nonempty) sets of states from the NFA
there may potentially be 2n −1 states in a DFA constructed from an nstate NFA.
It is not too difficult to construct classes of NFAs that expand exponentially in this
way when converted to DFAs, as we shall see in Sect. 1.9.1. Since we are mainly
interested in NFAs that are constructed from regular expressions as in Sect. 1.3, we
might ask ourselves if these NFAs might not be in a suitably simple class that do not
risk exponentialsized DFAs. Alas, this is not the case. Just as we can construct a
class of NFAs that expand exponentially, we can construct a class of regular expressions where the smallest equivalent DFAs are exponentially larger. This happens
rarely when we use regular expressions or NFAs to describe tokens in programming
languages, though.
It is possible to avoid the blowup in size by operating directly on regular expressions or NFAs when testing strings for inclusion in the languages these define. How
20
1 Lexical Analysis
ever, there is a speed penalty for doing so. A DFA can be run in time k ∗ v, where
v is the length of the input string v and k is a small constant that is independent
of the size of the DFA.1 Regular expressions and NFAs can be run in time close to
c ∗ N  ∗ v, where N  is the size of the NFA (or regular expression) and the constant
c typically is larger than k. All in all, DFAs are a lot faster to use than NFAs or regular
expressions, so it is only when the size of the DFA is a real problem that one should
consider using NFAs or regular expressions directly.
1.7 Minimisation of DFAs
Even though the DFA in Fig. 1.9 has only four states, it is not minimal. It is easy to
see that states s0 and s2 are equivalent: Neither are accepting and they have identical
transitions. We can hence collapse these states into a single state and get a threestate
DFA.
DFAs constructed from regular expressions through NFAs are often nonminimal,
though they are rarely very far from being minimal. Nevertheless, minimising a DFA
is not terribly difficult and can be done fairly fast, so many lexer generators perform
minimisation.
An interesting property of DFAs is that any regular language (a language that can
be expressed by a regular expression, NFA or DFA) has a unique equivalent minimal
DFA. Hence, we can decide equivalence of two regular expressions (or NFAs or
DFAs) by converting both to minimal DFAs and compare the results.
As hinted above, minimisation of DFAs is done by collapsing equivalent states.
However, deciding whether two states are equivalent is not just done by testing if
their immediate transitions are identical, since transitions to different states may be
equivalent if the target states turn out to be equivalent. Hence, we use a strategy
where we first assume all states to be equivalent and then distinguish them only if
we can prove them different. We use the following rules for this:
• An accepting state is not equivalent to a nonaccepting state.
• If two states s1 and s2 have transitions on the same symbol c to states t1 and t2
that we have already proven to be different, then s1 and s2 are different. This also
applies if only one of s1 or s2 have a defined transition on c.
This leads to the following algorithm.
Algorithm 1.4 (DFA minimisation) Given a DFA D over the alphabet Σ with states
S, where F ⊆ S is the set of the accepting states, we construct a minimal DFA Dmin ,
where each state is a group of equivalent states from D. The groups in the minimal
DFA are consistent: For any pair of states s1 , s2 in a G 1 and any symbol c, move(s1 , c)
and move(s2 , c) are both in the same group, or both are undefined. In other words,
we can not tell s1 and s2 apart by looking at their transitions.
1 If
memory access is assumed to be constant time, regardless of memory size.
1.7 Minimisation of DFAs
21
We minimise the DFA D in the following way:
(1) We start with two groups: the set of accepting states F and the set of nonaccepting states S\F. Both these groups are initially unmarked.
(2) We pick any unmarked group G and check if it is consistent. If it is, we mark it.
If G is not consistent, we split it into maximal consistent subgroups and replace
G by these. All groups are then unmarked. A consistent subgroup is maximal if
adding any other state from G to it will make it inconsistent.
(3) If there are no unmarked groups left, we are done, and the remaining groups are
the states of the minimal DFA. Otherwise, we go back to step 2.
The starting state of the minimal DFA is the group that contains the original starting
state, and any group of accepting states is an accepting state in the minimal DFA.
The time needed for minimisation using Algorithm 1.4 depends on the strategy used
for picking groups in step 2. With random choices, the worst case is quadratic in the
size of the DFA, but there exist strategies for choosing groups and data structures for
representing these that guarantee a worstcase time that is O(n × log(n)), where n is
the number of states in the (nonminimal) DFA. In other words, the method can be
implemented so it uses little more than linear time to do minimisation. We will not
here go into further detail but just refer to [1] for the optimal algorithm.
We will, however, note that we can make a slight optimisation to Algorithm 1.4:
A group that consists of a single state needs never be split, so we need never select
such in step 2, and we can stop when all unmarked groups are singletons.
1.7.1 Example
As an example of minimisation, take the DFA in Fig. 1.10.
We now make the initial division into two groups: The accepting and the nonaccepting states.
Fig. 1.10 Nonminimal
DFA
22
1 Lexical Analysis
G 1 = {0, 6}
G 2 = {1, 2, 3, 4, 5, 7}
These are both unmarked. We next pick any unmarked group, say G 1 . To check if
this is consistent, we make a table of its transitions:
G1 a b
0 G2 −
6 G2 −
This is consistent, so we just mark it and select the remaining unmarked group G 2
and make a table for this
G2 a b
1 G2 G2
2 G2 G2
3 − G2
4 G1 G2
5 G2 G2
7 G1 G2
G 2 is evidently not consistent, so we split it into maximal consistent subgroups and
erase all marks (including the one on G 1 ):
G1
G3
G4
G5
=
=
=
=
{0, 6}
{1, 2, 5}
{3}
{4, 7}
We now pick G 3 for consideration:
G3 a b
1 G5 G3
2 G4 G3
5 G5 G3
This is not consistent either, so we split again and get
G1
G4
G5
G6
G7
= {0, 6}
= {3}
= {4, 7}
= {1, 5}
= {2}
1.7 Minimisation of DFAs
23
We now pick G 5 and check this:
G5 a b
4 G1 G6
7 G1 G6
This is consistent, so we mark it and pick another group, say, G 6 :
G6 a b
1 G5 G7
5 G5 G7
This, also, is consistent, so we have only one unmarked nonsingleton group left: G 1 .
G1 a b
0 G6 −
6 G6 −
As we mark this, we see that there are no unmarked groups left other than singletons.
Hence, the groups now form a minimal DFA equivalent to the one in Fig. 1.10. The
minimised DFA is shown in Fig. 1.11.
1.7.2 Dead States
Algorithm 1.4 works under some, as yet, unstated assumptions:
• The move function is total, i.e., there are transitions on all symbols from all states,
or
• There are no dead states in the DFA.
A dead state is a state from which no accepting state can be reached. Dead states
do not occur in DFAs constructed from NFAs without dead states, and NFAs with
dead states can not be constructed from regular expressions by the method shown in
Sect. 1.3. Hence, as long as we use minimisation only on DFAs constructed by this
process, we are safe. However, if we get a DFA of unknown origin, we risk that it
may contain both dead states and undefined transitions.
Fig. 1.11 Minimal DFA
24
1 Lexical Analysis
A transition to a dead state should rightly be equivalent to an undefined transition,
as neither can yield future acceptance. The only difference is that we discover this
earlier on an undefined transition than when we make a transition to a dead state.
However, Algorithm 1.4 will treat these differently and may hence decree a group to
be inconsistent even though it is not. This will make the algorithm split a group
that does not need to be split, hence producing a nonminimal DFA. Consider, for
example, the following DFA:
start
1
2
3
States 1 and 2 are, in fact, equivalent, as starting from either one, any sequence of as
(and no other sequences) will lead to an accepting state. A minimal equivalent DFA
consists of a single accepting state with a transition to itself on a.
But Algorithm 1.4 will see a transition on b out of state 2 but no transition on b
out of state 1, so it will not keep states 1 and 2 in the same group. As a result, no
reduction in the DFA is made.
There are two solutions to this problem:
(1) Make sure there are no dead states. This can be ensured by invariant, as is the
case for DFAs constructed from regular expressions by the methods shown in this
chapter, or by explicitly removing dead states before minimisation. Dead states
can be found by a simple reachability analysis for directed graphs (if you can’t
reach an accepting state from state s, s is a dead state). In the above example,
state 3 is dead and can be removed (including the transition to it). This makes
states 1 and 2 stay in the same group during minimisation.
(2) Make sure there are no undefined transitions. This can be achieved by adding a
new dead state (which has transitions to itself on all symbols) and replacing all
undefined transitions by transitions to this dead state. After minimisation, the
group that contains the added dead state will contain all dead states from the
original DFA. This group can now be removed from the minimal DFA (which
will once more have undefined transitions). In the above example, a new (nonaccepting) state 4 has to be added. State 1 has a transition to state 4 on b, state 3
has a transition to state 4 on both a and b, and state 4 has transitions to itself
on both a and b. After minimisation, state 1 and 2 will be joined, as will state 3
and 4. Since state 4 is dead, all states joined with it are also dead, so we can
remove the combined state 3 and 4 from the resulting minimised automaton.
Suggested Exercises: 1.6, 1.11(c).
1.8 Lexers and Lexer Generators
25
1.8 Lexers and Lexer Generators
We have, in the previous sections, seen how we can convert a language description
written as a regular expression into an efficiently executable representation (a DFA).
What we want is something more: A program that does lexical analysis, i.e., a lexer:
• A lexer has to distinguish between several different types of tokens, e.g., numbers,
variables and keywords. Each of these are described by its own regular expression.
• A lexer does not check if its entire input is included in the languages defined by
the regular expressions. Instead, it has to cut the input into pieces (tokens), each
of which is included in one of the languages.
• If there are several ways to split the input into legal tokens, the lexer has to decide
which of these it should use.
A program that takes a set of token definitions (each consisting of a regular expression
and a token name) and generates a lexer is called a lexer generator.
The simplest approach would be to generate a DFA for each token definition and
apply the DFAs one at a time to the input. This can, however, be quite slow, so we will
instead from the set of token definitions generate a single DFA that tests for all the
tokens simultaneously. This is not difficult to do: If the tokens are defined by regular
expressions r1 , r2 , . . . , rn , then the regular expression r1  r2  . . .  rn describes the
union of the languages r1 , r2 , . . . , rn and the DFA constructed from this combined
regular expression will scan for all token types at the same time.
However, we also wish to distinguish between different token types, so we must
be able to know which of the many tokens was recognised by the combined DFA.
We can accomplish this with the following construction of a combined DFA:
(1) Construct NFAs N1 , N2 , . . . , Nn for each of r1 , r2 , . . . , rn .
(2) Mark the accepting states of the NFAs by the name of the tokens they accept.
(3) Combine the NFAs to a single NFA by adding a new starting state which has
epsilontransitions to each of the starting states of the NFAs.
(4) Convert the combined NFA to a DFA.
(5) Each accepting state of the DFA consists of a set of NFA states, at least one
of which is an accepting state which we marked by token type in step 2. These
marks are used to mark the accepting states of the DFA, so each of these will
indicate all the token types it accepts.
If the same accepting state in the DFA can accept several different token types, it is
because these overlap. This is not unusual, as keywords usually overlap with variable
names and a description of floating point constants may include integer constants as
well. In such cases, we can do one of two things:
• Let the lexer generator generate an error and require the user to make sure the
tokens are disjoint.
• Let the user of the lexer generator choose which of the tokens is preferred.
It can be quite difficult (though always possible) with regular expressions to define,
e.g., the set of names that are not keywords. Hence, it is common to let the lexer
26
1 Lexical Analysis
choose according to a prioritised list. Normally, the order in which tokens are defined
in the input to the lexer generator indicates priority (earlier defined tokens take
precedence over later defined tokens). Hence, keywords are usually defined before
variable names, which means that, for example, the string “if” is recognised as
a keyword and not a variable name. When an accepting state in a DFA contains
accepting NFA states with different marks, the mark corresponding to the highest
priority (earliest defined) token is used. Hence, we can simply erase all but one mark
from each accepting state. This is a very simple and effective solution to the problem.
When we described minimisation of DFAs, we used two initial groups: One for
the accepting states and one for the nonaccepting states. As there are now several
kinds of accepting states (one for each token), we must use one group for each token,
so we will have a total of n + 1 initial groups when we have n different tokens.
To illustrate the precedence rule, Fig. 1.12 shows an NFA made by combining
NFAs for variable names, the keyword if, integers and floats, as described by the
regular expressions in Sect. 1.1.2. The individual NFAs are (simplified versions of)
what you get from the method described in Sect. 1.4. When a transition is labelled
by a set of characters, it is a shorthand for a set of transitions each labelled by a
single character. The accepting states are labelled with token names as described
above. The corresponding minimised DFA is shown in Fig. 1.13. Note that state G
Fig. 1.12 Combined NFA for several tokens
1.8 Lexers and Lexer Generators
27
Fig. 1.13 Combined DFA
for several tokens
is a combination of states 9 and 12 from the NFA, so it can accept both NUM and
FLOAT, but since integers take priority over floats, we have marked G with NUM only.
Splitting the Input Stream
As mentioned, the lexer must cut the input into tokens. This may be done in several
ways. For example, the string if17 can be split in many different ways:
•
•
•
•
•
•
As one token, which is the variable name if17.
As the variable name if1 followed by the number 7.
As the keyword if followed by the number 17.
As the keyword if followed by the numbers 1 and 7.
As the variable name i followed by the variable name f17.
And several more.
A common convention is that it is the longest prefix of the input that matches any
token which will be chosen. Hence, the first of the above possible splittings of if17
will be chosen. Note that the principle of the longest match takes precedence over the
order of definition of tokens, so even though the string starts with the keyword if,
which has higher priority than variable names, the variable name is chosen because
it is longer.
28
1 Lexical Analysis
Modern languages like C, Java or SML follow this convention, and so do most
lexer generators, but some (mostly older) languages like FORTRAN do not. When
other conventions are used, lexers must either be written by hand to handle these
conventions, or the conventions used by the lexer generator must be sidestepped.
Some lexer generators allow the user to have some control over the conventions used.
The principle of the longest matching prefix is handled by letting the DFA read
as far as it can, until it either reaches the end of the input, or no transition is defined
on the next input symbol. If the current state at this point is accepting, we are in
luck, and can simply output the corresponding token. If not, we must go back to the
last time we were in an accepting state and output the token indicated by this. The
characters read since then are put back in the input stream. The lexer must, hence,
retain the symbols it has read since the last accepting state, so it in such situations
can reinsert these in the input. If we are not at the end of the input stream, we restart
the DFA (in its initial state) on the remaining input to find the next tokens.
As an example, consider lexing of the string 3ey with the DFA in Fig. 1.13.
We get to the accepting state G after reading the digit 3. However, we can continue
making legal transitions to state I on e and then to state J on  (as these could be the
start of the exponent part of a real number). It is only when we, in state J, find that
there is no transition on y that we realise that this is not the case. We must now go
back to the last accepting state (G) and output the number 3 as the first token and
reinsert  and e in the input stream, so we can continue with ey when we look
for the subsequent tokens.
Lexical Errors
If no prefix of the input string forms a valid token, a lexical error has occurred.
When this happens, the lexer will usually report an error. At this point, it may stop
reading the input or it may attempt continued lexical analysis by skipping characters
until a valid prefix is found. The purpose of the latter approach is to try finding
further lexical errors in the same input, so several of these can be corrected by the
user before rerunning the lexer. Some of these subsequent errors may, however, not
be real errors, but may be caused by the lexer not skipping enough characters (or
skipping too many) after the first error is found. If, for example, the start of a comment
is illformed, the lexer may try to interpret the contents of the comment as individual
tokens, and if the end of a comment is illformed, the lexer will read until the end of
the next comment (if any) before continuing, hence skipping too much text.
When the lexer finds an error, the consumer of the tokens that the lexer produces
(e.g., the rest of the compiler) can not usually itself produce a valid result. However,
the compiler may try to find other errors in the remaining input, again allowing the
user to find several errors in one editcompile cycle. Again, some of the subsequent
errors may really be spurious errors caused by lexical error(s), so the user will have
to guess at the validity of every error message except the first, as only the first error
message is guaranteed to be a real error. Nevertheless, such error recovery has,
when the input is so large that restarting the lexer from the start of input incurs a
considerable time over
