Principal The Theory and Practice of Compiler Writing

The Theory and Practice of Compiler Writing

Mcgraw-Hill College
ISBN 10:
ISBN 13:
McGraw-Hill computer science series
DJVU, 6.44 MB
Descarga (djvu, 6.44 MB)
You can write a book review and share your experiences. Other readers will always be interested in your opinion of the books you've read. Whether you've loved the book or not, if you give your honest and detailed thoughts then people will find new books that are right for them.
. . .. . . COMPILER

11_11- 11111111 11111111 11111111




A .T




---- ---- ---- ----



Jean-Paul Tremblay Paul G. Sorenson

Department of Computational Science University of Saskatchewan, Saskatoon Canada

McGraw- Hill Book Company

New York St. Louis San Francisco Auckland Bogota Hamburg Johannesburg London Madrid Mexico Montreal New Delhi Panama Paris Sao Paulo Singapore Sydney Tokyo Toronto

This book was set in Times Roman by Science Typographers, Inc. The editors were Eric M. Munson, Kaye Pace, Ellen W. MacElree, and Jo Satloff; the production supervisor was Joe Campanella. The drawings were done by Volt Information Sciences, Inc. R.R. Donnelley & Sons Company was printer and binder.


Copyright @ 1985 by McGraw-Hill, Inc. All rights reserved. Printed in the United States of America. Except as permitted under the United States Copyright Act of 1976, no part of this publication may be reproduced or distributed in any form or by any means, or stored in a data base or retrieval system, without the prior written permission of the publisher.


ISBN 0-07-065161-2

Library of Congress Cataloging in Publication Data Tremblay, Jean-Paul, date The theory and practice of compiler writing. (McGraw-Hill computer science series) Includes bibliographies and index. 1. Electronic digital computers-Programming. 2. Com- piling (Electronic computers) I. Sorenson, P. G. II. Title. III. Series. QA 76.6.T734 1985 001.64'2 84-25096 ISBN 0-07-065161-2

Dedie- Dedicated

A la memoire de mon pere Philippe Tremblay

To my father, Chester, and in memory of my mother, Josephine Sorenson


Chapter 1 1-1 1-2 1-3

Chapter 2







Introduction Programming Languages Translators Model of a Compiler Bibliography

1 1 4 5 11

Notation and Concepts for Language; s and Grammars Sets and Strings 2-1.1 Basic Concepts of Set Theory Exercises 2-1.1 2-1.2 Relations Exercises 2-1.2 2-1.3 Strings Discussion of Grammars Exercises 2-2 Classification of Grammars Exercises 2-3 Context-Free Grammars and Parsing 2-4.1 Syntax Terminology Exercises 2-4.1 2-4.2 A View of Parsing Exercises 2-4.2 2-4.3 Reduced Grammars and Grammars with e-Rules Exercises 2-4.3 2-4.4 Extended BNF Notation Exercises 2-4.4 Bibliography

13 14 14 18 19 27 29 30 37 37 42 42 43 50 52 56 57 63 64 66 67



Chapter 3 3-1 3-2 3-3 3-4


3-6 3-7 3-8

Chapter 4 4-1 4-2



Programming- Language Design Overview of the Problem Preliminary Considerations Sources of Ideas Goals and Design Philosophies of Programming Languages 3-4.1 Human CommunicatiQU 3-4.2 Prevention and Detection of Errors 3-4.3 Usability 3-4.4 Programming Effectiveness 3-4.5 Compilability 3-4.6 Efficiency 3-4. 7 Machi

e Independence 3-4.8 Simplici ty 3-4.9 Uniformity 3-4.10 Orthogonality 3-4.11 Generalization and Specialization 3-4.12 Other Design Philosophies Detailed Design 3-5.1 Microstructure 3-5.2 Expression Structure 3-5.3 Data Structure 3-5.4 Control Structure 3-5.5 Compile Structure 3-5.6 I/O Structure Reduction of Size Pragmatics Comments on the Design of Ada 3-8.1 Evaluation of Ada from a Language Design Viewpoint Usability and Program Effectiveness Machine Independence and Portability Efficiency Modularity and Maintainability Compilability and Compile Structure Simplicity 3-8.2 Ada Programming Support Environment 3-8.3 Software Technology for Adaptable Reliable Systems (STARS) Chap ter Exercises Bibliography

Scanners The Scanning Process An Elementary Scanner Design and Its Implementation Exercises 4-2 Regular Grammars and Regular Expressions Exercises 4-3 Finite-State Acceptors 4-4.1 Deterministic Finite-State Acceptors 4-4.2 Nondeterministic Finite-State Acceptors

68 69 70 72 74 74 75 78 78 80 81 83 84 84 85 86 86 88 88 92 93 102 113 114 116 116 120 122 122 129 129 129 130 130 131

132 134 135

138 139 141 150 151 155 156 156 157




Chapter 5 5-1 5-2

5-3 5-4


Chapter 6 6-1



4-4.3 Nondeterministic Finite-State Acceptors with f- Transi tions Exercises 4-4 Equivalence of Regular Grammars and Finite-State Acceptors Exercises 4- 5 Equivalence of Regular Expressions and Finite-State Acceptors Exercises 4-6 A Scanner Generator Exercises 4- 7 Bibliography

Compile-Time Error Handling Overview of Error Handling Error Detection 5-2.1 The Nature of Syntax Errors 5-2.2 How Errors Are Detected 5-2.3 Where Errors Are Detected Error Reporting Error Recovery 5-4.1 Ad Hoc Recovery Mechanisms 5-4.2 Syntax-Directed Recovery 5-4.3 Secondary Error Recovery 5-4.4 Context-Sensitive Recovery Error Repair 5- 5.1 Ad Hoc Repair 5-5.2 Syntax-Directed Repair 5-5.3 Context-Sensitive Repair 5-5.4 Spelling Repair Chapter Exercises Bibliography

T op- Down Parsing General Top-Down Parsing Strategies 6-1.1 Brute-Force Approach Exercises 6-1.1 6-1.2 Recursive-Descent Parsers Recursive-Descent Parsing Algorithm Error Detection in Recursive-Descent Parsers Exercises 6-1.2 6-1.3 Notions of Top-Down Parsing with Limited Backup 6-1.4 Top-Down Parsing with Limited Backup Exercises 6-1.4 Top-Down Parsing with No Backup 6-2.1 Notions of Parsing with No Backup 6-2.2 Simple LL(I) Grammars Exercises 6-2.2 6-2.3 LL(I) Grammars without (-Rules Exercises 6-2.3 6-2.4 LL(I) Grammars with (-Rules

162 165 166 169 170 176 177 181 181

182 182 185 185 187 189 190 195 195 197 199 200 200 201 201 202 203 205 206

208 208 209 219 219 219 224 228 228 229 234 235 235 236 241 242 248 249


Chapter 7 7-1




Exercises 6-2.4 6-2.5 Error Handling for LL(I) Parsers Exercises 6-2.5 Chap ter Exercises Bibliography

Bottom-Up Parsing Polish Expressions and Their Compilation 7-1.1 Polish Notation 7-1.2 Conversion of Infix Expressions to Polish Notation 7-1.3 Error Detection for Infix Expressions Exercises 7-1 Operator Precedence Grammars 7-2.1 Notions and Use of Operator Precedence Relations 7-2.2 Formal Definition of Operator Precedence Relations 7-2.3 Operator Precedence Parsing Algorithm 7-2.4 Precedence Functions in Operator Precedence Parsers 7-2.5 Error Detection in Operator Precedence Parsers Exercises 7-2 Simple Precedence Grammars 7-3.1 Notions and Use of Precedence Relations 7-3.2 Formal Definition of Precedence Relations 7-3.3 Parsing Algorithm for Simple Precedence Grammars 7-3.4 Precedence Functions for Simple Precedence Grammars 7-3.5 Error Recovery for Simple Precedence Parsers 7-3.6 Notions of Extended Precedence Exercises 7-3 LR Grammars 7-4.1 Concepts and Terminology Exercises 7-4.1 7-4.2 LR(O) Parsers Exercises 7-4.2 7-4.3 SLR (1) Parsers Exercises 7-4.3 7-4.4 Canonical LR(I) Parsers Exercises 7-4.4 7-4.5 LA LR (1) Parsers Exercises 7-4.5 7-4.6 Efficient Generation of Lookahead Sets for LALR(I) Parsers Exercises 7-4.6 7-4.7 Representation and Optimization of LR Parsers Sparse- Matrix Representations Optimization Transformations 7 -4.8 Error Detection and Recovery in LR Parsers Early Methods of Error Recovery Application of Graham-Rhodes Method to LR Parsers Other LR Error-Recovery Methods

258 260 272 272 273

274 275 276 278 283 285 286 287 293 294 298 300 301 302 302 306 309 311 320 332 339 341 341 346 346 352 353 361 362 369 370 375

375 383 383 384 386 389 389 392 410


Chapter 8 8-1 8-2 8-3 8-4 8-5


Chapter 9





Comparison of Parsing Methods Bibliography

Symbol- Table- Handling Techniques Perspective and Motivation When to Construct and Interact with the Symbol Table Symbol-Table Contents Operations on Symbol Tables Symbol-Table Organizations for Non-Block-Structured Languages 8-5.1 Unordered Symbol Tables 8-5.2 Ordered Symbol Tables 8-5.3 Tree-Structured Symbol Tables 8-5.4 Hash Symbol Tables Symbol- Table Organizations for Block-Structured Languages 8-6.1 Block-Structured Language Concepts 8-6.2 Stack Symbol Tables 8-6.3 Stack-Implemented Tree-Structured Symbol Tables 8-6.4 Stack-Implemented Hash-Structured Symbol Tables Chapter Exercises Bibliography

Run-Time Storage Organization and Management Static Storage Allocation Exercises 9-1 Dynamic Storage Allocation 9-2.1 Activation Records 9-2.2 Parameter Area 9-2.3 Display Area 9-2.4 Run-Time Address Calculation 9-2.5 Handling Recursive Procedures Exercises 9-2 Heap Storage Allocation 9-3.1 Implicit Storage Requests 9-3.2 Explicit Storage Requests 9-3.3 Management of Fixed-Length Blocks 9-3.4 Management of Variable-Length Blocks First-Fit Heap Storage-Management Strategy Boundary-Tag Heap Storage-Management Strategy Buddy-System Heap Storage-Management Strategy 9-3.5 Free-as- You-Go Storage Release 9-3.6 Garbage Collection 9- 3.7 Compaction Exercises 9- 3 Bibliography

410 415

418 418 419 422 426

428 428 429 433 450 464 465 467 468 471 474 476

477 477 480 480 482 483 484 487 488 491 492 493 493 495 496 497 501 506 512 513 517 519 520

xiv CONTENTS Chapter 10 Intermediate Forms of Source Programs 521 10-1 Polish Notation 522 10-2 N-tuple Notation 523 10-3 Abstract Syntax Trees 525 10-4 Threaded Code 527 10-5 Abstract Machine Code 531 10- 5.1 Portabili ty and Abstract Machine 531 10- 5.2 The P-Code Abstract Machine for PASCAL 532 Chapter Exercises 534 Bibliography 535 Chapter 11 Semantic Analysis and Code Generation 536 11-1 What Is Meant by Semantic Analysis 537 11-2 Implicit Stacking in Recursive-Descent Compilation 540 11-3 Semantic Stacks in Bottom-Up Compilation 548 11-4 Action Symbols in Top-Down Compilation 554 11-5 Attributed Translation 559 11-5.1 Attributed Translation Grammar 560 11-5.2 L-Attributed Translation Grammar 563 11-6 Example Language Constructs 568 11-6.1 Declarations 568 Constant Type 570 Variable Declarations 571 Procedure Declarations 578 11-6.2 Expressions 579 11-6.3 Assignment Statements 591 11-6.4 Control Statements 592 Case-Selection Statement 593 Repeat - While Statement 595 For Loop Statement 596 11-6.5 Procedure Calls and Returns 598 Procedure Calls 598 Return Statements and Procedure Termination 603 11-6.6 Input and Output Statements 603 Input Statements 603 Output Statements 607 11-6.7 Compiler Aids 607 Chapter Exercises 609 Bibliography 609 Chapter 12 Code Optimization 610 12-1 Introduction 610 12-2 Basic Blocks 611 12-3 Folding 612 12-4 Redundant-Subexpression Elimination 620 12-5 Optimization within Iterative Loops 631 12-5.1 Loop Unrolling 635 12-5.2 Frequency Reduction 637


Chapter 13 13-1 13-2


Chapter 14 14-1 14-2


Appendix A-I


12-5.3 Strength Reduction 12-5.4 Combining Loop-Optimization Techniques Global Optimization through Flowgraph Analysis 12-6.1 Flowgraph Construction 12-6.2 Flowgraph Analysis Flowgraph-Analysis Problems Flow-A nalysis Algorithms 12-6.3 Applications to Program Optimization 12-6.4 Implementation and Further Considerations Chapter Exercises Bibliography

Machine-Dependent Optimization Introduction to Machine-Dependent Optimization Register-Allocation Optimization 13-2.1 . Register Allocation in Single-Register Machines 13-2.2 Register Allocation in Multiregister Machines Machine Architecture and the Generation of Real Code 13-3.1 The PDP-ll 13-3.2 The VAX-II 13-3.3 The MC68000 Chapter Exercises Bibliography

Compiler-Compilers Introduction to Compiler-Compilers Examples of Parser Generators 14-2.1 Y ACC: A LALR(I) Parser Generator 14-2.2 An Attributed LL(I) Parser Generator Machine-Independent Code Generation 14-3.1 The Production-Quality Compiler-Compiler System Introduction The Formalization of Instruction-Set Processors and TCOL The Code-Generation Process: Its Phases and Organization 14-3.2 The Table-Driven Generator of Graham and Glanville Introduction The Target-Machine Description The Code-Generation Process Results and Extensions 14-3.3 Other Code-Generation Systems Fraser's Knowledge-Based Code-Generator Generator The Finite-State Approach of Donegan Bibliography

Algorithmic Notation Format Conventions A-l.l Name of Algorithm A-l.2 Introductory Comment

646 651 655 655 660 661 667 678 680 681 683

684 684 686 686 694 702 703 710 716 720 721

722 723 726 727 735 742 743 743

745 749 755 756 757 758 760 761 761 763 764

767 768 768 768

xvi CONTENTS A-l.3 Steps 768 A-l.4 Comments 768 A-2 Statements and Control Structures 768 A-2.1 Assignment Statement 769 A-2.2 If Statement 769 A-2.3 Case Statement 770 A-2.4 Repeat Statement 770 A-2.5 Go To and Exitloop Statements 772 A-2.6 Exit Statement 772 A-2.7 Variable Names 773 A-3 I>ata Structures 773 A-3.1 Arrays 773 A-3.2 Dynamic Storage 774 A-4 Arithmetic Operations and Expressions 774 A-5 Strings and String Operations 775 A-6 Relations and Relational Operators 775 A-7 Logical Operations and Expressions 776 A-8 Input and Output 776 A-9 Subalgori thms 777 A-9.1 Functions 777 A-9.2 Procedures 778 A-9.3 Parameters 778 Indexes 781 Name Index 781 Subject Index 785


This book is intended as a text for a one- or two-semester course in compiler design at the senior undergraduate or introductory graduate level. It can also be used as a self-study and reference book in compiler design. The purpose of this text is to cover the underlying concepts and techniques used in compiler design. Some of these techniques can be used in software engineering or software design. The reader should have at least one year of experience in programming a high-level language and an assembly language. In addition, a familiarity with elementary data structures and discrete mathematics is a definite asset. The text, however, is reasonably self-contained. Chapter 1 contains an overview of compiler design and gives a model for a compiler. Chapter 2 covers the basic notations and concepts of discrete mathematics, grammars, and languages. This material is used in the remainder of the book. Students who are familiar with this material can quickly peruse it or ignore it. They need not have a complete mastery of the concepts on their first reading. They can, however, refer to the material when it is used in subsequent chapters. In Chapter 3 we introduce the elements of programming language design. We attempt to give a unified treatment of the topics in sufficient depth to be useful to persons actually designing a programming language. Much of this material is appearing in a compiler book for the first time. Chapter 4 details the informal and the formal approaches to lexical analyzers (i.e., scanners). Chapter 5 gives an overview of error handling. Although it doesn't contain many detailed algorithms on error recovery and repair strategies, it discusses important factors and considerations of error recovery and repair in general. Detailed algorithms for particular strategies are given in Chapters 6 and 7. Chapter 6 is the first of two chapters on parsing. The chapter deals with top-down parsing. In the first section we examine a number of ad hoc methods. The second section gives a detailed description of syntax-directed translation using LL(l) grammars.



Chapter 7 discusses several syntax-directed bottom-up parsing techniques. The discussion includes several precedence and LR parsers along with associated error detection and recovery strategies. In Chapter 8, first we outline the important functions of a symbol table in the translation process and then give the details of several symbol-table organizations for block-structured and non-block-structured languages. Chapter 9 provides a description of the static, dynamic, explicit, and implicit storage-allocation strategies that are frequently adopted when compiling pro- grams for a wide variety of existing languages. In Chapter 10 we discuss the advantages of using intermediate source forms. Five types of intermediate source forms are examined in detail. Chapter 11 deals with semantic analysis and code generation. This chapter presents four types of syntax-directed processing techniques, which include trans- lation grammars and attribute translation. Chapter 12 presents several machine-independent optimization strategies, including folding, redundant subexpression elimination, optimization within itera- tive loops, and global optimization through flow analysis. Chapter 13 covers several machine-dependent optimization techniques such as register allocation and peephole optimization Chapter 14 gives a brief overview of compiler-compiler systems. The main focus is on the PQCC and Graham-Glanville systems. Much of this material appears in a compiler book for the first time. As we mentioned earlier, the book can be used in a two-semester course. Depending on the level of such a course, the instructor may wish to supplement the text with reading assignments from the references. Several one-semester courses are possible. For example, the following selection of topics constitutes such a course:

Chapters 1 and 2 Chapter 3 (can be omitted if students don't do their own language design) Sections 4-1 and 4-2 Chapter 5 Sections 6-1.2 and 6-2 (except Sec. 6-2.5) Sections 7-4.1, 7-4.2, 7-4.3, and 7-4.6 Sections 8-5.3 and 8-5.4 and parts of Sec. 8-6 Chapter 9 (select one or two methods) Chapter 10 Chapter 11 (select a few methods of code generation)

An important aspect of teaching a course in compiler writing is to illustrate the key theoretical concepts. A frequently used strategy for achieving this goal is to have a student design a simple programming language and implement a compiler for this language. Many texts in compiler writing do not, because of size considerations and level of presentation, adequately present details that illustrate the implementation of a compiler. In a separate volume, entitled An Implementa- tion Guide to Compiler Writing (published with McGraw-Hill in 1982), we present,


in a case study manner, the development of a small compiler typical of those developed in a first compiler course. Instructors can use this book as a guide to illustrate difficulties that commonly arise when implementing a compiler. It can also be used as a sample document that is similar to what students should produce for their own compilers. I t is often necessary for instructors teaching a course in compiler writing to set up their own" toy" compiler as a vehicle for illustrating techniques for implementing a compiler. Also, many of the problems that are encountered in writing a compiler for a newly designed language must be illustrated. In this guide we attempt to fulfill this role. The guide begins with a documented description of a simple program language, GAUSS, which serves as an example for the student. GAUSS is a block-structured language whose design was influenced by ALGOL, PASCAL, PL/I, and FORTRAN. It also contains string-manipulation facilities capable of manipulating variable-length strings. It then illustrates the topics of scanner generation, LL(I) parsing, symbol-table construction, code generation, semantic analysis, error detection, and repair for the GAUSS language. Finally, the guide describes an interpreter for the code generated by the GAUSS compiler and discusses the run-time environment in which the machine code is executed. The code generated by the compiler executes on a hypothetical machine which is stack oriented.


We would like to thank the many people who assisted us in the preparation of this manuscript. Brent Clark assisted in the preparation of some parts of Chapters 12 and 13. Jack Copeck contributed to earlier drafts of Chapters 6 and 12. Howard Hamilton assisted in program flow analysis techniques. Rod Nickel contributed to the part of Chapter 7 that deals with error detection and recovery in precedence parsers. Lyle Opseth assisted in error detection and recovery in Chapter 7 and a preliminary version of Chapter 14. Darwyn Peachey assisted in the preparation of Chapter 5 and some error detection and recovery techniques in Chapters 6 and 7. Henry Spencer contributed to Chapter 3, and Joe Wald contributed to the attributed translation part of Chapter 11. Special thanks is due to John DeDourek, who class-tested the manuscript and proofread the entire galley form of the book. We also wish to thank the reviewers of the manuscript: Michael Bauer, James Harp, John Lowther, and Tom Pennello. A thanks to Chris Fraser for his comments on Chapter 14. We are grateful to Sue Wotton, who did such an excellent job of typing the manuscript, and to Gail Walker for her typing support. Finally, we are very grateful for the comments of our students in the Department of Computational Science at the University of Saskatchewan, who have class-tested preliminary versions of the book during the last seven years.

lean-Paul Tremblay Paul G. Sorenson




Interactions involving humans are most effectively carried out through the medium of language. Language permits the expression of thoughts and ideas, and without it, communication as we know it would be very difficult indeed. In computer programming, a programming language serves as a means of communication between the person with a problem and the computer used to help solve it. An effective programming language enhances both the development and the expression of computer programs. It must bridge the gap between the often unstructured nature of human thought and the precision required for computer execution. A program solution to a given problem will be easier and more natural to obtain if the programming language used is close to the problem. That is, the language should contain constructs which reflect the terminology and elements used in describing the problem and are independent of the computer used. Such a programming language is usually high-level. Digital computers, on the other hand, accept and understand only their own special low-level language, consisting typically of long sequences of zeros and ones. These sequences are generally unintelligible to humans. This type of low-level language is quite different from the high-level language used to describe a problem in a given area. A hierarchy of programming languages based on increasing machine indepen- dence includes the following:

1. Machine-level languages 2. Assembly languages 3. Higher-level or user-oriented languages 4. Problem-oriented languages



A machine-level language is the lowest form of computer language. Each instruction in a program is represented by a numeric code, and numerical addresses are used throughout the program to refer to memory locations in the computer's memory. All bookkeeping aspects of the program are the sole respon- sibility of the machine-language programmer. Finally, all diagnostics and programming aids must be supplied by the programmer. Also included as machine-level programs are programs written in microcode (i.e., microprograms). Microcode allows for the expression of some of the more powerful machine-level instructions in terms of a set of basic machine instructions. Assembly language is essentially a symbolic version of a machine-level lan- guage. Each operation code is given a symbolic code such as ADD for addition and MUL for multiplication. Moreover, memory locations are given symbolic names such as PAY and RATE. Some assembly languages contain macroinstruc- tions which are at a higher level than assembly-languages instructions. Assembly- language systems offer certain diagnostic and debugging assistance that is normally not available at the machine level. A high-level language such as FORTRAN, PASCAL, or PL/I offers most of the features of an assembly language. While some facilities for accessing system- level features may not be provided, a high-level language offers a more enriched set of language features such as structured control constructs, nested statements, blocks, and procedures. A problem-oriented language provides for the expression of problems in a specific application or problem area. Examples of such languages are SEQUEL for database retrieval applications and COGO for civil engineering applications. In this book we concern ourselves exclusively with the development of translators for high-level algebraic languages such as ALGOL, PASCAL, and FORTRAN. Although the class of high-level languages includes both procedural and nonprocedural languages, this book concentrates on procedural languages. Unlike procedural programs, the position of a particular statement in a nonproce- dural program has no bearing on its execution; that is, the interchange of any two statements in a nonprocedural program will not affect the results obtained though its execution. Examples of nonprocedural languages are PSL, a requirements statement language, and CSMP, a simulation and modeling language. Advantages of high-level languages over low-level languages such as machine and assembly languages include the following:

1. High-level languages are easier to learn than their lower-level counterparts. The learning of many high-level languages requires little or no computer hardware background because such languages are relatively machine-indepen- dent. Furthermore, these languages are closer to their problem areas than lower-level languages. 2. The programmer does not have to be concerned with clerical tasks involving numerical or symbolic references to instructions, memory locations, con- stants, etc. Such tasks are handled by a system which translates the high-level language program into machine language.


3. A programmer is not required to know how to convert data from external forms to various internal forms within the memory of a computer. The ability of a programmer to convert, say, numeric data into internal forms such as floating-point numbers and packed-decimal numbers should be irrelevant. 4. Most high-level languages offer a programmer a variety of control structures which are not available in low-level languages. High-level languages offer several of the following language constructs: Conditional statements (such as IF-THEN-ELSE and CASE statements) Looping statements (for both counted and conditional loops) Nested statements Block structures These control structures improve programming style and facilitate certain programming approaches such as structured programming. Resulting pro- grams are easier to read, understand, and alter. This results in reduced programming costs because programs are less complicated. 5. Programs written in a high-level language are usually more easily debugged than their machine- or assembly-language equivalents. High-level languages offer constructs that eliminate or reduce certain kinds of programming errors that occur in low-level languages. For example, the declaration of variables in a program adds a certain degree of redundancy which is useful in detecting errors in the improper use of variables. Languages often enforce a disciplined use of pointers. Moreover, a structured program is much more easily de- bugged than its unstructured counterpart. 6. Since most high-level languages offer more powerful control and data- structuring capabilities than low-level languages, the former class of lan- guages facilitates the expression of a solution to a particular problem. 7. Because of the availability of certain language features such as procedures, high-level languages permit a modular and hierarchical description of pro- gramming tasks. These tasks can then be assigned to a team of programmers, thus facilitating a division of labor with a minimum of disruption and effort. Such an approach permits better documentation of a problem. Also, in- creased compatibility among programs and programmers can be realized. 8. Finally, high-level languages are relatively machine-independent. Conse- quently, certain programs such as FORTRAN and COBOL programs are portable. These programs can be executed on different computers with little, if any, change even though these computers have different internal architec- tures and machine-language instruction sets. Program portability reduces costs and, in general, protects an installation against computer obsolescence. Often when a computer is replaced, all associated assembly- and machine- language programs become obsolete.

Points 1 through 5 have been stressed in the past and found to be very important. Today, points 6 through 8 are receiving considerable attention and promise to be dominant considerations in the immediate future.


High-level language programs, of course, must be translated automatically to equivalent machine-language programs. The notions of such translators are examined in the next section.


A translator inputs and then converts a source program into an object or target program. The source program is written in a source language and the object program belongs to an object language. If the source language being translated is assembly language and the object program is machine language, the translator is called an assembler. Assembly language resembles closely machine language. In an elementary assembly lan- guage, most of its instructions are symbolic representations of corresponding machine-language instructions. Each assembly instruction contains an ordered set of fields. For example, the first field might represent a label which is followed immediately by an operation field. This field in turn might be followed by one or more operand fields. More sophisticated assembly languages, however, support macroinstructions. A macroinstruction can be expanded to a sequence of simple machine-language instructions. High-level language constructs such as case state- ments, nested statements, and blocks are not usually contained in assembly languages. A translator which transforms a high-level language such as FORTRAN, PASCAL, or COBOL into a particular computer's machine or assembly language is called a compiler. The time at which the conversion of a source program to an object program occurs is called compile time. The object program is executed at run time. Figure 1-1 illustrates the compilation process. Note that the source program and data are processed at different times, namely, compile time and run time, respectively. Another kind of translator, called an interpreter, processes an internal form of the source program and data at the same time. That is, interpretation of the


Sou rce program


Object program

Executing com pu ter


Compile time

Run time

Figure 1-1 Compilation process.



Sou rce program



Figure 1-2 Interprctive process.

internal source form occurs at run time and no object program is generated. Figure 1-2 illustrates this interpretive process. Some interpreters analyze each source statement every time it is to be executed. Such an approach is very time-consuming and is seldom used. A more efficient approach involves applying compilation techniques to the translation of the source program to an in ter- mediate source form. This intermediate source form is then interpreted by the in terpreter program. Interpreters have become increasingly popular lately, particularly in micro- computer environments where the overhead of interpretation seems to be signif- candy less for the user. For example, a main reason why languages such as BASIC, APL, LISP, and SMALL T ALK-80 have become very popular is because they have been implemented in an interpretive environment. In this book, we concentrate on compiler construction techniques. Although a complete compiler will not be found in this text, the reader is referred to the accompanying text, An Implementation Guide to Compiler Writing, where a compiler for the high-level programming language GAUSS is described in detail. A compiler can be described in a modular fashion. The next section gives an overview of such an approach.


The task of constructing a compiler for a particular source language is complex. The complexity and nature of the compilation process depend, to a large extent, on the source language. Compiler complexity can often be reduced if a program- ming-language designer takes various design factors into consideration. Several of these programming-language design factors are discussed in Chap. 3. Since we are dealing with high-level source languages such as ALGOL and PASCAL, however, a basic model of a compiler can be formulated. Such a model is given in Fig. 1-3. Although this model may vary for the compilation of different high-level lan- guages, it is nevertheless representative of the compilation process. A compiler must perform two major tasks: the analysis of a source program and the synthesis of its corresponding object program. The analysis task deals with the decomposition of the source program into its basic parts. Using these


Source Object program program

ANAL YSIS SYNTHESIS , r Lex ical -- Syntactic .... Semantic .... Code .. Code analyzer - analyzer -- analyzer - generator -- optimizer



, , -

, , , ,


Figure 1-3 Components of a compiler.

parts, the synthesis task builds their equivalent object program modules. The performance of these tasks is realized more easily by building and maintaining several tables. More will be said about the contents of these tables shortly. A source program is a string of symbols each of which is generally a letter, a digi t, or certain special symbols such as +, -, and (,). A source program contains elementary language constructs such as variable J}ames, labels, constants, keywords, and operators. It is therefore desirable for the compiler to identify these various types as classes. These language constructs are given in the defini- tion of the language. The source program (see Fig. 1-3) is input to a lexical analyzer or scanner whose purpose is to separate the incoming text into pieces or tokens such as constants, variable names, keywords (such as DO, IF, and THEN in PL/I), and operators. In essence, the lexical analyzer performs low-level syntax analysis. For efficiency reasons, each class of tokens is given a unique internal representation number. For example, a variable name may be given a representation number of 1, a constant a value of 2, a label the number 3, the addition operator ( + ) a value of 4, etc. For example, the PL/I statement


would be translated by the lexical analyzer into the following sequence of tokens (and associated representation numbers):

TEST 3 26 IF 20




1 15 1 21 1 10 1 27

Note that in scanning the source statement and generating the representation number of each token we have ignored spaces (or blanks) in the statement. The lexical analyzer must, in general, process blanks and comments. Both of these classes of items typically do not represent executable parts of a program and are ignored, for example, in PL/I. Certain programming languages allow the continuation of statements over multiple lines. Lexical analyzers must then handle the input processing of such multiple-line statements. Also, some scanners place constants, labels, and variable names in appropriate tables. A table entry for a variable, for example, may contain its name, type (e.g., REAL, INTEGER, or BOOLEAN), object program address, value, and line in which it is declared. The lexical analyzer supplies tokens to the syntax analyzer. These tokens may take the form of a pair of items. The first item gives the address or location of the token in some symbol table. The second item is the representation number of the token. Such an approach offers a distinct advantage to the syntax analyzer; namely, all tokens are represented by fixed-length information: an address (or pointer) and an integer. Chapter 4 describes the scanning process in more detail. The syntax analyzer is much more complex than the lexical analyzer. Its function is to take the source program (in the form of tokens) from the lexical analyzer and determine the manner in which it is to be decomposed into its constituent parts. That is, the syntax analyzer determines the overall structure of the source program. This process is analogous to determining the structure of a sentence in the English language. In such an instance we are interested in identifying certain classes such as "subject," "predicate," "verb," "noun," and " adjective." In syntax analysis we are concerned with grouping tokens into larger syn- tactic classes such as expression, statement, and procedure. The syntax analyzer (or parser) outputs a syntax tree (or its equivalent) in which its leaves are the tokens and every nonleaf node represents a syntactic class type. For example, an analysis of the source statement

(A + B)*(C + D)

can produce the syntactic classes (factor), (term), and (expression) as exhibited in the syntax tree given in Fig. 1-4. The syntax-tree approach is described in Chap. 2, where a set of rules known as a grammar is used to define precisely the source language. A grammar can be used by the syntax analyzer to determine the


< expression> I < term>

* < factor>


< expression> + < term> I I < term> < factor> I I < factor> 0 I c

< term> I < factor>

) < expression> + < term> I I < term> < factor> I I < factor> B I A

Figure 1-4 A syntax tree for the expression (A + B)*(C + D).

structure of the source program. This recognition process is called parsing, and consequently we often refer to syntax analyzers as parsers. Several classes of parsers are discussed in detail in Chaps. 6 and 7. The syntax tree produced by the syntax analyzer is used by the semantic analyzer. The function of the semantic analyzer is to determine the meaning (or semantics) of the source program. Although it is conceptually desirable to separate the syntax of a source program from its semantics, the syntax and semantic analyzers work in close cooperation. Nevertheless, the semantic-analysis process is a different and unique process in a compiler. For an expression such as (A + B) * (C + D), for example, the semantic analyzer must determine what actions are specified by the arithmetic operators of addition and multiplication. When the parser recognizes an operator such as "+" or "*," it invokes a semantic routine which specifies the action to be performed. This routine may check that the two operands to be added have been declared, that they have the same type (if not, the routine would probably make them the same), and that both operands have values. The semantic analyzer often interacts with the various tables of the compiler in performing its task. The semantic-analyzer actions may involve the generation of an intermediate form of source code. For the expression (A + B) * (C + D), the intermediate source code might be the following set of quadruples: ( +, A, B, Tl) ( + , C, D, T2) ( * , Tl, T2, T3)


where ( + , A, B, T1) is interpreted to mean "add A and B and place the result in temporary T1," (+, C, 0, T2) is interpreted to mean "add C and 0 and place this result in T2," and (*, T1, T2, T3) is interpreted to mean" multiply Tl and T2 and place the result in T3." The exact form of the intermediate source language used depends, to a large extent, on how it is to be processed during the synthesis task. An infix expression may be converted to an intermediate form called Polish notation. Using this approach, the infix expression (A + B)*(C + D) would be converted to the equivalent suffix-Polish expression AB + CD + *. The latter expression contains the arithmetic operators in the order in which they are to be executed. Another type of intermediate form of source language can be a syntax tree which represents the parse of the source program. (See Chap. 10.) The output of the semantic analyzer is passed on to the code generator. At this point the intermediate form of the source-language program is usually translated to either assembly language or II).achine language. As an example, the translation of the three quadruples for the previous expression can yield the following sequence of single-address, single-accumulator assembly-language in- structions:

LOA A Load the contents of A into the accumulator. ADD B Add the contents of B to that of the accumulator. STO T1 Store the accumulator contents in temporary storage T1. LOA C Load the contents of C into the accumulator. ADD 0 Add the contents of 0 to that of the accumulator. STO T2 Store the accumulator contents in temporary storage T2. LOA T1 Load the contents of T1 into the accumulator. MUL T2 Multiply the contents of T2 by that of the accumulator. STO T3 Store accumulator contents in temporary storage T3. The topic of code generation is presented in Chap. 11. The output of the code generator is passed on to a code optimizer. This process is present in more sophisticated compilers. Its purpose is to produce a more efficient object program. Certain optimizations that are possible at a local level include the evaluation of constant expressions, the use of certain operator properties such as associativity, commutativity, and distributivity, and the detection of common subexpressions. Because of the commutativity of the multiplication operator, the previous assem- bly code can be reduced to the following: LOA A ADDB STO Tl LDAC ADD 0 MUL T1 STO T2

Note that the expression evaluated by this code is (C + D)*(A + B). More global optimizations can also be performed. These include the single evaluation of multiple occurrences of identical subexpressions and removing


statements that are invariant inside a loop and placing them outside of that loop. These kinds of optimizations are machine-independent. There are, however, certain machine-dependent optimizations which can also be performed. Optimal register allocation is an example of this kind of optimization. A good code optimizer can produce as good or even better code than an experienced assembly-language programmer. Optimization techniques are the topic of Chaps. 12 and 13. The model of a compiler given in Fig. 1-3 makes distinctions between its successive phases. In certain compilers, however, some of these phases are combined. In our discussion of this model the details are omitted on how the five processes interact with each other. Let us examine some possible interactions between the lexical and syntax analyzers. One possibility is that the scanner generates a token for the syntax analyzer for processing. The syntax analyzer then "calls" the scanner when the next token is required. Another possibility is for the scanner to produce all the tokens corresponding (0 the source program before passing control to the syntax analyzer. In this case the scanner has examined the entire source program-this is called a separate pass. Some compilers make as little as one pass while other compilers have been known to make more than 30 passes (e.g., some of IBM's first PL/I compilers). Factors which influence the number of passes to be used in a particular compiler include the following:

1. Available memory 2. Speed and size of compiler 3. Speed and size of object program 4. Debugging features required 5. Error-detection and -recovery techniques desired 6. Number of people and time required to accomplish the compiler writing project

Several educationally oriented or student compilers such as W A TFIV and PL/C are essentially one-pass compilers. In these compilers very little (if any) code optimization is performed. The reason for this is the belief that programs will be compiled more times than executed. Such programs are often executed once and discarded. Also the semantic-analyzer and code-generation phases are invariably combined into one phase. Great emphasis is placed on debugging, error detection, error recovery, and diagnostic capabilities. Certain source languages, however, cannot be compiled in a single pass. Compilers for PL/I and FORTRAN often contain many passes. In particular, the optimization phase may require that several passes of the source program (or its intermediate form) be made. In addition, the optimization phase is often spread throughout the other passes of the compilation process (e.g., certain constant folding, as described in Chap. 12, has been incorporated in the lexical analyzer of some compilers). Other combinations are possible. Sometimes the syntax-analyzer, semantic- analyzer, and code-generation phases can be combined into a single phase.


An aspect of compiler construction which was omitted in Fig. 1-3 deals with the area of error detection and error recovery. Error recovery is most closely associated with the syntax-analysis phase. The aim of error recovery is to prolong the compilation life of the program as long as possible before the compiler "gives up" on the source program. A strategy that is often adopted is that if at all possible, the object program is executed. This approach can reduce the number, of times a source program has to be compiled before becoming error-free. Some error-recovery approaches insert and/or discard tokens in attempting to correct the portion of the source program which is syntactically incorrect. We attach significant importance to this topic in this book. In particular, Chap. 5 introduces the general notions of error detection and error recovery. Also, particular strate- gies for different parsing techniques are illustrated in Chaps. 6 and 7. Another aspect of compiling which has been ignored in the current discussion deals with the various symbol tables that must be created and maintained during the executions of the required object program. The topic of symbol tables is presented in Chap. 8. Also, certain language constructs in the source language imply that several data structures must be stored and maintained in the memory of the computer. The design and implementation of these structures is called run-time organization and implementation or storage management. For block- structured languages, string-oriented langUages such as SNOBOL, and languages which permit the declaration and manipulation of programmer-defined data structures, the run-time considerations are very significant and complex. Details of these problems are dealt with in Chap. 9. In closing, a comment on the relative difficulty of implementing each of the phases in Fig. 1-3 is in order. The lexical-analysis phase is perhaps the most straightforward. Next in difficulty comes the syntactic-analysis or parsing phase. A lot of research, based on formal language theory, has been done on these two phases. As a result, the scanning and parsing t>liases can be largely automated. The theory of scanning is developed in Chap. 4. Chapters 6 and 7, on the other hand, describe the theory of parsing. The real difficulties in compiler construction are in the semantic-analysis, code-generation, and code-optimization phases. Another problem area is the portability of programs. These difficulties, which are often language- and machine-dependent, are examined in Chaps. 10, 11, 12, and 13. Several attempts have been made at automating compiler construction. Some of these efforts have enjoyed limited success. These translator-writing or compiler- compiler systems are briefly discussed in Chap. 14.


Aho, A. V., and J. D. Ullman: Principles of Compiler Design, Reading, Mass.: Addison-Wesley, 1977. Barrett, W. A., and 1. D. Couch: Compiler Construction, Theory and Practice, Chicago: Science Research Associates, Inc., 1979.


Calingaert, P.: Assemblers, Compilers and Program Translators, Rockville, Md.: Computer Science Press, Inc., 1979. Cocke, J., and J. T. Schwartz: Programming Languages and Their Compilers, Courant Institute of Mathematical Sciences, New York University, April 1970. Gries, D.: Compiler Construction for Digital Computers, New York: John Wiley & Sons, 1971. Knuth, D. E.: "A History of Writing Compilers," Computers and Automation, Vol. 11, No. 12, December 1962, pp. 8-14. Lewis, P. M. II, D. J. Rosenkrantz, and R. E. Stearns: Compiler Design Theory, Reading, Mass.: Addison-Wesley, 1976. Pollack, B. W. (ed.): Compiler Techniques, Princeton, N.J.: Auerbach, 1972. Rosen, S., Programming Systems and Languages, New York: McGraw-Hill, 1967. Tremblay, J. P., and P. G. Sorenson: An Implementation Guide To Compiler Writing, New York: McGraw- Hill, 1982.



This chapter covers basic material which we will use extensively throughout the remainder of this book. Rather than spreading this material in various chapters of the text, therefore forcing the reader to read at least those parts of each chapter where the notions are introduced, we felt that a better alternative would be to put in this chapter basic fundamental material which is common to most of the remaining chapters. In this way the text can be read in a modular manner, thus allowing the reader to select certain chapters without loss of continuity. The first section deals with basic material concerning sets and strings. Section 2-2 introduces the notion of a grammar and its use in describing the syntax of a language. In the third section, a classification of grammars is given and the classification is shown to form a hierarchy. The material dealing with context-free grammars of Sec. 2-4 introduces basic notions such as parsing, syntax tree, and ambiguity. The problem of specifying the syntax of a programming lan- guage by the use of a metalanguage is discussed at some length. The material in Sees. 2-1, 2-2, 2-4.1, and 2-4.2 is drawn primarily from Tremblay and Manohar (1975). Readers who are familiar with the notions of sets, relations, and strings can proceed to Sec. 2-2. Readers who have been exposed to grammars can begin at Sec. 2-4.




A programming language consists of a set of programs (or strings). As we will see shortly, it is convenient to use a grammar as a formal vehicle for generating these programs. Several relations can be defined on the rules of a grammar, and these relations can lead to efficient compilation algorithms for the language associated with that grammar. Consequently, the purpose of this section is to give a brief introduction to the basic elements of sets and strings. In particular, the concepts of sets and operations on sets are given in Sec. 2-1.1. The notion of a relation is introduced in Sec. 2-1.2. Other important aspects about relations which are introduced in this subsection include the graph and transitive closure of a relation. The section ends with an elementary treatment of strings.

2-1.1 Basic Concepts of Set Theory By a set we mean a collection of objects of any sort. The word "object" is used here in a very broad sense to include even abstract objects. A fundamental concept of set theory is that of membership or belonging to a set. Any object belonging to a set is called a member or an element of that set. If an element p belongs to a set A, we write pEA, which is read as "p is an element of the set A" or "p belongs to the set A." If there exists an object q which does not belong to the set A, we express this fact as q

A. There are many ways of specifying a set. For example, a set consisting of the decimal digits is generally written as {O,1,2,3,4,5,6,7,8,9}

The names of the elements are enclosed in braces and separated by commas. If we wish to denote this set as D, we write

D = {O,1,2,3,4,5,6,7,8,9} where the equality sign indicates that D is the set {O, 1,2, 3,4,5,6, 7, 8, 9}. This method of specifying a set is not always convenient. For example, the set D can be more easily described by using a predicate as follows: D = {xix is a decimal digit} where the symbol I is read as "such that." In this case, the predicate is " is a decimal digit." If we let P(x) denote any predicate, then {xIP(x)} defines a set and is read" the set of all x such that P( x) is true." An element a belongs to this set if P( a) is true; otherwise a does not belong to the set. If we denote the set {xIP(x)} by B, then B =. {xIP(x)}. Sets which are specified by listing their elements can also be characterized by means of a predicate. For example, the set {a, b, 1, 9} can be defined as {xl(x = a) V(x = b) v(x = 1) V(x = 9)}

where the symbol V denotes a logical disjunction (or). A predicate can also


contain the logical connectives /\ (logical conjunction or and), --, (logical nega- tion),

(if then),

(if and only if), and the relational operators <, < , = , =1=, > , and >. Furthermore, a predicate can also contain the existential quantifier (3, meaning" there exists") and the universal quantifier ('V, representing "for every"). Although it is possible to characterize any set by a predicate, it is sometimes convenient to specify sets by another method, such as

C = {1,3,5,...} S = {a,a 2 ,a 3 ,...}

In this representation the missing elements can be determined from the elements present and from the context. The number of distinct elements present in a set may be finite or infinite. We call a set finite if it contains a finite number of distinguishable elements; otherwise a set is infinite. Note that no restriction has been placed on the objects that can be members of a set. It is not unusual to have sets whose members are themselves sets, such as A = {O, {a, b}, 1, { p}}. It is important, however, to distinguish between the set { p }, which is an element of A, and the element p, which is a member of { p } but not a member of A. For any two sets A and B, if every element of A is an element of B, then A is called a subset of B, or A is said to be included in B, or B includes A. Symbolically, this relation is denoted by A c B, or equivalently by B :) A. For any two sets A and B, note that A c B does not necessarily imply that B c A except for the following case. Two sets A and B are said to be equal iff (if and only if) A c Band B c A so that A = B. The set A is called a proper subset of a set B if A c B and A =1= B. This relation is represented by A c B. We now introduce two special sets; the first includes every set under discussion while the second is included in every set under discussion. A set is called a universal set (E) if it includes every set under discussion. For example, the universal set for natural numbers may be E = {O, 1,2,. . . }. A set which does not contain any element is called an empty set or a null set. An empty set will be denoted by </>. Given any set A, we know that the null set </> and the set A are both subsets of A. Also for any element pEA, the set { p } is a subset of A. Similarly, we can consider other subsets of A. Rather than finding individual subsets of A, we would like to say something about the set of all subsets of A. For any set A, a collection or family of all subsets of A is called the power set of A. The power set of A is denoted by p(A) or 2 A , so that p(A) = 2 A = {xix c A}. We now introduce the concept of an indexed set. Let J = {Sl' S2, S3,. . . } and A be a family of sets A = {ASl' A S2 ' A S3 ' . . .} such that for any Sj E J there corresponds a set As E A, and also As = As iff Sj = S J o . A is then called an I I J indexed set, J the index set, and any subscript such as the s j in A s is called an I index.


An indexed family of sets can also be written as A = {A j } j E J. In particular, if J = {1,2,3,...}, then A = {A 1 ,A 2 ,A 3 ,...}. Also, if J = {1,2,3,...,n}, then A = {A I' A 2' A 3' . . . , An} = {A j } j E I where In = {I, 2, . . . , n } . For a set S n containing n elements, the power set p(S) is written as the indexed set p(S) = { Bj } j E J where J = {O, 1,2, . . . , 2 n - I}. Let us now consider some operations on sets. In particular, we emphasize the set operations of intersection, union, and complement. Using these operations, one can construct new sets by combining the elements of given sets. The intersection of any two sets A and B, written as A n B, is the set consisting of all elements which belong to both A and B. Symbolically, A n B = {xix E A /\ x E B} For any indexed set A = {Aj}jEJ' n Aj = {xix E Aj for every i E J} iEJ

For J = In = {1,2,...,n}, we can write

n n Aj = n Aj = Al n A 2 n ... nA n i=1 iEl n

Two sets A and B are called disjoint iff A n B = </>. A collection of sets is called a disjoint collection if, for every possible pair of sets in the collection, the two sets are disjoint. The elements of a disjoint collection are said to be mutually disjoint. Let A be an indexed set A = {A j } j E J. The set A is a disjoint collection iff A j n A j = </> for all i, j E J, i =1= j. For any two sets A and B, the union of A and B, written as A U B, is the set of all elements which are members of the set A or the set B or both. Symbolically, it is written as A U B = {xix E A V X E B}. For any indexed set A = {Aj}jEJ' UAj= {xlxEAjforatleastoneiEJ} iEJ

For J = In = {I, 2, 3, . . . , n}, we may write


U A.=A UA U ... UA I 1 2 n i= 1

The relative complement of B in A (or of B with respect to A), written as A - B, is the set consisting of all elements of A which are not elements of B, that IS,

A - B = {xix E A /\ x

B} The relative complement of B in A is also called the difference of A and B. The relative complement of A with respect to the universal set E, that is, E - A, is called the absolute complement of A. So far we have been concerned with sets, their equality, and operations on sets to form new sets. We now introduce the notion of an ordered pair. An ordered pair consists of two objects in a given fixed order. Note that an ordered pair is not a set consisting of two elements. The ordering of the two objects is important. We denote an ordered pair by (x, y).


The equality of two ordered pairs (x, y) and (u, v) is defined by (x,y) = (u,v) <=> «x = u) I\(y = v))

where the symbol "<=> " denotes logical equivalence; that is, A <=> B means that A is equivalent to B. At this point we want to also define the term "imply." A is said to imply B written A

B, if and only if A -+ B is always true. In mathematics, the notations A -+ B and A

B are often used interchangeably. The idea of an ordered pair can be extended to define an ordered triple and, more generally, an n-tuple. We write an n-tuple as (x I' X 2' . . . , X n). An important idea in set theory is the notion of a cartesian product. Given two sets A and B, the set of all ordered pairs such that the first member of the ordered pair is an element of A and the second member is an element of B is called the cartesian product of A and B and is written as A X B. Accordingly A xB= {(x,Y)I(XEA)I\(yEB)}

This notion of cartesian product can be extended to any finite number of sets. Let A = {A j } j E I be an indexed set and In = {I, 2, . . ., n}. We denote the cartesian n product of the sets AI' A 2 , . . ., An by X A. = A X A 2 X ... xA I I n jEI n

which is defined recursively as

X Aj = Al jE II



Ai = (iE

_l Ai) X Am

form = 2,3,...,n

Our definition of cartesian product of n sets is related to the definition of n-tuples in the sense that Al XA 2 X ... XA n = {(X I ,X 2 ,...,x n )l(x i E AI) l\(x 2 E A 2 ) 1\ ... I\(x n E An)} As an example, consider the following sets:

Al = {1,3},

A 2 = {3,4},

A3 = {1,3,4,6}

Several set operations are exhibited in the following: Al U A 2 = {1,3,4} Al n A3 = {1,3}

3 U Aj = {I, 3,4,6} ;= 1 3 nA j ={3} j= 1

A 3 -A I ={4,6} p(A I ) = {cp, {I}, {3}, {I, 3}} Al X A 2 = {(I, 3), (1,4), (3,3), (3,4)}


In this section we have introduced the basic concepts of set theory. We now turn our attention to the notions of relations and orderings.


1 Give another description of the following sets and indicate those which are infinite sets. ( a) {x I x is an in teger and 5


12} ( h) {2, 4, 8, .. . } «(') All the countries of the world 2 Given S = {2,a,{3},4} and R = {{a},3,4,1}, indicatewhcthcr the following arc true or false. (a) {a}ES (b) {a}ER (c) {a,4,{3}}

S (d) {{a},1,3,4} c R (e) R = S (f) {a}

S (g) {a}

R (h)cpcR (i) cp



E (j) {CP}

S (k) cp E R (/) cp

{{3},4} 3 Show that (R

S) /\ (S C Q)

R c Q is always true. Is it correct to replace R c Q by R

Q? Explain your answer. 4 Give the power sets of the following. (a) {a,{b}} (b) {l,CP} (c) {X, Y, Z} 5 Given A = {x I x is an integer and 1


5}, B = {3, 4,5, 17}, and C = {I, 2, 3, .. . }, find A n B, A n C, A u B, and A U C. 6 Show that A

A u B and A n B

A. 7 Show that A


A u B = B. 8 If S = {a,b,c}, find nonempty disjoint sets Al and A 2 such that Al u A 2 = S. find other solutions to this problem. 9 The symmetric difference of two sets A and B is the set A + B defined by A + B = (A - B) U (B - A). Given A = {2,3,4}, B = {1,2}, and C = {4,5,6}, find A + B, B + C, A + B + C, and (A + B) + (B + C). Show that A + B + C is associative. 10 Give examples of sets A, B, C such that A u B = A u C, but B =1= C. 11 Write the members of {a,b} X {1,2,3}. 12 Write A X B X C, B 2 , A 3 , B 2 X A, and A X B where A = {I}, B = {a,h}, and C = {2,3}. 13 Show by means of an example that A X B =1= B X A and (A X B) X C =1= A X (B X C). 14 Show that for any two sets A and B

p(A) u p(B)

p(A U B) p(A) n p(B) = p(A n B)

Show by means of an example that

p(A) U p(B) =1= p(A U B)


15 Prove the identities






16 Show that A X(BnC)=(A XB)n(A XC). 17 Show that A X B = B X A

(A = cp) v (B = cp) v (A = B) 18 Show that (A n B) u C = A n (B U C) iff C

A 19 Show that (A - B) - C = (A - C) - (B - C) 20 Prove that (A n B) X (C n D) = (A XC) n (B X D)

2-1.2 Relations

The concept of a relation is a basic concept in mathematics as well as in everyday life. Associated with a relation is the act of comparing objects which are related to one another. The ability of a computer to perform different tasks based upon the result of a comparison is an important attribute used several times during the execution of a typical program. In this subsection we first formalize the concept of a relation and then discuss methods of representing a relation by using a matrix or its graph. The relation matrix is useful in determining the properties of a relation and also in representing a relation on a computer. Various basic properties of a relation are given, and certain important classes of relations are in trod uced. The word "relation" suggests some familiar examples of relations such as the relation of father to son, sister to brother, or uncle to nephew. Familiar examples in arithmetic are relations such as less than, greater than, or that of equality between two real numbers. We also know the relation between the area of an equilateral triangle and the length of one of its sides and the area of a square and the length of a side. These examples suggest relationships between two objects. Throughout the discussion we consider relations, called binary relations, between a pair of objects. Any set of ordered pairs defines a binary relation. We call a binary relation simply a relation. It is sometimes convenient to express a particular ordered pair, say, (x, y) E

, where R is a relation, as xRy. In mathematics, relations are often denoted by special symbols rather than by capital letters. A familiar example is the relation "less than" for real numbers. This relation is denoted by <. In fact, < should be considered as the name of a set whose elements are ordered pairs. More precisely the relation < is < = {(x, y) Ix, yare real numbers and x is less than y} Let S be a binary relation. The set D(S) of all objects x such that for some y, (x, y) E S is called the domain of S, that is,

D(S) = {xl(3y)«x,y) E S)}

where the symbol 3 denotes existential quantification. Similarly, the set R(S) of all objects y such that for some x, (x, y) E S is called the range of S, that is,

R(S) = {YI(3x)«x,y) E S)}


Since a relation has been defined as a set of ordered pairs, it is therefore possible to apply the usual operations of sets to relations as well. The resulting sets will also be ordered pairs and will define some relations. If Rand S denote two relations, then R n S defines a relation such that

x(R n S)y

xRy /\ xSy

Similarly, R U S is a relation such that

x(R U S)y

xRy V xSy


x(R - S)y

xRy /\ x$y

where x$y denotes that x is not related to y in relation S. A binary relation R in a set X is reflexive if, for every x E X, xRx, that is, (x, x) E R. The relation R is symmetric if, for every x and y in X, whenever xRy, then yRx. R is said to be transitive if, for every x, y, and z in X, whenever xRy and yRz, then xRz. R is irreflexive if, for every x E X, (x, x)

R. Finally, R is said to be antisymmetric if, for every x and y in X, whenever xRy and yRx, then x = y. Several important classes of relations having one or more of the properties gi ven here will be discussed later in this subsection. A relation R from a finite set X to a finite set Y can also be represented by a matrix called the relation matrix of R. Let X = {Xl' X2'...' x m }, Y = {YI' Y2,..., Yn}' and R be a relation from X to Y. The relation matrix can be obtained by first constructing a table whose rows are labeled by successive elements of X and whose columns are labeled by successive elements of Y. If xiRYJ' then we enter a 1 in the ith row and jth column. If xp$yq, then we enter a zero in the pth row and the qth column. As a special case, consider m = 2, n = 3, and R given by

R = {( X I' YI)' (x 2' YI)' (x 2' Y3) }

The required table for R is Table 2-1. If we assume that the elements of X and Y appear in a certain order, then the relation R can be represented by a matrix whose elements are Is and Os. This matrix can be written down from the table constructed or can be defined in the

Table 2-1 YI Y2 Y3 Xl 1 0 0 x2 1 0 1


following manner:

r. . = ( 1, IJ 0,

if x.R y . I J

if xi$.Yj

where r ij is the element in the ith row and the jth column. The matrix obtained this way is the relation matrix. If X and Y have m and n elements, respectively, then the matrix is an m X n matrix. For the relation R just given, the.. relation matrix is


o o


One not only can write a relation matrix when a relation R is given but also obtain the relation if the relation matrix is given. Throughout the remainder of this subsection, unless otherwise stated, we will assume X = Y; that is, the relations are defined in a set X. Thus the relation matrices are square. A relation matrix reflects some of the properties of a relation in a set. If a relation is reflexive, all the diagonal entries must be 1. If a relation is symmetric, the relation matrix is symmetric. If a relation is an tisymmetric, its matrix is such that if r ij = 1, then 'j.i = 0 for i =1= j. A relation can also be represented pictorially by drawing its graph. Although we shall introduce some of the concepts of graph theory which are discussed in later chapters, here we shall use graphs only as a tool to represent relations. Let R be a relation in a set X = {Xl' X2'. . ., x n }. The elements of X are represented by points or circles called nodes. The nodes corresponding to Xi and x j are labeled Xi and Xj' respectively. These nodes may also be called vertices. If xiRx j , that is, if (Xi' Xj) E R, then we connect nodes Xi and x j by means of an arc and put an arrow on this arc in the direction from Xi to Xi" When all the nodes correspond- ing to the ordered pairs in R are connected by arcs with proper arrows, we get a directed graph of the relation R. If xiRx j and xjRx i , then we draw two arcs between Xi and xi" For the sake of simplicity, we may replace the two arcs by one arc with arrows pointing in both directions. If xiRx i , we get an arc which starts from node Xi and returns to Xi' Such an arc is called a sling. From the graph of a relation it is possible to observe some of its properties. Several examples are given in Fig. 2-1. Another very important notion which is used in later chapters is that of a partition. Let S be a given set and A = {AI' A 2 , . . . , An} where each A" i = 1,2,..., n, is a subset of S and U

== I A, = S, then the set A is called a covering of S, and the sets AI' A 2 , . . . , An are said to cover S. If, in addition, the elements of A, which are subsets of S, are mutually disjoint, then A is called a partition of S, and the sets AI' A 2 , . . .

An are called the blocks of the partition. We now proceed to define an important class of relations which partitions a set. A relation R in a set X is called an equivalence relation if it is reflexive, symmetric, and transitive. If R is an equivalence relation in a set X, then D( R), the domain of R, is X itself. Therefore, R will be called a relation on X. Some




o y













Figure 2-1 Graphs of relations.

examples of equivalence relations are equality of numbers on a set of real numbers, similarity of triangles on a set of triangles, and the relation of lines being parallel on a set of lines in a plane. Another example of an equivalence relation which will be used in the discussion on hashing functions (see Chap. 8) is that of equality in a modular number system. Let I denote the set of all positive integers, and let n be a positive integer. For x E I and y E I, we define R as R = {(x, Y)lx - Y is divisible by n} Note that "x - y is divisible by n" is equivalent to the statement that both x and y have the same remainder when each is divided by n. Perhaps the most important idea about relations that is used in formal parsing techniques involves taking the" transitive closure" of a relation. We now develop this idea. Since a binary relation is a set of ordered pairs, the usual operations such as union and intersection on these sets produce other relations. We now consider another operation on relations-relations which are formed in two or more stages. Familiar examples of such relations are the relation of being a nephew or a brother's or sister's son, the relation of an uncle or a father's or mother's brother, and the relation of being a grandfather or a father's or mother's father. These relations can be produced in the following manner. Let R be a relation from X to Y and S be a relation from Y to Z. Then a relation written as R 0 S is called a composite relation of Rand S where RoS = {(x,z)lx EX /\ z E Z /\(3y)(y E Y /\(x,y) E R /\(y,z) E S)} The operation of obtaining R 0 S from Rand S is called composition of relations.




R 0 S





. Z








° x4



Figure 2-2 Relations R, S, and R 0 S.

Note that R 0 S is empty if the intersection of the range of R and the domain of S is empty. R 0 S is nonempty if there is at least one ordered pair (x, y) E R such that the second member y E Y of the ordered pair is a first member in an ordered pair in S. For the relation R 0 S, the domain is a subset of X and the range is a subset of Z. In fact, the domain is a subset of the domain of R, and its range is a subset of the range of S. From the graphs of Rand S one can easily construct the graph of R 0 S. As an example, see Fig. 2-2. The operation of composition is a binary operation on relations, and it produces a relation from two relations. The same operations can be applied again to produce other relations. For example, let R be a relation from X to Y, S a relation from Y to Z, and P a relation from Z to W. Then R 0 S is a relation from X to Z. We can form (R 0 S)o P, which is a relation from X to W. Similarly, we can also form R o(S 0 P), which again is a relation from X to W. Let us assume that (R 0 S)o P is nonempty, and let (x, y) E R, (y, z) E S, and (z, w) E P. This assumption means (x, z) E R 0 S and (x, w) E (R 0 S)o P. Of course, (y, w) E Sop and (x, w) E R o(S 0 P), which shows that


This result states that the operation of composition on relations is associative. We


may delete the parentheses in writing (R 0 S) 0 P, so that (RoS)oP = Ro(Sop) = RoSoP We know that the relation matrix of a relation R from a set X = {Xl' X 2 , . . . , Xm} to a set Y = {Yl, Y2, . . . , Yn} is given by a matrix having m rows and n columns. We shall denote the relation matrix of R by MR. M R has entries which are Is and Os. Similarly the relation matrix Ms of a relation S from the set Y to a set Z = {Zl' Z2'. . ., zp} is an n X p matrix. The relation matrix of R 0 S can be obtained from the matrices M Rand Ms in the following manner. From the definition it is clear that (Xi' Z k) E R 0 S if there is at least one element of Y, say, Yj' such that (Xi' Yj) E Rand (Yj' Zk) E S. There may be more than one element of Y which has properties similar to those of Yj' for example, (Xi' Yr) E Rand (Yr' Zk) E S. In all such cases, (Xi' Zk) E R 0 S. Thus when we scan the ith row of M R and kth column of Ms, and we come across at least one j, such that the entries in the jth location of the row as well as the column under consideration are Is, then in this instance, the entry in the ith row and k th column of M R 0 s is also 1; otherwise it is o. Scanning a row of M R along with every column of Ms gives one row of M R 0 s. In this way, we can obtain all the rows of M R 0 s. In general, let the relations A and B be represented by n X m and m X r matrices, respectively. Then the composition A 0 B which we denote by the relation matrix C is expressed as

m Coo = V aO k /\ b k o i = 1,2,...,n; j = 1,2,...,r IJ I J k=l where a ik /\ b kj and V := 1 indicate bit-ANDing (i.e., 1 /\ 0 = 0 /\ 1 = 0 /\ 0 = 0, 1 /\ 1 = 1) and bit-ORing (i.e., 1 V 1 = 1 V 0 = 0 V 1 = 1,0 V 0 = 0), re- spectively. Let us now consider some distinct relations R 1 , R 2 , R 3 , R4 in a set X = { a, b, c} given by Rl = {(a,b),(a,c),(c,b)} R 2 = {(a,b),(b,c),(c,a)} R3 = {(a,b), (b,c), (c,c)} R4 = {(a,b), (b,a), (c,c)} Denoting the composition of a relation by itself as R 0 R = R 2 R 0 R 0 R = R 0 R 2 = R 3 R 0 R m - l = Rm let us write the powers of the given relations. Clearly Rf= {(a,b)} Ri=cf> Ri=cf> R

= {(a,c), (b,a), (c,b)} R

= {(a, a), (b,b), (c,c)} R 4 - R R 5 = R 2 R 6 - R 3 2- 2 2 2 2- 2 R

= {(a,c),(b,c),(c,c)} =R



= {(a,a),(b,b),(c,c)} R


R 5 - R 2 4 - 4


Given a finite set X, containing n elements, and a relation R in X, we can interpret Rm (m = 1,2,...) in terms of its graph. This interpretation is done for a number of applications throughout the text. With the help of such an interpre- tation or from the examples given here, it is possible to say that there are at most n distinct powers of R, for Rm, m > n, that can be expressed in terms of R, R 2, . . . , R n. Our next step is to construct the relation in X given by R+= R U R 2 U R 3 U ... Naturally, this construction will require only a finite number of powers of R to be calculated, and these calculations can easily be performed by using the matrix representation of the relation R and the Boolean multiplication of these matrices. Let us now see what the corresponding relations R:, R;, R;, and R; are R: = Rl U Rl U Ri ... = Rl R; = R 2 U R


. .. = R 2 U R


= {(a, b), (b,c), (c,a), (a,c), (b,a), (c,b), (a, a), (b,b), (c,c)} R; = {(a, b), (b,c), (c,c), (a,c)} R; = {(a,b),(b,a),(c,c),(a,a),(b,b)} Observe that the relations R:, R; , . . ., R; are all transItIve and that Rl c Ri, R 2 c R;,..., R4 C R;. From the graphs of these relations one can easily see that R

is obtained from R; (i = 1,2,3,4) by adding only those ordered pairs to R i such that R: is transitive. We now define R + in general.

Definition 2-1 Let X be any finite set and R be a relation in X. The relation R+= R U R 2 U R 3 U ... in X is called the transitive closure of R in X.

Theorem 2-1 The transitive closure R + of a relation R in a finite set X is transitive. Also for any other transitive relation P in X such that R c P, we have R + c P. In this sense, R + is the smallest transitive relation contain- ing R.

Based on this theorem, the transitive closure of the relation matrix for some relation A can easily be computed by using the following algorithm due to Warshall.

Procedure WARSHALL (A, n, P). Given the relation matrix A with n rows, the following steps produce the transitive closure of A, which is denoted by P. The variables i, j, and k are local integer variables.

1. [Ini tialize] P

A 2. [Perform a pass] Repeat through step 4 for k = 1, 2,...,n 3. [Process rows] Repeat step 4 for i = 1, 2,...,n


4. [Process columns] Repeat for j = 1, 2,...,n Pij

Pij V (Pik /\ Pkj). 5. [Finished] Exit


To show that this algorithm produces the required matrix, note that step 1 produces a matrix in which Pij = 1 if there is a path of length 1 from Vi to '1. Assume that for a fixed k, the intermediate matrix P produced by steps 3 and 4 of the algorithm is such that the element in the ith row and jth column in this matrix is 1 if and only if there is a path from Vi to '1 which includes only nodes from v 1 , v 2 ,,,.,V k as intermediate nodes. Now with an updated value of k, we find that Pij = 1 either if Pij = 1 in an earlier step or if there is a path from Vi to v j which traverses through V k + 1. This means that Pij = 1 if and only if there is a path from Vi to v j which includes only nodes from v 1 , V 2 ,...,V k + 1 as intermediate nodes. Instances of transitive closures of relations will be given in Chaps. 6 and 7, where such relations are derived from a grammar and used in the parsing phase. Also, the transitive closure of a graph of a program can be used to perform certain optimizations in that program. This aspect of compiler writing is the topic of Chaps. 12 and 13. In terminating this subsection we introduce the notion of the converse of a relation. Given a relation R from X to Y, a relation R from Y to X is called the converse of R, where the ordered pairs of R are obtained by interchanging the members in each of the ordered pairs of R. This means, for x E X and y E Y, that xRy

yRx. _ From the definition of R it follows that R = R. The relation matrix Mk of R can be obtained by simply interchanging the rows and columns of MR. Such a matrix is called the transpose of MR. Therefore, M k = transpose of M R The graph of R is also obtained from that of R by simply reversing the arrows on each arc. We now consider the converse of a composite relation. For this purpose, let R be a relation from X to Yand S be a relation from Y to Z. Obviously, R is a relation from Y to X, S from Z to Y; R 0 S is a relation from X to Z, and R 0 S is a relation from Z to X. Also the relation So R is from Z to X. We now show that

RoS=SoR If xRy and ySz, then x(R 0 S)z and z(R 0 S)x. But zSy and yRx, so that z(S 0 R)x. This is true for any x E X and z E Z; hence the required result. The same rule can be expressed in terms of the relation matrices by saying that the transpose of M R 0 s is the same as the matrix Ms 0 k. The matrix Ms 0 k can be obtained from the matrices Ms and M k , which in turn can be obtained from the matrices Ms and MR.


Later in the text, the concept of a relation will be applied to a grammar. In fact there are several interesting relations which can be obtained from a grammar. Many of the operations which are to be performed on these relations have been discussed in this subsection.


1 Give an example of a relation which is neither reflexive nor irreflexive. 2 Give an example of a relation which is both symmetric and antisymmetric. 3 If relations Rand S are both reflexive, show that R U Sand R n S are also reflexive. 4 If relations Rand S are reflexive, symmetric, and transitive, show that R n S is also reflexive, symmetric, and transitive. 5 Determine whether the following relations are transitive:

Rl = {(1,1)} R 2 = {(1,2), (2,2)} R3 = {(1,2), (2,3), (1,3), (2,1)}

6 Given S = {1,2,3,4} and a relation R on S defined by R = {(1,2), (4,3), (2,2), (2,1), (3,1)}

show that R is not transitive. Find a relation Rl d R such that Rl is transitive. Can you find another relation R 2 d R which is also transitive? 7 Given S = {I, 2,..., 10} and a relation R on S where R = {(x,Y)lx + Y = 10}

what are the properties of the relation? 8 Let R be a relation on the set of positive real numbers so that its graphical representation consists of points in the first quadrant of the cartesian plane. What can we expect if R is (a) reflexive, (b) symmetric, and (c) transitive? 9 Let L denote the relation" less than or equal to" and D denote the relation" divides," where xDy means "x divides y." Both Land D are defined on the set {I, 2, 3, 6}. Write Land D as sets, and find L n D. 10 Show that the relations L and D given in Exercise 9 are both reflexive, antisymmetric, and transitive. Give another example of such a relation. Draw the graphs of these relations. 11 Let R denote a relation on the set of ordered pairs of positive integers such that (x, y) R (u, v) iff xv = yu. Show that R is an equivalence relation. 12 Given a set S = {I, 2, 3,4, 5}, find the equivalence relation on S which generates the partition whose equivalence classes are the spts {I, 2 }, {3}, and {4, 5 }. Draw the graph of the relation. 13 Prove that the relation "congruence modulo m" given by == = {( x, y) I x - y is divisible by m}

over the set of positive integers is an equivalence relation. Show also that if Xl == Yl and X 2 == Y2' then (Xl + x2) == (Yl + Y2)' 14 Prov

the following equivalences and equalities: (a) k = R (b)R=S

k=s (c) R



s (d) R U S = k u S (e) R n S = k n S


15 Show that if a relation R is reflexive, then k is also reflexive. Show also that similar remarks hold if R is transitive, irreflexive, symmetric, or antisymmetric. 16 What nonzero entries are there in the relation matrix of R n k if R is an antisymmetric relation in a set X? 17 Given the relation matrix M R of a relation R on the set {a, b, c}, find the relation matrices of k, R 2 = R 0 R, R 3 = R 0 R 0 R, and R 0 k.

M R = U r


18 Two equivalence relations Rand S are given by their relation matrices M R and Ms. Show that R 0 S is not an equivalence relation.

MR=[i i


Ms = [

r n

Obtain equivalence relations R 1 and R 2 on {I, 2, 3} such that RIo R 2 is also an equivalence relation. 19 Using Warshall's algorithm, obtain the transitive closure of the relation whose graph is given in Fig. 2-3. 20 For the graph of the relation R given in Fig. 2-4, determine Mk, M R 1\ Mk, and Mk 1\ MR'




Figure 2-3




Figure 2-4


2-1.3 Strings

In this subsection we are concerned with some elementary properties of strings. There are many interesting properties exhibited by string operations, just as there are interesting properties for arithmetic operations over the natural numbers. To refamiliarize ourselves with some of the properties associated with operations, let us consider the operation of addition on the natural numbers. This operation can be represented, in general, by a functional system in two variables: j(x,y)=x+y where x and yare natural numbers. This system exhibits certain interesting properties. First, the sum of any two natural numbers is a natural number. This property is called closure. Closure is a necessary property for a system (i.e., a set and an operation on that set) to be classified as an algebraic system. Second, (x + y) + z = x + (y + z) = x + y + z when x, y, and z are natural numbers; accordingly the operation of addition is said to be associative. Third, there exists a number i such that for every natural number x, x + i = x. This number is zero and is called the unit element or identity of the additive system. Many other important properties, such as distributivity and commutativity, exist when arith- metic operations such as addition and multiplication are applied to the set of natural numbers. We begin a discussion of strings by formally defining a string. To do so, we must introduce the notion of an alphabet and the operation of concatenation. Simply stated, an alphabet V is a finite nonempty set of symbols. The set V = {a,b, c,..., z} is a familiar example of an alphabet and {a, /3, y, E} is a four-chara

ter alphabet (which is a sub alphabet of the Greek alphabet). The concatenation of two alphabetic characters, say 'a' and 'b', is said to form a sequence of characters, namely, 'ab'. (Note that henceforth when we refer to a character from an alphabet or a sequence of such characters, they are enclosed in single quote marks.) The operation of concatenation also applies to sequences of characters. For example, 'ab' concatenated with 'ab' is 'abab'. We denote the concatenation operator by the special symbol o. This allows us to write expressions such as 'ab' 0 'a', which is identical in value to 'aba'. A string (or sentence) over an alphabet V is either a letter from the alphabet V or a sequence of letters derived from the concatenation of zero or more characters from the alphabet V. Examples of strings over an alphabet V = {a, b, c} are 'a', 'ca', 'ccba', and 'bbb'. Let V 0 V = V 2 designate all strings of length 2 on V, V 0 V 0 V = V 2 0 V = V 3 designate all strings of length 3 on V, and, in general, V 0 V 0 ... 0 V = V n designate all strings of length n on V. Then the closure of V, denoted as V+, is defined as V+= V U V 2 U V 3 U ... For completeness, a special string E called the empty (or null) string is often combined with V+ to form the closure set V* of V. That is, V* = {E} U V U V 2 U V 3 U . .. = {E} U V+. The string E has the identity property (i.e., x 0 E =


EO X = X for any string x which is an element of V*), and it is called the identity element in the system formed by the set V* and the operation of concatenation. Associativity is another property of this system (that is, (x 0 y) 0 z = x 0 (y 0 z) = x 0 y 0 z for x, y, Z E V*). As an example, consider the set of strings V* that can be generated from an alphabet V = {x, y}. Some subsets of V* are V 2 = { 'xx', 'xy', 'yx', 'yy'} V 3 = {'xxx', 'xxy', 'xyx', 'xyy', 'yxx', 'yxy', 'yyx', 'yyy'} V 4 = { 'xxxx', 'xxxy', 'xxyx', 'xxyy', 'xyxx', 'xyxy', 'xyyx', 'xyyy', 'yxxx', 'yxxy', 'yxyx', 'yxyy', 'yyxx', 'yyxy', 'yyyx', 'yyyy'}

We will, on many occasions, refer to the closure set V* of an alphabet. As another example of a string, let us examine FORTRAN. The FORTRAN alphabet consists of 26 letters, 10 digits, and a set of special characters, such as '( " ')', ',', '= " and '+ '. It is only these characters that are used in writing a FORTRAN program. Hence a program can be viewed as the concatenation of characters over an alphabet to yield an arbitrarily long string. In Chap. 4, we see that the problem of ensuring that only the proper set of characters appears in a program is handled by the scanning phase of a compiler. Let x, y, and z be strings over an alphabet where z = xy. The string x is called a prefix or head of z. If y =1= E, then x is called a proper prefix or proper head. Similarly, y is called a suffix or tail of z, and if x =1= E, then y is called a proper suffix or proper tail. Some of the concepts introduced here are used in the next section, where the notions of grammars and languages are introduced.


Programming languages must be preciiely defined. Unf

rtunately, for some of the earlier programming languages the existence of a particular compiler finally provided the precise definition of such a language. The proper specification of a programming language involves the definition of the following:

1. The set of symbols (or alphabet) that can be used to construct correct programs 2. The set of all syntactically correct programs 3. The" meaning" of all syntactically correct programs

In this section we shall be concerned with the first two items in the specification of programming languages. A language L can be considered a subset of the closure (including the empty string) of an alphabet. The language consisting of this closure set is not particu-


larly interesting, since it is too large. Our definition of a language L is a set of strings or sentences over some finite alphabet V T , so that L c Vi-. How can a language be represented? A language consists of a finite or an infinite set of sentences. Finite languages can be specified by exhaustively enumerating all their sentences. However, for infinite languages, such an enumer- ation is not possible. On the other hand, any means of specifying a language should be finite. One method of specification which satisfies this requirement uses a generative device called a grammar. A grammar consists of a finite nonempty set of rules or productions which specify the syntax of the language. As well, a grammar imposes structure on the sentences of a language. Many grammars may generate the same language but impose different structures on the sentences of that language. The study of grammars constitutes an important subarea of computer science called formal language theory. This area emerged in the mid- 1950s as a result of the efforts of Noam Chomsky (1959), who gave a mathemati- cal model of a grammar in connection with his study of natural languages. In 1960, the concept of a grammar became important to programmers because the syntax of ALGOL 60 was described by a grammar. A second method of language specification is to have a machine, called an acceptor, determine whether a given sentence belongs to the language. This approach is discussed further in Chaps. 4, 6, and 7 along with some very interesting and important relationships that exist between grammars and accep- tors. In this section we are concerned with a grammar as a mathematical system for defining languages and as a device for giving some useful structure to sentences in a language. It was mentioned earlier that a grammar imposes a structure on the sentences of a language. For a sentence in English such a structure is described in terms of subject, predicate, phrase, noun, and so on. On the other hand, for a program, the structure is given in terms of procedures, statements, expressions, etc. In any case, it may be desirable to describe all such structures and to obtain a set of all the correct or admissible sentences in a language. For example, we may have a set of correct sentences in English or a set of valid ALGOL programs. The grammatical structure of a language helps us determine whether a particular sentence does or does not belong to the set of correct sentences. The gramlnatical structure of a sentence is generally studied by analyzing the various parts of a sentence and their relationships to one another. Consider the sentence" the cat ate a mouse." Its structure (or parse) is shown in Fig. 2-5. This diagram of a parse displays the syntax of a sentence in a manner similar to a tree and is therefore called a syntax tree. Each node in the diagram represents a phrase of the syntax. The words such as "the" and "cat" are the basic symbols, or primitives, of the language. The syntax of a small subset of the English language can be described by using the symbols

S: sentence

V: verb

0: object

A: article

N: noun

SP: subject phrase

VP: verb phrase

NP: noun phrase


in the following rules: S -+ SP VP N-+ ' mouse' SP -+ A N N-+ ' tree ' A-+ 'a ' VP -+ V 0 A-+ ' the' V-+ ' ate' N-+ ' monkey' V-+ ' climbs' N-+ 'banana' o -+ NP N-+ 'cat' NP -+ A N

These rules state that a sentence is composed of a "subject phrase" followed by a "verb phrase"; the" subject phrase" is composed of an "article" followed by a " noun"; a verb phrase is composed of a "verb" followed by an "object"; and so on. The structure of a language is discussed by using symbols such as "sentence," "verb," "subject phrase," and" verb phrase," which represent syntactic classes of elements. Each syntactic class consists of a number of alternative structures, and each structure consists of an ordered set of items which are either primitives (of the language) or syntactic classes. These alternative structures are called produc- tions or rules of syntax, or replacement rules. For example, the production, S ---+ SP VP defines a "sentence" to be composed of a "subject phrase" followed by a "verb phrase." The symbol -+ separates the syntactic class" sentence" from its definition. The syntactic class and the arrow symbol along with the interpreta- tion of a production enable us to describe a language. A system or language which describes another language is known as a metalanguage. The metalanguage used to teach German at most universities is




verb phrase

verb object I noun phrase

article noun I I ate a mouse

subject phrase



Figure 2-5 Syntax tree for an English sentence.


English, while the metalanguage used to teach English is English. The diagram of the parse of a sentence describes its syntax but not its meaning or semantics. At this time, we are mainly concerned with the syntax of a language, and the device which we have just defined to give the syntactic definition of the language is called a grammar. Using the grammatical rules for our example, we can either produce (gener- ate) or derive a sentence in the language. A computer programmer is concerned with producing programs which adhere to the productions (grammatical rules) of the language. The compiler of a language, on the other hand, is faced with the problem of determining whether a given sentence (source program) is syntacti- cally correct based upon the given grammatical rules. If the syntax is correct, the compiler produces object code. Consider the problem of trying to generate or produce the sentence" the cat ate the mouse" from the set of productions given. It is accomplished by starting first with the syntactic class symbol S and looking for a production which has S to the left of the arrow. There is only one such production, namely, S -+ SP VP We have replaced the class S by its only possible composition. We then take the string SPVP and look for a production whose left-hand side is SP and then replace it with the right-hand side of that production. The application of the only production pos- sible produces the string ANVP We next look for a production whose left part is A, and two such productions are found. By selecting the production A -+ 'the' and upon substituting its right-hand side in the string A N VP, we obtain the string the N VP This enumerative process is continued until we arrive at the correct sentence. At this point, the sentence contains only primitive or terminal elements of the language (no classes). A complete derivation or generation of the sentence" the cat ate a mouse" is as follows:




the N VP

the cat VP

the cat V 0

the cat ate 0

the cat ate NP

the cat ate A N

the cat ate a N

the cat ate a mouse


Here the symbol

denotes that the string on the right-hand side of the symbol can be obtained by applying one replacement rule to the previous string. The rules for the example language can produce a number of sentences. Examples of such sentences are

The cat ate a banana. The monkey climbs a tree. The monkey climbs the tree. The banana ate a cat.

The last of these sentences, although grammatically correct, does not make sense because of its semantics. This situation is often allowed in the specification of languages. There are many valid FORTRAN and PASCAL programs that do not make sense. It is easier to define languages if certain sentences of questionable validi ty are allowed by the replacement rules. There is another important thing to note concerning the generation of strings (i.e., sentences) from a grammar. We have neglected (for legibility reasons) to put into the terminal symbols of the grammar the blank characters which normally appear between the words in the sentences. Throughout the book, we assume, unless it is obvious from context of the text, that a blank delimits all terminal symbols. The set of sentences that can be generated by the rules of the previous example is finite. Any interesting language usually consists of an infinite set of sentences. As a matter of fact, the importance of a finite device such as a grammar is that it permits the study of the structure of a language consisting of an infinite set of sentences. Let the symbols L, 0, and I denote the classes L: letter, D: digit, and I: identifier. The productions which follow are recursive and produce an infinite set of names because the syntactic class I is present on both the left and the right sides of certain productions.

I -+ L

0-+0 0-+1

I -+ 10 I -+ IL





I t is easily seen that the class I defines an infinite set of strings or names in which each name consists of a letter followed by any number of letters or digits. This set is a consequence of using recursion in the definition of the productions



10 and I

IL. In fact, recursion is fundamental to the definition of an infinite language by the use of a grammar. Let us now formalize the idea of a grammar and how it is used. For this purpose, let V T be a finite nonempty set of symbols called the terminal alphabet. The symbols in V T are called terminal symbols. The metalanguage which is used to generate strings in the language is assumed to contain a set of syntactic classes or variables called nonterminal symbols. The set of non terminal symbols is denoted by V N , and the elements of V N are used to define the syntax (structure) of the language. Furthermore, the sets V N and V T are assumed to be disjoint. The set V N U V T consisting of nonterminal and terminal symbols is called the vocabul- ary of the language. In much of the discussion in this section we use capital letters such as A, B, C, . . ., X, Y, Z to denote non terminal symbols, and SI' S2, . .. to represent the elements of the vocabulary. The strings of terminal symbols are denoted by lowercase letters x, y, z, . . ., while strings of symbols over the vocabulary are given by a, 13, y, . .. . The length of a string a will be denoted by lal.

Definition 2-2 A grammar is defined by a 4-tuple G = (V N , V T , S, 4» where V T and V N are sets of terminal and non terminal (syntactic class) symbols, respectively. S, a distinguished element of V N and therefore of the vocabul- ary, is called the starting symbol. 4> is a finite nonempty subset of the relation from(VTU VN)*VN(VTU VN)*tO(VTU V N )*.lngeneral,anelement(a,l3) is written as a

13 and is called a production rule or a rewriting rule.

For our example defining an identifier, we may write the grammar as G 1 = (V N , V T , S, 4» in which V N = {I,L,D}

V T = {a, b, c, d, e, f, g, h, i, j, k, I, m, n, 0, p, q, r, s , t , u, v, w, x, y, z, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 }

S=I 4> = {I

L, I



a, L

b,..., L

z, D

0, D

1,..., D


Definition 2-3 Let G = (V N' V T' S, 4» be a grammar . For (J,

E V*, (J is said to be a direct derivative of

, written as

(J, if there are strings <PI and <P2 (including possibly empty strings) such that

= <Pla<P2 and (J = <P113<p2 and a

13 is a production of G.


(J, we may also say that

directly produces (J or (J directly reduces to

. For grammar G 1 of our example, we have listed in Table 2-2 some illustrations of direct derivations. The concepts can now be extended to produce a string (J not necessarily directly but in a number of steps from a string


36 THE THEORY AND PRACTICE OF COMPILER WRITING Table 2-2 1/; (J Rule used <1>1 <1>2 1 L 1-+ L f f LL Lx L-+x L f LDL LIL D-+l L L LDDL L2DL D-+2 L DL

Definition 2-4 Let G = (V N , V T , S, <1» be a grammar. The string

produces (J «(J reduces to

, or (J is the derivative of

), written as

(J, if there are strings <l>o,<I>I""'CPn (n > 0) such that

= CPo



CPn = (J. The relation

is the transitive closure of the relation

. If we let n = 0, we can define the reflexive transitive closure of

as * +

(J <=>

(J or

= (J Returning to grammar G 1 , we show that the string a13 is derived from I by following the derivation sequence: I






a13 Note that as long as we have a non terminal symbol in the string, we can produce a new string from it (assuming no rules of the form A -+ A). On the other hand, if a string contains only terminal symbols, then the derivation is complete, and we cannot produce any further strings from it.

Definition 2-5 A sentential form is any derivative of the unique non terminal symbol S. The language L generat