Principal
The Theory and Practice of Compiler Writing
The Theory and Practice of Compiler Writing
JeanPaul Tremblay, Paul G. SorensonCategories:
Año:
1985
Editorial:
McgrawHill College
Idioma:
english
Páginas:
813
ISBN 10:
1441665250
ISBN 13:
9788178001838
Series:
McGrawHill computer science series
File:
DJVU, 6.44 MB
Descarga (djvu, 6.44 MB)
 Checking other formats...
 Please login to your account first

Need help? Please read our short guide how to send a book to Kindle.
The file will be sent to your email address. It may take up to 15 minutes before you receive it.
The file will be sent to your Kindle account. It may takes up to 15 minutes before you received it.
Please note you need to add our email km@bookmail.org to approved email addresses. Read more.
Please note you need to add our email km@bookmail.org to approved email addresses. Read more.
You can write a book review and share your experiences. Other readers will always be interested in your opinion of the books you've read. Whether you've loved the book or not, if you give your honest and detailed thoughts then people will find new books that are right for them.
1

2

. . .. . . COMPILER 11_11 11111111 11111111 11111111 ! I I A .T T.T . ATA     WRTNG JEANPAUL TREMBLAY PAUL G. SORENSON THE THEORY AND PRACTICE OF COMPILER WRITING JeanPaul Tremblay Paul G. Sorenson Department of Computational Science University of Saskatchewan, Saskatoon Canada McGraw Hill Book Company New York St. Louis San Francisco Auckland Bogota Hamburg Johannesburg London Madrid Mexico Montreal New Delhi Panama Paris Sao Paulo Singapore Sydney Tokyo Toronto This book was set in Times Roman by Science Typographers, Inc. The editors were Eric M. Munson, Kaye Pace, Ellen W. MacElree, and Jo Satloff; the production supervisor was Joe Campanella. The drawings were done by Volt Information Sciences, Inc. R.R. Donnelley & Sons Company was printer and binder. THE THEORY AND PRACflCE OF COMPILER WRITING Copyright @ 1985 by McGrawHill, Inc. All rights reserved. Printed in the United States of America. Except as permitted under the United States Copyright Act of 1976, no part of this publication may be reproduced or distributed in any form or by any means, or stored in a data base or retrieval system, without the prior written permission of the publisher. 234567890DOCDOC898765 ISBN 0070651612 Library of Congress Cataloging in Publication Data Tremblay, JeanPaul, date The theory and practice of compiler writing. (McGrawHill computer science series) Includes bibliographies and index. 1. Electronic digital computersProgramming. 2. Com piling (Electronic computers) I. Sorenson, P. G. II. Title. III. Series. QA 76.6.T734 1985 001.64'2 8425096 ISBN 0070651612 Dedie Dedicated A la memoire de mon pere Philippe Tremblay To my father, Chester, and in memory of my mother, Josephine Sorenson CONTENTS Chapter 1 11 12 13 Chapter 2 21 22 23 24 Preface XVll Introduction Programming Languages Translators Model of a Compiler Bibliography 1 1 4 5 11 Notation and Concepts for Language; s and Grammars Sets and Strings 21.1 Basic Concepts of Set Theory Exercises 21.1 21.2 Relations Exercises 21.2 21.3 Strings Discussion of Grammars Exercises 22 Classification of Grammars Exercises 23 ContextFree Grammars and Parsing 24.1 Syntax Terminology Exercises 24.1 24.2 A View of Parsing Exercises 24.2 24.3 Reduced Grammars and Grammars with eRules Exercises 24.3 24.4 Extended BNF Notation Exercises 24.4 Bibliography 13 14 14 18 19 27 29 30 37 37 42 42 43 50 52 56 57 63 64 66 67 ix x CONTENTS Chapter 3 31 32 33 34 35 36 37 38 Chapter 4 41 42 43 44 Programming Language Design Overview of the Problem Preliminary Considerations Sources of Ideas Goals and Design Philosophies of Programming Languages 34.1 Human CommunicatiQU 34.2 Prevention and Detection of Errors 34.3 Usability 34.4 Programming Effectiveness 34.5 Compilability 34.6 Efficiency 34. 7 Machi e Independence 34.8 Simplici ty 34.9 Uniformity 34.10 Orthogonality 34.11 Generalization and Specialization 34.12 Other Design Philosophies Detailed Design 35.1 Microstructure 35.2 Expression Structure 35.3 Data Structure 35.4 Control Structure 35.5 Compile Structure 35.6 I/O Structure Reduction of Size Pragmatics Comments on the Design of Ada 38.1 Evaluation of Ada from a Language Design Viewpoint Usability and Program Effectiveness Machine Independence and Portability Efficiency Modularity and Maintainability Compilability and Compile Structure Simplicity 38.2 Ada Programming Support Environment 38.3 Software Technology for Adaptable Reliable Systems (STARS) Chap ter Exercises Bibliography Scanners The Scanning Process An Elementary Scanner Design and Its Implementation Exercises 42 Regular Grammars and Regular Expressions Exercises 43 FiniteState Acceptors 44.1 Deterministic FiniteState Acceptors 44.2 Nondeterministic FiniteState Acceptors 68 69 70 72 74 74 75 78 78 80 81 83 84 84 85 86 86 88 88 92 93 102 113 114 116 116 120 122 122 129 129 129 130 130 131 132 134 135 138 139 141 150 151 155 156 156 157 45 46 47 Chapter 5 51 52 53 54 55 Chapter 6 61 62 CONTENTS xi 44.3 Nondeterministic FiniteState Acceptors with f Transi tions Exercises 44 Equivalence of Regular Grammars and FiniteState Acceptors Exercises 4 5 Equivalence of Regular Expressions and FiniteState Acceptors Exercises 46 A Scanner Generator Exercises 4 7 Bibliography CompileTime Error Handling Overview of Error Handling Error Detection 52.1 The Nature of Syntax Errors 52.2 How Errors Are Detected 52.3 Where Errors Are Detected Error Reporting Error Recovery 54.1 Ad Hoc Recovery Mechanisms 54.2 SyntaxDirected Recovery 54.3 Secondary Error Recovery 54.4 ContextSensitive Recovery Error Repair 5 5.1 Ad Hoc Repair 55.2 SyntaxDirected Repair 55.3 ContextSensitive Repair 55.4 Spelling Repair Chapter Exercises Bibliography T op Down Parsing General TopDown Parsing Strategies 61.1 BruteForce Approach Exercises 61.1 61.2 RecursiveDescent Parsers RecursiveDescent Parsing Algorithm Error Detection in RecursiveDescent Parsers Exercises 61.2 61.3 Notions of TopDown Parsing with Limited Backup 61.4 TopDown Parsing with Limited Backup Exercises 61.4 TopDown Parsing with No Backup 62.1 Notions of Parsing with No Backup 62.2 Simple LL(I) Grammars Exercises 62.2 62.3 LL(I) Grammars without (Rules Exercises 62.3 62.4 LL(I) Grammars with (Rules 162 165 166 169 170 176 177 181 181 182 182 185 185 187 189 190 195 195 197 199 200 200 201 201 202 203 205 206 208 208 209 219 219 219 224 228 228 229 234 235 235 236 241 242 248 249 xii CONTENTS Chapter 7 71 72 73 74 Exercises 62.4 62.5 Error Handling for LL(I) Parsers Exercises 62.5 Chap ter Exercises Bibliography BottomUp Parsing Polish Expressions and Their Compilation 71.1 Polish Notation 71.2 Conversion of Infix Expressions to Polish Notation 71.3 Error Detection for Infix Expressions Exercises 71 Operator Precedence Grammars 72.1 Notions and Use of Operator Precedence Relations 72.2 Formal Definition of Operator Precedence Relations 72.3 Operator Precedence Parsing Algorithm 72.4 Precedence Functions in Operator Precedence Parsers 72.5 Error Detection in Operator Precedence Parsers Exercises 72 Simple Precedence Grammars 73.1 Notions and Use of Precedence Relations 73.2 Formal Definition of Precedence Relations 73.3 Parsing Algorithm for Simple Precedence Grammars 73.4 Precedence Functions for Simple Precedence Grammars 73.5 Error Recovery for Simple Precedence Parsers 73.6 Notions of Extended Precedence Exercises 73 LR Grammars 74.1 Concepts and Terminology Exercises 74.1 74.2 LR(O) Parsers Exercises 74.2 74.3 SLR (1) Parsers Exercises 74.3 74.4 Canonical LR(I) Parsers Exercises 74.4 74.5 LA LR (1) Parsers Exercises 74.5 74.6 Efficient Generation of Lookahead Sets for LALR(I) Parsers Exercises 74.6 74.7 Representation and Optimization of LR Parsers Sparse Matrix Representations Optimization Transformations 7 4.8 Error Detection and Recovery in LR Parsers Early Methods of Error Recovery Application of GrahamRhodes Method to LR Parsers Other LR ErrorRecovery Methods 258 260 272 272 273 274 275 276 278 283 285 286 287 293 294 298 300 301 302 302 306 309 311 320 332 339 341 341 346 346 352 353 361 362 369 370 375 375 383 383 384 386 389 389 392 410 75 Chapter 8 81 82 83 84 85 86 Chapter 9 91 92 93 CONTENTS xiii Comparison of Parsing Methods Bibliography Symbol Table Handling Techniques Perspective and Motivation When to Construct and Interact with the Symbol Table SymbolTable Contents Operations on Symbol Tables SymbolTable Organizations for NonBlockStructured Languages 85.1 Unordered Symbol Tables 85.2 Ordered Symbol Tables 85.3 TreeStructured Symbol Tables 85.4 Hash Symbol Tables Symbol Table Organizations for BlockStructured Languages 86.1 BlockStructured Language Concepts 86.2 Stack Symbol Tables 86.3 StackImplemented TreeStructured Symbol Tables 86.4 StackImplemented HashStructured Symbol Tables Chapter Exercises Bibliography RunTime Storage Organization and Management Static Storage Allocation Exercises 91 Dynamic Storage Allocation 92.1 Activation Records 92.2 Parameter Area 92.3 Display Area 92.4 RunTime Address Calculation 92.5 Handling Recursive Procedures Exercises 92 Heap Storage Allocation 93.1 Implicit Storage Requests 93.2 Explicit Storage Requests 93.3 Management of FixedLength Blocks 93.4 Management of VariableLength Blocks FirstFit Heap StorageManagement Strategy BoundaryTag Heap StorageManagement Strategy BuddySystem Heap StorageManagement Strategy 93.5 Freeas YouGo Storage Release 93.6 Garbage Collection 9 3.7 Compaction Exercises 9 3 Bibliography 410 415 418 418 419 422 426 428 428 429 433 450 464 465 467 468 471 474 476 477 477 480 480 482 483 484 487 488 491 492 493 493 495 496 497 501 506 512 513 517 519 520 xiv CONTENTS Chapter 10 Intermediate Forms of Source Programs 521 101 Polish Notation 522 102 Ntuple Notation 523 103 Abstract Syntax Trees 525 104 Threaded Code 527 105 Abstract Machine Code 531 10 5.1 Portabili ty and Abstract Machine 531 10 5.2 The PCode Abstract Machine for PASCAL 532 Chapter Exercises 534 Bibliography 535 Chapter 11 Semantic Analysis and Code Generation 536 111 What Is Meant by Semantic Analysis 537 112 Implicit Stacking in RecursiveDescent Compilation 540 113 Semantic Stacks in BottomUp Compilation 548 114 Action Symbols in TopDown Compilation 554 115 Attributed Translation 559 115.1 Attributed Translation Grammar 560 115.2 LAttributed Translation Grammar 563 116 Example Language Constructs 568 116.1 Declarations 568 Constant Type 570 Variable Declarations 571 Procedure Declarations 578 116.2 Expressions 579 116.3 Assignment Statements 591 116.4 Control Statements 592 CaseSelection Statement 593 Repeat  While Statement 595 For Loop Statement 596 116.5 Procedure Calls and Returns 598 Procedure Calls 598 Return Statements and Procedure Termination 603 116.6 Input and Output Statements 603 Input Statements 603 Output Statements 607 116.7 Compiler Aids 607 Chapter Exercises 609 Bibliography 609 Chapter 12 Code Optimization 610 121 Introduction 610 122 Basic Blocks 611 123 Folding 612 124 RedundantSubexpression Elimination 620 125 Optimization within Iterative Loops 631 125.1 Loop Unrolling 635 125.2 Frequency Reduction 637 126 Chapter 13 131 132 133 Chapter 14 141 142 143 Appendix AI CONTENTS xv 125.3 Strength Reduction 125.4 Combining LoopOptimization Techniques Global Optimization through Flowgraph Analysis 126.1 Flowgraph Construction 126.2 Flowgraph Analysis FlowgraphAnalysis Problems FlowA nalysis Algorithms 126.3 Applications to Program Optimization 126.4 Implementation and Further Considerations Chapter Exercises Bibliography MachineDependent Optimization Introduction to MachineDependent Optimization RegisterAllocation Optimization 132.1 . Register Allocation in SingleRegister Machines 132.2 Register Allocation in Multiregister Machines Machine Architecture and the Generation of Real Code 133.1 The PDPll 133.2 The VAXII 133.3 The MC68000 Chapter Exercises Bibliography CompilerCompilers Introduction to CompilerCompilers Examples of Parser Generators 142.1 Y ACC: A LALR(I) Parser Generator 142.2 An Attributed LL(I) Parser Generator MachineIndependent Code Generation 143.1 The ProductionQuality CompilerCompiler System Introduction The Formalization of InstructionSet Processors and TCOL The CodeGeneration Process: Its Phases and Organization 143.2 The TableDriven Generator of Graham and Glanville Introduction The TargetMachine Description The CodeGeneration Process Results and Extensions 143.3 Other CodeGeneration Systems Fraser's KnowledgeBased CodeGenerator Generator The FiniteState Approach of Donegan Bibliography Algorithmic Notation Format Conventions Al.l Name of Algorithm Al.2 Introductory Comment 646 651 655 655 660 661 667 678 680 681 683 684 684 686 686 694 702 703 710 716 720 721 722 723 726 727 735 742 743 743 745 749 755 756 757 758 760 761 761 763 764 767 768 768 768 xvi CONTENTS Al.3 Steps 768 Al.4 Comments 768 A2 Statements and Control Structures 768 A2.1 Assignment Statement 769 A2.2 If Statement 769 A2.3 Case Statement 770 A2.4 Repeat Statement 770 A2.5 Go To and Exitloop Statements 772 A2.6 Exit Statement 772 A2.7 Variable Names 773 A3 I>ata Structures 773 A3.1 Arrays 773 A3.2 Dynamic Storage 774 A4 Arithmetic Operations and Expressions 774 A5 Strings and String Operations 775 A6 Relations and Relational Operators 775 A7 Logical Operations and Expressions 776 A8 Input and Output 776 A9 Subalgori thms 777 A9.1 Functions 777 A9.2 Procedures 778 A9.3 Parameters 778 Indexes 781 Name Index 781 Subject Index 785 PREFACE This book is intended as a text for a one or twosemester course in compiler design at the senior undergraduate or introductory graduate level. It can also be used as a selfstudy and reference book in compiler design. The purpose of this text is to cover the underlying concepts and techniques used in compiler design. Some of these techniques can be used in software engineering or software design. The reader should have at least one year of experience in programming a highlevel language and an assembly language. In addition, a familiarity with elementary data structures and discrete mathematics is a definite asset. The text, however, is reasonably selfcontained. Chapter 1 contains an overview of compiler design and gives a model for a compiler. Chapter 2 covers the basic notations and concepts of discrete mathematics, grammars, and languages. This material is used in the remainder of the book. Students who are familiar with this material can quickly peruse it or ignore it. They need not have a complete mastery of the concepts on their first reading. They can, however, refer to the material when it is used in subsequent chapters. In Chapter 3 we introduce the elements of programming language design. We attempt to give a unified treatment of the topics in sufficient depth to be useful to persons actually designing a programming language. Much of this material is appearing in a compiler book for the first time. Chapter 4 details the informal and the formal approaches to lexical analyzers (i.e., scanners). Chapter 5 gives an overview of error handling. Although it doesn't contain many detailed algorithms on error recovery and repair strategies, it discusses important factors and considerations of error recovery and repair in general. Detailed algorithms for particular strategies are given in Chapters 6 and 7. Chapter 6 is the first of two chapters on parsing. The chapter deals with topdown parsing. In the first section we examine a number of ad hoc methods. The second section gives a detailed description of syntaxdirected translation using LL(l) grammars. xvii xviii PREFACE Chapter 7 discusses several syntaxdirected bottomup parsing techniques. The discussion includes several precedence and LR parsers along with associated error detection and recovery strategies. In Chapter 8, first we outline the important functions of a symbol table in the translation process and then give the details of several symboltable organizations for blockstructured and nonblockstructured languages. Chapter 9 provides a description of the static, dynamic, explicit, and implicit storageallocation strategies that are frequently adopted when compiling pro grams for a wide variety of existing languages. In Chapter 10 we discuss the advantages of using intermediate source forms. Five types of intermediate source forms are examined in detail. Chapter 11 deals with semantic analysis and code generation. This chapter presents four types of syntaxdirected processing techniques, which include trans lation grammars and attribute translation. Chapter 12 presents several machineindependent optimization strategies, including folding, redundant subexpression elimination, optimization within itera tive loops, and global optimization through flow analysis. Chapter 13 covers several machinedependent optimization techniques such as register allocation and peephole optimization Chapter 14 gives a brief overview of compilercompiler systems. The main focus is on the PQCC and GrahamGlanville systems. Much of this material appears in a compiler book for the first time. As we mentioned earlier, the book can be used in a twosemester course. Depending on the level of such a course, the instructor may wish to supplement the text with reading assignments from the references. Several onesemester courses are possible. For example, the following selection of topics constitutes such a course: Chapters 1 and 2 Chapter 3 (can be omitted if students don't do their own language design) Sections 41 and 42 Chapter 5 Sections 61.2 and 62 (except Sec. 62.5) Sections 74.1, 74.2, 74.3, and 74.6 Sections 85.3 and 85.4 and parts of Sec. 86 Chapter 9 (select one or two methods) Chapter 10 Chapter 11 (select a few methods of code generation) An important aspect of teaching a course in compiler writing is to illustrate the key theoretical concepts. A frequently used strategy for achieving this goal is to have a student design a simple programming language and implement a compiler for this language. Many texts in compiler writing do not, because of size considerations and level of presentation, adequately present details that illustrate the implementation of a compiler. In a separate volume, entitled An Implementa tion Guide to Compiler Writing (published with McGrawHill in 1982), we present, PREFACE xix in a case study manner, the development of a small compiler typical of those developed in a first compiler course. Instructors can use this book as a guide to illustrate difficulties that commonly arise when implementing a compiler. It can also be used as a sample document that is similar to what students should produce for their own compilers. I t is often necessary for instructors teaching a course in compiler writing to set up their own" toy" compiler as a vehicle for illustrating techniques for implementing a compiler. Also, many of the problems that are encountered in writing a compiler for a newly designed language must be illustrated. In this guide we attempt to fulfill this role. The guide begins with a documented description of a simple program language, GAUSS, which serves as an example for the student. GAUSS is a blockstructured language whose design was influenced by ALGOL, PASCAL, PL/I, and FORTRAN. It also contains stringmanipulation facilities capable of manipulating variablelength strings. It then illustrates the topics of scanner generation, LL(I) parsing, symboltable construction, code generation, semantic analysis, error detection, and repair for the GAUSS language. Finally, the guide describes an interpreter for the code generated by the GAUSS compiler and discusses the runtime environment in which the machine code is executed. The code generated by the compiler executes on a hypothetical machine which is stack oriented. ACKNOWLEDGMENTS We would like to thank the many people who assisted us in the preparation of this manuscript. Brent Clark assisted in the preparation of some parts of Chapters 12 and 13. Jack Copeck contributed to earlier drafts of Chapters 6 and 12. Howard Hamilton assisted in program flow analysis techniques. Rod Nickel contributed to the part of Chapter 7 that deals with error detection and recovery in precedence parsers. Lyle Opseth assisted in error detection and recovery in Chapter 7 and a preliminary version of Chapter 14. Darwyn Peachey assisted in the preparation of Chapter 5 and some error detection and recovery techniques in Chapters 6 and 7. Henry Spencer contributed to Chapter 3, and Joe Wald contributed to the attributed translation part of Chapter 11. Special thanks is due to John DeDourek, who classtested the manuscript and proofread the entire galley form of the book. We also wish to thank the reviewers of the manuscript: Michael Bauer, James Harp, John Lowther, and Tom Pennello. A thanks to Chris Fraser for his comments on Chapter 14. We are grateful to Sue Wotton, who did such an excellent job of typing the manuscript, and to Gail Walker for her typing support. Finally, we are very grateful for the comments of our students in the Department of Computational Science at the University of Saskatchewan, who have classtested preliminary versions of the book during the last seven years. leanPaul Tremblay Paul G. Sorenson CHAPTER ONE INTRODUCTION 11 PROGRAMMING LANGUAGES Interactions involving humans are most effectively carried out through the medium of language. Language permits the expression of thoughts and ideas, and without it, communication as we know it would be very difficult indeed. In computer programming, a programming language serves as a means of communication between the person with a problem and the computer used to help solve it. An effective programming language enhances both the development and the expression of computer programs. It must bridge the gap between the often unstructured nature of human thought and the precision required for computer execution. A program solution to a given problem will be easier and more natural to obtain if the programming language used is close to the problem. That is, the language should contain constructs which reflect the terminology and elements used in describing the problem and are independent of the computer used. Such a programming language is usually highlevel. Digital computers, on the other hand, accept and understand only their own special lowlevel language, consisting typically of long sequences of zeros and ones. These sequences are generally unintelligible to humans. This type of lowlevel language is quite different from the highlevel language used to describe a problem in a given area. A hierarchy of programming languages based on increasing machine indepen dence includes the following: 1. Machinelevel languages 2. Assembly languages 3. Higherlevel or useroriented languages 4. Problemoriented languages 1 2 THE THEORY AND PRACTICE OF COMPILER WRITING A machinelevel language is the lowest form of computer language. Each instruction in a program is represented by a numeric code, and numerical addresses are used throughout the program to refer to memory locations in the computer's memory. All bookkeeping aspects of the program are the sole respon sibility of the machinelanguage programmer. Finally, all diagnostics and programming aids must be supplied by the programmer. Also included as machinelevel programs are programs written in microcode (i.e., microprograms). Microcode allows for the expression of some of the more powerful machinelevel instructions in terms of a set of basic machine instructions. Assembly language is essentially a symbolic version of a machinelevel lan guage. Each operation code is given a symbolic code such as ADD for addition and MUL for multiplication. Moreover, memory locations are given symbolic names such as PAY and RATE. Some assembly languages contain macroinstruc tions which are at a higher level than assemblylanguages instructions. Assembly language systems offer certain diagnostic and debugging assistance that is normally not available at the machine level. A highlevel language such as FORTRAN, PASCAL, or PL/I offers most of the features of an assembly language. While some facilities for accessing system level features may not be provided, a highlevel language offers a more enriched set of language features such as structured control constructs, nested statements, blocks, and procedures. A problemoriented language provides for the expression of problems in a specific application or problem area. Examples of such languages are SEQUEL for database retrieval applications and COGO for civil engineering applications. In this book we concern ourselves exclusively with the development of translators for highlevel algebraic languages such as ALGOL, PASCAL, and FORTRAN. Although the class of highlevel languages includes both procedural and nonprocedural languages, this book concentrates on procedural languages. Unlike procedural programs, the position of a particular statement in a nonproce dural program has no bearing on its execution; that is, the interchange of any two statements in a nonprocedural program will not affect the results obtained though its execution. Examples of nonprocedural languages are PSL, a requirements statement language, and CSMP, a simulation and modeling language. Advantages of highlevel languages over lowlevel languages such as machine and assembly languages include the following: 1. Highlevel languages are easier to learn than their lowerlevel counterparts. The learning of many highlevel languages requires little or no computer hardware background because such languages are relatively machineindepen dent. Furthermore, these languages are closer to their problem areas than lowerlevel languages. 2. The programmer does not have to be concerned with clerical tasks involving numerical or symbolic references to instructions, memory locations, con stants, etc. Such tasks are handled by a system which translates the highlevel language program into machine language. INTRODUCTION 3 3. A programmer is not required to know how to convert data from external forms to various internal forms within the memory of a computer. The ability of a programmer to convert, say, numeric data into internal forms such as floatingpoint numbers and packeddecimal numbers should be irrelevant. 4. Most highlevel languages offer a programmer a variety of control structures which are not available in lowlevel languages. Highlevel languages offer several of the following language constructs: Conditional statements (such as IFTHENELSE and CASE statements) Looping statements (for both counted and conditional loops) Nested statements Block structures These control structures improve programming style and facilitate certain programming approaches such as structured programming. Resulting pro grams are easier to read, understand, and alter. This results in reduced programming costs because programs are less complicated. 5. Programs written in a highlevel language are usually more easily debugged than their machine or assemblylanguage equivalents. Highlevel languages offer constructs that eliminate or reduce certain kinds of programming errors that occur in lowlevel languages. For example, the declaration of variables in a program adds a certain degree of redundancy which is useful in detecting errors in the improper use of variables. Languages often enforce a disciplined use of pointers. Moreover, a structured program is much more easily de bugged than its unstructured counterpart. 6. Since most highlevel languages offer more powerful control and data structuring capabilities than lowlevel languages, the former class of lan guages facilitates the expression of a solution to a particular problem. 7. Because of the availability of certain language features such as procedures, highlevel languages permit a modular and hierarchical description of pro gramming tasks. These tasks can then be assigned to a team of programmers, thus facilitating a division of labor with a minimum of disruption and effort. Such an approach permits better documentation of a problem. Also, in creased compatibility among programs and programmers can be realized. 8. Finally, highlevel languages are relatively machineindependent. Conse quently, certain programs such as FORTRAN and COBOL programs are portable. These programs can be executed on different computers with little, if any, change even though these computers have different internal architec tures and machinelanguage instruction sets. Program portability reduces costs and, in general, protects an installation against computer obsolescence. Often when a computer is replaced, all associated assembly and machine language programs become obsolete. Points 1 through 5 have been stressed in the past and found to be very important. Today, points 6 through 8 are receiving considerable attention and promise to be dominant considerations in the immediate future. 4 THE THEORY AND PRACTICE OF COMPILER WRITING Highlevel language programs, of course, must be translated automatically to equivalent machinelanguage programs. The notions of such translators are examined in the next section. 12 TRANSLATORS A translator inputs and then converts a source program into an object or target program. The source program is written in a source language and the object program belongs to an object language. If the source language being translated is assembly language and the object program is machine language, the translator is called an assembler. Assembly language resembles closely machine language. In an elementary assembly lan guage, most of its instructions are symbolic representations of corresponding machinelanguage instructions. Each assembly instruction contains an ordered set of fields. For example, the first field might represent a label which is followed immediately by an operation field. This field in turn might be followed by one or more operand fields. More sophisticated assembly languages, however, support macroinstructions. A macroinstruction can be expanded to a sequence of simple machinelanguage instructions. Highlevel language constructs such as case state ments, nested statements, and blocks are not usually contained in assembly languages. A translator which transforms a highlevel language such as FORTRAN, PASCAL, or COBOL into a particular computer's machine or assembly language is called a compiler. The time at which the conversion of a source program to an object program occurs is called compile time. The object program is executed at run time. Figure 11 illustrates the compilation process. Note that the source program and data are processed at different times, namely, compile time and run time, respectively. Another kind of translator, called an interpreter, processes an internal form of the source program and data at the same time. That is, interpretation of the Data Sou rce program Compiler Object program Executing com pu ter Results Compile time Run time Figure 11 Compilation process. INTRODUCTION 5 Data Sou rce program Interpreter Results Figure 12 Interprctive process. internal source form occurs at run time and no object program is generated. Figure 12 illustrates this interpretive process. Some interpreters analyze each source statement every time it is to be executed. Such an approach is very timeconsuming and is seldom used. A more efficient approach involves applying compilation techniques to the translation of the source program to an in ter mediate source form. This intermediate source form is then interpreted by the in terpreter program. Interpreters have become increasingly popular lately, particularly in micro computer environments where the overhead of interpretation seems to be signif candy less for the user. For example, a main reason why languages such as BASIC, APL, LISP, and SMALL T ALK80 have become very popular is because they have been implemented in an interpretive environment. In this book, we concentrate on compiler construction techniques. Although a complete compiler will not be found in this text, the reader is referred to the accompanying text, An Implementation Guide to Compiler Writing, where a compiler for the highlevel programming language GAUSS is described in detail. A compiler can be described in a modular fashion. The next section gives an overview of such an approach. 13 MODEL OF A COMPILER The task of constructing a compiler for a particular source language is complex. The complexity and nature of the compilation process depend, to a large extent, on the source language. Compiler complexity can often be reduced if a program minglanguage designer takes various design factors into consideration. Several of these programminglanguage design factors are discussed in Chap. 3. Since we are dealing with highlevel source languages such as ALGOL and PASCAL, however, a basic model of a compiler can be formulated. Such a model is given in Fig. 13. Although this model may vary for the compilation of different highlevel lan guages, it is nevertheless representative of the compilation process. A compiler must perform two major tasks: the analysis of a source program and the synthesis of its corresponding object program. The analysis task deals with the decomposition of the source program into its basic parts. Using these 6 THE THEORY AND PRACTICE OF COMPILER WRITING Source Object program program ANAL YSIS SYNTHESIS , r Lex ical  Syntactic .... Semantic .... Code .. Code analyzer  analyzer  analyzer  generator  optimizer , , , ,  , , , , TAB LES Figure 13 Components of a compiler. parts, the synthesis task builds their equivalent object program modules. The performance of these tasks is realized more easily by building and maintaining several tables. More will be said about the contents of these tables shortly. A source program is a string of symbols each of which is generally a letter, a digi t, or certain special symbols such as +, , and (,). A source program contains elementary language constructs such as variable J}ames, labels, constants, keywords, and operators. It is therefore desirable for the compiler to identify these various types as classes. These language constructs are given in the defini tion of the language. The source program (see Fig. 13) is input to a lexical analyzer or scanner whose purpose is to separate the incoming text into pieces or tokens such as constants, variable names, keywords (such as DO, IF, and THEN in PL/I), and operators. In essence, the lexical analyzer performs lowlevel syntax analysis. For efficiency reasons, each class of tokens is given a unique internal representation number. For example, a variable name may be given a representation number of 1, a constant a value of 2, a label the number 3, the addition operator ( + ) a value of 4, etc. For example, the PL/I statement TEST: IF A > B THEN X = Y; would be translated by the lexical analyzer into the following sequence of tokens (and associated representation numbers): TEST 3 26 IF 20 INTRODUCTION 7 A > B THEN X y 1 15 1 21 1 10 1 27 Note that in scanning the source statement and generating the representation number of each token we have ignored spaces (or blanks) in the statement. The lexical analyzer must, in general, process blanks and comments. Both of these classes of items typically do not represent executable parts of a program and are ignored, for example, in PL/I. Certain programming languages allow the continuation of statements over multiple lines. Lexical analyzers must then handle the input processing of such multipleline statements. Also, some scanners place constants, labels, and variable names in appropriate tables. A table entry for a variable, for example, may contain its name, type (e.g., REAL, INTEGER, or BOOLEAN), object program address, value, and line in which it is declared. The lexical analyzer supplies tokens to the syntax analyzer. These tokens may take the form of a pair of items. The first item gives the address or location of the token in some symbol table. The second item is the representation number of the token. Such an approach offers a distinct advantage to the syntax analyzer; namely, all tokens are represented by fixedlength information: an address (or pointer) and an integer. Chapter 4 describes the scanning process in more detail. The syntax analyzer is much more complex than the lexical analyzer. Its function is to take the source program (in the form of tokens) from the lexical analyzer and determine the manner in which it is to be decomposed into its constituent parts. That is, the syntax analyzer determines the overall structure of the source program. This process is analogous to determining the structure of a sentence in the English language. In such an instance we are interested in identifying certain classes such as "subject," "predicate," "verb," "noun," and " adjective." In syntax analysis we are concerned with grouping tokens into larger syn tactic classes such as expression, statement, and procedure. The syntax analyzer (or parser) outputs a syntax tree (or its equivalent) in which its leaves are the tokens and every nonleaf node represents a syntactic class type. For example, an analysis of the source statement (A + B)*(C + D) can produce the syntactic classes (factor), (term), and (expression) as exhibited in the syntax tree given in Fig. 14. The syntaxtree approach is described in Chap. 2, where a set of rules known as a grammar is used to define precisely the source language. A grammar can be used by the syntax analyzer to determine the 8 THE THEORY AND PRACTICE OF COMPILER WRITING < expression> I < term> * < factor> « < expression> + < term> I I < term> < factor> I I < factor> 0 I c < term> I < factor> ) < expression> + < term> I I < term> < factor> I I < factor> B I A Figure 14 A syntax tree for the expression (A + B)*(C + D). structure of the source program. This recognition process is called parsing, and consequently we often refer to syntax analyzers as parsers. Several classes of parsers are discussed in detail in Chaps. 6 and 7. The syntax tree produced by the syntax analyzer is used by the semantic analyzer. The function of the semantic analyzer is to determine the meaning (or semantics) of the source program. Although it is conceptually desirable to separate the syntax of a source program from its semantics, the syntax and semantic analyzers work in close cooperation. Nevertheless, the semanticanalysis process is a different and unique process in a compiler. For an expression such as (A + B) * (C + D), for example, the semantic analyzer must determine what actions are specified by the arithmetic operators of addition and multiplication. When the parser recognizes an operator such as "+" or "*," it invokes a semantic routine which specifies the action to be performed. This routine may check that the two operands to be added have been declared, that they have the same type (if not, the routine would probably make them the same), and that both operands have values. The semantic analyzer often interacts with the various tables of the compiler in performing its task. The semanticanalyzer actions may involve the generation of an intermediate form of source code. For the expression (A + B) * (C + D), the intermediate source code might be the following set of quadruples: ( +, A, B, Tl) ( + , C, D, T2) ( * , Tl, T2, T3) INTRODUCTION 9 where ( + , A, B, T1) is interpreted to mean "add A and B and place the result in temporary T1," (+, C, 0, T2) is interpreted to mean "add C and 0 and place this result in T2," and (*, T1, T2, T3) is interpreted to mean" multiply Tl and T2 and place the result in T3." The exact form of the intermediate source language used depends, to a large extent, on how it is to be processed during the synthesis task. An infix expression may be converted to an intermediate form called Polish notation. Using this approach, the infix expression (A + B)*(C + D) would be converted to the equivalent suffixPolish expression AB + CD + *. The latter expression contains the arithmetic operators in the order in which they are to be executed. Another type of intermediate form of source language can be a syntax tree which represents the parse of the source program. (See Chap. 10.) The output of the semantic analyzer is passed on to the code generator. At this point the intermediate form of the sourcelanguage program is usually translated to either assembly language or II).achine language. As an example, the translation of the three quadruples for the previous expression can yield the following sequence of singleaddress, singleaccumulator assemblylanguage in structions: LOA A Load the contents of A into the accumulator. ADD B Add the contents of B to that of the accumulator. STO T1 Store the accumulator contents in temporary storage T1. LOA C Load the contents of C into the accumulator. ADD 0 Add the contents of 0 to that of the accumulator. STO T2 Store the accumulator contents in temporary storage T2. LOA T1 Load the contents of T1 into the accumulator. MUL T2 Multiply the contents of T2 by that of the accumulator. STO T3 Store accumulator contents in temporary storage T3. The topic of code generation is presented in Chap. 11. The output of the code generator is passed on to a code optimizer. This process is present in more sophisticated compilers. Its purpose is to produce a more efficient object program. Certain optimizations that are possible at a local level include the evaluation of constant expressions, the use of certain operator properties such as associativity, commutativity, and distributivity, and the detection of common subexpressions. Because of the commutativity of the multiplication operator, the previous assem bly code can be reduced to the following: LOA A ADDB STO Tl LDAC ADD 0 MUL T1 STO T2 Note that the expression evaluated by this code is (C + D)*(A + B). More global optimizations can also be performed. These include the single evaluation of multiple occurrences of identical subexpressions and removing 10 THE THEORY AND PRACTICE OF COMPILER WRITING statements that are invariant inside a loop and placing them outside of that loop. These kinds of optimizations are machineindependent. There are, however, certain machinedependent optimizations which can also be performed. Optimal register allocation is an example of this kind of optimization. A good code optimizer can produce as good or even better code than an experienced assemblylanguage programmer. Optimization techniques are the topic of Chaps. 12 and 13. The model of a compiler given in Fig. 13 makes distinctions between its successive phases. In certain compilers, however, some of these phases are combined. In our discussion of this model the details are omitted on how the five processes interact with each other. Let us examine some possible interactions between the lexical and syntax analyzers. One possibility is that the scanner generates a token for the syntax analyzer for processing. The syntax analyzer then "calls" the scanner when the next token is required. Another possibility is for the scanner to produce all the tokens corresponding (0 the source program before passing control to the syntax analyzer. In this case the scanner has examined the entire source programthis is called a separate pass. Some compilers make as little as one pass while other compilers have been known to make more than 30 passes (e.g., some of IBM's first PL/I compilers). Factors which influence the number of passes to be used in a particular compiler include the following: 1. Available memory 2. Speed and size of compiler 3. Speed and size of object program 4. Debugging features required 5. Errordetection and recovery techniques desired 6. Number of people and time required to accomplish the compiler writing project Several educationally oriented or student compilers such as W A TFIV and PL/C are essentially onepass compilers. In these compilers very little (if any) code optimization is performed. The reason for this is the belief that programs will be compiled more times than executed. Such programs are often executed once and discarded. Also the semanticanalyzer and codegeneration phases are invariably combined into one phase. Great emphasis is placed on debugging, error detection, error recovery, and diagnostic capabilities. Certain source languages, however, cannot be compiled in a single pass. Compilers for PL/I and FORTRAN often contain many passes. In particular, the optimization phase may require that several passes of the source program (or its intermediate form) be made. In addition, the optimization phase is often spread throughout the other passes of the compilation process (e.g., certain constant folding, as described in Chap. 12, has been incorporated in the lexical analyzer of some compilers). Other combinations are possible. Sometimes the syntaxanalyzer, semantic analyzer, and codegeneration phases can be combined into a single phase. INTRODUCTION 11 An aspect of compiler construction which was omitted in Fig. 13 deals with the area of error detection and error recovery. Error recovery is most closely associated with the syntaxanalysis phase. The aim of error recovery is to prolong the compilation life of the program as long as possible before the compiler "gives up" on the source program. A strategy that is often adopted is that if at all possible, the object program is executed. This approach can reduce the number, of times a source program has to be compiled before becoming errorfree. Some errorrecovery approaches insert and/or discard tokens in attempting to correct the portion of the source program which is syntactically incorrect. We attach significant importance to this topic in this book. In particular, Chap. 5 introduces the general notions of error detection and error recovery. Also, particular strate gies for different parsing techniques are illustrated in Chaps. 6 and 7. Another aspect of compiling which has been ignored in the current discussion deals with the various symbol tables that must be created and maintained during the executions of the required object program. The topic of symbol tables is presented in Chap. 8. Also, certain language constructs in the source language imply that several data structures must be stored and maintained in the memory of the computer. The design and implementation of these structures is called runtime organization and implementation or storage management. For block structured languages, stringoriented langUages such as SNOBOL, and languages which permit the declaration and manipulation of programmerdefined data structures, the runtime considerations are very significant and complex. Details of these problems are dealt with in Chap. 9. In closing, a comment on the relative difficulty of implementing each of the phases in Fig. 13 is in order. The lexicalanalysis phase is perhaps the most straightforward. Next in difficulty comes the syntacticanalysis or parsing phase. A lot of research, based on formal language theory, has been done on these two phases. As a result, the scanning and parsing t>liases can be largely automated. The theory of scanning is developed in Chap. 4. Chapters 6 and 7, on the other hand, describe the theory of parsing. The real difficulties in compiler construction are in the semanticanalysis, codegeneration, and codeoptimization phases. Another problem area is the portability of programs. These difficulties, which are often language and machinedependent, are examined in Chaps. 10, 11, 12, and 13. Several attempts have been made at automating compiler construction. Some of these efforts have enjoyed limited success. These translatorwriting or compiler compiler systems are briefly discussed in Chap. 14. BIBLIOGRAPHY Aho, A. V., and J. D. Ullman: Principles of Compiler Design, Reading, Mass.: AddisonWesley, 1977. Barrett, W. A., and 1. D. Couch: Compiler Construction, Theory and Practice, Chicago: Science Research Associates, Inc., 1979. 12 THE THEORY AND PRACTICE OF COMPILER WRITING Calingaert, P.: Assemblers, Compilers and Program Translators, Rockville, Md.: Computer Science Press, Inc., 1979. Cocke, J., and J. T. Schwartz: Programming Languages and Their Compilers, Courant Institute of Mathematical Sciences, New York University, April 1970. Gries, D.: Compiler Construction for Digital Computers, New York: John Wiley & Sons, 1971. Knuth, D. E.: "A History of Writing Compilers," Computers and Automation, Vol. 11, No. 12, December 1962, pp. 814. Lewis, P. M. II, D. J. Rosenkrantz, and R. E. Stearns: Compiler Design Theory, Reading, Mass.: AddisonWesley, 1976. Pollack, B. W. (ed.): Compiler Techniques, Princeton, N.J.: Auerbach, 1972. Rosen, S., Programming Systems and Languages, New York: McGrawHill, 1967. Tremblay, J. P., and P. G. Sorenson: An Implementation Guide To Compiler Writing, New York: McGraw Hill, 1982. CHAPTER TWO NOTATION AND CONCEPTS FOR LANGUAGES AND GRAMMARS This chapter covers basic material which we will use extensively throughout the remainder of this book. Rather than spreading this material in various chapters of the text, therefore forcing the reader to read at least those parts of each chapter where the notions are introduced, we felt that a better alternative would be to put in this chapter basic fundamental material which is common to most of the remaining chapters. In this way the text can be read in a modular manner, thus allowing the reader to select certain chapters without loss of continuity. The first section deals with basic material concerning sets and strings. Section 22 introduces the notion of a grammar and its use in describing the syntax of a language. In the third section, a classification of grammars is given and the classification is shown to form a hierarchy. The material dealing with contextfree grammars of Sec. 24 introduces basic notions such as parsing, syntax tree, and ambiguity. The problem of specifying the syntax of a programming lan guage by the use of a metalanguage is discussed at some length. The material in Sees. 21, 22, 24.1, and 24.2 is drawn primarily from Tremblay and Manohar (1975). Readers who are familiar with the notions of sets, relations, and strings can proceed to Sec. 22. Readers who have been exposed to grammars can begin at Sec. 24. 13 14 THE THEORY AND PRACTICE OF COMPILER WRITING 21 SETS AND STRINGS A programming language consists of a set of programs (or strings). As we will see shortly, it is convenient to use a grammar as a formal vehicle for generating these programs. Several relations can be defined on the rules of a grammar, and these relations can lead to efficient compilation algorithms for the language associated with that grammar. Consequently, the purpose of this section is to give a brief introduction to the basic elements of sets and strings. In particular, the concepts of sets and operations on sets are given in Sec. 21.1. The notion of a relation is introduced in Sec. 21.2. Other important aspects about relations which are introduced in this subsection include the graph and transitive closure of a relation. The section ends with an elementary treatment of strings. 21.1 Basic Concepts of Set Theory By a set we mean a collection of objects of any sort. The word "object" is used here in a very broad sense to include even abstract objects. A fundamental concept of set theory is that of membership or belonging to a set. Any object belonging to a set is called a member or an element of that set. If an element p belongs to a set A, we write pEA, which is read as "p is an element of the set A" or "p belongs to the set A." If there exists an object q which does not belong to the set A, we express this fact as q A. There are many ways of specifying a set. For example, a set consisting of the decimal digits is generally written as {O,1,2,3,4,5,6,7,8,9} The names of the elements are enclosed in braces and separated by commas. If we wish to denote this set as D, we write D = {O,1,2,3,4,5,6,7,8,9} where the equality sign indicates that D is the set {O, 1,2, 3,4,5,6, 7, 8, 9}. This method of specifying a set is not always convenient. For example, the set D can be more easily described by using a predicate as follows: D = {xix is a decimal digit} where the symbol I is read as "such that." In this case, the predicate is " is a decimal digit." If we let P(x) denote any predicate, then {xIP(x)} defines a set and is read" the set of all x such that P( x) is true." An element a belongs to this set if P( a) is true; otherwise a does not belong to the set. If we denote the set {xIP(x)} by B, then B =. {xIP(x)}. Sets which are specified by listing their elements can also be characterized by means of a predicate. For example, the set {a, b, 1, 9} can be defined as {xl(x = a) V(x = b) v(x = 1) V(x = 9)} where the symbol V denotes a logical disjunction (or). A predicate can also NOTATION AND CONCEPTS FOR LANGUAGES AND GRAMMARS 15 contain the logical connectives /\ (logical conjunction or and), , (logical nega tion), (if then), (if and only if), and the relational operators <, < , = , =1=, > , and >. Furthermore, a predicate can also contain the existential quantifier (3, meaning" there exists") and the universal quantifier ('V, representing "for every"). Although it is possible to characterize any set by a predicate, it is sometimes convenient to specify sets by another method, such as C = {1,3,5,...} S = {a,a 2 ,a 3 ,...} In this representation the missing elements can be determined from the elements present and from the context. The number of distinct elements present in a set may be finite or infinite. We call a set finite if it contains a finite number of distinguishable elements; otherwise a set is infinite. Note that no restriction has been placed on the objects that can be members of a set. It is not unusual to have sets whose members are themselves sets, such as A = {O, {a, b}, 1, { p}}. It is important, however, to distinguish between the set { p }, which is an element of A, and the element p, which is a member of { p } but not a member of A. For any two sets A and B, if every element of A is an element of B, then A is called a subset of B, or A is said to be included in B, or B includes A. Symbolically, this relation is denoted by A c B, or equivalently by B :) A. For any two sets A and B, note that A c B does not necessarily imply that B c A except for the following case. Two sets A and B are said to be equal iff (if and only if) A c Band B c A so that A = B. The set A is called a proper subset of a set B if A c B and A =1= B. This relation is represented by A c B. We now introduce two special sets; the first includes every set under discussion while the second is included in every set under discussion. A set is called a universal set (E) if it includes every set under discussion. For example, the universal set for natural numbers may be E = {O, 1,2,. . . }. A set which does not contain any element is called an empty set or a null set. An empty set will be denoted by </>. Given any set A, we know that the null set </> and the set A are both subsets of A. Also for any element pEA, the set { p } is a subset of A. Similarly, we can consider other subsets of A. Rather than finding individual subsets of A, we would like to say something about the set of all subsets of A. For any set A, a collection or family of all subsets of A is called the power set of A. The power set of A is denoted by p(A) or 2 A , so that p(A) = 2 A = {xix c A}. We now introduce the concept of an indexed set. Let J = {Sl' S2, S3,. . . } and A be a family of sets A = {ASl' A S2 ' A S3 ' . . .} such that for any Sj E J there corresponds a set As E A, and also As = As iff Sj = S J o . A is then called an I I J indexed set, J the index set, and any subscript such as the s j in A s is called an I index. 16 THE THEORY AND PRACTICE OF COMPILER WRITING An indexed family of sets can also be written as A = {A j } j E J. In particular, if J = {1,2,3,...}, then A = {A 1 ,A 2 ,A 3 ,...}. Also, if J = {1,2,3,...,n}, then A = {A I' A 2' A 3' . . . , An} = {A j } j E I where In = {I, 2, . . . , n } . For a set S n containing n elements, the power set p(S) is written as the indexed set p(S) = { Bj } j E J where J = {O, 1,2, . . . , 2 n  I}. Let us now consider some operations on sets. In particular, we emphasize the set operations of intersection, union, and complement. Using these operations, one can construct new sets by combining the elements of given sets. The intersection of any two sets A and B, written as A n B, is the set consisting of all elements which belong to both A and B. Symbolically, A n B = {xix E A /\ x E B} For any indexed set A = {Aj}jEJ' n Aj = {xix E Aj for every i E J} iEJ For J = In = {1,2,...,n}, we can write n n Aj = n Aj = Al n A 2 n ... nA n i=1 iEl n Two sets A and B are called disjoint iff A n B = </>. A collection of sets is called a disjoint collection if, for every possible pair of sets in the collection, the two sets are disjoint. The elements of a disjoint collection are said to be mutually disjoint. Let A be an indexed set A = {A j } j E J. The set A is a disjoint collection iff A j n A j = </> for all i, j E J, i =1= j. For any two sets A and B, the union of A and B, written as A U B, is the set of all elements which are members of the set A or the set B or both. Symbolically, it is written as A U B = {xix E A V X E B}. For any indexed set A = {Aj}jEJ' UAj= {xlxEAjforatleastoneiEJ} iEJ For J = In = {I, 2, 3, . . . , n}, we may write n U A.=A UA U ... UA I 1 2 n i= 1 The relative complement of B in A (or of B with respect to A), written as A  B, is the set consisting of all elements of A which are not elements of B, that IS, A  B = {xix E A /\ x B} The relative complement of B in A is also called the difference of A and B. The relative complement of A with respect to the universal set E, that is, E  A, is called the absolute complement of A. So far we have been concerned with sets, their equality, and operations on sets to form new sets. We now introduce the notion of an ordered pair. An ordered pair consists of two objects in a given fixed order. Note that an ordered pair is not a set consisting of two elements. The ordering of the two objects is important. We denote an ordered pair by (x, y). NOTATION AND CONCEPTS FOR LANGUAGES AND GRAMMARS 17 The equality of two ordered pairs (x, y) and (u, v) is defined by (x,y) = (u,v) <=> «x = u) I\(y = v)) where the symbol "<=> " denotes logical equivalence; that is, A <=> B means that A is equivalent to B. At this point we want to also define the term "imply." A is said to imply B written A B, if and only if A + B is always true. In mathematics, the notations A + B and A B are often used interchangeably. The idea of an ordered pair can be extended to define an ordered triple and, more generally, an ntuple. We write an ntuple as (x I' X 2' . . . , X n). An important idea in set theory is the notion of a cartesian product. Given two sets A and B, the set of all ordered pairs such that the first member of the ordered pair is an element of A and the second member is an element of B is called the cartesian product of A and B and is written as A X B. Accordingly A xB= {(x,Y)I(XEA)I\(yEB)} This notion of cartesian product can be extended to any finite number of sets. Let A = {A j } j E I be an indexed set and In = {I, 2, . . ., n}. We denote the cartesian n product of the sets AI' A 2 , . . ., An by X A. = A X A 2 X ... xA I I n jEI n which is defined recursively as X Aj = Al jE II and i Ai = (iE _l Ai) X Am form = 2,3,...,n Our definition of cartesian product of n sets is related to the definition of ntuples in the sense that Al XA 2 X ... XA n = {(X I ,X 2 ,...,x n )l(x i E AI) l\(x 2 E A 2 ) 1\ ... I\(x n E An)} As an example, consider the following sets: Al = {1,3}, A 2 = {3,4}, A3 = {1,3,4,6} Several set operations are exhibited in the following: Al U A 2 = {1,3,4} Al n A3 = {1,3} 3 U Aj = {I, 3,4,6} ;= 1 3 nA j ={3} j= 1 A 3 A I ={4,6} p(A I ) = {cp, {I}, {3}, {I, 3}} Al X A 2 = {(I, 3), (1,4), (3,3), (3,4)} 18 THE THEORY AND PRACTICE OF COMPILER WRITING In this section we have introduced the basic concepts of set theory. We now turn our attention to the notions of relations and orderings. EXERCISES 21.1 1 Give another description of the following sets and indicate those which are infinite sets. ( a) {x I x is an in teger and 5 x 12} ( h) {2, 4, 8, .. . } «(') All the countries of the world 2 Given S = {2,a,{3},4} and R = {{a},3,4,1}, indicatewhcthcr the following arc true or false. (a) {a}ES (b) {a}ER (c) {a,4,{3}} S (d) {{a},1,3,4} c R (e) R = S (f) {a} S (g) {a} R (h)cpcR (i) cp {{a}} R E (j) {CP} S (k) cp E R (/) cp {{3},4} 3 Show that (R S) /\ (S C Q) R c Q is always true. Is it correct to replace R c Q by R Q? Explain your answer. 4 Give the power sets of the following. (a) {a,{b}} (b) {l,CP} (c) {X, Y, Z} 5 Given A = {x I x is an integer and 1 x 5}, B = {3, 4,5, 17}, and C = {I, 2, 3, .. . }, find A n B, A n C, A u B, and A U C. 6 Show that A A u B and A n B A. 7 Show that A B A u B = B. 8 If S = {a,b,c}, find nonempty disjoint sets Al and A 2 such that Al u A 2 = S. find other solutions to this problem. 9 The symmetric difference of two sets A and B is the set A + B defined by A + B = (A  B) U (B  A). Given A = {2,3,4}, B = {1,2}, and C = {4,5,6}, find A + B, B + C, A + B + C, and (A + B) + (B + C). Show that A + B + C is associative. 10 Give examples of sets A, B, C such that A u B = A u C, but B =1= C. 11 Write the members of {a,b} X {1,2,3}. 12 Write A X B X C, B 2 , A 3 , B 2 X A, and A X B where A = {I}, B = {a,h}, and C = {2,3}. 13 Show by means of an example that A X B =1= B X A and (A X B) X C =1= A X (B X C). 14 Show that for any two sets A and B p(A) u p(B) p(A U B) p(A) n p(B) = p(A n B) Show by means of an example that p(A) U p(B) =1= p(A U B) NOTATION AND CONCEPTS FOR LANGUAGES AND GRAMMARS 19 15 Prove the identities AnA=A Ancp=cp AnE=A and AuE=E 16 Show that A X(BnC)=(A XB)n(A XC). 17 Show that A X B = B X A (A = cp) v (B = cp) v (A = B) 18 Show that (A n B) u C = A n (B U C) iff C A 19 Show that (A  B)  C = (A  C)  (B  C) 20 Prove that (A n B) X (C n D) = (A XC) n (B X D) 21.2 Relations The concept of a relation is a basic concept in mathematics as well as in everyday life. Associated with a relation is the act of comparing objects which are related to one another. The ability of a computer to perform different tasks based upon the result of a comparison is an important attribute used several times during the execution of a typical program. In this subsection we first formalize the concept of a relation and then discuss methods of representing a relation by using a matrix or its graph. The relation matrix is useful in determining the properties of a relation and also in representing a relation on a computer. Various basic properties of a relation are given, and certain important classes of relations are in trod uced. The word "relation" suggests some familiar examples of relations such as the relation of father to son, sister to brother, or uncle to nephew. Familiar examples in arithmetic are relations such as less than, greater than, or that of equality between two real numbers. We also know the relation between the area of an equilateral triangle and the length of one of its sides and the area of a square and the length of a side. These examples suggest relationships between two objects. Throughout the discussion we consider relations, called binary relations, between a pair of objects. Any set of ordered pairs defines a binary relation. We call a binary relation simply a relation. It is sometimes convenient to express a particular ordered pair, say, (x, y) E , where R is a relation, as xRy. In mathematics, relations are often denoted by special symbols rather than by capital letters. A familiar example is the relation "less than" for real numbers. This relation is denoted by <. In fact, < should be considered as the name of a set whose elements are ordered pairs. More precisely the relation < is < = {(x, y) Ix, yare real numbers and x is less than y} Let S be a binary relation. The set D(S) of all objects x such that for some y, (x, y) E S is called the domain of S, that is, D(S) = {xl(3y)«x,y) E S)} where the symbol 3 denotes existential quantification. Similarly, the set R(S) of all objects y such that for some x, (x, y) E S is called the range of S, that is, R(S) = {YI(3x)«x,y) E S)} 20 THE THEORY AND PRACTICE OF COMPILER WRITING Since a relation has been defined as a set of ordered pairs, it is therefore possible to apply the usual operations of sets to relations as well. The resulting sets will also be ordered pairs and will define some relations. If Rand S denote two relations, then R n S defines a relation such that x(R n S)y xRy /\ xSy Similarly, R U S is a relation such that x(R U S)y xRy V xSy Also x(R  S)y xRy /\ x$y where x$y denotes that x is not related to y in relation S. A binary relation R in a set X is reflexive if, for every x E X, xRx, that is, (x, x) E R. The relation R is symmetric if, for every x and y in X, whenever xRy, then yRx. R is said to be transitive if, for every x, y, and z in X, whenever xRy and yRz, then xRz. R is irreflexive if, for every x E X, (x, x) R. Finally, R is said to be antisymmetric if, for every x and y in X, whenever xRy and yRx, then x = y. Several important classes of relations having one or more of the properties gi ven here will be discussed later in this subsection. A relation R from a finite set X to a finite set Y can also be represented by a matrix called the relation matrix of R. Let X = {Xl' X2'...' x m }, Y = {YI' Y2,..., Yn}' and R be a relation from X to Y. The relation matrix can be obtained by first constructing a table whose rows are labeled by successive elements of X and whose columns are labeled by successive elements of Y. If xiRYJ' then we enter a 1 in the ith row and jth column. If xp$yq, then we enter a zero in the pth row and the qth column. As a special case, consider m = 2, n = 3, and R given by R = {( X I' YI)' (x 2' YI)' (x 2' Y3) } The required table for R is Table 21. If we assume that the elements of X and Y appear in a certain order, then the relation R can be represented by a matrix whose elements are Is and Os. This matrix can be written down from the table constructed or can be defined in the Table 21 YI Y2 Y3 Xl 1 0 0 x2 1 0 1 NOTATION AND CONCEPTS FOR LANGUAGES AND GRAMMARS 21 following manner: r. . = ( 1, IJ 0, if x.R y . I J if xi$.Yj where r ij is the element in the ith row and the jth column. The matrix obtained this way is the relation matrix. If X and Y have m and n elements, respectively, then the matrix is an m X n matrix. For the relation R just given, the.. relation matrix is [ o o ] One not only can write a relation matrix when a relation R is given but also obtain the relation if the relation matrix is given. Throughout the remainder of this subsection, unless otherwise stated, we will assume X = Y; that is, the relations are defined in a set X. Thus the relation matrices are square. A relation matrix reflects some of the properties of a relation in a set. If a relation is reflexive, all the diagonal entries must be 1. If a relation is symmetric, the relation matrix is symmetric. If a relation is an tisymmetric, its matrix is such that if r ij = 1, then 'j.i = 0 for i =1= j. A relation can also be represented pictorially by drawing its graph. Although we shall introduce some of the concepts of graph theory which are discussed in later chapters, here we shall use graphs only as a tool to represent relations. Let R be a relation in a set X = {Xl' X2'. . ., x n }. The elements of X are represented by points or circles called nodes. The nodes corresponding to Xi and x j are labeled Xi and Xj' respectively. These nodes may also be called vertices. If xiRx j , that is, if (Xi' Xj) E R, then we connect nodes Xi and x j by means of an arc and put an arrow on this arc in the direction from Xi to Xi" When all the nodes correspond ing to the ordered pairs in R are connected by arcs with proper arrows, we get a directed graph of the relation R. If xiRx j and xjRx i , then we draw two arcs between Xi and xi" For the sake of simplicity, we may replace the two arcs by one arc with arrows pointing in both directions. If xiRx i , we get an arc which starts from node Xi and returns to Xi' Such an arc is called a sling. From the graph of a relation it is possible to observe some of its properties. Several examples are given in Fig. 21. Another very important notion which is used in later chapters is that of a partition. Let S be a given set and A = {AI' A 2 , . . . , An} where each A" i = 1,2,..., n, is a subset of S and U == I A, = S, then the set A is called a covering of S, and the sets AI' A 2 , . . . , An are said to cover S. If, in addition, the elements of A, which are subsets of S, are mutually disjoint, then A is called a partition of S, and the sets AI' A 2 , . . . An are called the blocks of the partition. We now proceed to define an important class of relations which partitions a set. A relation R in a set X is called an equivalence relation if it is reflexive, symmetric, and transitive. If R is an equivalence relation in a set X, then D( R), the domain of R, is X itself. Therefore, R will be called a relation on X. Some 22 THE THEORY AND PRACTICE OF COMPILER WRITING y y o y Z x x yRx yRy XRZAZRxAyRz y Z y x XRYAXRzAZRy XRXAXRYAyRx Figure 21 Graphs of relations. examples of equivalence relations are equality of numbers on a set of real numbers, similarity of triangles on a set of triangles, and the relation of lines being parallel on a set of lines in a plane. Another example of an equivalence relation which will be used in the discussion on hashing functions (see Chap. 8) is that of equality in a modular number system. Let I denote the set of all positive integers, and let n be a positive integer. For x E I and y E I, we define R as R = {(x, Y)lx  Y is divisible by n} Note that "x  y is divisible by n" is equivalent to the statement that both x and y have the same remainder when each is divided by n. Perhaps the most important idea about relations that is used in formal parsing techniques involves taking the" transitive closure" of a relation. We now develop this idea. Since a binary relation is a set of ordered pairs, the usual operations such as union and intersection on these sets produce other relations. We now consider another operation on relationsrelations which are formed in two or more stages. Familiar examples of such relations are the relation of being a nephew or a brother's or sister's son, the relation of an uncle or a father's or mother's brother, and the relation of being a grandfather or a father's or mother's father. These relations can be produced in the following manner. Let R be a relation from X to Y and S be a relation from Y to Z. Then a relation written as R 0 S is called a composite relation of Rand S where RoS = {(x,z)lx EX /\ z E Z /\(3y)(y E Y /\(x,y) E R /\(y,z) E S)} The operation of obtaining R 0 S from Rand S is called composition of relations. NOTATION AND CONCEPTS FOR LANGUAGES AND GRAMMARS 23 R s R 0 S x y z x . Z x, x, X2 X2 X3 X3 X40 ° x4 x5 x5 Figure 22 Relations R, S, and R 0 S. Note that R 0 S is empty if the intersection of the range of R and the domain of S is empty. R 0 S is nonempty if there is at least one ordered pair (x, y) E R such that the second member y E Y of the ordered pair is a first member in an ordered pair in S. For the relation R 0 S, the domain is a subset of X and the range is a subset of Z. In fact, the domain is a subset of the domain of R, and its range is a subset of the range of S. From the graphs of Rand S one can easily construct the graph of R 0 S. As an example, see Fig. 22. The operation of composition is a binary operation on relations, and it produces a relation from two relations. The same operations can be applied again to produce other relations. For example, let R be a relation from X to Y, S a relation from Y to Z, and P a relation from Z to W. Then R 0 S is a relation from X to Z. We can form (R 0 S)o P, which is a relation from X to W. Similarly, we can also form R o(S 0 P), which again is a relation from X to W. Let us assume that (R 0 S)o P is nonempty, and let (x, y) E R, (y, z) E S, and (z, w) E P. This assumption means (x, z) E R 0 S and (x, w) E (R 0 S)o P. Of course, (y, w) E Sop and (x, w) E R o(S 0 P), which shows that (RoS)oP=Ro(SoP) This result states that the operation of composition on relations is associative. We 24 THE THEORY AND PRACTICE OF COMPILER WRITING may delete the parentheses in writing (R 0 S) 0 P, so that (RoS)oP = Ro(Sop) = RoSoP We know that the relation matrix of a relation R from a set X = {Xl' X 2 , . . . , Xm} to a set Y = {Yl, Y2, . . . , Yn} is given by a matrix having m rows and n columns. We shall denote the relation matrix of R by MR. M R has entries which are Is and Os. Similarly the relation matrix Ms of a relation S from the set Y to a set Z = {Zl' Z2'. . ., zp} is an n X p matrix. The relation matrix of R 0 S can be obtained from the matrices M Rand Ms in the following manner. From the definition it is clear that (Xi' Z k) E R 0 S if there is at least one element of Y, say, Yj' such that (Xi' Yj) E Rand (Yj' Zk) E S. There may be more than one element of Y which has properties similar to those of Yj' for example, (Xi' Yr) E Rand (Yr' Zk) E S. In all such cases, (Xi' Zk) E R 0 S. Thus when we scan the ith row of M R and kth column of Ms, and we come across at least one j, such that the entries in the jth location of the row as well as the column under consideration are Is, then in this instance, the entry in the ith row and k th column of M R 0 s is also 1; otherwise it is o. Scanning a row of M R along with every column of Ms gives one row of M R 0 s. In this way, we can obtain all the rows of M R 0 s. In general, let the relations A and B be represented by n X m and m X r matrices, respectively. Then the composition A 0 B which we denote by the relation matrix C is expressed as m Coo = V aO k /\ b k o i = 1,2,...,n; j = 1,2,...,r IJ I J k=l where a ik /\ b kj and V := 1 indicate bitANDing (i.e., 1 /\ 0 = 0 /\ 1 = 0 /\ 0 = 0, 1 /\ 1 = 1) and bitORing (i.e., 1 V 1 = 1 V 0 = 0 V 1 = 1,0 V 0 = 0), re spectively. Let us now consider some distinct relations R 1 , R 2 , R 3 , R4 in a set X = { a, b, c} given by Rl = {(a,b),(a,c),(c,b)} R 2 = {(a,b),(b,c),(c,a)} R3 = {(a,b), (b,c), (c,c)} R4 = {(a,b), (b,a), (c,c)} Denoting the composition of a relation by itself as R 0 R = R 2 R 0 R 0 R = R 0 R 2 = R 3 R 0 R m  l = Rm let us write the powers of the given relations. Clearly Rf= {(a,b)} Ri=cf> Ri=cf> R = {(a,c), (b,a), (c,b)} R = {(a, a), (b,b), (c,c)} R 4  R R 5 = R 2 R 6  R 3 2 2 2 2 2 2 R = {(a,c),(b,c),(c,c)} =R =Rj=R R = {(a,a),(b,b),(c,c)} R =R4 R 5  R 2 4  4 NOTATION AND CONCEPTS FOR LANGUAGES AND GRAMMARS 25 Given a finite set X, containing n elements, and a relation R in X, we can interpret Rm (m = 1,2,...) in terms of its graph. This interpretation is done for a number of applications throughout the text. With the help of such an interpre tation or from the examples given here, it is possible to say that there are at most n distinct powers of R, for Rm, m > n, that can be expressed in terms of R, R 2, . . . , R n. Our next step is to construct the relation in X given by R+= R U R 2 U R 3 U ... Naturally, this construction will require only a finite number of powers of R to be calculated, and these calculations can easily be performed by using the matrix representation of the relation R and the Boolean multiplication of these matrices. Let us now see what the corresponding relations R:, R;, R;, and R; are R: = Rl U Rl U Ri ... = Rl R; = R 2 U R U R . .. = R 2 U R U R = {(a, b), (b,c), (c,a), (a,c), (b,a), (c,b), (a, a), (b,b), (c,c)} R; = {(a, b), (b,c), (c,c), (a,c)} R; = {(a,b),(b,a),(c,c),(a,a),(b,b)} Observe that the relations R:, R; , . . ., R; are all transItIve and that Rl c Ri, R 2 c R;,..., R4 C R;. From the graphs of these relations one can easily see that R is obtained from R; (i = 1,2,3,4) by adding only those ordered pairs to R i such that R: is transitive. We now define R + in general. Definition 21 Let X be any finite set and R be a relation in X. The relation R+= R U R 2 U R 3 U ... in X is called the transitive closure of R in X. Theorem 21 The transitive closure R + of a relation R in a finite set X is transitive. Also for any other transitive relation P in X such that R c P, we have R + c P. In this sense, R + is the smallest transitive relation contain ing R. Based on this theorem, the transitive closure of the relation matrix for some relation A can easily be computed by using the following algorithm due to Warshall. Procedure WARSHALL (A, n, P). Given the relation matrix A with n rows, the following steps produce the transitive closure of A, which is denoted by P. The variables i, j, and k are local integer variables. 1. [Ini tialize] P A 2. [Perform a pass] Repeat through step 4 for k = 1, 2,...,n 3. [Process rows] Repeat step 4 for i = 1, 2,...,n 26 THE THEORY AND PRACTICE OF COMPILER WRITING 4. [Process columns] Repeat for j = 1, 2,...,n Pij Pij V (Pik /\ Pkj). 5. [Finished] Exit o To show that this algorithm produces the required matrix, note that step 1 produces a matrix in which Pij = 1 if there is a path of length 1 from Vi to '1. Assume that for a fixed k, the intermediate matrix P produced by steps 3 and 4 of the algorithm is such that the element in the ith row and jth column in this matrix is 1 if and only if there is a path from Vi to '1 which includes only nodes from v 1 , v 2 ,,,.,V k as intermediate nodes. Now with an updated value of k, we find that Pij = 1 either if Pij = 1 in an earlier step or if there is a path from Vi to v j which traverses through V k + 1. This means that Pij = 1 if and only if there is a path from Vi to v j which includes only nodes from v 1 , V 2 ,...,V k + 1 as intermediate nodes. Instances of transitive closures of relations will be given in Chaps. 6 and 7, where such relations are derived from a grammar and used in the parsing phase. Also, the transitive closure of a graph of a program can be used to perform certain optimizations in that program. This aspect of compiler writing is the topic of Chaps. 12 and 13. In terminating this subsection we introduce the notion of the converse of a relation. Given a relation R from X to Y, a relation R from Y to X is called the converse of R, where the ordered pairs of R are obtained by interchanging the members in each of the ordered pairs of R. This means, for x E X and y E Y, that xRy yRx. _ From the definition of R it follows that R = R. The relation matrix Mk of R can be obtained by simply interchanging the rows and columns of MR. Such a matrix is called the transpose of MR. Therefore, M k = transpose of M R The graph of R is also obtained from that of R by simply reversing the arrows on each arc. We now consider the converse of a composite relation. For this purpose, let R be a relation from X to Yand S be a relation from Y to Z. Obviously, R is a relation from Y to X, S from Z to Y; R 0 S is a relation from X to Z, and R 0 S is a relation from Z to X. Also the relation So R is from Z to X. We now show that RoS=SoR If xRy and ySz, then x(R 0 S)z and z(R 0 S)x. But zSy and yRx, so that z(S 0 R)x. This is true for any x E X and z E Z; hence the required result. The same rule can be expressed in terms of the relation matrices by saying that the transpose of M R 0 s is the same as the matrix Ms 0 k. The matrix Ms 0 k can be obtained from the matrices Ms and M k , which in turn can be obtained from the matrices Ms and MR. NOTATION AND CONCEPTS FOR LANGUAGES AND GRAMMARS 27 Later in the text, the concept of a relation will be applied to a grammar. In fact there are several interesting relations which can be obtained from a grammar. Many of the operations which are to be performed on these relations have been discussed in this subsection. EXERCISES 21.2 1 Give an example of a relation which is neither reflexive nor irreflexive. 2 Give an example of a relation which is both symmetric and antisymmetric. 3 If relations Rand S are both reflexive, show that R U Sand R n S are also reflexive. 4 If relations Rand S are reflexive, symmetric, and transitive, show that R n S is also reflexive, symmetric, and transitive. 5 Determine whether the following relations are transitive: Rl = {(1,1)} R 2 = {(1,2), (2,2)} R3 = {(1,2), (2,3), (1,3), (2,1)} 6 Given S = {1,2,3,4} and a relation R on S defined by R = {(1,2), (4,3), (2,2), (2,1), (3,1)} show that R is not transitive. Find a relation Rl d R such that Rl is transitive. Can you find another relation R 2 d R which is also transitive? 7 Given S = {I, 2,..., 10} and a relation R on S where R = {(x,Y)lx + Y = 10} what are the properties of the relation? 8 Let R be a relation on the set of positive real numbers so that its graphical representation consists of points in the first quadrant of the cartesian plane. What can we expect if R is (a) reflexive, (b) symmetric, and (c) transitive? 9 Let L denote the relation" less than or equal to" and D denote the relation" divides," where xDy means "x divides y." Both Land D are defined on the set {I, 2, 3, 6}. Write Land D as sets, and find L n D. 10 Show that the relations L and D given in Exercise 9 are both reflexive, antisymmetric, and transitive. Give another example of such a relation. Draw the graphs of these relations. 11 Let R denote a relation on the set of ordered pairs of positive integers such that (x, y) R (u, v) iff xv = yu. Show that R is an equivalence relation. 12 Given a set S = {I, 2, 3,4, 5}, find the equivalence relation on S which generates the partition whose equivalence classes are the spts {I, 2 }, {3}, and {4, 5 }. Draw the graph of the relation. 13 Prove that the relation "congruence modulo m" given by == = {( x, y) I x  y is divisible by m} over the set of positive integers is an equivalence relation. Show also that if Xl == Yl and X 2 == Y2' then (Xl + x2) == (Yl + Y2)' 14 Prov the following equivalences and equalities: (a) k = R (b)R=S k=s (c) R s k s (d) R U S = k u S (e) R n S = k n S 28 THE THEORY AND PRACTICE OF COMPILER WRITING 15 Show that if a relation R is reflexive, then k is also reflexive. Show also that similar remarks hold if R is transitive, irreflexive, symmetric, or antisymmetric. 16 What nonzero entries are there in the relation matrix of R n k if R is an antisymmetric relation in a set X? 17 Given the relation matrix M R of a relation R on the set {a, b, c}, find the relation matrices of k, R 2 = R 0 R, R 3 = R 0 R 0 R, and R 0 k. M R = U r ] 18 Two equivalence relations Rand S are given by their relation matrices M R and Ms. Show that R 0 S is not an equivalence relation. MR=[i i ] Ms = [ r n Obtain equivalence relations R 1 and R 2 on {I, 2, 3} such that RIo R 2 is also an equivalence relation. 19 Using Warshall's algorithm, obtain the transitive closure of the relation whose graph is given in Fig. 23. 20 For the graph of the relation R given in Fig. 24, determine Mk, M R 1\ Mk, and Mk 1\ MR' V1 V4 V5 Figure 23 V1 V3 V5 Figure 24 NOTATION AND CONCEPTS FOR LANGUAGES AND GRAMMARS 29 21.3 Strings In this subsection we are concerned with some elementary properties of strings. There are many interesting properties exhibited by string operations, just as there are interesting properties for arithmetic operations over the natural numbers. To refamiliarize ourselves with some of the properties associated with operations, let us consider the operation of addition on the natural numbers. This operation can be represented, in general, by a functional system in two variables: j(x,y)=x+y where x and yare natural numbers. This system exhibits certain interesting properties. First, the sum of any two natural numbers is a natural number. This property is called closure. Closure is a necessary property for a system (i.e., a set and an operation on that set) to be classified as an algebraic system. Second, (x + y) + z = x + (y + z) = x + y + z when x, y, and z are natural numbers; accordingly the operation of addition is said to be associative. Third, there exists a number i such that for every natural number x, x + i = x. This number is zero and is called the unit element or identity of the additive system. Many other important properties, such as distributivity and commutativity, exist when arith metic operations such as addition and multiplication are applied to the set of natural numbers. We begin a discussion of strings by formally defining a string. To do so, we must introduce the notion of an alphabet and the operation of concatenation. Simply stated, an alphabet V is a finite nonempty set of symbols. The set V = {a,b, c,..., z} is a familiar example of an alphabet and {a, /3, y, E} is a fourchara ter alphabet (which is a sub alphabet of the Greek alphabet). The concatenation of two alphabetic characters, say 'a' and 'b', is said to form a sequence of characters, namely, 'ab'. (Note that henceforth when we refer to a character from an alphabet or a sequence of such characters, they are enclosed in single quote marks.) The operation of concatenation also applies to sequences of characters. For example, 'ab' concatenated with 'ab' is 'abab'. We denote the concatenation operator by the special symbol o. This allows us to write expressions such as 'ab' 0 'a', which is identical in value to 'aba'. A string (or sentence) over an alphabet V is either a letter from the alphabet V or a sequence of letters derived from the concatenation of zero or more characters from the alphabet V. Examples of strings over an alphabet V = {a, b, c} are 'a', 'ca', 'ccba', and 'bbb'. Let V 0 V = V 2 designate all strings of length 2 on V, V 0 V 0 V = V 2 0 V = V 3 designate all strings of length 3 on V, and, in general, V 0 V 0 ... 0 V = V n designate all strings of length n on V. Then the closure of V, denoted as V+, is defined as V+= V U V 2 U V 3 U ... For completeness, a special string E called the empty (or null) string is often combined with V+ to form the closure set V* of V. That is, V* = {E} U V U V 2 U V 3 U . .. = {E} U V+. The string E has the identity property (i.e., x 0 E = 30 THE THEORY AND PRACTICE OF COMPILER WRITING EO X = X for any string x which is an element of V*), and it is called the identity element in the system formed by the set V* and the operation of concatenation. Associativity is another property of this system (that is, (x 0 y) 0 z = x 0 (y 0 z) = x 0 y 0 z for x, y, Z E V*). As an example, consider the set of strings V* that can be generated from an alphabet V = {x, y}. Some subsets of V* are V 2 = { 'xx', 'xy', 'yx', 'yy'} V 3 = {'xxx', 'xxy', 'xyx', 'xyy', 'yxx', 'yxy', 'yyx', 'yyy'} V 4 = { 'xxxx', 'xxxy', 'xxyx', 'xxyy', 'xyxx', 'xyxy', 'xyyx', 'xyyy', 'yxxx', 'yxxy', 'yxyx', 'yxyy', 'yyxx', 'yyxy', 'yyyx', 'yyyy'} We will, on many occasions, refer to the closure set V* of an alphabet. As another example of a string, let us examine FORTRAN. The FORTRAN alphabet consists of 26 letters, 10 digits, and a set of special characters, such as '( " ')', ',', '= " and '+ '. It is only these characters that are used in writing a FORTRAN program. Hence a program can be viewed as the concatenation of characters over an alphabet to yield an arbitrarily long string. In Chap. 4, we see that the problem of ensuring that only the proper set of characters appears in a program is handled by the scanning phase of a compiler. Let x, y, and z be strings over an alphabet where z = xy. The string x is called a prefix or head of z. If y =1= E, then x is called a proper prefix or proper head. Similarly, y is called a suffix or tail of z, and if x =1= E, then y is called a proper suffix or proper tail. Some of the concepts introduced here are used in the next section, where the notions of grammars and languages are introduced. 22 DISCUSSION OF GRAMMARS Programming languages must be preciiely defined. Unf rtunately, for some of the earlier programming languages the existence of a particular compiler finally provided the precise definition of such a language. The proper specification of a programming language involves the definition of the following: 1. The set of symbols (or alphabet) that can be used to construct correct programs 2. The set of all syntactically correct programs 3. The" meaning" of all syntactically correct programs In this section we shall be concerned with the first two items in the specification of programming languages. A language L can be considered a subset of the closure (including the empty string) of an alphabet. The language consisting of this closure set is not particu NOTATION AND CONCEPTS FOR LANGUAGES AND GRAMMARS 31 larly interesting, since it is too large. Our definition of a language L is a set of strings or sentences over some finite alphabet V T , so that L c Vi. How can a language be represented? A language consists of a finite or an infinite set of sentences. Finite languages can be specified by exhaustively enumerating all their sentences. However, for infinite languages, such an enumer ation is not possible. On the other hand, any means of specifying a language should be finite. One method of specification which satisfies this requirement uses a generative device called a grammar. A grammar consists of a finite nonempty set of rules or productions which specify the syntax of the language. As well, a grammar imposes structure on the sentences of a language. Many grammars may generate the same language but impose different structures on the sentences of that language. The study of grammars constitutes an important subarea of computer science called formal language theory. This area emerged in the mid 1950s as a result of the efforts of Noam Chomsky (1959), who gave a mathemati cal model of a grammar in connection with his study of natural languages. In 1960, the concept of a grammar became important to programmers because the syntax of ALGOL 60 was described by a grammar. A second method of language specification is to have a machine, called an acceptor, determine whether a given sentence belongs to the language. This approach is discussed further in Chaps. 4, 6, and 7 along with some very interesting and important relationships that exist between grammars and accep tors. In this section we are concerned with a grammar as a mathematical system for defining languages and as a device for giving some useful structure to sentences in a language. It was mentioned earlier that a grammar imposes a structure on the sentences of a language. For a sentence in English such a structure is described in terms of subject, predicate, phrase, noun, and so on. On the other hand, for a program, the structure is given in terms of procedures, statements, expressions, etc. In any case, it may be desirable to describe all such structures and to obtain a set of all the correct or admissible sentences in a language. For example, we may have a set of correct sentences in English or a set of valid ALGOL programs. The grammatical structure of a language helps us determine whether a particular sentence does or does not belong to the set of correct sentences. The gramlnatical structure of a sentence is generally studied by analyzing the various parts of a sentence and their relationships to one another. Consider the sentence" the cat ate a mouse." Its structure (or parse) is shown in Fig. 25. This diagram of a parse displays the syntax of a sentence in a manner similar to a tree and is therefore called a syntax tree. Each node in the diagram represents a phrase of the syntax. The words such as "the" and "cat" are the basic symbols, or primitives, of the language. The syntax of a small subset of the English language can be described by using the symbols S: sentence V: verb 0: object A: article N: noun SP: subject phrase VP: verb phrase NP: noun phrase 32 THE THEORY AND PRACTICE OF COMPILER WRITING in the following rules: S + SP VP N+ ' mouse' SP + A N N+ ' tree ' A+ 'a ' VP + V 0 A+ ' the' V+ ' ate' N+ ' monkey' V+ ' climbs' N+ 'banana' o + NP N+ 'cat' NP + A N These rules state that a sentence is composed of a "subject phrase" followed by a "verb phrase"; the" subject phrase" is composed of an "article" followed by a " noun"; a verb phrase is composed of a "verb" followed by an "object"; and so on. The structure of a language is discussed by using symbols such as "sentence," "verb," "subject phrase," and" verb phrase," which represent syntactic classes of elements. Each syntactic class consists of a number of alternative structures, and each structure consists of an ordered set of items which are either primitives (of the language) or syntactic classes. These alternative structures are called produc tions or rules of syntax, or replacement rules. For example, the production, S + SP VP defines a "sentence" to be composed of a "subject phrase" followed by a "verb phrase." The symbol + separates the syntactic class" sentence" from its definition. The syntactic class and the arrow symbol along with the interpreta tion of a production enable us to describe a language. A system or language which describes another language is known as a metalanguage. The metalanguage used to teach German at most universities is sentence the cat verb phrase verb object I noun phrase article noun I I ate a mouse subject phrase article noun Figure 25 Syntax tree for an English sentence. NOTATION AND CONCEPTS FOR LANGUAGES AND GRAMMARS 33 English, while the metalanguage used to teach English is English. The diagram of the parse of a sentence describes its syntax but not its meaning or semantics. At this time, we are mainly concerned with the syntax of a language, and the device which we have just defined to give the syntactic definition of the language is called a grammar. Using the grammatical rules for our example, we can either produce (gener ate) or derive a sentence in the language. A computer programmer is concerned with producing programs which adhere to the productions (grammatical rules) of the language. The compiler of a language, on the other hand, is faced with the problem of determining whether a given sentence (source program) is syntacti cally correct based upon the given grammatical rules. If the syntax is correct, the compiler produces object code. Consider the problem of trying to generate or produce the sentence" the cat ate the mouse" from the set of productions given. It is accomplished by starting first with the syntactic class symbol S and looking for a production which has S to the left of the arrow. There is only one such production, namely, S + SP VP We have replaced the class S by its only possible composition. We then take the string SPVP and look for a production whose lefthand side is SP and then replace it with the righthand side of that production. The application of the only production pos sible produces the string ANVP We next look for a production whose left part is A, and two such productions are found. By selecting the production A + 'the' and upon substituting its righthand side in the string A N VP, we obtain the string the N VP This enumerative process is continued until we arrive at the correct sentence. At this point, the sentence contains only primitive or terminal elements of the language (no classes). A complete derivation or generation of the sentence" the cat ate a mouse" is as follows: S SP VP A N VP the N VP the cat VP the cat V 0 the cat ate 0 the cat ate NP the cat ate A N the cat ate a N the cat ate a mouse 34 THE THEORY AND PRACTICE OF COMPILER WRITING Here the symbol denotes that the string on the righthand side of the symbol can be obtained by applying one replacement rule to the previous string. The rules for the example language can produce a number of sentences. Examples of such sentences are The cat ate a banana. The monkey climbs a tree. The monkey climbs the tree. The banana ate a cat. The last of these sentences, although grammatically correct, does not make sense because of its semantics. This situation is often allowed in the specification of languages. There are many valid FORTRAN and PASCAL programs that do not make sense. It is easier to define languages if certain sentences of questionable validi ty are allowed by the replacement rules. There is another important thing to note concerning the generation of strings (i.e., sentences) from a grammar. We have neglected (for legibility reasons) to put into the terminal symbols of the grammar the blank characters which normally appear between the words in the sentences. Throughout the book, we assume, unless it is obvious from context of the text, that a blank delimits all terminal symbols. The set of sentences that can be generated by the rules of the previous example is finite. Any interesting language usually consists of an infinite set of sentences. As a matter of fact, the importance of a finite device such as a grammar is that it permits the study of the structure of a language consisting of an infinite set of sentences. Let the symbols L, 0, and I denote the classes L: letter, D: digit, and I: identifier. The productions which follow are recursive and produce an infinite set of names because the syntactic class I is present on both the left and the right sides of certain productions. I + L 0+0 0+1 I + 10 I + IL L+a 0+9 L+b L+z I t is easily seen that the class I defines an infinite set of strings or names in which each name consists of a letter followed by any number of letters or digits. This set is a consequence of using recursion in the definition of the productions NOTATION AND CONCEPTS FOR LANGUAGES AND GRAMMARS 35 I 10 and I IL. In fact, recursion is fundamental to the definition of an infinite language by the use of a grammar. Let us now formalize the idea of a grammar and how it is used. For this purpose, let V T be a finite nonempty set of symbols called the terminal alphabet. The symbols in V T are called terminal symbols. The metalanguage which is used to generate strings in the language is assumed to contain a set of syntactic classes or variables called nonterminal symbols. The set of non terminal symbols is denoted by V N , and the elements of V N are used to define the syntax (structure) of the language. Furthermore, the sets V N and V T are assumed to be disjoint. The set V N U V T consisting of nonterminal and terminal symbols is called the vocabul ary of the language. In much of the discussion in this section we use capital letters such as A, B, C, . . ., X, Y, Z to denote non terminal symbols, and SI' S2, . .. to represent the elements of the vocabulary. The strings of terminal symbols are denoted by lowercase letters x, y, z, . . ., while strings of symbols over the vocabulary are given by a, 13, y, . .. . The length of a string a will be denoted by lal. Definition 22 A grammar is defined by a 4tuple G = (V N , V T , S, 4» where V T and V N are sets of terminal and non terminal (syntactic class) symbols, respectively. S, a distinguished element of V N and therefore of the vocabul ary, is called the starting symbol. 4> is a finite nonempty subset of the relation from(VTU VN)*VN(VTU VN)*tO(VTU V N )*.lngeneral,anelement(a,l3) is written as a 13 and is called a production rule or a rewriting rule. For our example defining an identifier, we may write the grammar as G 1 = (V N , V T , S, 4» in which V N = {I,L,D} V T = {a, b, c, d, e, f, g, h, i, j, k, I, m, n, 0, p, q, r, s , t , u, v, w, x, y, z, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 } S=I 4> = {I L, I IL, I ID, L a, L b,..., L z, D 0, D 1,..., D 9} Definition 23 Let G = (V N' V T' S, 4» be a grammar . For (J, E V*, (J is said to be a direct derivative of , written as (J, if there are strings <PI and <P2 (including possibly empty strings) such that = <Pla<P2 and (J = <P113<p2 and a 13 is a production of G. If (J, we may also say that directly produces (J or (J directly reduces to . For grammar G 1 of our example, we have listed in Table 22 some illustrations of direct derivations. The concepts can now be extended to produce a string (J not necessarily directly but in a number of steps from a string . 36 THE THEORY AND PRACTICE OF COMPILER WRITING Table 22 1/; (J Rule used <1>1 <1>2 1 L 1+ L f f LL Lx L+x L f LDL LIL D+l L L LDDL L2DL D+2 L DL Definition 24 Let G = (V N , V T , S, <1» be a grammar. The string produces (J «(J reduces to , or (J is the derivative of ), written as (J, if there are strings <l>o,<I>I""'CPn (n > 0) such that = CPo <1>1,<1>1 CP2""'CPnl CPn = (J. The relation is the transitive closure of the relation . If we let n = 0, we can define the reflexive transitive closure of as * + (J <=> (J or = (J Returning to grammar G 1 , we show that the string a13 is derived from I by following the derivation sequence: I ID IDD LDD aDD aID a13 Note that as long as we have a non terminal symbol in the string, we can produce a new string from it (assuming no rules of the form A + A). On the other hand, if a string contains only terminal symbols, then the derivation is complete, and we cannot produce any further strings from it. Definition 25 A sentential form is any derivative of the unique non terminal symbol S. The language L generat