Principal The Theory and Practice of Compiler Writing

The Theory and Practice of Compiler Writing

ISBN 10:
ISBN 13:
McGraw-Hill computer science series
DJVU, 6.29 MB
Descarga (djvu, 6.29 MB)

You may be interested in


Most frequently terms

You can write a book review and share your experiences. Other readers will always be interested in your opinion of the books you've read. Whether you've loved the book or not, if you give your honest and detailed thoughts then people will find new books that are right for them.

THE THEORY AND PRACTICE OF COMPILER WRITING Jean-Paul Tremblay Paul G. Sorenson Department of Computational Science University of Saskatchewan, Saskatoon Canada McGraw-Hill Book Company New York St. Louis San Francisco Auckland Bogota Hamburg Johannesburg London Madrid Mexico Montreal New Delhi Panama Paris Sao Paulo Singapore Sydney Tokyo Toronto

This book was set in Times Roman by Science Typographers, Inc. The editors were Eric M. Munson, Kaye Pace, Ellen W. MacElree, and Jo Satloff; the production supervisor was Joe Campanella. The drawings were done by Volt Information Sciences, Inc. R.R. Donnelley & Sons Company was printer and binder. THE THEORY AND PRACTICE OF COMPILER WRITING Copyright © 1985 by McGraw-Hill, Inc. All rights reserved. Printed in the United States of America. Except as permitted under the United States Copyright Act of 1976, no part of this publication may be reproduced or distributed in any form or by any means, or stored in a data base or retrieval system, without the prior written permission of the publisher. 234567890DOCDOC898765 ISBN 0-07-0fc,Slbl-2 Library of Congress Cataloging in Publication Data Tremblay, Jean-Paul, date The theory and practice of compiler writing. (McGraw-Hill computer science series) Includes bibliographies and index. 1. Electronic digital computers—Programming. 2. Compiling (Electronic computers) I. Sorenson, P. G. II. Title. III. Series. QA76.6.T734 1985 001.64'2 84-25096 ISBN 0-07-065161-2

Dedie—Dedicated A la memoire de mon pere Philippe Tremblay To my father, Chester, and in memory of my mother, Josephine Sorenson

CONTENTS Preface xvii Chapter 1 Introduction 1 1-1 Programming Languages 1 1-2 Translators 4 1-3 Model of a Compiler 5 Bibliography 11 Chapter 2 Notation and Concepts for Languages and Grammars 13 2-1 Sets and Strings 14 2-1.1 Basic Concepts of Set Theory 14 Exercises 2-1.1 18 2-1.2 Relations 19 Exercises 2-1.2 27 2-1.3 Strings 29 2-2 Discussion of Grammars 30 ; Exercises 2-2 37 2-3 Classification of Grammars 37 Exercises 2-3 42 2-4 Context-Free Grammars and Parsing 42 2-4.1 Syntax Terminology 43 Exercises 2-4.1 50 2-4.2 A View of Parsing 52 Exercises 2-4.2 56 2-4.3 Reduced Grammars and Grammars with e-Rules 57 Exercises 2-4.3 63 2-4.4 Extended BNF Notation 64 Exercises 2-4.4 66 Bibliography 67

X CONTENTS Chapter 3 3-1 3-2 3-3 3-4 3-5 3-6 3-7 3-8 Chapter 4 4-1 4-2 4-3 4-4 Programming-Language Design Overview of the Problem Preliminary Considerations Sources of Ideas Goals and Design Philosophies of Programming Languages 3-4.1 Human Communicatiqp 3-4.2 Prevention and Detection of Errors 3-4.3 Usability 3-4.4 Programming Effectiveness 3-4.5 Compilability 3-4.6 Efficiency 3-4.7 Machine Independence 3-4.8 Simplicity 3-4.9 Uniformity 3-4.10 Orthogonality 3-4.11 Generalization and Specialization 3-4.12 Other Design Philosophies Detailed Design 3-5.1 Microstructure 3-5.2 Expression Structure 3-5.3 Data Structure 3-5.4 Control Structure 3-5.5 Compile Structure 3-5.6 I/O Structure Reduction of Size Pragmatics Comments on the Design of Ada 3-8.1 Evaluation of Ada from a Language Design Viewpoint Usability and Program Effectiveness Machine Independence and Portability Efficiency Modularity and Maintainability Compilability and Compile Structure Simplicity 3-8.2 Ada Programming Support Environment 3-8.3 Software Technology for Adaptable Reliable Systems (STARS) Chapter Exercises Bibliography Scanners The Scanning Process An Elementary Scanner Design and Its Implementation Exercises 4-2 Regular Grammars and Regular Expressions Exercises 4-3 Finite-State Acceptors 4-4.1 Deterministic Finite-State Acceptors 4-4.2 Nondeterministic Finite-State Acceptors 68 69 70 72 74 74 75 78 78 80 81 83 84 84 85 86 86 88 88 92 93 102 113 114 116 116 120 122 122 129 129 129 130 130 131 132 134 135 138 139 141 150 151 155 156 156 157

CONTENTS Xi 4-4.3 Nondeterministic Finite-State Acceptors with e-Transitions 162 Exercises 4-4 165 4-5 Equivalence of Regular Grammars and Finite-State Acceptors 166 Exercises 4-5 169 4-6 Equivalence of Regular Expressions and Finite-State Acceptors 170 Exercises 4-6 176 4-7 A Scanner Generator 177 Exercises 4-7 181 Bibliography 181 Chapter 5 Compile-Time Error Handling 182 5-1 Overview of Error Handling 182 5-2 Error Detection 185 5-2.1 The Nature of Syntax Errors 185 5-2.2 How Errors Are Detected 187 5-2.3 Where Errors Are Detected 189 5-3 Error Reporting 190 5-4 Error Recovery 195 5-4.1 Ad Hoc Recovery Mechanisms 195 5-4.2 Syntax-Directed Recovery 197 5-4.3 Secondary Error Recovery 199 5-4.4 Context-Sensitive Recovery 200 5-5 Error Repair 200 5-5.1 Ad Hoc Repair 201 5-5.2 Syntax-Directed Repair 201 5-5.3 Context-Sensitive Repair 202 5-5.4 Spelling Repair 203 Chapter Exercises 205 Bibliography 206 Chapter 6 Top-Down Parsing 208 6-1 General Top-Down Parsing Strategies 208 6-1.1 Brute-Force Approach 209 Exercises 6-1.1 219 6-1.2 Recursive-Descent Parsers 219 Recursive-Descent Parsing Algorithm 219 Error Detection in Recursive-Descent Parsers 224 Exercises 6-1.2 228 6-1.3 Notions of Top-Down Parsing with Limited Backup 228 6-1.4 Top-Down Parsing with Limited Backup 229 Exercises 6-1.4 234 6-2 Top-Down Parsing with No Backup 235 6-2.1 Notions of Parsing with No Backup 235 6-2.2 Simple LL(\) Grammars 236 Exercises 6-2.2 241 6-2.3 LL(l) Grammars without c-Rules 242 Exercises 6-2.3 248 6-2.4 LL(1) Grammars with c-Rules 249

Xll CONTENTS Exercises 6-2.4 258 6-2.5 Error Handling for LL(1) Parsers 260 Exercises 6-2.5 272 Chapter Exercises 272 Bibliography 273 Chapter 7 Bottom-Up Parsing 274 7-1 Polish Expressions and Their Compilation 275 7-1.1 Polish Notation 276 7-1.2 Conversion of Infix Expressions to Polish Notation 278 7-1.3 Error Detection for Infix Expressions 283 Exercises 7-1 285 7-2 Operator Precedence Grammars 286 7-2.1 Notions and Use of Operator Precedence Relations 287 7-2.2 Formal Definition of Operator Precedence Relations 293 7-2.3 Operator Precedence Parsing Algorithm 294 7-2.4 Precedence Functions in Operator Precedence Parsers 298 7-2.5 Error Detection in Operator Precedence Parsers 300 Exercises 7-2 301 7-3 Simple Precedence Grammars 302 7-3.1 Notions and Use of Precedence Relations 302 7-3.2 Formal Definition of Precedence Relations 306 7-3.3 Parsing Algorithm for Simple Precedence Grammars 309 7-3.4 Precedence Functions for Simple Precedence Grammars 311 7-3.5 Error Recovery for Simple Precedence Parsers 320 7-3.6 Notions of Extended Precedence 332 Exercises 7-3 339 7-4 LR Grammars 341 7-4.1 Concepts and Terminology 341 Exercises 7-4.1 346 7-4.2 LR(0) Parsers 346 Exercises 7-4.2 352 7-4.3 SLR(l) Parsers 353 Exercises 7-4.3 361 7-4.4 Canonical LR(l) Parsers 362 Exercises 7-4.4 369 7-4.5 LALR{\) Parsers 370 Exercises 7-4.5 375 7-4.6 Efficient Generation of Lookahead Sets for LALR(\) Parsers 375 Exercises 7-4.6 383 7-4.7 Representation and Optimization of LR Parsers 383 Sparse-Matrix Representations 384 Optimization Transformations 386 7-4.8 Error Detection and Recovery in LR Parsers 389 Early Methods of Error Recovery 389 Application of Graham-Rhodes Method to LR Parsers 392 Other LR Error-Recovery Methods 410

CONTENTS Xiii 7-5 Comparison of Parsing Methods 410 Bibliography 415 Chapter 8 Symbol-Table-Handling Techniques 418 8-1 Perspective and Motivation 418 8-2 When to Construct and Interact with the Symbol Table 419 8-3 Symbol-Table Contents 422 8-4 Operations on Symbol Tables 426 8-5 Symbol-Table Organizations for Non-Block-Structured Languages 428 8-5.1 Unordered Symbol Tables 428 8-5.2 Ordered Symbol Tables 429 8-5.3 Tree-Structured Symbol Tables 433 8-5.4 Hash Symbol Tables 450 8-6 Symbol-Table Organizations for Block-Structured Languages 464 8-6.1 Block-Structured Language Concepts 465 8-6.2 Stack Symbol Tables 467 8-6.3 Stack-Implemented Tree-Structured Symbol Tables 468 8-6.4 Stack-Implemented Hash-Structured Symbol Tables 471 Chapter Exercises 474 Bibliography 476 Chapter 9 Run-Time Storage Organization and Management 477 9-1 9-2 9-3 Static Storage Allocation Exercises 9-1 Dynamic Storage Allocation 9-2.1 Activation Records 9-2.2 Parameter Area 9-2.3 Display Area 9-2.4 Run-Time Address Calculation 9-2.5 Handling Recursive Procedures Exercises 9-2 Heap Storage Allocation 9-3.1 Implicit Storage Requests 9-3.2 Explicit Storage Requests 9-3.3 Management of Fixed-Length Blocks 9-3.4 Management of Variable-Length Blocks First-Fit Heap Storage-Management Strategy Boundary-Tag Heap Storage-Management Strategy Buddy-System Heap Storage-Management Strategy 9-3.5 Free-as-You-Go Storage Release 9-3.6 Garbage Collection 9-3.7 Compaction Exercises 9-3 Bibliography 477 480 480 482 483 484 487 488 491 492 493 493 495 496 497 501 506 512 513 517 519 520

XIV CONTENTS Chapter 10 10-1 10-2 10-3 10-4 10-5 Chapter 11 n-i 11-2 11-3 11-4 11-5 11-6 Chapter 12 12-1 12-2 12-3 12-4 12-5 Intermediate Forms of Source Programs Polish Notation N-tupk j Notation Abstract Syntax Trees Threaded Code Abstract Machine Code 10-5.1 10-5.2 Portability and Abstract Machine The P-Code Abstract Machine for PASCAL Chapter Exercises Bibliography Semantic Analysis and Code Generation What Is Meant by Semantic Analysis Implicit Stacking in Recursive-Descent Compilation Semantic Stacks in Bottom-Up Compilation Action Symbols in Top-Down Compilation Attributed Translation 11-5.1 11-5.2 Examp 11-6.1 11-6.2 11-6.3 11-6.4 11-6.5 11-6.6 11-6.7 Attributed Translation Grammar L-Attributed Translation Grammar le Language Constructs Declarations Constant Type Variable Declarations Procedure Declarations Expressions Assignment Statements Control Statements Case-Selection Statement Repeat- While Statement For Loop Statement Procedure Calls and Returns Procedure Calls Return Statements and Procedure Termination Input and Output Statements Input Statements Output Statements Compiler Aids Chapter Exercises Bibliography Code Optimization Introduction Basic Blocks Folding Redundant-Subexpression Elimination Optimization within Iterative Loops 12-5.1 12-5.2 Loop Unrolling Frequency Reduction 521 522 523 525 527 531 531 532 534 535 536 537 540 548 554 559 560 563 568 568 570 571 578 579 591 592 593 595 596 598 598 603 603 603 607 607 609 609 610 610 611 612 620 631 635 637

CONTENTS XV 12-5.3 Strength Reduction 646 12-5.4 Combining Loop-Optimization Techniques 651 12-6 Global Optimization through Flowgraph Analysis 655 12-6.1 Flowgraph Construction 655 12-6.2 Flowgraph Analysis 660 Flowgraph-Analysis Problems 661 Flow-Analysis Algorithms 667 12-6.3 Applications to Program Optimization 678 12-6.4 Implementation and Further Considerations 680 Chapter Exercises 681 Bibliography 683 Chapter 13 Machine-Dependent Optimization 684 13-1 Introduction to Machine-Dependent Optimization 684 13-2 Register-Allocation Optimization 686 13-2.1 Register Allocation in Single-Register Machines 686 13-2.2 Register Allocation in Multiregister Machines 694 13-3 Machine Architecture and the Generation of Real Code 702 13-3.1 The PDP-11 703 13-3.2 TheVAX-11 710 13-3.3 The MC68000 716 Chapter Exercises 720 Bibliography 721 Chapter 14 Compiler-Compilers 722 14-1 Introduction to Compiler-Compilers 723 14-2 Examples of Parser Generators 726 14-2.1 YACC: A LALR(l) Parser Generator 727 14-2.2 An Attributed LL(1) Parser Generator 735 14-3 Machine-Independent Code Generation 742 14-3.1 The Production-Quality Compiler-Compiler System 743 Introduction 743 The Formalization of Instruction-Set Processors and TCOL 745 The Code-Generation Process: Its Phases and Organization 749 14-3.2 The Table-Driven Generator of Graham and Glanville 755 Introduction 756 The Target-Machine Description 757 The Code-Generation Process 758 Results and Extensions 760 14-3.3 Other Code-Generation Systems 761 Fraser*s Knowledge-Based Code-Generator Generator 761 The Finite-State Approach of Donegan 763 Bibliography 764 Appendix Algorithmic Notation 767 A-l Format Conventions 768 A-l.l Name of Algorithm 768 A-l.2 Introductory Comment 768

XVI CONTENTS A-1.3 Steps 768 A-1.4 Comments 768 A-2 Statements and Control Structures 768 A-2.1 Assignment Statement 769 A-2.2 If Statement 769 A-2.3 Case Statement 770 A-2.4 Repeat Statement 770 A-2.5 Go To and Exitloop Statements 772 A-2.6 Exit Statement 772 A-2.7 Variable Names 773 A-3 Data Structures 773 A-3.1 Arrays 773 A-3.2 Dynamic Storage 774 A-4 Arithmetic Operations and Expressions 774 A-5 Strings and String Operations 775 A-6 Relations and Relational Operators 775 A-7 Logical Operations and Expressions 776 A-8 Input and Output 776 A-9 Subalgorithms 777 A-9.1 Functions 777 A-9.2 Procedures 778 A-9.3 Parameters 778 Indexes 781 Name Index 781 Subject Index 785

PREFACE This book is intended as a text for a one- or two-semester course in compiler design at the senior undergraduate or introductory graduate level. It can also be used as a self-study and reference book in compiler design. The purpose of this text is to cover the underlying concepts and techniques used in compiler design. Some of these techniques can be used in software engineering or software design. The reader should have at least one year of experience in programming a high-level language and an assembly language. In addition, a familiarity with elementary data structures and discrete mathematics is a definite asset. The text, however, is reasonably self-contained. Chapter 1 contains an overview of compiler design and gives a model for a compiler. Chapter 2 covers the basic notations and concepts of discrete mathematics, grammars, and languages. This material is used in the remainder of the book. Students who are familiar with this material can quickly peruse it or ignore it. They need not have a complete mastery of the concepts on their first reading. They can, however, refer to the material when it is used in subsequent chapters. In Chapter 3 we introduce the elements of programming language design. We attempt to give a unified treatment of the topics in sufficient depth to be useful to persons actually designing a programming language. Much of this material is appearing in a compiler book for the first time. Chapter 4 details the informal and the formal approaches to lexical analyzers (i.e., scanners). Chapter 5 gives an overview of error handling. Although it doesn't contain many detailed algorithms on error recovery and repair strategies, it discusses important factors and considerations of error recovery and repair in general. Detailed algorithms for particular strategies are given in Chapters 6 and 7. Chapter 6 is the first of two chapters on parsing. The chapter deals with top-down parsing. In the first section we examine a number of ad hoc methods. The second section gives a detailed description of syntax-directed translation using LL(\) grammars. xvii

Xviii PREFACE Chapter 7 discusses several syntax-directed bottom-up parsing techniques. The discussion includes several precedence and LR parsers along with associated error detection and recovery strategies. In Chapter 8, first we outline the important functions of a symbol table in the translation process and then give the details of several symbol-table organizations for block-structured and non-block-structured languages. Chapter 9 provides a description of the static, dynamic, explicit, and implicit storage-allocation strategies that are frequently adopted when compiling programs for a wide variety of existing languages. In Chapter 10 we discuss the advantages of using intermediate source forms. Five types of intermediate source forms are examined in detail. Chapter 11 deals with semantic analysis and code generation. This chapter presents four types of syntax-directed processing techniques, which include translation grammars and attribute translation. Chapter 12 presents several machine-independent optimization strategies, including folding, redundant subexpression elimination, optimization within iterative loops, and global optimization through flow analysis. Chapter 13 covers several machine-dependent optimization techniques such as register allocation and peephole optimization Chapter 14 gives a brief overview of compiler-compiler systems. The main focus is on the PQCC and Graham-Glanville systems. Much of this material appears in a compiler book for the first time. As we mentioned earlier, the book can be used in a two-semester course. Depending on the level of such a course, the instructor may wish to supplement the text with reading assignments from the references. Several one-semester courses are possible. For example, the following selection of topics constitutes such a course: Chapters 1 and 2 Chapter 3 (can be omitted if students don't do their own language design) Sections 4-1 and 4-2 Chapter 5 Sections 6-1.2 and 6-2 (except Sec. 6-2.5) Sections 7-4.1, 7-4.2, 7-4.3, and 7-4.6 Sections 8-5.3 and 8-5.4 and parts of Sec. 8-6 Chapter 9 (select one or two methods) Chapter 10 Chapter 11 (select a few methods of code generation) An important aspect of teaching a course in compiler writing is to illustrate the key theoretical concepts. A frequently used strategy for achieving this goal is to have a student design a simple programming language and implement a compiler for this language. Many texts in compiler writing do not, because of size considerations and level of presentation, adequately present details that illustrate the implementation of a compiler. In a separate volume, entitled An Implementation Guide to Compiler Writing (published with McGraw-Hill in 1982), we present,

PREFACE XIX in a case study manner, the development of a small compiler typical of those developed in a first compiler course. Instructors can use this book as a guide to illustrate difficulties that commonly arise when implementing a compiler. It can also be used as a sample document that is similar to what students should produce for their own compilers. It is often necessary for instructors teaching a course in compiler writing to set up their own "toy" compiler as a vehicle for illustrating techniques for implementing a compiler. Also, many of the problems that are encountered in writing a compiler for a newly designed language must be illustrated. In this guide we attempt to fulfill this role. The guide begins with a documented description of a simple program language, GAUSS, which serves as an example for the student. GAUSS is a block-structured language whose design was influenced by ALGOL, PASCAL, PL/I, and FORTRAN. It also contains string-manipulation facilities capable of manipulating variable-length strings. It then illustrates the topics of scanner generation, LL{\) parsing, symbol-table construction, code generation, semantic analysis, error detection, and repair for the GAUSS language. Finally, the guide describes an interpreter for the code generated by the GAUSS compiler and discusses the run-time environment in which the machine code is executed. The code generated by the compiler executes on a hypothetical machine which is stack oriented. ACKNOWLEDGMENTS We would like to thank the many people who assisted us in the preparation of this manuscript. Brent Clark assisted in the preparation of some parts of Chapters 12 and 13. Jack Copeck contributed to earlier drafts of Chapters 6 and 12. Howard Hamilton assisted in program flow analysis techniques. Rod Nickel contributed to the part of Chapter 7 that deals with error detection and recovery in precedence parsers. Lyle Opseth assisted in error detection and recovery in Chapter 7 and a preliminary version of Chapter 14. Darwyn Peachey assisted in the preparation of Chapter 5 and some error detection and recovery techniques in Chapters 6 and 7. Henry Spencer contributed to Chapter 3, and Joe Wald contributed to the attributed translation part of Chapter 11. Special thanks is due to John DeDourek, who class-tested the manuscript and proofread the entire galley form of the book. We also wish to thank the reviewers of the manuscript: Michael Bauer, James Harp, John Lowther, and Tom Pennello. A thanks to Chris Fraser for his comments on Chapter 14. We are grateful to Sue Wotton, who did such an excellent job of typing the manuscript, and to Gail Walker for her typing support. Finally, we are very grateful for the comments of our students in the Department of Computational Science at the University of Saskatchewan, who have class-tested preliminary versions of the book during the last seven years. Jean-Paul Tremblay Paul G. Sorenson

CHAPTER ONE INTRODUCTION 1-1 PROGRAMMING LANGUAGES Interactions involving humans are most effectively carried out through the medium of language. Language permits the expression of thoughts and ideas, and without it, communication as we know it would be very difficult indeed. In computer programming, a programming language serves as a means of communication between the person with a problem and the computer used to help solve it. An effective programming language enhances both the development and the expression of computer programs. It must bridge the gap between the often unstructured nature of human thought and the precision required for computer execution. A program solution to a given problem will be easier and more natural to obtain if the programming language used is close to the problem. That is, the language should contain constructs which reflect the terminology and elements used in describing the problem and are independent of the computer used. Such a programming language is usually high-level. Digital computers, on the other hand, accept and understand only their own special low-level language, consisting typically of long sequences of zeros and ones. These sequences are generally unintelligible to humans. This type of low-level language is quite different from the high-level language used to describe a problem in a given area. A hierarchy of programming languages based on increasing machine independence includes the following: 1. Machine-level languages 2. Assembly languages 3. Higher-level or user-oriented languages 4. Problem-oriented languages 1

2 THE THEORY AND PRACTICE OF COMPILER WRITING A machine-level language is the lowest form of computer language. Each instruction in a program is represented by a numeric code, and numerical addresses are used throughout the program to refer to memory locations in the computer's memory. All bookkeeping aspects of the program are the sole responsibility of the machine-language programmer. Finally, all diagnostics and programming aids must be supplied by the programmer. Also included as machine-level programs are programs written in microcode (i.e., microprograms). Microcode allows for the expression of some of the more powerful machine-level instructions in terms of a set of basic machine instructions. Assembly language is essentially a symbolic version of a machine-level language. Each operation code is given a symbolic code such as ADD for addition and MUL for multiplication. Moreover, memory locations are given symbolic names such as PAY and RATE. Some assembly languages contain macroinstruc- tions which are at a higher level than assembly-languages instructions. Assembly- language systems offer certain diagnostic and debugging assistance that is normally not available at the machine level. A high-level language such as FORTRAN, PASCAL, or PL/I offers most of the features of an assembly language. While some facilities for accessing system- level features may not be provided, a high-level language offers a more enriched set of language features such as structured control constructs, nested statements, blocks, and procedures. A problem-oriented language provides for the expression of problems in a specific application or problem area. Examples of such languages are SEQUEL for database retrieval applications and COGO for civil engineering applications. In this book we concern ourselves exclusively with the development of translators for high-level algebraic languages such as ALGOL, PASCAL, and FORTRAN. Although the class of high-level languages includes both procedural and nonprocedural languages, this book concentrates on procedural languages. Unlike procedural programs, the position of a particular statement in a nonprocedural program has no bearing on its execution; that is, the interchange of any two statements in a nonprocedural program will not affect the results obtained though its execution. Examples of nonprocedural languages are PSL, a requirements statement language, and CSMP, a simulation and modeling language. Advantages of high-level languages over low-level languages such as machine and assembly languages include the following: 1. High-level languages are easier to learn than their lower-level counterparts. The learning of many high-level languages requires little or no computer hardware background because such languages are relatively machine-independent. Furthermore, these languages are closer to their problem areas than lower-level languages. 2. The programmer does not have to be concerned with clerical tasks involving numerical or symbolic references to instructions, memory locations, constants, etc. Such tasks are handled by a system which translates the high-level language program into machine language.

INTRODUCTION 3 3. A programmer is not required to know how to convert data from external forms to various internal forms within the memory of a computer. The ability of a programmer to convert, say, numeric data into internal forms such as floating-point numbers and packed-decimal numbers should be irrelevant. 4. Most high-level languages offer a programmer a variety of control structures which are not available in low-level languages. High-level languages offer several of the following language constructs: Conditional statements (such as IF-THEN-ELSE and CASE statements) Looping statements (for both counted and conditional loops) Nested statements Block structures These control structures improve programming style and facilitate certain programming approaches such as structured programming. Resulting programs are easier to read, understand, and alter. This results in reduced programming costs because programs are less complicated. 5. Programs written in a high-level language are usually more easily debugged than their machine- or assembly-language equivalents. High-level languages offer constructs that eliminate or reduce certain kinds of programming errors that occur in low-level languages. For example, the declaration of variables in a program adds a certain degree of redundancy which is useful in detecting errors in the improper use of variables. Languages often enforce a disciplined use of pointers. Moreover, a structured program is much more easily debugged than its unstructured counterpart. 6. Since most high-level languages offer more powerful control and data- structuring capabilities than low-level languages, the former class of languages facilitates the expression of a solution to a particular problem. 7. Because of the availability of certain language features such as procedures, high-level languages permit a modular and hierarchical description of programming tasks. These tasks can then be assigned to a team of programmers, thus facilitating a division of labor with a minimum of disruption and effort. Such an approach permits better documentation of a problem. Also, increased compatibility among programs and programmers can be realized. 8. Finally, high-level languages are relatively machine-independent. Consequently, certain programs such as FORTRAN and COBOL programs are portable. These programs can be executed on different computers with little, if any, change even though these computers have different internal architectures and machine-language instruction sets. Program portability reduces costs and, in general, protects an installation against computer obsolescence. Often when a computer is replaced, all associated assembly- and machine- language programs become obsolete. Points 1 through 5 have been stressed in the past and found to be very important. Today, points 6 through 8 are receiving considerable attention and promise to be dominant considerations in the immediate future.

4 THE THEORY AND PRACTICE OF COMPILER WRITING High-level language programs, of course, must be translated automatically to equivalent machine-language programs. The notions of such translators are examined in the next section. 1-2 TRANSLATORS A translator inputs and then converts a source program into an object or target program. The source program is written in a source language and the object program belongs to an object language. If the source language being translated is assembly language and the object program is machine language, the translator is called an assembler. Assembly language resembles closely machine language. In an elementary assembly language, most of its instructions are symbolic representations of corresponding machine-language instructions. Each assembly instruction contains an ordered set of fields. For example, the first field might represent a label which is followed immediately by an operation field. This field in turn might be followed by one or more operand fields. More sophisticated assembly languages, however, support macroinstructions. A macroinstruction can be expanded to a sequence of simple machine-language instructions. High-level language constructs such as case statements, nested statements, and blocks are not usually contained in assembly languages. A translator which transforms a high-level language such as FORTRAN, PASCAL, or COBOL into a particular computer's machine or assembly language is called a compiler. The time at which the conversion of a source program to an object program occurs is called compile time. The object program is executed at run time. Figure 1-1 illustrates the compilation process. Note that the source program and data are processed at different times, namely, compile time and run time, respectively. Another kind of translator, called an interpreter, processes an internal form of the source program and data at the same time. That is, interpretation of the Source program Compiler Compile time Object program Executing computer Run time Results Figure 1-1 Compilation process.

INTRODUCTION 5 Source program Interpreter Results Figure 1-2 Interpretive process. internal source form occurs at run time and no object program is generated. Figure 1-2 illustrates this interpretive process. Some interpreters analyze each source statement every time it is to be executed. Such an approach is very time-consuming and is seldom used. A more efficient approach involves applying compilation techniques to the translation of the source program to an intermediate source form. This intermediate source form is then interpreted by the interpreter program. Interpreters have become increasingly popular lately, particularly in microcomputer environments where the overhead of interpretation seems to be signif- cantly less for the user. For example, a main reason why languages such as BASIC, APL, LISP, and SMALLTALK-80 have become very popular is because they have been implemented in an interpretive environment. In this book, we concentrate on compiler construction techniques. Although a complete compiler will not be found in this text, the reader is referred to the accompanying text, An Implementation Guide to Compiler Writing, where a compiler for the high-level programming language GAUSS is described in detail. A compiler can be described in a modular fashion. The next section gives an overview of such an approach. 1-3 MODEL OF A COMPILER The task of constructing a compiler for a particular source language is complex. The complexity and nature of the compilation process depend, to a large extent, on the source language. Compiler complexity can often be reduced if a program- ming-language designer takes various design factors into consideration. Several of these programming-language design factors are discussed in Chap. 3. Since we are dealing with high-level source languages such as ALGOL and PASCAL, however, a basic model of a compiler can be formulated. Such a model is given in Fig. 1-3. Although this model may vary for the compilation of different high-level languages, it is nevertheless representative of the compilation process. A compiler must perform two major tasks: the analysis of a source program and the synthesis of its corresponding object program. The analysis task deals with the decomposition of the source program into its basic parts. Using these

6 THE THEORY AND PRACTICE OF COMPILER WRITING Source program 1 i Lexical analyzer I ANALYSIS Syntactic analyzer \ i r i Semantic analyzer 1 f 1 Object program A SYNTHESIS Code generator A Code optimizer | \r TABLES Figure 1-3 Components of a compiler. parts, the synthesis task builds their equivalent object program modules. The performance of these tasks is realized more easily by building and maintaining several tables. More will be said about the contents of these tables shortly. A source program is a string of symbols each of which is generally a letter, a digit, or certain special symbols such as +, —, and (,). A source program contains elementary language constructs such as variable names, labels, constants, keywords, and operators. It is therefore desirable for the compiler to identify these various types as classes. These language constructs are given in the definition of the language. The source program (see Fig. 1-3) is input to a lexical analyzer or scanner whose purpose is to separate the incoming text into pieces or tokens such as constants, variable names, keywords (such as DO, IF, and THEN in PL/I), and operators. In essence, the lexical analyzer performs low-level syntax analysis. For efficiency reasons, each class of tokens is given a unique internal representation number. For example, a variable name may be given a representation number of 1, a constant a value of 2, a label the number 3, the addition operator (4-) a value of 4, etc. For example, the PL/I statement TEST: IF A > B THEN X = Y; would be translated by the lexical analyzer into the following sequence of tokens (and associated representation numbers): TEST 3 : 26 IF 20

INTRODUCTION 7 A > B THEN X = Y ? 1 15 1 21 1 10 1 27 Note that in scanning the source statement and generating the representation number of each token we have ignored spaces (or blanks) in the statement. The lexical analyzer must, in general, process blanks and comments. Both of these classes of items typically do not represent executable parts of a program and are ignored, for example, in PL/I. Certain programming languages allow the continuation of statements over multiple lines. Lexical analyzers must then handle the input processing of such multiple-line statements. Also, some scanners place constants, labels, and variable names in appropriate tables. A table entry for a variable, for example, may contain its name, type (e.g., REAL, INTEGER, or BOOLEAN), object program address, value, and line in which it is declared. The lexical analyzer supplies tokens to the syntax analyzer. These tokens may take the form of a pair of items. The first item gives the address or location of the token in some symbol table. The second item is the representation number of the token. Such an approach offers a distinct advantage to the syntax analyzer; namely, all tokens are represented by fixed-length information: an address (or pointer) and an integer. Chapter 4 describes the scanning process in more detail. The syntax analyzer is much more complex than the lexical analyzer. Its function is to take the source program (in the form of tokens) from the lexical analyzer and determine the manner in which it is to be decomposed into its constituent parts. That is, the syntax analyzer determines the overall structure of the source program. This process is analogous to determining the structure of a sentence in the English language. In such an instance we are interested in identifying certain classes such as "subject," "predicate," "verb," "noun," and "adjective." In syntax analysis we are concerned with grouping tokens into larger syntactic classes such as expression, statement, and procedure. The syntax analyzer (or parser) outputs a syntax tree (or its equivalent) in which its leaves are the tokens and every nonleaf node represents a syntactic class type. For example, an analysis of the source statement (A + B)*(C + D) can produce the syntactic classes (factor), (term), and (expression) as exhibited in the syntax tree given in Fig. 1-4. The syntax-tree approach is described in Chap. 2, where a set of rules known as a grammar is used to define precisely the source language. A grammar can be used by the syntax analyzer to determine the

8 THE THEORY AND PRACTICE OF COMPILER WRITING < expression > < factor > * < factor > cexpression>) ( < expression > ) < expression > + <term> /N < expression > + < term > < term > < factor > < factor > <factor> <factor> Figure 1-4 A syntax tree for the expression (A + B)*(C + D). structure of the source program. This recognition process is called parsing, and consequently we often refer to syntax analyzers as parsers. Several classes of parsers are discussed in detail in Chaps. 6 and 7. The syntax tree produced by the syntax analyzer is used by the semantic analyzer. The function of the semantic analyzer is to determine the meaning (or semantics) of the source program. Although it is conceptually desirable to separate the syntax of a source program from its semantics, the syntax and semantic analyzers work in close cooperation. Nevertheless, the semantic-analysis process is a different and unique process in a compiler. For an expression such as (A + B)*(C + D), for example, the semantic analyzer must determine what actions are specified by the arithmetic operators of addition and multiplication. When the parser recognizes an operator such as " +" or " *," it invokes a semantic routine which specifies the action to be performed. This routine may check that the two operands to be added have been declared, that they have the same type (if not, the routine would probably make them the same), and that both operands have values. The semantic analyzer often interacts with the various tables of the compiler in performing its task. The semantic-analyzer actions may involve the generation of an intermediate form of source code. For the expression (A + B)*(C + D), the intermediate source code might be the following set of quadruples: ( + ,A,B,T1) ( + ,C,D,T2) (*,T1,T2,T3)

INTRODUCTION 9 where (+, A, B, Tl) is interpreted to mean "add A and B and place the result in temporary Tl," (+, C, D, T2) is interpreted to mean "add C and D and place this result in T2," and (*,T1,T2,T3) is interpreted to mean "multiply Tl and T2 and place the result in T3." The exact form of the intermediate source language used depends, to a large extent, on how it is to be processed during the synthesis task. An infix expression may be converted to an intermediate form called Polish notation. Using this approach, the infix expression (A + B)*(C + D) would be converted to the equivalent suffix-Polish expression AB + CD + *. The latter expression contains the arithmetic operators in the order in which they are to be executed. Another type of intermediate form of source language can be a syntax tree which represents the parse of the source program. (See Chap. 10.) The output of the semantic analyzer is passed on to the code generator. At this point the intermediate form of the source-language program is usually translated to either assembly language or machine language. As an example, the translation of the three quadruples for the previous expression can yield the following sequence of single-address, single-accumulator assembly-language instructions: LDA A Load the contents of A into the accumulator. ADD B Add the contents of B to that of the accumulator. STO Tl Store the accumulator contents in temporary storage Tl. LDA C Load the contents of C into the accumulator. ADD D Add the contents of D to that of the accumulator. STO T2 Store the accumulator contents in temporary storage T2. LDA Tl Load the contents of Tl into the accumulator. MUL T2 Multiply the contents of T2 by that of the accumulator. STO T3 Store accumulator contents in temporary storage T3. The topic of code generation is presented in Chap. 11. The output of the code generator is passed on to a code optimizer. This process is present in more sophisticated compilers. Its purpose is to produce a more efficient object program. Certain optimizations that are possible at a local level include the evaluation of constant expressions, the use of certain operator properties such as associativity, commutativity, and distributivity, and the detection of common subexpressions. Because of the commutativity of the multiplication operator, the previous assembly code can be reduced to the following: LDA A ADDB STOT1 LDAC ADDD MULT1 STOT2 Note that the expression evaluated by this code is (C + D) * (A + B). More global optimizations can also be performed. These include the single evaluation of multiple occurrences of identical subexpressions and removing

10 THE THEORY AND PRACTICE OF COMPILER WRITING statements that are invariant inside a loop and placing them outside of that loop. These kinds of optimizations are machine-independent. There are, however, certain machine-dependent optimizations which can also be performed. Optimal register allocation is an example of this kind of optimization. A good code optimizer can produce as good or even better code than an experienced assembly-language programmer. Optimization techniques are the topic of Chaps. 12 and 13. The model of a compiler given in Fig. 1-3 makes distinctions between its successive phases. In certain compilers, however, some of these phases are combined. In our discussion of this model the details are omitted on how the five processes interact with each other. Let us examine some possible interactions between the lexical and syntax analyzers. One possibility is that the scanner generates a token for the syntax analyzer for processing. The syntax analyzer then "calls" the scanner when the next token is required. Another possibility is for the scanner to produce all the tokens corresponding to the source program before passing control to the syntax analyzer. In this case the scanner has examined the entire source program—this is called a separate pass. Some compilers make as little as one pass while other compilers have been known to make more than 30 passes (e.g., some of IBM's first PL/I compilers). Factors which influence the number of passes to be used in a particular compiler include the following: 1. Available memory 2. Speed and size of compiler 3. Speed and size of object program 4. Debugging features required 5. Error-detection and -recovery techniques desired 6. Number of people and time required to accomplish the compiler writing project Several educationally oriented or student compilers such as WATFIV and PL/C are essentially one-pass compilers. In these compilers very little (if any) code optimization is performed. The reason for this is the belief that programs will be compiled more times than executed. Such programs are often executed once and discarded. Also the semantic-analyzer and code-generation phases are invariably combined into one phase. Great emphasis is placed on debugging, error detection, error recovery, and diagnostic capabilities. Certain source languages, however, cannot be compiled in a single pass. Compilers for PL/I and FORTRAN often contain many passes. In particular, the optimization phase may require that several passes of the source program (or its intermediate form) be made. In addition, the optimization phase is often spread throughout the other passes of the compilation process (e.g., certain constant folding, as described in Chap. 12, has been incorporated in the lexical analyzer of some compilers). Other combinations are possible. Sometimes the syntax-analyzer, semantic- analyzer, and code-generation phases can be combined into a single phase.

INTRODUCTION 11 An aspect of compiler construction which was omitted in Fig. 1-3 deals with the area of error detection and error recovery. Error recovery is most closely associated with the syntax-analysis phase. The aim of error recovery is to prolong the compilation life of the program as long as possible before the compiler "gives up" on the source program. A strategy that is often adopted is that if at all possible, the object program is executed. This approach can reduce the number of times a source program has to be compiled before becoming error-free. Some error-recovery approaches insert and/or discard tokens in attempting to correct the portion of the source program which is syntactically incorrect. We attach significant importance to this topic in this book. In particular, Chap. 5 introduces the general notions of error detection and error recovery. Also, particular strategies for different parsing techniques are illustrated in Chaps. 6 and 7. Another aspect of compiling which has been ignored in the current discussion deals with the various symbol tables that must be created and maintained during the executions of the required object program. The topic of symbol tables is presented in Chap. 8. Also, certain language constructs in the source language imply that several data structures must be stored and maintained in the memory of the computer. The design and implementation of these structures is called run-time organization and implementation or storage management. For block- structured languages, string-oriented languages such as SNOBOL, and languages which permit the declaration and manipulation of programmer-defined data structures, the run-time considerations are very significant and complex. Details of these problems are dealt with in Chap. 9. In closing, a comment on the relative difficulty of implementing each of the phases in Fig. 1-3 is in order. The lexical-analysis phase is perhaps the most straightforward. Next in difficulty comes the syntactic-analysis or parsing phase. A lot of research, based on formal language theory, has been done on these two phases. As a result, the scanning and parsing phases can be largely automated. The theory of scanning is developed in Chap. 4. Chapters 6 and 7, on the other hand, describe the theory of parsing. The real difficulties in compiler construction are in the semantic-analysis, code-generation, and code-optimization phases. Another problem area is the portability of programs. These difficulties, which are often language- and machine-dependent, are examined in Chaps. 10, 11, 12, and 13. Several attempts have been made at automating compiler construction. Some of these efforts have enjoyed limited success. These translator-writing or compiler- compiler systems are briefly discussed in Chap. 14. BIBLIOGRAPHY Aho, A. V., and J. D. Ullman: Principles of Compiler Design, Reading, Mass.: Addison-Wesley, 1977. Barrett, W. A., and J. D. Couch: Compiler Construction, Theory and Practice, Chicago: Science Research Associates, Inc., 1979.

12 THE THEORY AND PRACTICE OF COMPILER WRITING Calingaert, P.: Assemblers, Compilers and Program Translators, Rockville, Md.: Computer Science Press, Inc., 1979. Cocke, J., and J. T. Schwartz: Programming Languages and Their Compilers, Courant Institute of Mathematical Sciences, New York University, April 1970. Gries, D.: Compiler Construction for Digital Computers, New York: John Wiley & Sons, 1971. Knuth, D. E.: "A History of Writing Compilers," Computers and Automation, Vol. 11, No. 12, December 1962, pp. 8-14. Lewis, P. M. II, D. J. Rosenkrantz, and R. E. Stearns: Compiler Design Theory, Reading, Mass.: Addison-Wesley, 1976. Pollack, B. W. (ed.): Compiler Techniques, Princeton, N.J.: Auerbach, 1972. Rosen, S., Programming Systems and Languages, New York: McGraw-Hill, 1967. Tremblay, J. P., and P. G. Sorenson: An Implementation Guide To Compiler Writing, New York: McGraw-Hill, 1982.

CHAPTER TWO NOTATION AND CONCEPTS FOR LANGUAGES AND GRAMMARS This chapter covers basic material which we will use extensively throughout the remainder of this book. Rather than spreading this material in various chapters of the text, therefore forcing the reader to read at least those parts of each chapter where the notions are introduced, we felt that a better alternative would be to put in this chapter basic fundamental material which is common to most of the remaining chapters. In this way the text can be read in a modular manner, thus allowing the reader to select certain chapters without loss of continuity. The first section deals with basic material concerning sets and strings. Section 2-2 introduces the notion of a grammar and its use in describing the syntax of a language. In the third section, a classification of grammars is given and the classification is shown to form a hierarchy. The material dealing with context-free grammars of Sec. 2-4 introduces basic notions such as parsing, syntax tree, and ambiguity. The problem of specifying the syntax of a programming language by the use of a metalanguage is discussed at some length. The material in Sees. 2-1, 2-2, 2-4.1, and 2-4.2 is drawn primarily from Tremblay and Manohar (1975). Readers who are familiar with the notions of sets, relations, and strings can proceed to Sec. 2-2. Readers who have been exposed to grammars can begin at Sec. 2-4. 13

14 THE THEORY AND PRACTICE OF COMPILER WRITING 2-1 SETS AND STRINGS A programming language consists of a set of programs (or strings). As we will see shortly, it is convenient to use a grammar as a formal vehicle for generating these programs. Several relations can be defined on the rules of a grammar, and these relations can lead to efficient compilation algorithms for the language associated with that grammar. Consequently, the purpose of this section is to give a brief introduction to the basic elements of sets and strings. In particular, the concepts of sets and operations on sets are given in Sec. 2-1.1. The notion of a relation is introduced in Sec. 2-1.2. Other important aspects about relations which are introduced in this subsection include the graph and transitive closure of a relation. The section ends with an elementary treatment of strings. 2-1.1 Basic Concepts of Set Theory By a set we mean a collection of objects of any sort. The word "object" is used here in a very broad sense to include even abstract objects. A fundamental concept of set theory is that of membership or belonging to a set. Any object belonging to a set is called a member or an element of that set. If an element p belongs to a set A, we write p e A, which is read as "/? is an element of the set A" or "/? belongs to the set A" If there exists an object q which does not belong to the set A9 we express this fact as q £ A. There are many ways of specifying a set. For example, a set consisting of the decimal digits is generally written as {0,1,2,3,4,5,6,7,8,9} The names of the elements are enclosed in braces and separated by commas. If we wish to denote this set as Z>, we write D= {0,1,2,3,4,5,6,7,8,9} where the equality sign indicates that D is the set {0,1,2,3,4,5,6,7,8,9}. This method of specifying a set is not always convenient. For example, the set D can be more easily described by using a predicate as follows: D = {x\x is a decimal digit} where the symbol | is read as "such that." In this case, the predicate is " is a decimal digit." If we let P(x) denote any predicate, then {x\P(x)} defines a set and is read "the set of all x such that P(x) is true." An element a belongs to this set if P(a) is true; otherwise a does not belong to the set. If we denote the set {jc|P(jc)} by B, then B = {jc|P(jc)}. Sets which are specified by listing their elements can also be characterized by means of a predicate. For example, the set {a, b, 1,9} can be defined as {x\(x = a) V(x = b) V(x = 1) V(x = 9)} where the symbol V denotes a logical disjunction (or). A predicate can also

NOTATION AND CONCEPTS FOR LANGUAGES AND GRAMMARS 15 contain the logical connectives A (logical conjunction or and), -, (logical negation), -> (if then), <=> (if and only if), and the relational operators <,<,=, =£ , > , and > . Furthermore, a predicate can also contain the existential quantifier (3, meaning "there exists") and the universal quantifier (V, representing "for every"). Although it is possible to characterize any set by a predicate, it is sometimes convenient to specify sets by another method, such as C= {1,3,5,...} S= {a,a\a\...} In this representation the missing elements can be determined from the elements present and from the context. The number of distinct elements present in a set may be finite or infinite. We call a set finite if it contains a finite number of distinguishable elements; otherwise a set is infinite. Note that no restriction has been placed on the objects that can be members of a set. It is not unusual to have sets whose members are themselves sets, such as A = (0, {a, b}, 1, { /?}}. It is important, however, to distinguish between the set { p}, which is an element of A, and the element /?, which is a member of { p} but not a member of A. For any two sets A and B, if every element of A is an element of B, then A is called a subset of 2?, or A is said to be included in B, or B includes A. Symbolically, this relation is denoted by A c B, or equivalently by B d A. For any two sets A and B, note that A c B does not necessarily imply that B c A except for the following case. Two sets A and B are said to be equal iff (if and only if) A c B and B c A so that A = B. The set A is called a proper subset of a set B if A c B and A =£ B. This relation is represented by A c B. We now introduce two special sets; the first includes every set under discussion while the second is included in every set under discussion. A set is called a universal set (E) if it includes every set under discussion. For example, the universal set for natural numbers may be E = (0,1,2,...}. A set which does not contain any element is called an empty set or a null set. An empty set will be denoted by <j>. Given any set A, we know that the null set <j> and the set A are both subsets of A. Also for any element p e A, the set { p} is a subset of A. Similarly, we can consider other subsets of A. Rather than finding individual subsets of A, we would like to say something about the set of all subsets of A. For any set A9 a collection or family of all subsets of A is called the power set of A. The power set of A is denoted by p(A) or 2A, so that p(A) = 2A = {x\x c A}. We now introduce the concept of an indexed set. Let J = {sx, s2, s3,...} and A be a family of sets A = {A ,A 9A ,...} such that for any s, e7 there corresponds a set As e ,4, and also As = ^45 iff s, = Sj. A is then called an indexed set, J the index set, and any subscript such as the s, in As> is called an index.

16 THE THEORY AND PRACTICE OF COMPILER WRITING An indexed family of sets can also be written as A = {^4, },<=/• In particular, if J = {1,2,3,...}, then A = {Al9A29A39...}. Also, if J = (1,2,3,..., n}9 then A = {Al9 A29 A39..., An} = {Ai}i&In where In = (1,2,..., n}. For a set S containing n elements, the power set p(S) is written as the indexed set p(S) = {Bt}i&Jwhere J = (0,1,2,. ..,2"-l}. Let us now consider some operations on sets. In particular, we emphasize the set operations of intersection, union, and complement. Using these operations, one can construct new sets by combining the elements of given sets. The intersection of any two sets A and B, written as A n 2?, is the set consisting of all elements which belong to both A and B. Symbolically, A O B = [x\x e A A x e B) For any indexed set A = {^4, },<=/, P| Ai = { x\x e ,4, for every iGi} For J = In= (1,2,..., n}, we can write /=i /<=/„ Two sets ^4 and 5 are called disjoint iff A n B = <j>. A collection of sets is called a disjoint collection if, for every possible pair of sets in the collection, the two sets are disjoint. The elements of a disjoint collection are said to be mutually disjoint. Let A be an indexed set A = [A^^j. The set A is a disjoint collection iff At O Aj = <f> for all i, y g/,/^y. For any two sets ^4 and 5, the union of ^4 and 2?, written as A U 2?, is the set of all elements which are members of the set A or the set B or both. Symbolically, it is written as A U B = (x|x e A Vjcg5}. For any indexed set A = {^4/}/ey, |J Ai; = { x|x e ,4. for at least one /g7} For J = In= (1,2,3,..., n}, we may write U^ = A U^2U ••• UAn i = l The relative complement of 2? in ^4 (or of B with respect to A), written as A — B, is the set consisting of all elements of A which are not elements of B, that is, A - B = {x\x ^ A A x £ B) The relative complement of B in A is also called the difference of A and 2?. The relative complement of A with respect to the universal set £, that is, E — A, is called the absolute complement of ^4. So far we have been concerned with sets, their equality, and operations on sets to form new sets. We now introduce the notion of an ordered pair. An ordered pair consists of two objects in a given fixed order. Note that an ordered pair is not a set consisting of two elements. The ordering of the two objects is important. We denote an ordered pair by (x, y).

NOTATION AND CONCEPTS FOR LANGUAGES AND GRAMMARS 17 The equality of two ordered pairs (x, y) and (w, v) is defined by (x,y) = (u,v) ~ ((x = u) /\(y = v)) where the symbol "<=> " denotes logical equivalence', that is, A <=> B means that A is equivalent to B. At this point we want to also define the term "imply." A is said to imply B written A => B, if and only if A -* B is always true. In mathematics, the notations A -* B and A => 2? are often used interchangeably. The idea of an ordered pair can be extended to define an ordered triple and, more generally, an /i-tuple. We write an /i-tuple as (xx, x2,..., xn). An important idea in set theory is the notion of a cartesian product. Given two sets A and B, the set of all ordered pairs such that the first member of the ordered pair is an element of A and the second member is an element of B is called the cartesian product of A and B and is written as A X B. Accordingly A XB = {(x,y)\(x^A)A(y^B)} This notion of cartesian product can be extended to any finite number of sets. Let A = { Ai }t; e j^ be an indexed set and In= (1,2,..., n}. We denote the cartesian product of the sets Al9 A2,..., Anby X Ai = AlXA2X ••• XAn which is defined recursively as XAi = Ax and X Ak= ( X a\xAm for m = 2,3,..., * Our definition of cartesian product of n sets is related to the definition of /i-tuples in the sense that = {(x1,x2,...,x„)|(x1 g^,) a(jc2s^2) A ••• A(xn(EAn)} As an example, consider the following sets: ^={1,3}, ^2 = {3,4}, A3 ={1,3,4,6} Several set operations are exhibited in the following: A,UA2= {1,3,4} AxnA,= {1,3} 3 LM,= {1,3,4,6} ru = {3} y=i A3-Ax- {4,6} pK) ={*.{!}, {3}, {1,3}} AX/12= {(1,3), (1,4), (3,3), (3,4)}

18 THE THEORY AND PRACTICE OF COMPILER WRITING In this section we have introduced the basic concepts of set theory. We now turn our attention to the notions of relations and orderings. EXERCISES 2-1.1 1 Give another description of the following sets and indicate those which are infinite sets. (a) {x\x is an integer and 5 < x < 12} (/>) {2,4,8,...} (c) All the countries of the world 2 Given S = {2, a, {3},4} and R = {{#},3,4,1}, indicate whether the following are true or false. (a) {a}eS (b) {a}^R (c) {a,4,{3}}cS (d) {{<i},l,3,4}c* (e) R = S (/) WCS (g) {*}<=* (/?) <f>c R (i) <*><= {{a}}QRQE U) {<M<=S (k) <j> e R (/) 4>c{{3},4} 3 Show that (R cz S) A (S cz Q) -^ R cz Q is always true. Is it correct to replace R c Q by /? c g? Explain your answer. 4 Give the power sets of the following. (*) (MM} (b) {1,*} (c) {*,r,z} 5 Given ^ = {jc|jc is an integer and 1 < x < 5}, B = {3,4,5,17}, and C = {1,2,3,...}, find A n 5, ^ n C, /I U 5, and A U C. 6 Show that /I c /i U B and ^ n B c /i. 7 Show that AczB**AuB=B. 8 If S = {tf,Z?,c}, find nonempty disjoint sets Ax and /12 such that AYKJ A2 = S. Find other solutions to this problem. 9 The symmetric difference of two sets A and 5 is the set A + B defined by A + B = (A - B)U {B - A). Given A = {2,3,4}, B = {1,2}, and C = {4,5,6}, find A + £, £ + C, /* + B + C, and (/I + 5) + (5 + C). Show that /i + B + C is associative. 10 Give examples of sets A, B, C such that A U 5 = A U C, but 5 * C. 11 Write the members of {a,Z>} X {1,2,3}. 12 Write /IX5XCJ2, /i3, £2 X >f, and A X 5 where A = {1}, 5= {«,£}, and C = {2,3}. 13 Show by means of an example that A X B * B X A and (/i Xfi)xC^^ X (5 X C). 14 Show that for any two sets A and B p(A) U p(B) c p(A U 5) P(/l)np(5) = p(^5) Show by means of an example that p(A) U p(B) * p(A U B)

NOTATION AND CONCEPTS FOR LANGUAGES AND GRAMMARS 19 15 Prove the identities A O A = A A O </> = </> A O E = A and A U E = E 16 Show that A X (B n C) = (A X B)n(A X C). 17 Show that A X B = B X A ~ {A = </>) V (B = </>) V (A = B) 18 Show that (A n B)U C = A n (5 U C) iff C c /i 19 Show that (/* - 5) - C = (/* - C) - (B - C) 20 Prove that (A n B) X (C n D) = (A X C)n(B X D) 2-1.2 Relations The concept of a relation is a basic concept in mathematics as well as in everyday life. Associated with a relation is the act of comparing objects which are related to one another. The ability of a computer to perform different tasks based upon the result of a comparison is an important attribute used several times during the execution of a typical program. In this subsection we first formalize the concept of a relation and then discuss methods of representing a relation by using a matrix or its graph. The relation matrix is useful in determining the properties of a relation and also in representing a relation on a computer. Various basic properties of a relation are given, and certain important classes of relations are introduced. The word "relation" suggests some familiar examples of relations such as the relation of father to son, sister to brother, or uncle to nephew. Familiar examples in arithmetic are relations such as less than, greater than, or that of equality between two real numbers. We also know the relation between the area of an equilateral triangle and the length of one of its sides and the area of a square and the length of a side. These examples suggest relationships between two objects. Throughout the discussion we consider relations, called binary relations, between a pair of objects. Any set of ordered pairs defines a binary relation. We call a binary relation simply a relation. It is sometimes convenient to express a particular ordered pair, say, (x, y) e R, where R is a relation, as xRy. In mathematics, relations are often denoted by special symbols rather than by capital letters. A familiar example is the relation "less than" for real numbers. This relation is denoted by <. In fact, < should be considered as the name of a set whose elements are ordered pairs. More precisely the relation < is < = {(x, y)|x, y are real numbers and x is less than y} Let S be a binary relation. The set D(S) of all objects x such that for some y9 (x, y) e S is called the domain of S, that is, D(S)={x\(3y)((x,y)<ES)} where the symbol 3 denotes existential quantification. Similarly, the set R(S) of all objects y such that for some x, (x, y) e S is called the range of S, that is, R(S)= {y\(3x)((x,y)eS)}

20 THE THEORY AND PRACTICE OF COMPILER WRITING Since a relation has been defined as a set of ordered pairs, it is therefore possible to apply the usual operations of sets to relations as well. The resulting sets will also be ordered pairs and will define some relations. If R and S denote two relations, then R n S defines a relation such that x(R O S)y <=* xRy A xSy Similarly, R U S is a relation such that x(R U S)y ~ xRy V xSy Also x(R - S)y <=» xRy A x$y where x$y denotes that x is not related to y in relation S. A binary relation R in a set X is reflexive if, for every xgJ, xRx, that is, (x, x) ^ R. The relation R is symmetric if, for every x and y in A", whenever xRy, then >>/£x. R is said to be transitive if, for every x, j>, and z in A", whenever x/ty and yRz9 then xRz. /£ is irreflexive if, for every x ^ X, (x, x) £ R. Finally, /£ is said to be antisymmetric if, for every x and y in A", whenever xRy and yRx, then x = y. Several important classes of relations having one or more of the properties given here will be discussed later in this subsection. A relation R from a finite set A" to a finite set Y can also be represented by a matrix called the relation matrix of R. Let X = { xl9 x2, • • •, xm}, y = { yx, y2,..., yn}, and R be a relation from X to y. The relation matrix can be obtained by first constructing a table whose rows are labeled by successive elements of X and whose columns are labeled by successive elements of Y. If x,/ty, then we enter a 1 in the /th row and yth column. If xpfiyq, then we enter a zero in the pth row and the gth column. As a special case, consider m = 2, n = 3, and R given by R = {(*1> -Vl)' (*2> .Vl), (*2> ^3)} The required table for R is Table 2-1. If we assume that the elements of X and Y appear in a certain order, then the relation R can be represented by a matrix whose elements are Is and Os. This matrix can be written down from the table constructed or can be defined in the Table 2-1 y\ yi ^3 xx 1 0 0 x2 1 0 1

NOTATION AND CONCEPTS FOR LANGUAGES AND GRAMMARS 21 following manner: /l, if XiRyj ^ (O, ifxjyj where rtj is the element in the /th row and the jth column. The matrix obtained this way is the relation matrix. If X and Y have m and n elements, respectively, then the matrix is an m X n matrix. For the relation R just given, the* relation matrix is fl 0 0] Ll 0 lj One not only can write a relation matrix when a relation R is given but also obtain the relation if the relation matrix is given. Throughout the remainder of this subsection, unless otherwise stated, we will assume X = Y; that is, the relations are defined in a set X. Thus the relation matrices are square. A relation matrix reflects some of the properties of a relation in a set. If a relation is reflexive, all the diagonal entries must be 1. If a relation is symmetric, the relation matrix is symmetric. If a relation is antisymmetric, its matrix is such that if rtj = 1, then rji = 0 for / =£ j. A relation can also be represented pictorially by drawing its graph. Although we shall introduce some of the concepts of graph theory which are discussed in later chapters, here we shall use graphs only as a tool to represent relations. Let R be a relation in a set X = {xl9 x2,..., xn}. The elements of X are represented by points or circles called nodes. The nodes corresponding to xt and Xj are labeled xt and xJ9 respectively. These nodes may also be called vertices. If XjRXj, that is, if (xi9 Xj) e R, then we connect nodes xt and Xj by means of an arc and put an arrow on this arc in the direction from xt to Xj. When all the nodes corresponding to the ordered pairs in R are connected by arcs with proper arrows, we get a directed graph of the relation R. If xtRxj and XjRxt, then we draw two arcs between xt and Xj. For the sake of simplicity, we may replace the two arcs by one arc with arrows pointing in both directions. If x^Rx^ we get an arc which starts from node xt and returns to xt. Such an arc is called a sling. From the graph of a relation it is possible to observe some of its properties. Several examples are given in Fig. 2-1. Another very important notion which is used in later chapters is that of a partition. Let 5 be a given set and A = {Av A2,..., An) where each At, /= 1,2,...,/!, is a subset of S and U".^, = S, then the set A is called a covering of S, and the sets Al9 A2, ...,An are said to cover S. If, in addition, the elements of A, which are subsets of S, are mutually disjoint, then A is called a partition of S, and the sets Al9A29...,A„ are called the blocks of the partition. We now proceed to define an important class of relations which partitions a set. A relation R in a set X is called an equivalence relation if it is reflexive, symmetric, and transitive. If R is an equivalence relation in a set X, then D(R)9 the domain of R, is X itself. Therefore, R will be called a relation on X. Some

22 THE THEORY AND PRACTICE OF COMPILER WRITING y y / P *c> x yRx yRy xRzazRxayRz y xQ > Oz xRyAxRzAzRy xRxAxRyAyRx Figure 2-1 Graphs of relations. examples of equivalence relations are equality of numbers on a set of real numbers, similarity of triangles on a set of triangles, and the relation of lines being parallel on a set of lines in a plane. Another example of an equivalence relation which will be used in the discussion on hashing functions (see Chap. 8) is that of equality in a modular number system. Let / denote the set of all positive integers, and let n be a positive integer. For x e / and y e /, we define R as R = {(*, y)\x - y is divisible by n) Note that "x — y is divisible by n " is equivalent to the statement that both x and y have the same remainder when each is divided by n. Perhaps the most important idea about relations that is used in formal parsing techniques involves taking the " transitive closure" of a relation. We now develop this idea. Since a binary relation is a set of ordered pairs, the usual operations such as union and intersection on these sets produce other relations. We now consider another operation on relations—relations which are formed in two or more stages. Familiar examples of such relations are the relation of being a nephew or a brother's or sister's son, the relation of an uncle or a father's or mother's brother, and the relation of being a grandfather or a father's or mother's father. These relations can be produced in the following manner. Let R be a relation from X to Y and S be a relation from Y to Z. Then a relation written as R ° S is called a composite relation of R and S where RoS = {(x,z)\x e X A z e Z A(3y)(y e Y A(x,y) <e R A(y,z) e S)} The operation of obtaining R ° S from R and S is called composition of relations. £

NOTATION AND CONCEPTS FOR LANGUAGES AND GRAMMARS 23 R S X ^Y ► Z Ro S X ► Z Figure 2-2 Relations R, S, and R ° S. Note that R ° S is empty if the intersection of the range of R and the domain of S is empty. R ° S is nonempty if there is at least one ordered pair (x9 y) e R such that the second member y ^ Y of the ordered pair is a first member in an ordered pair in S. For the relation i?°5, the domain is a subset of X and the range is a subset of Z. In fact, the domain is a subset of the domain of R, and its range is a subset of the range of S. From the graphs of R and S one can easily construct the graph of R <> S. As an example, see Fig. 2-2. The operation of composition is a binary operation on relations, and it produces a relation from two relations. The same operations can be applied again to produce other relations. For example, let R be a relation from X to Y, S a relation from Y to Z, and P a relation from Z to W. Then R ° S is a relation from X to Z. We can form (R°S)°P, which is a relation from X to W. Similarly, we can also form R°(S ° P), which again is a relation from X to JK. Let us assume that (R° S)° P is nonempty, and let (x, y) e /?, (^, z) e S, and (z, w) e P. This assumption means (x,z)g/?o5 and (x, w) e (/? o S)<> p. Of course, (y,w) ^ S° P and (x, w) e Ro(S° P), which shows that (#°S)o/> = Ro(SoP) This result states that the operation of composition on relations is associative. We

24 THE THEORY AND PRACTICE OF COMPILER WRITING may delete the parentheses in writing (R ° S)° P, so that (RoS)oP = Ro(SoP) = RoSop We know that the relation matrix of a relation R from a set X = { xl9 xl9. • •, xm} to a set Y = { yl9 yl9..., yn} is given by a matrix having w rows and n columns. We shall denote the relation matrix of R by MR. MR has entries which are Is and Os. Similarly the relation matrix Ms of a relation S from the set y to a set Z = {zl9z29...9zp} is an /i X /? matrix. The relation matrix of R ° S can be obtained from the matrices MR and M5 in the following manner. From the definition it is clear that (*,-, zk) e Ro s if there is at least one element of Y9 say, yj9 such that (*,., >>•) e R and (j;., z^) e S. There may be more than one element of Y which has properties similar to those of yJ9 for example, (■*/> yr) G ^ an^ (yr> zk) G ^- ^n all suc^ cases» (xn zk) G ^ ° ^- Thus when we scan the /th row of MR and /cth column of M5, and we come across at least one y, such that the entries in the j th location of the row as well as the column under consideration are Is, then in this instance, the entry in the /th row and /cth column of MRoS is also 1; otherwise it is 0. Scanning a row of MR along with every column of Ms gives one row of MR 0 s. In this way, we can obtain all the rows of MR 0 s. In general, let the relations A and B be represented by n X m and m X r matrices, respectively. Then the composition A ° B which we denote by the relation matrix C is expressed as m Ca= V aikAbkj / = l,2,...,/i; y = l,2,...,r k = \ where aik A bkj and V ^=1 indicate bit-ANDing (i.e., 1a0 = 0a1=0a0 = 0,1 A 1 = 1) and bit-ORing (i.e., 1 V 1 = 1 V 0 = 0 V 1 = 1,0 V 0 = 0), respectively. Let us now consider some distinct relations Rx, R2, R3, R4 in a set X = {a9b,c} given by Rx= {(a,b)9(a9c)9(c,b)} R2= {(a9b),(b9c)9(c,a)} R3= {(a9b)9(b9c)9(c9c)} R4= {(a9b)9(b9a)9(c9c)} Denoting the composition of a relation by itself as RoR = R2 RoRoR = Ro R2 = R3 ... R o R»*-l = R™ let us write the powers of the given relations. Clearly Rl={(a,b)} Rl-* R* = <t> R\= {(a,c),(b,a),(c,b)} R\= {(a,a),(b,b),(c,c)} d4 n n5 p2 n6 n3 l\2 — l\2 I\2 — I\2 l\2 — l\2 *•• R23- {(a,c),(b,c),(c,c)} = R] = R* = R> R24= {(a,a),(b,b),(c,c)) R\ = R4 R54 = R24

NOTATION AND CONCEPTS FOR LANGUAGES AND GRAMMARS 25 Given a finite set X9 containing n elements, and a relation R in X, we can interpret Rm (m = 1,2,...)in terms of its graph. This interpretation is done for a number of applications throughout the text. With the help of such an interpretation or from the examples given here, it is possible to say that there are at most n distinct powers of R9 for Rm9 m > n9 that can be expressed in terms of R9R29...9Rn. Our next step is to construct the relation in X given by R + = RU R2U R3 U ••• Naturally, this construction will require only a finite number of powers of R to be calculated, and these calculations can easily be performed by using the matrix representation of the relation R and the Boolean multiplication of these matrices. Let us now see what the corresponding relations Rf, R29 R3 , and R% are R? = Rx U R2 U Rl • • • = Rx R J = R2 U R\ U R\ • • • = R2 U R\ U R\ = {(a9b)9(b9c)9 (c9a)9 (a9c)9 (b9a)9(c9b)9(a9a)9(b9b)9 {c9c)} Rt = {(a9b)9(b9c)9(c9c)9(a9c)} *; = {(a9b)9(b9a)9(c9c)9(a9a)9(b9b)} Observe that the relations R?, R29..., R% are all transitive and that Rx c Ri , R2 £ R2,...9 R4C: R4. From the graphs of these relations one can easily see that Rf is obtained from Rl (i = 1,2, 3,4) by adding only those ordered pairs to Rt such that R* is transitive. We now define R+ in general. Definition 2-1 Let X be any finite set and i^bea relation in X. The relation R + = RU R2U R3 U ••• in* is called the transitive closure of # in X. Theorem 2-1 The transitive closure R+ of a relation /? in a finite set X is transitive. Also for any other transitive relation P in X such that R c P, we have R + (i P. In this sense, /?+ is the smallest transitive relation containing R. Based on this theorem, the transitive closure of the relation matrix for some relation A can easily be computed by using the following algorithm due to Warshall. Procedure WARSHALL (A, n, P). Given the relation matrix A with n rows, the following steps produce the transitive closure of A, which is denoted by P. The variables i, j, and k are local integer variables. 1. [Initialize] P<- A 2. [Perform a pass] Repeat through step 4 for k = 1,2,...,n 3. [Process rows] Repeat step 4 for j = 1,2,...,n

26 THE THEORY AND PRACTICE OF COMPILER WRITING 4. [Process columns] Repeat for j = 1,2,...,n Pij «- Pij V (Pik A Pkj). 5. [Finished] Exit □ To show that this algorithm produces the required matrix, note that step 1 produces a matrix in which pt] = 1 if there is a path of length 1 from Vj to vr Assume that for a fixed k, the intermediate matrix P produced by steps 3 and 4 of the algorithm is such that the element in the ith row and jth column in this matrix is 1 if and only if there is a path from vf to v} which includes only nodes from v1} v2,...,vk as intermediate nodes. Now with an updated value of k, we find that p^ = 1 either if p^ = 1 in an earlier step or if there is a path from Vj to Vj which traverses through vk + 1. This means that Pjj = 1 if and only if there is a path from Vj to Vj which includes only nodes from v1,v2,...,vk + 1 as intermediate nodes. Instances of transitive closures of relations will be given in Chaps. 6 and 7, where such relations are derived from a grammar and used in the parsing phase. Also, the transitive closure of a graph of a program can be used to perform certain optimizations in that program. This aspect of compiler writing is the topic of Chaps. 12 and 13. In terminating this subsection we introduce the notion of the converse of a relation. Given a relation R from X to Y, a relation R from Y to X is called the converse of R, where the ordered pairs of R are obtained by interchanging the members in each of the ordered pairs of R. This means, for x e X and y e Y, that xRy <=> ykx. From the definition of R it follows that R = R. The relation matrix MR of R can be obtained by simply interchanging the rows and columns of MR. Such a matrix is called the transpose of MR. Therefore, MR = transpose of MR The graph of R is also obtained from that of R by simply reversing the arrows on each arc. We now consider the converse of a composite relation. For this purpose, let R be Si relation from X to Y and S be a relation from Y to Z. Obviously, R is a relation from Y to X, S from Z to Y; R ° S is a relation from X to Z, and R s S is a relation from Z to X. Also the relation S ° R is from Z to X. We now show that RoS = SoR If xRy and ySz9 then x(R ° S)z and z(R 5 S)x. But zSy and ykx, so that z(5 ° R)x. This is true for any x ^ X and z ^ Z\ hence the required result. The same rule can be expressed in terms of the relation matrices by saying that the transpose of MRoS is the same as the matrix M^oR. The matrix M^oR can be obtained from the matrices M$ and MR, which in turn can be obtained from the matrices Ms and MR.

NOTATION AND CONCEPTS FOR LANGUAGES AND GRAMMARS 27 Later in the text, the concept of a relation will be applied to a grammar. In fact there are several interesting relations which can be obtained from a grammar. Many of the operations which are to be performed on these relations have been discussed in this subsection. EXERCISES 2-1.2 1 Give an example of a relation which is neither reflexive nor irreflexive. 2 Give an example of a relation which is both symmetric and antisymmetric. 3 If relations R and S are both reflexive, show that R U S and R n S are also reflexive. 4 If relations R and S are reflexive, symmetric, and transitive, show that R n S is also reflexive, symmetric, and transitive. 5 Determine whether the following relations are transitive: /l,-{(1,1)} R2- {(1,2), (2,2)} R3 = {(1,2), (2,3), (1,3), (2,1)} 6 Given S = {1,2,3,4} and a relation R on S defined by R- {(1,2), (4,3), (2,2), (2,1), (3,1)} show that R is not transitive. Find a relation R^ D R such that Rx is transitive. Can you find another relation R2 2 R which is also transitive? 7 Given S = {1,2,..., 10} and a relation R on S where /?= {(x,y)\x+y = \0} what are the properties of the relation? 8 Let R be a relation on the set of positive real numbers so that its graphical representation consists of points in the first quadrant of the cartesian plane. What can we expect if R is (a) reflexive, (b) symmetric, and (c) transitive? 9 Let L denote the relation "less than or equal to" and D denote the relation "divides," where xDy means "x divides y." Both L and D are defined on the set {1,2,3,6}. Write L and D as sets, and find Ln D. 10 Show that the relations L and D given in Exercise 9 are both reflexive, antisymmetric, and transitive. Give another example of such a relation. Draw the graphs of these relations. 11 Let R denote a relation on the set of ordered pairs of positive integers such that (jc, y)R(u, v) iff xv = yu. Show that R is an equivalence relation. 12 Given a set S = {1,2,3,4,5}, find the equivalence relation on S which generates the partition whose equivalence classes are the spts {1,2}, {3}, and {4,5}. Draw the graph of the relation. 13 Prove that the relation "congruence modulo m" given by = ={(jc,j>)|jc-j>is divisible by m } over the set of positive integers is an equivalence relation. Show also that if xx = yl and x2 = y2, then (*i + xi) = (y\ + y2)- 14 Prove the following equivalences and equalities: (a) R = R (b) R = S ~ R = S (c) R c S ~ £ c S (d) R 0 S = R U S (e) R n S = R O S

28 THE THEORY AND PRACTICE OF COMPILER WRITING 15 Show that if a relation R is reflexive, then R is also reflexive. Show also that similar remarks hold if R is transitive, irreflexive, symmetric, or antisymmetric. 16 What nonzero entries are there in the relation matrix of R n R if R is an antisymmetric relation in a set XI 17 Given the relation matrix MR of a relation R on the set {a, b, c}, find the relation matrices of R, R2 = R o R, R3 = R o R o Ry and /? ° fl. A/„ = 1 1 Li 0 i i i 0 1 18 Two equivalence relations R and S are given by their relation matrices MR and Ms. Show that R o s is not an equivalence relation. Mr 1 1 0 1 1 0 0 0 1 Ms = 1 0 .0 0 1 1 0 1 1 Obtain equivalence relations RY and R2 on {1,2,3} such that RY ° R2 is also an equivalence relation. 19 Using Warshall's algorithm, obtain the transitive closure of the relation whose graph is given in Fig. 2-3. 20 For the graph of the relation R given in Fig. 2-4, determine MR, MR A MR, and MR A MR. Figure 2-3 v1' v4 Figure 2-4

NOTATION AND CONCEPTS FOR LANGUAGES AND GRAMMARS 29 2-1.3 Strings In this subsection we are concerned with some elementary properties of strings. There are many interesting properties exhibited by string operations, just as there are interesting properties for arithmetic operations over the natural numbers. To refamiliarize ourselves with some of the properties associated with operations, let us consider the operation of addition on the natural numbers. This operation can be represented, in general, by a functional system in two variables: f(x,y) = x + y where x and y are natural numbers. This system exhibits certain interesting properties. First, the sum of any two natural numbers is a natural number. This property is called closure. Closure is a necessary property for a system (i.e., a set and an operation on that set) to be classified as an algebraic system. Second, (x+y) + z = x + {y + z) = x+y + z when x9 y, and z are natural numbers; accordingly the operation of addition is said to be associative. Third, there exists a number / such that for every natural number x, x + / = x. This number is zero and is called the unit element or identity of the additive system. Many other important properties, such as distributivity and commutativity, exist when arithmetic operations such as addition and multiplication are applied to the set of natural numbers. We begin a discussion of strings by formally defining a string. To do so, we must introduce the notion of an alphabet and the operation of concatenation. Simply stated, an alphabet V is a finite nonempty set of symbols. The set V = {a,b, c,..., z) is a familiar example of an alphabet and (a, /?, y, e) is a four-character alphabet (which is a subalphabet of the Greek alphabet). The concatenation of two alphabetic characters, say %a% and *b\ is said to form a sequence of characters, namely, 'ab'. (Note that henceforth when we refer to a character from an alphabet or a sequence of such characters, they are enclosed in single quote marks.) The operation of concatenation also applies to sequences of characters. For example, 'ab' concatenated with 'ab' is *abab\ We denote the concatenation operator by the special symbol <>. This allows us to write expressions such as 'ab' ° *a\ which is identical in value to *aba\ A string (or sentence) over an alphabet V is either a letter from the alphabet For a sequence of letters derived from the concatenation of zero or more characters from the alphabet V. Examples of strings over an alphabet V = { a, b, c) are %a\ %ca\ %ccba\ and *bbb\ Let V o V = V2 designate all strings of length 2 on V, V ° V ° V = V2<>V = V3 designate all strings of length 3 on V, and, in general, F°F<> ••• «F= Vn designate all strings of length n on V. Then the closure of V, denoted as K+, is defined as v+= vu v2u v3u ••• For completeness, a special string e called the empty (or null) string is often combined with V+ to form the closure set V* of V. That is, V* = {e} U V U V2 \j V3 U - - • = {e} U V+. The string e has the identity property (i.e., x ° e =

30 THE THEORY AND PRACTICE OF COMPILER WRITING e o x = x for any string x which is an element of V*), and it is called the identity element in the system formed by the set V* and the operation of concatenation. Associativity is another property of this system (that is, (x° y)° z = x°(y°z) = x o y o z for x, y9 z e F*). As an example, consider the set of strings V* that can be generated from an alphabet V = (x, y). Some subsets of F* are V2 = {'xx\ 'xy', %yx\ %yy%) V3 = {'jocx', %xxy\ *xyx\ *xyy\ *yxx\ 'yxy', %yyx\ 'yyy'} V4 = { %xxxx\ 'xxxy*9 %xxyx\ *xxyy\ %xyxx\ %xyxy\ *xyyx\ %xyyy\ *yxxx\ %yxxy\ *yxyx\ *yxyy*, %yyxx\ *yyxy\ *yyyx\ %yyyy%) We will, on many occasions, refer to the closure set V* of an alphabet. As another example of a string, let us examine FORTRAN. The FORTRAN alphabet consists of 26 letters, 10 digits, and a set of special characters, such as '(', ')', V, ,==\ and ' + '. It is only these characters that are used in writing a FORTRAN program. Hence a program can be viewed as the concatenation of characters over an alphabet to yield an arbitrarily long string. In Chap. 4, we see that the problem of ensuring that only the proper set of characters appears in a program is handled by the scanning phase of a compiler. Let x, y, and z be strings over an alphabet where z = xy. The string x is called a prefix or head of z. If y i=- e, then x is called a proper prefix or proper head. Similarly, y is called a suffix or tail of z, and if x =£ e, then y is called a proper suffix or proper tail. Some of the concepts introduced here are used in the next section, where the notions of grammars and languages are introduced. 2-2 DISCUSSION OF GRAMMARS Programming languages must be precisely defined. Unfortunately, for some of the earlier programming languages the existence of a particular compiler finally provided the precise definition of such a language. The proper specification of a programming language involves the definition of the following: 1. The set of symbols (or alphabet) that can be used to construct correct programs 2. The set of all syntactically correct programs 3. The "meaning" of all syntactically correct programs In this section we shall be concerned with the first two items in the specification of programming languages. A language L can be considered a subset of the closure (including the empty string) of an alphabet. The language consisting of this closure set is not particu-

NOTATION AND CONCEPTS FOR LANGUAGES AND GRAMMARS 31 larly interesting, since it is too large. Our definition of a language L is a set of strings or sentences over some finite alphabet VT, so that L c K£. How can a language be represented? A language consists of a finite or an infinite set of sentences. Finite languages can be specified by exhaustively enumerating all their sentences. However, for infinite languages, such an enumeration is not possible. On the other hand, any means of specifying a language should be finite. One method of specification which satisfies this requirement uses a generative device called a grammar. A grammar consists of a finite nonempty set of rules or productions which specify the syntax of the language. As well, a grammar imposes structure on the sentences of a language. Many grammars may generate the same language but impose different structures on the sentences of that language. The study of grammars constitutes an important subarea of computer science called formal language theory. This area emerged in the mid- 1950s as a result of the efforts of Noam Chomsky (1959), who gave a mathematical model of a grammar in connection with his study of natural languages. In 1960, the concept of a grammar became important to programmers because the syntax of ALGOL 60 was described by a grammar. A second method of language specification is to have a machine, called an acceptor, determine whether a given sentence belongs to the language. This approach is discussed further in Chaps. 4, 6, and 7 along with some very interesting and important relationships that exist between grammars and acceptors. In this section we are concerned with a grammar as a mathematical system for defining languages and as a device for giving some useful structure to sentences in a language. It was mentioned earlier that a grammar imposes a structure on the sentences of a language. For a sentence in English such a structure is described in terms of subject, predicate, phrase, noun, and so on. On the other hand, for a program, the structure is given in terms of procedures, statements, expressions, etc. In any case, it may be desirable to describe all such structures and to obtain a set of all the correct or admissible sentences in a language. For example, we may have a set of correct sentences in English or a set of valid ALGOL programs. The grammatical structure of a language helps us determine whether a particular sentence does or does not belong to the set of correct sentences. The grammatical structure of a sentence is generally studied by analyzing the various parts of a sentence and their relationships to one another. Consider the sentence "the cat ate a mouse." Its structure (or parse) is shown in Fig. 2-5. This diagram of a parse displays the syntax of a sentence in a manner similar to a tree and is therefore called a syntax tree. Each node in the diagram represents a phrase of the syntax. The words such as "the" and "cat" are the basic symbols, or primitives, of the language. The syntax of a small subset of the English language can be described by using the symbols S: sentence V: verb O: object A: article N: noun SP: subject phrase VP: verb phrase NP: noun phrase

32 THE THEORY AND PRACTICE OF COMPILER WRITING in the following rules: S -> SP VP SP-> AN A -> 'a' A -» •the1 N -* 'monkey1 N -> 'banana' N -» 'cat' N -> 'mouse' N -» 'tree' VP-» VO V-» 'ate' V -* 'climbs' 0-»NP NP-» AN These rules state that a sentence is composed of a "subject phrase" followed by a "verb phrase"; the "subject phrase" is composed of an "article" followed by a "noun"; a verb phrase is composed of a "verb" followed by an "object"; and so on. The structure of a language is discussed by using symbols such as "sentence," "verb," "subject phrase," and "verb phrase," which represent syntactic classes of elements. Each syntactic class consists of a number of alternative structures, and each structure consists of an ordered set of items which are either primitives (of the language) or syntactic classes. These alternative structures are called productions or rules of syntax, or replacement rules. For example, the production, S -> SP VP defines a "sentence" to be composed of a "subject phrase" followed by a "verb phrase." The symbol -* separates the syntactic class "sentence" from its definition. The syntactic class and the arrow symbol along with the interpretation of a production enable us to describe a language. A system or language which describes another language is known as a metalanguage. The metalanguage used to teach German at most universities is subject phrase verb phrase article noun verb article the cat ate a Figure 2-5 Syntax tree for an English sentence. object noun phrase

NOTATION AND CONCEPTS FOR LANGUAGES AND GRAMMARS 33 English, while the metalanguage used to teach English is English. The diagram of the parse of a sentence describes its syntax but not its meaning or semantics. At this time, we are mainly concerned with the syntax of a language, and the device which we have just defined to give the syntactic definition of the language is called a grammar. Using the grammatical rules for our example, we can either produce (generate) or derive a sentence in the language. A computer programmer is concerned with producing programs which adhere to the productions (grammatical rules) of the language. The compiler of a language, on the other hand, is faced with the problem of determining whether a given sentence (source program) is syntactically correct based upon the given grammatical rules. If the syntax is correct, the compiler produces object code. Consider the problem of trying to generate or produce the sentence " the cat ate the mouse" from the set of productions given. It is accomplished by starting first with the syntactic class symbol S and looking for a production which has S to the left of the arrow. There is only one such production, namely, S -» SP VP We have replaced the class S by its only possible composition. We then take the string SPVP and look for a production whose left-hand side is SP and then replace it with the right-hand side of that production. The application of the only production possible produces the string ANVP We next look for a production whose left part is A, and two such productions are found. By selecting the production A -* ' the' and upon substituting its right-hand side in the string A N VP, we obtain the string the N VP This enumerative process is continued until we arrive at the correct sentence. At this point, the sentence contains only primitive or terminal elements of the language (no classes). A complete derivation or generation of the sentence " the cat ate a mouse" is as follows: S => SP VP =* ANVP =* the N VP => the cat VP => the cat V O => the cat ate O => the cat ate NP => the cat ate A N => the cat ate a N => the cat ate a mouse

34 THE THEORY AND PRACTICE OF COMPILER WRITING Here the symbol => denotes that the string on the right-hand side of the symbol can be obtained by applying one replacement rule to the previous string. The rules for the example language can produce a number of sentences. Examples of such sentences are The cat ate a banana. The monkey climbs a tree. The monkey climbs the tree. The banana ate a cat. The last of these sentences, although grammatically correct, does not make sense because of its semantics. This situation is often allowed in the specification of languages. There are many valid FORTRAN and PASCAL programs that do not make sense. It is easier to define languages if certain sentences of questionable validity are allowed by the replacement rules. There is another important thing to note concerning the generation of strings (i.e., sentences) from a grammar. We have neglected (for legibility reasons) to put into the terminal symbols of the grammar the blank characters which normally appear between the words in the sentences. Throughout the book, we assume, unless it is obvious from context of the text, that a blank delimits all terminal symbols. The set of sentences that can be generated by the rules of the previous example is finite. Any interesting language usually consists of an infinite set of sentences. As a matter of fact, the importance of a finite device such as a grammar is that it permits the study of the structure of a language consisting of an infinite set of sentences. Let the symbols L, D, and I denote the classes L: letter, D: digit, and I: identifier. The productions which follow are recursive and produce an infinite set of names because the syntactic class I is present on both the left and the right sides of certain productions. I -» L D-»0 I -* ID D -» 1 I -» IL L -> a D^9 L ->b L -»z It is easily seen that the class I defines an infinite set of strings or names in which each name consists of a letter followed by any number of letters or digits. This set is a consequence of using recursion in the definition of the productions

NOTATION AND CONCEPTS FOR LANGUAGES AND GRAMMARS 35 I -> ID and I -* IL. In fact, recursion is fundamental to the definition of an infinite language by the use of a grammar. Let us now formalize the idea of a grammar and how it is used. For this purpose, let VT be a finite nonempty set of symbols called the terminal alphabet. The symbols in VT are called terminal symbols. The metalanguage which is used to generate strings in the language is assumed to contain a set of syntactic classes or variables called nonterminal symbols. The set of nonterminal symbols is denoted by VN9 and the elements of VN are used to define the syntax (structure) of the language. Furthermore, the sets VN and VT are assumed to be disjoint. The set VN U VT consisting of nonterminal and terminal symbols is called the vocabulary of the language. In much of the discussion in this section we use capital letters such as A9 B9 C,..., X9 Y, Z to denote nonterminal symbols, and Sl9 S2,. • • to represent the elements of the vocabulary. The strings of terminal symbols are denoted by lowercase letters x9 y9 z,..., while strings of symbols over the vocabulary are given by a, /?, y,... . The length of a string a will be denoted by \a\. Definition 2-2 A grammar is defined by a 4-tuple G = (VN9 VT9 S9 <£) where VT and VN are sets of terminal and nonterminal (syntactic class) symbols, respectively. S9 a distinguished element of VN and therefore of the vocabulary, is called the starting symbol. $ is a finite nonempty subset of the relation from (VT U VN)*VN(VT U VN)* to (VT U VN)*. In general, an element (a, /3) is written as a -* /? and is called a production rule or a rewriting rule. For our example defining an identifier, we may write the grammar as Gl = (VN,VT9S,<b) in which VN= {I,L,D} VT= {a9b,c9d,e9f9g9h9i9j9k9l9m9n,o9p9q9r9 s919 u9v9w9x9 y9 z, 0,1,2,3,4,5,6,7,8,9} S = I $ = {/ -» L9I - IL9I - ID9L-> a9L-*b9...9L^ z, D -»0,Z)-> l,...,Z)-»9} Definition 2-3 Let G = (VN9 VT9 S, $) be a grammar. For a, ^ e V*9 a is said to be a direct derivative of \p9 written as \p => a, if there are strings <j>l and </>2 (including possibly empty strings) such that \f/ = <t>i<x<t>2 and a = ^/ty^ anc* a -> /? is a production of G. If \p => a, we may also say that ^ directly produces a or a directly reduces to \f/. For grammar Gx of our example, we have listed in Table 2-2 some illustrations of direct derivations. The concepts can now be extended to produce a string a not necessarily directly but in a number of steps from a string \p.

36 THE THEORY AND PRACTICE OF COMPILER WRITING Table 2-2 + / LL LDL LDDL a L Lx L\L L2DL Rule used /- L L- x D^ 1 D - 2 *i e L L L <t>2 e e L DL Definition 2-4 Let G = (VN, VT, S, <£) be a grammar. The string \f/ produces o (a reduces to \f/9 or a is the derivative of \p), written as \f/ => a, if there are strings </>0, </>!,...,<J>„ (/i > 0) such that ^ = <f>0 ^ *i»*i ^ *2> • • • >*n-i =* *w = a. The relation => is the transitive closure of the relation =>. If we let n = 0, we can define the reflexive transitive closure of => as i//=»a <=> ^=»a or \p = a Returning to grammar Gx, we show that the string a\3 is derived from I by following the derivation sequence: /=>/!)=> jdd => LZ>Z) => aZ)Z) => AID => al3 Note that as long as we have a nonterminal symbol in the string, we can produce a new string from it (assuming no rules of the form A -* A). On the other hand, if a string contains only terminal symbols, then the derivation is complete, and we cannot produce any further strings from it. Definition 2-5 A sentential form is any derivative of the unique nonterminal symbol S. The language L generated by a grammar G is the set of all sentential forms whose symbols are terminal, i.e., L(G) = (a|S^>aandae V*} Therefore, the language is merely a subset of the set of all terminal strings (sentences) over VT. We conclude this section by introducing the Backus Naur form (BNF) of metalanguage, which is slightly different from the metalanguage which was previously used. The metavariables or syntactic classes will be enclosed by the symbols ( and >. Using this terminology, the symbol (sentence) is a symbol of VN and the symbol "sentence" is an element of VT. In this way, no confusion or ambiguity arises when we attempt to distinguish the two symbols. BNF has been used extensively in the formal definition of many programming languages. A popular language described using BNF is ALGOL. For example, the definition of an identifier in BNF is given as (identifier):: = (letter) | (identifier) (letter) | (identifier) (digit) (letter)::= a|b|c| ••• |y|z (digit>::=0|l|2|...|8|9

NOTATION AND CONCEPTS FOR LANGUAGES AND GRAMMARS 37 Note that the symbol ::= replaces the symbol -> in the grammar notation, and | is used to separate different right-hand sides of productions corresponding to the same left-hand side. The symbols ::= and | are interpreted as "is defined as" and "or," respectively. In Sec. 2-4.4 we examine an extended version of BNF. In this section we have formally introduced the concept of a grammar. There are a number of restrictions that can be placed on the rewriting rules of a grammar. The result of imposing these restrictions is to classify grammars into four distinct classes. The next section looks at this classification of grammars. EXERCISES 2-2 1 Using the example grammar in the text which described a small subset of the English language, give derivations for the following sentences: the monkey ate a banana the cat climbs a tree the cat ate the monkey 2 Consider the following grammar with the set of terminal symbols {a,b}\ S -> a S -+ Sa S -> b S -> bS