Principal An Implementation Guide to Compiler Writing

An Implementation Guide to Compiler Writing

Año:
1982
Editorial:
Mcgraw-Hill (Tx)
Idioma:
english
Páginas:
259 / 268
ISBN 10:
0070651663
ISBN 13:
9780070651661
File:
DJVU, 1.71 MB
Descarga (djvu, 1.71 MB)
 
You can write a book review and share your experiences. Other readers will always be interested in your opinion of the books you've read. Whether you've loved the book or not, if you give your honest and detailed thoughts then people will find new books that are right for them.
AN IMPLEMENTATION GUIDE TO COMP LER VVRT NG



JEAN-PAUL TREMBLAY AND PAUL G SORENSON



;/\



AN IMPLEMENTATION GUIDE TO COMPILER WRITING



Jean-Paul T rernblay Paul G. Sorenson



Department of Computational Science University of Saskatchewan, Saskatoon Canada



McGRAW-HILL BOOK COMPANY



New York St. Louis San Francisco Auckland Bogota Hamburg Johannesburg London Madrid Mexico Montreal New Delhi Panama Paris Sao Paulo Singapore Sydney Tokyo Toronto



The editor was James E. Vastyan; the production supervisor was Dennis J. Conroy. The cover was designed by Anne Canevari Green. The Whitlock Press, Inc., was printer and binder.



An Implementation Guide to Compiler Writing



Copyright @ 1982 by McGraw-Hili, Inc. All rights reserved. Printed in the United States of America. Except as permitted under the United States Copyright Act of 1976, no part of this publication may be reproduced or distributed in any form or by any means, or stored in a data base or retrieval system, without the prior written permission of the publisher. 1234567890WHWH 898765432



ISBN 0-07-065166-3



Library of Congress Cataloging in Publication Data Tremblay, Jean-Paul, date An implementation guide to compiler writing.



Bibliography: p. Includes index. 1. Compiling (Electronic computers) P. G. II. Title. QA76.6.T728 001.64 '25 ISBN 0-07-065166-3



I. Sorenson,



82-7119 AACR2



CONTENTS



PREFACE IX



CHAPTER 1 THE DESIGN AND IMPLEMENTATION OF THE PROGRAMMING LANGUAGE GAUSS 1



1.1 Introduction 1 1.2 Program 2 1.3 Declarations 3 1-4 Procedures 6 1-4.1 Procedure Definition 6 1-4.2 Procedure Calls 7 1-4.3 Procedure Returns . 8 1-5 Statements 8 1-6 Control Structures 9 1-6.1 Choice 9 1.6.2 Repetition 10 1.6.3 Procedure and Program Termination 10 1-7 Assignment Statement 10 1-8 Expressions 12 1.8.1 Integer Expressions 12 1-8.2 String Expressions 13 1-8.3 Logical Expressions 13 1-9 Input and Output 16 1-10 Lexical Structure 18 1-11 Control Commands and Compiler Aids 19 1-12 Sample Programs 22 Bibliography 26



vi CONTENT; S



CHAPTER 2 THE IMPLEMENTATION OF A COMPILER FOR GAUSS 27



2-1 Introduction 27 2-2 Scanner 28 2.2.1 Introduction 29 2-2.2 The Scanner as a Finite-State Machine 29 2-2.3 Top-down Description of Scanner 31 2-2.4 Structural Representation of Data 31 2-2.5 Implementation of the Scanner 32 2-2.6 Scanner Source Code 35 2-3 Parser 45 2-3.1 Introduction 46 2-3.2 LL(I) Grammar 46 2-3.3 Parsing Function 57 2-3.4 Top-down Description of Parser 59 2-3.5 Implementation of the Parser 59 2-3.6 Syntactic Error Recovery 62 2-3.7 LL(I) Parser Source Code 66 2-4 Table Handler 99 2-4.1 Introduction 99 2-4.2 Symbol Table Organization 100 2-4.3 Top-down Description of Table Handler 102 2-4.4 LOOKUP Procedure 102 2-4.5 SYM-REM Procedure 103 2-4.6 Table Handler Source Code 104 2-5 Code Generator 109 2-5.1 Intermediate Code Environment 109 2-5.1.1 Hypothetical Machine Organization 110 2-5.1.2 Activation Records 110 2-5.1.3 Address 111 2-5.1.4 Strings 111 2-5.2 Code Generation and Semantic Analysis Strategy 111 2-5.2.1 Compile-Time Environment 113 2-5.2.2 Augmented Grammar 113 2-5.3 Top-down Description of Code Generator 116 2-5.4 Declaration Semantics 117 2-5.4.1 Arrays 117 2-5.4.2 Variable Declarations 118 2-5.4.3 Constant Declarations 121 2-5.4.4 Procedure Declarations 123 2-5.4.5 Declaration Examples 125 2-5.5 Procedure Semantics 126 2-5.5.1 Procedure Call 126 2-5.5.2 Procedure Body 130 2-5.5.3 Procedure Return 130 2-5.6 Expressions 131 2-5.6.1 Primaries 131 2-5.6.2 Operations 133 2-5.7 Assignment Semantics 136



CONTENTS



2-5.8 Control Structure Semantic 2-5.8.1 If-then-else 2-5.8.2 Loop-while 2-5.8.3 Stop 2-5.9 Input Semantics 2-5.10 Output Semantics 2-5.11 Semantic Error Recovery 2-5.12 Action Routines Source Listing 2-5.13 Compilation Errors 2-6 Gauss Source Listing Bibliography



CHAPTER 3 THE RUN-TIME ENVIRONMENT



3-1 Introduction 3-2 The Implementation of the Hypothetical Machine Interpreter 3-2.1 The Run-Time Environment 3-2.2 String-Space Organization 3-2.3 Garbage Collection of String Space 3-2.4 Software Organization 3-2.5 Sample Programs 3-2.5.1 Factorial Program 3-2.5.2 Simple Sort Program 3-3 Run-Time Errors



APPENDIX DESCRIPTION OF THE OBJECT-MACHINE LANGUAGE



1 Notation 2 Instructions



INDEX



.. VII



138 138 139 139 139 140 141 142 176 178 192



193



193 194 194 197 199 199 234 234 237 237



244



244 245



256



PREF ACE



An important aspect of teaching a course in compiler writing is to illustrate the key theoretical concepts normally taught in such a course. A frequently used strategy for achieving this goal is to have a student design a simple programming language and implement a compiler for this language. Many texts in compiler writing do not, because of size considerations and level of presentation, adequately present details that illustrate the implementation of a compiler. This book presents, in a case study manner, the development of a small compiler typical of those developed in a first compiler course. Instructors can use this book as a guide to illustrate difficulties that commonly arise when implementing a compiler. They can also use the guide as an example document that is similar to what students should produce for their own compilers. It is necessary for instructors that teach a course in compiler writing to set up their own "toy" compiler as a vehicle for illustrating techniques for implementing a compiler. Also, many of the problems that are encountered in writing a compiler for a newly designed language must be illustrated. This guide is an attempt to fulfill this role. Although this book has been written to accompany our main text entitled "The Theory and Practice of Compiler Writing", it can also be used with any other book that contains the key theoretical concepts of compiler writing. In particular, this guide requires a basic knowledge of scanners, LL(I) parsing, attributed grammar translation, error detection and repair, symbol-table methods, run-time storage management, code generation and interpreters. This book begins with a documented description of a simple programming language, GAUSS, which serves as an example for the student. GAUSS is a block- structured language whose design was influenced by ALGOL, PASCAL, PL/I and FORTRAN. It also contains string-manipulation facilities capable of manipulating variable-length strings. Chapter 2 illustrates the topics of scanner generation, LL(I) parsing, symbol table construction; code generation, semantic analysis, error detection and repair for the GAUSS language. Chapter 3 describes an interpreter for the code generated by the GAUSS compiler and discusses the run-time environment in which the machine code is executed. The code generated by the compiler executes on a hypothetical machine which is stack-oriented. The compiler and interpreter are programmed in PL/I. We owe a debt of thanks to the many people who assisted in the preparation of this manuscript and the implementation of the GAUSS compiler and hypothetical-machine interpreter. Doug Bulbeck and Darwin Peachey assisted in the design of GAUSS and the implementation of the scanner, parser and the error handler. Joe Wald contributed to the code generation phase, attribute translation and the design of the hypothetical machine. Lyle Opseth assisted throughout the preparation of the manuscript and, in particular, wrote the interpreter for the hypothetical machine. Beth Protsko assisted in the preparation of the index. We are also grateful for the comments of our students in the Dep

rtment of Computational Science at the University of Saskatchewan, who have class-tested preliminary versions of the book over the last five years.



Jean-Paul Tremblay



Paul G. Sorenson



Chapter 1



The Design and Description of the Programming Language GAUSS



1-1 INTRODUCTION



This chapter describes a small, reasonably well-structured programming language named GAUSS. GAUSS incorporates many of the features one would hope to find in a small programming language whose data types are integers, strings, and logical values, and arrays of these types. Programming languages in the family of ALGOL-derived languages require integers for counting and indexing arrays and strings, and require logical values to facilitate their flow of control, so GAUSS without its string data type could be looked upon as a sort of minimum ALGOL-like language. Thus, if we look for the area of application to which GAUSS is tailored, string manipulation is our only candidate. GAUSS is advanced beyond PL/I in string manipulation because GAUSS contains strings which may vary freely in length from zero to some implementation- defined upper bound. GAUSS also contains a few operators which exist in PL/Ionly as built-in functions; this tends to make the GAUSS formulation of string- manipulation tasks easier and more readable than the PL/I formulation. It cannot be said, however, that GAUSS is oriented toward string manipulation to the extent that SNOBOL4 is. In fact, the present form of GAUSS is primarily due to the desire



2



THE DESIGN AND DESCRIPTION OF GAUSS



to design a programming language whose overall shape is similar to the typical ALGOL-like language. Thus the problems posed in the task of constructing a compiler for GAUSS are a reasonable subset of those encountered in constructing a compiler for the most commonly impleMented programming languages. As one might expect from the preceding discussion, specific features found in GAUSS are derived from several of the more common programming languages. The overall format of the language is influenced by ALGOL-W (Sites [1972]) and C (Ritchie [1974]). Some details of the operators, expressions, and syntax are derived from PL/I (IBM), PASCAL (Jensen and Wirth [1975]), EULER (Wirth and Weber [1966]), and others. The design of a programming language is usually motivated by an intended area of application. The only application area intepded for GAUSS ic; its implementation by computer science students. The choice of language constructs was made primarily on the basis of their implementation difficulty. The synta

of some constructs was modified so that it would be consistent throughout the language. Some constructs were omitted because they were considered detrimental to structured programming techniques (e.g.. the GOTO construct). The metalanguage used to describe the syntax of the GAUSS language is an extended BNF form. For a complete description of this metalanguage, see Sec. 2-4.4 of the text 'The Theory and Practice of Compiler Writing." (Tremblay and Sorenson [1982]).



}..2 PROGRAM



Programs in GAUSS have the following form:



<program> ::= {<declaration>;} <procedure>; {<procedure>;}



A program is defined as any number of declarations followed by one or more procedures. The global declarations are discusserl in Sec. 1-3 and the procedures are described in Sec. 1-4. The program is initiated by a call to the procedure called MAIN. This procedure must occur in the source text, and it must be parameterless. Note that separate compilation is not available in GAUSS. Th;c; means that all procedures of a program must be present in the source file. The following is an example of a GAUSS program:



$$ THIS PROGRAM READS INTEGERS AND $$ PRINTS THEIR FACTORIAL VALUE



INTEGER PROC FACTORIAL (INTEGER PARM) FORWARD; PROC MAIN INTEGER I;



LOOP WHILE NOT END_OF ..FILE;



READ I;



IF I < 0



1-3 DECLARATIONS



3



THEN WRITE '1LLEGAL VALUE READ IN:", I US ING " s X2 IN". , " ,



ELSE WRITE FACTORIAL (I) USING "110, N"; ENDIF; ENDLOOP; ENDPROC;



$$ 'FACTORIAL' RETURNS THE FACTORIAL VALUE OF ITS ARGUMENT.



INTEGER PROC FACTORIAL (INTEGER I) INTEGER LMINUS_ONE;



IF I > 1 THEN LMINUS_ONE := I - 1; RETURN I * FACTORIAL (I.-MINUS_ONE); ELSE RETURN 1; ENDIF; ENDPROC;



This program computes the factorial of the given integers. The MAIN procedure of the program begins by reading in an integer. If the integer read in is greater than or equal to zero, the factorial value is computed by calling the FACTORIAL procedure. This value is then printed. If the integer read in is less than zero, the program prints an error message. This process is then repeated. When the input stream is exhausted (when END__OF --FILE becomes TRUE), the program stops. The factoriaf of the integer read in is computed by the recursive FACTORIAL procedure.



1-3 DECLARATIONS



This section describes the syntax for declaring variable identifiers and constant identifiers in GAUSS. These declarations are used to associate a specific type with an identifier. The three basic types are INTEGER, LOGICAL and STRING. Values of integer type have the precision allowed by the machine on which the compiler runs (e.g., -32768 to 32767 on a PDPll/40). The logical type has either TRUE or FALSE as a value. String-type values are character strings of arbitrary dynamically varying length. The only aggregate data type is the ARRAY type. Arrays are multi-dimensional and consist of simple elements. In order to discuss the scope of identifiers, it is necessary to define "global" and "Iocal" identifiers. Global identifiers are those that are declared at the beginning of the program, before any procedures are defined. These identifiers are available to every procedure in the program. The local identifiers are those declared at the beginning of a procedure, and they are known only within the procedure (see Sec. 1-4.1 on procedure definitions). To define the scope of identifiers in GAUSS we introduce the notion of an entity. An entity is a procedure name, a global identifier, a local identifier, or a pararrleter. The scope of a global identifier or a procedure name extends from the



4



THE DESIGN AND DESCRIPTION OF GAUSS



point at which it is declared to the end of the program. The scope of a local identifier or a parameter extends from the point at which it is declared to the end of the procedure in which it is declared. At any point in the program no two entities may be named by the same identifier. ' Consider the following program example:



INTEGER COUNT;



PROC MAIN INTEGER I;



ENDPROC;



PROC ALPHA (LOGICAL PARM) STRING I;



ENDPROC;



The scope of the global variable COUNT declared in the first line of the program extends from the point where it is declared to the end of the program. Thus it would be illegal to have an identifier COUNT within either of the two procedures, or to have a procedure called COUNT. The scope of the identifier I declared in the MAIN procedure extends only to the end of that procedure. Therefore, the second declaration of the identifier I in the ALPHA procedure is perfectly valid. The parameter PARM extends from its declaration in the parameter list of the AL.PHA procedure to the end of that procedure. Note that the ALPHA procedure is not declared until after the MAIN procedure. Therefore, the scope of the ALPHA procedure does not include the MAIN procedure. It follows that MAIN cannot have a call to ALPHA. All identifiers which are used must be previously declared. There are no default declarations. Declarations in GAUSS are defined as follows:



<declaration> ::= <type specifier> <identifier list> I <type specifier> CONST <const list> <const list> ::= <identifier> = <constant> {, <identifier> = <constant>} <type specifier> ::= <simple type> I <array> <simple type> ::= INTEGER I STRING I LOGICAL <array> ::= ARRAY «bound pair> {,<bound pair>} ) OF <simple type> <bound pair> ::= [<integer expression> :] <integer expression> <constant> ::= <simple constant> I <array constant> <simple constant> ::= [-]<number> I <literal> I TRUE I FALSE <array constant>::= «simple constant> {, simple constant>}) <identifier list> ::= <identifier> {, <identifier>}



Some examples of declarations as they might appear in a program follow. They are



1-3 DECLARATIONS



5



discussed in the subsequent paragraphs.



INTEGER I; STRING ALPHA, BETA, GAMMA; LOGICAL CONST TEST = TRUE, FAIL = FALSE; ARRAY (5,10, -12 : 33) OF LOGICAL HOSPITALDRUG

ECORD; ARRAY (2, 3) OF INTEGER CONST CARDS = (1, 3, 6,9, 12, 15);



The first example declares the identifier I to be an integer type variable. The next example declares ALPHA, BETA, and GAMMA to be string variables. The third example declares TEST and FAIL to be logical constant identifiers, with the values of TRUE and FALSE respectively. The next line gives a declaration of a three- dimensional array whose name is HOSPIT AL...DRUG

ECORD. Each of the 2,300 elements can contain one logical value. The next array declaration declares a constant array CARDS with the given integer values. As can be seen, a list of identifiers can be associated with one type specification. This eliminates the need to repeat the specifications for every identifier when several identifiers are of the same type. There are two basic kinds of declarations in GAUSS. The first kind, the variable declaration, is used to associate an identifier with a storage location whose contents may vary over its lifetime. The <type specifier> indicates the data type of the storage location, that is, the set of values it may contain. The possible data types are INTEGER, STRING, LOGICAL, or an ARRAY of type INTEGER, STRING, or LOGICAL. The second kind, the constant declaration, is used to specify constant identifiers. The CONST keyword indicates that each identifier in the list is associated with the specified value and retains that association throughout its lifetime. The constants assigned to the identifiers in the list must be of the same type as the <simple type> specified at the beginning of the declaration. Constants for arrays must be all of the same type, and are listed in row-major order. These constants are enclosed in parentheses and separated by commas. The <bound pair> for an array specifies the lower and upper bounds of one particular dimension of the array. The number of such <bound pairs> is the number of dimensions in the array. The upper bound must be greater than or equal to the lower bound. If the lower bound is missing, it is assumed to be 1. Note that the integer expressions which specify the bounds of global arrays or constant arrays must be defined at compile time. Therefore, these expressions must be constants or constant identifiers. The bounds of local arrays can be defined at run time; specifically, at the time the procedure is called. Therefore, these expressions can be arbitrarily complex. The referencing of array elements is done by including the subscript list. An example of declaring an array and referencing one of its elements follows:



ARRAY (31, 12, 1925 : 1979) OF STRING HISTORY; HISTORY (1, 1, 1930) := ''THE THIRTIES BEGIN.";



The declaration sets up a 31-by-12-by-55 array of strings, where the dimensions in this case are used to represent the day, month, and year, respectively, for the years from 1925 to 1979 inclusive. The second statement then assigns a string to one of these positions, which corresponds to 1 January 1930.



6



THE DESIGN AND DESCRIPTION OF GAUSS



1-4 PROCEDURES



The main vehicle which permits the writing of modular programs in GAUSS is the procedure. Descriptions of how procedures are defined and subsequently invoked are given in this section. The specification of return of control from a procedure is also presented.



1-4.1 Procedure Definition



This section defines a procedure, which is the

basic unit of a GAUSS program. GAUSS supports nonrecursive and recursive procedures, both of which are defined in the same way. The procedure definition is:



<procedure> ::= [<simple type>] PROC <identifier> <parameter list> <procedure tail> <procedure tail> ::= FORWARD I <procedure body> <procedure body> ::= { <declaration> ; } <statement>; {<statement>;} ENDPROC <parameter list> ::= [«parameter>, {<parameter>})] <parameter> ::= <parameter type> <identifier> <parameter type> ::= <simple type> I ARRAY ( I,} ) OF <simple type>



In the following discussion some examples of procedure declarations are presented.



INTEGER PROC ALPHA (ARRAY (,,) OF STRING CHARS, LOGICAL TEST)



ENDPROC;



This first example defines a procedure called ALPHA which returns an integer value to the calling routine. It has two parameters, the first of which is a three-dimensional array of strings called CHARS. Note that the bounds on the dimensions of the array are obtained from the calling procedure. The second parameter is a logical variable called TEST.



The next example shows the declaration of a procedure that does not return a value to the calling procedure, and has no parameters.



PROC CHECK FORWARD;



The following discussion examines the procedure declaration in a more detailed manner. A <procedure> consists of a "procedure head" followed by a <procedure tail>. The <procedure tail> can be the keyword FORWARD, indicating that the procedure is defined later. A FORWARD declaration is required only when a procedure call is encountered before its definition appears during compilation. This is necessary to allow one-pass compilation of the GAUSS programs. A <procedure tail> can also be a <procedure body>, that is, a qroup of



1-4 PROCEDURES



7



declarations and statements defining a unit of execution to be performed when the procedure is called at run time. The <procedure body> consists of any number of <declaration>s, followed by at least one <statement>. The <procedure body> must be ended with the ENDPROC keyword. The <simple type> specification in the procedure head refers to the type of the value to be returned by the procedure. This must be present for a procedure that acts as a function, where a returned value is needed. For a procedure which does not return a value, the <simple type> must be omitted. Note that procedures with a <simple type> specifier must always return a value. After the optional <simple type> comes the keyword PROC which specifies the beginning of the procedure. The identifier following the keyword PROC is the name of the procedure, which must not be defined previous to this point in the program, except if a FORWARD declaration has been used. The <parameter list>, if present, is a list of <parameter>S enclosed in parentheses. If there are no <parameter>s, then there is no <parameter list> in the procedure definition. Argument-parameter correspondence between the procedure call and the parameter list of the procedure is determined by relative order (i.e., the ith argument is associated with the ith parameter). Parameters consist of a <parameter type> followed by an identifier. The appearance of the identifier in the parameter list constitutes its declaration. Corresponding parameters and arguments must, of course, have the same type and dimension. Parameters of type ARRAY are specified without explicit bounds. The number of commas between the parentheses indicates the number of dimensions of the array parameter (i.e., the number of commas plus one). The bounds for each dimension of the parameter are copied from the bounds of the argument at run time.



1-4.2 Procedure Calls



A procedure call can be used like a statement, or like an expression. The difference depends on whether or not the optional <simple type> specification is present before the PROC keyword in the procedure declaration. The procedure call is a statement when the <simple type> specifier has been omitted. The procedure call can be used as an expression if and only if the <simple type> specifier is present. The procedure call is defined as:



<procedure call> ::= <identifier> <argument list> <argument list> ::= [«argument> {, <argument>})] <argument> ::= <identifier> [ <sublist > ]



Pass-by-reference is the standard parameter passing convention used by GAUSS. This means that whenever a variable has its value changed in the called procedure, it is changed in the calling procedure as well. Therefore, all arguments in the argument list must be variable identifiers (as opposed to constant identifiers or expressions). Pass-by-value can be easily simulated by a GAUSS programmer. A dummy variable in the calling procedure is assigned the desired value and then used in the procedure call. Some examples of procedure calls are as follows:



8



THE DESIGN AND DESCRIPTION OF GAUSS



MOVE (SOURCE, DESTINATION, COUNT); X = PI. ,



The first call is to a procedure called MOVE which takes three arguments. This call is used as a statement and, therefore, the called procedure cannot return a value. The second procedure call is to a procedure called PI which does not take any arguments. This call is used as an expression, and therefore the called procedure must return a value. The identifier in the procedure call is the name of the procedure. It is illegal to call a procedure which has not been previously declared. The <argument list> is a possibly empty list of <argument>s. An argument must be a variable identifier whose type and dimension are consistent with the corresponding parameter. Also, the number of arguments and parameters must be the same. The optional <sublist>, which is used to select a single array element, is of the form:



<sublist> ::= «integer expression> {, <integer expression> })



The number of <integer expression>s in the <sublist> must be equal to the number of <bound pair>s in the declaration for the array named by the identifier. Furthermore, the value of each expression must fall between the bounds specified by the corresponding <bound pair> in the declaration. The identifier must be associated by a prior declaration with a variable of array type. The optional <sublist> is used to pass one element of an array to a procedure. If the optional <sublist> is not used with an array identifier, the entire array is passed to the procedure. The <sublist> is valid only if it is used with an array identifier.



1-4.3 Procedure Returns



The return statement is used to stop execution of the procedure and return to the calling procedure, or to the run-time support system, in the case of a return from the MAIN procedure. The return statement can also return a value to the point of call in the calling procedure. The return statement must return a value if the procedure is declared to return a value (see Sec. 1-4.1). The value of the expression returned must be of the same <simple type> as that specified in the procedure head. If the procedure is not declared to return a value, then no value can be returned. The return statement is specified as follows:



<return stmt> ::= RETURN [<expression>]



If a procedure executes its last statement and tries to proceed beyond the ENDPROC delimiter, an automatic return takes place. If the procedure has been declared to return a value, it must execute an explicit return statement, which contains the value of an expression of the appropriate type.



1-5 STATEMENTS



The following is a definition of the executable statements of the GAUSS language. The syntax of each statement is defined in subsequent sections, except



1-6 CONTROL STRUCTURES



9



for the <procedure call> and the <return stmt> which have already been described.



<statement> ::= <procedure call> I <return stmt> I <if stmt> I <while stmt> I <stop stmt> I <assignment stmt> I <read stmt> I <write stmt>



The <procedure call> must be to a procedure which does not return a value.



1-6 CONTROL STRUCTURES



GAUSS provides three major control structures which can be used in a procedure. These structures allow choice, repetition and termination in the flow of control of a program.



1-6.1 Choice



The choice control structure is the IF-THEN-ELSE-ENDIF construct whose syntax is as follows:



<if stmt> ::= IF <logical expression> <then clause> [<else clause>] END IF <then clause> ::= THEN <statement> ; { <statement> ; } <else clause> ::= ELSE <statement> ; { <statement> ; }



The <logical expression> is evaluated at execution time and, if it is true, the group of statements following the THEN is executed. If the <logical expression> evaluates to false, then the group of statements following the ELSE is executed. If the optional <else clause> is not present, and the <logical expression> evaluates to false, then no statements are executed, and execution continues with the statement following the ENDIF keyword. An example of the IF construct is as follows:



IFA<B THEN K := A; A := B; ELSE K := B; B := A; ENDIF



This construct has the result of making A and B equal to the larger of the values of A and B, and setting K to the smaller of the values of A and B.



10 THE DESIGN AND DESCRIPTION OF GAUSS



1-6.2 Repetition



The control construct for affecting repetition is the LOOP-WHILE construct whose syntax is as follows:



<while stmt> ::= LOOP {<statement> ;} WHILE <logical expression> ; {<statement> ;} ENDLOOP



This loop construct is executed at run time in the following manner. First, the statements before the WHILE portion of the loop are executed. The logical expression is then evaluated. If it is true, then execution continues with the statements between the WHILE part of the loop and the ENDLOOP delimiter. When the ENDLOOP delimiter is encountered, execution continues from the beginning of the loop construct, and the process is repeated. If the <logical expression> evaluates to false, then the loop is terminated, and the next statement executed is the one following the END LOOP delimiter. The following illustration is an example of a WHILE statement which reads in a student number and uses the student number as an array index. This is done for each iteration of the loop. When the student number is the dummy value 9999, the loop terminates.



LOOP READ STUDENT J'JO; WHILE STUDENTJ'JO <> 9999; WRITE CLASS (STUDENT J'JO) USING "S, N"; ENDLOOP;



1-6.3 Procedure and Program Termination



The STOP control construct is provided to allow the programmer to end execution immediately. This is useful when a deeply nested procedure activation determines that the program should terminate. The only way such a construct may be avoided is by means of an elaborate use of returned flags which indicate the status of procedures upon their return. Such complexity is well worth avoiding, especially since the STOP statement provides no real challenge for the compiler. The syntax of the STOP statement is as follows:



<stop stmt> ::= STOP



As described in the previous section on procedures, GAUSS contains a RETURN statement which causes the termination of a procedure, with or without a returned value.



1-' ASSIGNMENT STATEMENT



The GAUSS assignment statement has the form



<assignment stmt> ::= <target> := <expression>



1-7 ASSIGNMENT STATEMENT 11



The value of the <expression> is copied into the <target> which has the following form:



<target> ::= <identifier> [ <sublist> ] [ <substring> ]



The identifier in the <target> must be a variable identifier. The optional <sublist> is given if and only if the identifier designates an array variable. The <sublist> has been discussed previously, but note that in this case, the <subIist> must be present if the identifier is an array variable (see Sec. 1-4.2). The optional <substring> may be given only if the identifier is type string or array of string. The optional <substring> has the form



<substring> ::= ( <start> '1 ' [ <length> ] ) <start> ::= <integer expression> <length> ::= <integer expression>



The <substring> specifies the start and length of a portion of the contents of a string variable. The integer expressions given by <start> and <length> and enclosed in parentheses, denote the starting position and length of the substring. The position of the first symbol in the string is designated by the integer 1. The <length> expression is optional. If it is missing the substring extends from the <start> position to the end of the string. It is illegal to specify a substring which does not lie entirely within the current bounds of the string. A <string expression> of arbitrary' length may be assigned to a <target> with the optional <substring> specification. The resulting value of the string (of which the substring is a part) is the original value with the substring replaced by the value of the right-hand side of the assignment statement. For example, the assignment, ZAP(51 0) := "THINGS";, will insert the string "THINGS" between the fourth and fifth characters of the string called ZAP. Note that in all cases of assignment, the expression must be of the same type as the target. The following illustration demonstrates some simple assignment statements. It has been assumed that X and Y have been declared INTEGER, A has been declared STRING, and FOREVER has been declared to have LOGICAL type.



X := 37 * Y + 3; A := "HELLO WORLD"; FOREVER := FALSE;



The next example shows assignments to array elements. It is assumed that the identifiers have had the proper declarations, and that the indices into an array are within the bounds for each dimension of the array.



TABLE (3, 4) := TABLE (3, 1) / TABLE (1, 4); FLAGS (M - N + 1) := LAND S OR (FLAGS (M) AND FLAGS (N));



The next examples show how an assignment to a substring is made. Again, it is assumed all identifiers have been declared with the appropriate types.



SA (41 2) := "THIS"; STR..ARRA Y (2, 2) (41 0) := "REALL Y ";



12 THE DESIGN AND DESCRIPTION OF GAUSS



If SA originally had the value



"OF IT"



after the assignment, it would have the value



"OF THIS"



This could also have been accomplished using the assignment



SA (41 ) := "THIS";



If STR..ARRA Y (2, 2) originally had the value



"IT IS"



then after the assignment, this element will have the value



'1T REALLY IS"



1-8 EXPRESSIONS



The expression is an important construct in GAUSS because it is the structural unit which may occupy the right-hand side of an assignment statement. Since an assignment is the most commonly used operation in the ALGOL family of languages (ALGOL-60, ALGOL-W, ALGOL-68, PL/I, ...) the importance of the expression is obvious.



<expression> ::= <integer expression> I <logical expression> I <string expression>



1-8.1 Integer Expressions



The following description defines the syntax of the <integer expression> as well as the precedence and associativity of the allowed operators.



<integer expression> ::= <integer expression> + <term> I <integer expression> - <term> I <t€rm> <term> ::= <term> * <factor> I <term> / <factor> I <term> % <factor> I <factor>



The binary operators just presented have the following meanings:



1-8 EXPRESSIONS 13



+ Signed integer addition - Signed integer subtraction * Signed integer multiplication / Signed integer division (the result is truncated) % Mod or remainder function



The following statements demonstrate the evaluation of the mod function.



-10 % 3 = 2, 10 % -3 = -2 10 % 3 = 1, -10 % -3 = -1 -(a % b) = -a % -b -a % b = b - (a % b) a % -b = (a % b) - b



The sole unary operation on integer values is the unary minus ' -' which has the effect of multiplying its operand by -1. Its syntax is as follows:



<factor> ::= [-] <integer primary> <integer primary> ::= «integer expression» I <identifier> [<sublist>] I <number> I <procedure call> I <string expression> @ <string expression> I # <string expression>



Some of the possible <integer primary>s require further qualification. The identifier must be associated by a prior declaration with an object of integer type. The optional <sublist> is to be used if and only if the identifier is used to designate an array of integer type. Note that constant array elements are allowed as integer primaries. The <procedure call:>'must be to a procedure which has been declared to have an integer value, Le., a procedure of the form



INTEGER PROC whatever ...



The <integer primary> '<string expression> @ <string expression>' which provides some pattern matching capabilities is illustrated below.



STRING_l @ STRING...2



This expression gives the number of the position in STRING_l where the first occurrence of STRING...2 in STRING_l begins. If STRING...2 does not occur in STRING_l then the value of the expression is O. The @ operator behaves exactly like the built -in INDEX function of PL/I. The final possible <integer primary> is of the form



# STRING



and it yields the current length of the string.



14 THE DESIGN AND DESCRIPTION OF GAUSS



1-8.2 String Expressions



The syntax of a string expression is as follows:



<string expression> ::= <string expression> & <string primary> I <string primary>



The operator & specifies the concatenation of the <string primary> onto the end of the <string expression>.



<string primary> ::= «string expression» I <literal> I <identifier> [<sub list>] [<substring>] I <procedure call>



Some examples of <string expressions> and assignments are:



DATE := MONTH & DAY & YEAR; STR....ARRAY (2,3) (11 4) := "GOOD-BYE CRUEL WORLD" & NOTE (11 # NOTE / 2);



The first expression concatenates the values of the three string variables, MONTH, DAY and YEAR together, and assigns the resulting string to DATE. The second example concatenates the given literal, plus the first half of the string value of NOTE, and assigns the resulting string to a substring of an array element. The identifier must be associated by a prior declaration with an object of type string. The optional <sublist> is used if and only if the identifier is an array identifier. The optional substring selection has been described in Sec. 1-7. A <procedure call> may be a <string primary> if the procedure is defined to return a string value.



1-8.3 Logical Expressions



The syntax of the logical expression is as follows:



<logical expression> ::= <logical part> <relation> <logical part> I <integer expression> <relation> <integer expression> I <string expression> <relation> <string expression> I <logical part> <logical part> ::= <logical part> OR <logical term> I <logical term> <logical term> ::= <logical term> AND <logical factor> I <logical factor> <logical factor> ::= NOT <logical primary> I <logical primary>



This description defines the construction of <logical expression>s from



1-8 EXPRESSIONS 15



<logical primary>S using the three logical operators OR, AND, and NOT. The precedences of the operators are specified by the syntax as NOT> AND> OR. The NOT operator inverts the truth value of its operand. The OR and AND operators differ from those of symbolic logic in the following manner. If the left operand of the OR is true, or the left operand of the AND is false, the operators yield the values TRUE and FALSE respectively, without evaluating their right operands. This means that a compiler for GAUSS must generate code for logical expressions such that the left operand of an AND or OR is always evaluated before the right operand; no optimization can be done which involves reversing the order of evaluation. (In fact, operands in all expressions are always evaluated in left-to-right order in the absence of parentheses or precedence differences.) The <relation> between the expressions found in <logical primary> is defined as



<relatioA> ::= '=' I '<' I '>' I '<=' I '>=' 1'0'



The relations have their usual meanings, with' 0' indicating the test for inequality. All of the relations can be used on string operands as well as on integers. For logical operands, the only relations that can be tested are equality ('=') and inequality ('<>'). All other relations are invalid for logical operands. F or string operands, the "Iess than" and "greater than" comparisons are based on the collating sequence of the machine. A string value X is less than a string value Y if and only if one of the following conditions hold.



1. If X is a leading substring of Y, such as when X = "STR" and Y = "STRING" then X is less than Y.



2. Otherwise, if the two strings are equal up to the ith position for some i and the two strings differ in the ith position then X is less than Y if the ith character of X is less than the ith character of Y with respect to the collating sequence. For example, when X and Y have the values



X = "ABCD" Y = "ABCR"



the two strings are equal up to, but not including, the fourth character. Assuming that D is less than R in the collating sequence, then X is less than Y. The logical primary is defined as follows:



<logical primary> ::= TRUE I FALSE I ( <logical expression> )



16 THE DESIGN AND DESCRIPTION OF GAUSS



I <identifier> [<sublist>] I <procedure call>



The identifier must be associated by a prior declaration with an object of type logical. The optional <sub list> is used if and only if the identifier is an array identifier. A <procedure call> can be a <logical primary> if the procedure called is defined to return a logical value.



1-9 INPUT AND OUTPUT



GAUSS input/output primitives are designed to be as simple as possible while still allowing easy performance of the usual input/output operations which occur in small programs. The input primitive is the READ statement whose syntax is as follows:



<read stmt> ::= READ <input goal list> <input goal list> ::= <input goal> {, <input goal> } <input goal> ::= <identifier> [ <sublist> ]



The identifier in the <input goal> can be any variable identifier. The optional <sublist> is to be used if and only if the identifier is an array identifier. Thus only one element of an array can be read at a time; it is impossible to read an entire array with the single execution of one READ statement. The <sublist> has been defined in Sec. 1-4.2. The READ statement examines the input stream for <simple constants> as defined in Sec. 1-3. Briefly, a <simple constant> can be one of the following.



[-] <number> <literal> TRUE FALSE



All characters which are not part of the simple constants are ignored (with a warning). The ith <input goal> in the <input goal list> is assigned the value of the ith <simple constant> in the input stream. If the type of the identifier and the constant differ, an error message is printed, and the value of the <input goal> is unchanged. No constant can run over the end of a line or card in the input stream. This restriction guards against unbounded strings "gobbling up" the entire input stream, and other such holocausts. Logical values and strings which try to run over the end of an input line are not interpreted as valid <simple constant>s, and thus are ignored (with a warning). The question of how to handle synchronous interrupts (which arise in a repeatable way from the execution of particular statements) in programming languages has not been clearly solved. PL/I has ON-units; other languages provide various flags and predicates to detect overflows, end-of-file, and so on. GAUSS provides no facilities for detecting or recovering from overflows, divisions by zero, and so on, within executing programs. However, an input end-of-file clearly cannot be handled in this lazy manner. ON-units are better avoided because they do not



1-9 INPUT AND OUTPUT 17



allow someone reading a program to understand easily the operation of the program at a particular point, since an ON-unit (which may be located in a distant piece of code) may be invoked at that point. Furthermore, ON-units have the flavor, and some of the basic problems, of the undisciplined GOTO statement. To handle the end-of-file condition, GAUSS provides a logical procedure called END_OF --FILE which, when called, returns TRUE if the input stream is exhausted, and returns FALSE otherwise. It is important to note that the input system scans ahead, enabling END_OF --FILE to return TRUE as soon as the last valid constant has been read. It is not necessary to attempt to read past the end of file before the procedure returns true. The following example would typically be placed before some READ statement in the program:



IF END_OF --FILE THEN WRITE "NOT ENOUGH DATA CARDS" USING "S, N"; STOP; ENDIF; READ whatever ...



The output primitive (as seen in the current example) is the WRITE statement, which has the following syntax:



<write stmt> ::= WRITE [<expression list>] USING <format list> <expression list> ::= <expression> {, <expression>}



The <format list> must have a particular structure defined by the following BNF:



<format list> ::= " <format item> {, <format item> } " <format item> ::= S [<number>] I I [<number>] I L I X <number> I T <number> IN



The WRITE statement is designed to allow formatted output. If the <expression list> is empty, then the purpose of the write statement is to space, tab, or advance to the next line. For every element of the <expression list>, there must be a corresponding <format item>. The <format item> dictates the type of the variable that is being printed and the field width of the printed item. The S format is for printing strings. If the optional field width <number> is left out, then the field width is assumed to be the length of the string. If the field width is present, the string is padded with blanks or chopped on the right, as required. The I format is used to print out integers. Again, if the field width is omitted, then the number of digits in the integer is taken to be the field width. If the field width is included, then integers are padded on the left with blanks, as required. If the integer is longer than the given field width, the entire integer will still be printed out with an accompanying warning message. Printing the number carries more information than filling the field with question marks or some other symbol which indicates an error.



18 THE DESIGN AND DESCRIPTION OF GAUSS



The L format is used to print out logical values. The output is either TRUE or FALSE and is printed in a field that is five characters wide. The X format produces n spaces, where n is the number following the X. The T format causes the next output to begin at the nth column, where n is the number that follows the T. If n is less than the current column position, a tab to the nth column of the next line occurs. The N format item is used to advance to the next line. Note that in all cases, the number following any of the format items must be positive. A field width of zero is considered to be a programmer error. The following illustrations give a few examples of GAUSS input and output statements:



READ ALPHA, BETA, GAMMA; READ NUMBER (I, J); WRITE "THE VALUE OF X IS:", X USING "T 16, S, X 2, I 5, N"; WRITE 3 + FACTORIAL (I) * 27 USING "1";



The first READ statement reads values for the identifiers ALPHA, BETA and GAMMA. The second READ statement reads the (I, J) element for the array NUMBER. The first WRITE statement prints out the given literal followed by the value of the identifier X. The format causes the string to be printed out starting in column 16, followed by two spaces, then the value of X is printed in a field of width five, and finally, a "new line" is "printed" so that the next WRITE statement prints on the next line of output. If no N was used in the format list, the next write statement would continue on the same line. The second write statement prints out the value of the integer expression in a field of the width required to hold the value.



1-10 LEXICAL STRUCTURE



This section describes the lexical structure of GAUSS, i.e., the symbols which form the basic units of the language.



<identifier> ::= <letter> {<letter> I <digit>} <letter> ::= AI B I ... I Y I Z I - <number> ::= <digit> {<digit>} <digit> ::= 0 I 11 21 31 41 51 61 71 81 9 <literal> ::= "{ <symbol>}" <symbol> ::= <character> I III' <character> ::= < any character on the machine except "">



Identifiers are of arbitrary length; however, only the first sixteen characters are significant. Keywords in GAUSS are reserved words; thus no identifier may be the same as any keyword. Identifiers, numbers, or literals cannot be split across input line boundaries. Also note that imbedded blanks are not allowed in identifiers or in numbers. An imbedded blank causes the identifier or number to be regarded as two separate identifiers or numbers. Numbers are a series of digits interpreted in base 10 notation. The value of the number must not exceed the capacity of the machine. Real or floating point numbers are not allowed.



1-11 CONTROL COMMANDS AND COMPILER AIDS 19



Literals (string constants) are enclosed in double quotes, and if it is necessary to place a double quote within the string, it is encoded as "". Thus a literal consisting of a double quote mark is encoded as ''''''''. Comments begin with $$. They can begin anywhere in a line and extend to the end of the line.



1-11 CONTROL COMMANDS AND COMPILER AIDS



There are three control commands that can be used with the GAUSS language to allow batching of programs, and to specify options to the compiler. These are the ?PROGRAM command, the ?OPTION command, and the ?DATA command. The ?PROGRAM command must be in the first line of any program. This is to allow batching of GAUSS programs, so that the compiler needs to be invoked only once to compile several programs. The ?PROGRAM keyword must begin in column 1 of the input line. If there is any other information on the line, it is ignored. The ?OPTION command is used to specify which options the user would like with the compilation of the program. This command can appear anywhere within a GAUSS source program, thus allowing options to be turned on and off for certain pieces of code. The ?OPTION keyword must begin in the first column of an input line. The options to be changed are specified on the rest of the line. The options, which are listed below, may begin with either a + or a -. If the option begins with the + sign, or if neither the + or - are specified, then this option is enabled. The - indicates that the option is to be disabled. The options in each ?OPTION command must be separated by one or more blanks. The possible options are as follows:



LIST When this option is enabled, the source listing of the GAUSS program being compiled is given. If the option is disabled, only the error messages (if any) are printed out. Initially, this option is enabled by the compiler. XREF When this option is enabled, a cross reference listing of the GAUSS program being compiled is given. A separate table is generated for local identifiers of each procedure and another table for global identifiers. Each table is in alphabetical order and contains the following domains about identifiers: name, type, use, dimension, and line numbers at which they were referenced. The use of an identifier is either as a simple identifier, simple constant identifier, array identifier, array constant identifier, or procedure identifier. The first line number in the reference list of an identifier specifies its first declaration. The only irregularity to this regards the names of the MAIN procedure and procedures which are not declared by the GAUSS programmer (Le., END_OF -FILE). In such a case a dummy declaration line number of zero is used. Initially this option is disabled by the compiler.



Note that the following options are intended for the use of the systems programmer only, since they produce information that is relevant only to someone who is responsible for maintaining the GAUSS compiler.



SCANNER..DEBUG This option is designed to help debug the scanner. Each time that the scanner is called, the value of the token, and the type of token that is found are printed out.



20 THE DESIGN AND DESCRIPTION OF GAUSS



LLL..DEBUG This option is used for debugging the LL(I) parser. When enabled, it causes the stack contents, the input token, and the parse action for every parse operation (Le., a grammar expansion, a pop, an error, accept or call OP) to be printed out. REPAIR-DEBUG This option is used to debug the error recovery portion of the compiler. Each time an error recovery is required, the current stack contents, the input symbol, stack insertions and deletions, and the final stack contents and input symbol are printed out. CODE-DEBUG This option can be used to help debug the code generator of the compiler. When it is enabled, the generated code is printed out. As an aid to understanding the code, the GLOBAL area and PROCINFO area are also output. For truly effective debugging of the code the SYM-DEBUG flag should be enabled also. SYM_DEBUG This option is used to obtain a listing of the symbol table entries generated by the compiler. A separate table is generated for local identifiers of each procedure and another table for global identifiers. The tables are ordered by hash location and contain the following domains about identifiers: name, hash location, link, and address. The hash location and link field illustrate the structure of the symbol table. They are useful for debugging the symbol table routines or fine-tuning the hashing function. The address field requires further qualification because of its many purposes. If the identifier is a procedure, the address field contains an index into the PROCINFO area (discussed in Sec. 2-5.3). If the identifier is a simple constant identifier, the address field contains the value of the constant. All other information about the identifier can be displayed with the XREF option.



$$ THIS PROGRAM READS IN A SERIES OF WORDS, EACH OF WHICH $$ HAS AN ASSOCIATED PAGE NUMBER. THE WORDS ARE THEN $$ SORTED SO THAT THEY CAN BE PRINTED IN ALPHABETICAL $$ ORDER, EACH WITH A LIST OF PAGE NUMBERS ASSOCIATED WITH IT. $$ EACH WORD IS PRINTED ONLY ONCE, BUT ALL PAGE NUMBERS $$ ASSOCIATED WITH EACH OCCURRENCE OF THE WORD WILL BE $$ PRINtED DUPLICATE PAGE NUMBERS IN THE INPUT WILL APPEAR $$ DUPLICATED IN THE OUTPUT AS WELL.



PROC SORT(INTEGER N) FORWARD;



PROC MAIN INTEGER N;



READ N; $$ THE NUMBER OF WORDS IS KNOWN BEFOREHAND. SORT(N); ENDPROC;



Fig. 1-1 Simple sort program



1-12 CONTROL COMMANDS AND COMPILER AIDS 21



$$ THIS SORT ROUTINE DOES ALL THE WORK, IN THAT IT READS, SORTS, $$ AND PRINTS THE WORDS. $$ A SIMPLE SELECTION SORT IS USED, AND NAMES ARE PRINTED AS $$ THEY ARE SORTED.



PROC SORT(lNTEGER N) ARRA Y(N) OF STRING NAME; ARRA Y(N) OF STRING PAGE; STRING FIRST, PAGELIST; INTEGER I, J; STRING CONST LARGE = "ZZ", VERY -LARGE = "ZZZ"; I := 1; LOOP WHILE I <= N; READ NAME (I), PAGE (I); I := I + 1; ENDLOOP;



$$ NOW DO THE SORT AND PRINT: LOOP FIRST := LARGE; I := 1; J := 0; LOOP WHILE I <= N; IF NAME (I) < FIRST THEN IF J <> 0 THEN NAME (J) := FIRST; PAGE (J) := PAGELIST; ENDIF;



J := I; FIRST := NAME (I); NAME (I) := VERY -LARGE; PAGEUST := PAGE (I); ELSE IF NAME (I) = FIRST THEN NAME (I) := VERY -LARGE; PAGELIST := PAGELIST & "," & PAGE (I); ENDIF; ENDIF; I := I + 1; ENDLOOP; WHILE FIRST <> LARGE; WRITE FIRST, PAGELIST USING "S20,X3,S40,N"; ENDLOOP; ENDPROC; Fig. 1-1 Simple sort program (cont' d.)



22 THE DESIGN AND DESCRIPTION OF GAUSS



The ?DA T A command indicates the beginning of the data for a GAUSS program. The command and the data following this command must come immediately after the end of the program. If there is no data for the program, this command may be omitted. The keyword ?DA T A must begin in column one.



1-12 SAMPLE PROGRAMS



This section gives two sample GAUSS programs. With each program is a short explanation of what the program does. The program given in Fig. 1-1 works as follows. The main procedure reads in the number of words and associated page numbers that the program has to sort. Then the SORT routine is called. The SORT routine begins by reading in the words and the page numbers into the string arrays NAME and PAGE respectively. The rest of the procedure consists of two nested WHILE loops. The outer loop is repeated once for each word to be printed. Each iteration prints the next word according to the alphabetical ordering. The inner loop goes through all of the words finding the next one to be printea and collecting the page numbers. It works in. the following way. If the name being looked at is less than any encountered so far, then the name being looked at is the one that should be printed. The current name and its



$$ $$ THIS PROGRAM GENERATES ALL PERMUTATIONS OF $$ THE LETTERS OF AN N-LETTER WORD (STRING) BY MEANS OF $$ ADJACENT TRANSPOSITIONS. THE TRANSPOSITIONS ARE $$ PERFORMED SIMUL T ANEOUSL Y ON THE FIRST N POSITIVE INTEGERS $$ AND ON THE SYMBOLS OF THE STRING.



STRING SUBJECT; $$ GLOBAL VARIABLE CONTAINING THE STRING TO BE $$ PERMUTED. INTEGER PROC LARGEST .-MOBILE (ARRAY 0 OF INTEGER PERM, ARRAY 0 OF LOGICAL DIRECTION, INTEGER N) FORWARD; PROC PERMUTE (INTEGER N) FORWARD;



$$ THE MAIN PROCEDURE READS IN THE STRING TO BE PERMUTED, $$ AND INVOKES THE PERMUTE PROCEDURE



PROC MAIN INTEGER N; READ SUBJECT; N := # SUBJECT; WRITE SUBJECT USING "S,N"; N := #SUBJECT; PERMUTE(N); ENDPROC;



Fig. 1-2 Permutation program



1-12 SAMPLE PROGRAMS 23



page list are put back in the array at the first occurrence of the current name. This position is kept in the variable J. Then the name being looked at and its page number become the new current name and page list. As each name becomes the new current name, it is replaced by a special token so that it will not be used again. This special token is the string "ZZ". Thus the program assumes that no name will begin with "ZZ". When the name being looked at is the same as the current name, then it is also replaced by a special token, and its page number is added to the page list of the current name. When the inner loop is done, the current name in FIRST and its associated page list are then printed. The program given in Fig. 1-2 is designed to generate all permutations of the letters of a given string by means of adjacent transpositions. This program is based on Algorithm 1.1 of Even [1973]. It is assumed that the reader is familiar with this method of generating such permutations.



$$ PERMUTE PERFORMS THE ACTUAL TRANSPOSITIONS, AND PRINTS $$ ALL OF THE PERMUTATIONS.



PROC PERMUTE (INTEGER N) INTEGER CONST INFINITY = 32767;



$$ LOCAL VARIABLES:



ARRAY (-1:N+1) OF INTEGER PERM; $$ A PERMUTATION OF THE $$ FIRST N POSITIVE INTEGERS, WITH GUARD $$ VALUES IN 0, -1, N+1, TO $$ ELIMINATE SOME CHECKS. ARRAY(N) OF LOGICAL DIRECTION; $$ DIRECTION OF MOVEMENT $$ OF THE CORRESPONDING NUMBER IN PERM. INTEGER ILM; $$ INDEX OF THE LARGEST MOBILE ELEMENT IN $$ PERM.



INTEGER I; STRING S;



$$ INITIALIZE PERM AND DIRECTION. $$ DIRECTION (I) = FALSE MEANS THAT PERM (I) IS TENDING $$ TO MOVE LEFTWARD.



I := 1; LOOP WHILE I <= N; PERM (I) := I; DIRECTION (I) := FALSE; I := I + 1; END LOOP;



$$ SET UP GUARD VALUES:



Fig. 1-2 Permutation program (cont'd.)



24 THE DESIGN AND DESCRIPTION OF GAUSS



PERM (-1) := 0; PERM (0) := INFINITY; PERM (N + 1) := INFINITY;.



LOOP WHILE TRUE; $$LOCATE THE LARGEST MOBILE INTEGER IN PERM.



ILM := LARGEST .-MOBILE (PERM, DIRECTION, N); IF ILM = -1 $$ NO MOBILE ELEMENT THEN RETURN; ENDIF; $$ PERFORM TRANSPOSITION ON PERM AND SUBJECT



I := PERM (ILM); S := SUBJECT (ILM I 1);



IF DIRECTION (ILM) THEN PERM (ILM) := PERM (ILM + 1); SUBJECT (ILM I 1) := SUBJECT (ILM + 11 1); PERM (ILM + 1) := I; SUBJECT (lLM + 11 1) := S; ELSE PERM (lLM) := PERM (lLM - 1); SUBJECT (lLM I 1) := SUBJECT (ILM - 11 1); PERM (ILM - 1) := I; SUBJECT (lLM - 11 1) := S; ENDIF;



ILM := I; $$ VALUE OF LARGEST MOBILE INTEGER



I := 1 ;



LOOP WHILE I <= N; IF PERM (I) > ILM THEN DIRECTION (I) := NOT DIRECTION (I); ENDIF; I := I + 1; ENDLOOP; WRITE SUBJECT USING "S,N"; ENDLOOP; ENDPROC;



Fig. 1-2 Permutation program (cont'd.)



1-12 SAMPLE PROGRAMS 25



$$ LARGEST.-MOBILE (PERM, DIRECTON, N) LOCATES THE LARGEST $$ MOBILE ELEMENT IN PERM, AND RETURNS THE INDEX OF IT.



INTEGER PROC LARGEST .-MOBILE (ARRAY 0 OF INTEGER PERM, ARRAY 0 OF LOGICAL DIRECTION, INTEGER N)



INTEGER I, ILM;



ILM := -1; I := 1;



LOOP WHILE I <= N; IF DIRECTION (I) THEN IF PERM (I) > PERM (I + 1) THEN IF PERM (I) > PERM (ILM) THEN ILM := I; ENDIF; ENDIF; ELSE IF PERM (I) > PERM (I - 1) THEN IF PERM (I) > PERM (ILM) THEN ILM := I; ENDIF; ENDIF; ENDIF; I := I + 1; ENDLOOP; RETURN ILM; ENDPROC;



Fig. 1-2 Permutation program (cont'd.)



The MAIN procedure reads the subject string and calls the PERMUTE procedure to obtain the permutations. Note that the permutations are actually performed on an array of integers, called PERM. The permutations on the string are obtained by moving the same elements in the string as those moved in PERM. The PERMUTE procedure first initializes the PERM and DIRECTION arrays. The DIRECTION array is used to determine which direction (left or right) the corresponding element of the PERM array is moving. After initialization, the PERMUTE procedure goes. into a loop which first transposes the correct two elements (specified by ILM - index of the largest mobile element and the corresponding entry in DIRECTION), and then reverses the direction of the



26 THE DESIGN AND DESCRIPTION OF GAUSS



elements in PERM (as indicated by DIRECTION) that are greater than the element at PERM (ILM)). The new permutation is then printed out and the loop repeats until there is no longer a mobile element, indicated by ILM = -1. When this happens, all permutations have been computed and the program terminates. ILM is calculated by the LARGEST .-MOBILE procedure, which simply looks for the largest element k of PERM such that there is an integer smaller than k adjacent to k on the side that is in the same direction as k is moving. The index of the element k in PERM is then returned, with -1 being returned if such a k does not exist.



BIBLIOGRAPHY



EVEN, S., "Algorithmic Combinatorics," The Macmillan Company, New York, 1973. IBM, "System/360 Operating System PL/I (F) Language Reference Manual," Form GC28-8201-3. JENSEN, K., and N. WIRTH, "PASCAL User Manual and Report," 2d ed., Springer-Verlag, New York, 1975. RITCHIE, D. M., "c Reference Manual," in "Documents for Use with the UNIX Time-sharing System," Bell Laboratories, Murray Hill, N. J., 1974. SITES, R. L., "ALGOL W Reference Manual," Technical Report STAN-CS-71-230, Computer Science Department, Stanford University, Feb. 1972. TREMBLAY, J.P., and P.G. SORENSON, "The Theory and Practice of Compiler Writing," McGraw-Hili Book Company, New York, 1982. WIRTH, N., and H. WEBER, "EULER: A Generalization of Algol and its Formal Definition, Parts I and II," Communications of the ACM, Vol. 9, No. 12, Jan. - Feb. 1966, pp. 13 - 35, 89 - 99.



Chapter 2



The Implementation of A Compiler For GAUSS



2-1 INTRODUCTION



Implementing a compiler presents many problems, especially to the novice J compiler writer. Since a guide of this type can not hope to answer all of the questions related to compiler writing for different languages and different types of implementations, perhaps the best strategy is to give one example of a compiler, and hope that in this way some of the questions a student may have about writing his or her own compiler can be answered. This chapter illustrates one particular implementation of a compiler for the GAUSS language. More specifically, the compiler developed is a single pass compiler that produces high-level, stack- oriented intermediate language instructions. In each section of the chapter, an attempt is made to explore some basic problems involved in 'writing a certain portion of the compiler as well as to present the solution to these problems. First, the goal of the particular part of the compiler is determined. Then, the method and data structures used to implement this section of the compiler are explored. Finally, each section presents the PL/I source code which implements the function that has been discussed. This allows the reader to examine the programming techniques that were used to put the algorithms and data structures together in a workable fashion.



28 THE IMPLEMENTATION OF A COMPILER FOR GAUSS



In Fig. 2-1 a breakdown of the compiler into four basic submodules is presented. These are the scanner, the parser, the table handler, and the code generator. The scanner reads the GAUSS program as input, and returns a token representation of the input to the parsing module. The parser performs the

syntactic analysis of the GAUSS program being compiled. The routines of the table handler look after the insertion, look-up and deletion of symbols from the symbol table. The code generator must analyze the semantics of the program and generate appropriate code which will perform the required operations in order to execute the GAUSS program being compiled. Figure 2-1 illustrates how the procedures can invoke each other. The parser invokes the scanner and the code generation routines. The scanner and code generation routines both invoke the table handler. This chapter follows the basic order outlined in the preceding paragraph. Section 2-2 describes the scanner. Section 2-3 presents the parser, and Sec. 2-4 presents the table handler, which maintains the symbol table. In Sec. 2-5, the code generator and the semantic analysis necessary to ensure the appropriate generation of code are presented. Section 2-6 provides a centralized description of compilation errors in the scanner, table handler, parser, and code generation routines.



2-2 SCANNER



The scanner is invoked by the parser to read the next lexical unit of source text and return an input token representing this lexical unit to the parser. This section describes the implementation of a scanner for the GAUSS language. In Sec. 2-2.1, some of the goals of a scanner are introduced. In Sec. 2-2.2, a finite state machine representation of the scanner is described, and in Sec. 2-2.3, a top-level



MAIN



INIT



PARSER



CODE GENERATOR



TABLE HANDLER



Fig. 2-1 Structure chart of a GAUSS compiler



2-2 SCANNER 29



breakdown of the routines in the scanner is given. Section 2-2.4 describes the nature of the information that must be handled and stored by the scanner and Sec. 2-2.5 deals with the implementation of the scanner. Finally, the PL/I source code for the scanner is given in Sec. 2-2.6.



2-2.1 Introduction



The purpose of the scanner is to return the next input token to the parse module. Tokens consist of all keywords, identifiers, and symbols found in a program. To be able to return a token, the scanner must isolate the next sequence of characters in the input stream which designate a valid token. Thus, the basic task of the scanner is that of editing the source program so as to remove information that is not important to the parsing and code generating phases of the compiler, and passing on to the parser an intermediate representation of the 't source program. This means that the scanner must be able to ignore such things as blanks and comments. Also, the scanner is responsible for differentiating between an identifier and a keyword of the language. The scanner should not only find a token, but should make sure that it has found the complete token. For example, ZAP 3 in the input stream should cause the tokens for an identifier and a number to be returned. On the other hand, ZAP3 should cause a single token for an identifier to be returned. Also, 3ZAP and 3 ZAP should both cause a token for a number and an identifier to be returned since identifiers cannot begin with a digit in GAUSS. Having introduced the notion of a scanner, let us formalize a method which will perform the required lexical analysis.



2-2.2 The Scanner as a Finite-State Machine



Since the scanner's output is a function of the input, and there are only a finite number of actions which the scanner can take for any input, the scanner can be easily described by a finite-state machine. The scanner is initially in the start state whenever it is invoked. The transition from its start state depends on the next character of the input stream. The operation of a scanner for GAUSS is shown in the state transition diagram in Fig. 2-2. Most of the arcs of this diagram are labelled with the input symbol which causes the transition. Arcs that are not labelled indicate the transition to be made whenever the input symbol is not one of those marked on the other arcs leaving a particular state. A set of symbols listed on an arc indicates that any member of the set causes the transition. If the input symbol is such that it corresponds to no arc leaving the state, the symbol is invalid and the scanner prints an error message. The arcs are labelled with the actions to be performed when a transition is made. The following two' types of actions occur



1. RETURN (token) signifies that token should be returned to the parser as the input token. 2. READ means that the next character from the input stream should be read. This action is discussed in more detail in Sec. 2-2.5.



Given this basic design for the operation of the scanner, a procedure can be implemented which will emulate the actions of the finite state machine given in Fig. 2-2. First, however, a breakdown of the scanner into its major components and their purpose is given.



30 THE IMPLEMENTATION OF A COMPILER FOR GAUSS



lIS



REPORT ERROR



$



RETURN (0#' L T ) ; > READ i RETURN ( #: NE ) ; . READ ; RETURN (

LE); RETURN (# LPAREN ); RETURN (#' PLUS ) ; RETURN (# SUBSTR ); RETURN (#' CONCAT ) ; IF NOT NEWLINE



<



(



+



a



*



RETURN (P MULT);



RETURN (#' RPAREN );



/



RETURN (#' SEMI) ; IF LAST TOKEN WAS IDENTIFIER OR ) THEN RETURN (# MINUS ) ; ELSE RETURN (#" UMINUS ) ; RETURN (N DIV ) ;



RETURN (# COMMA ) ;



0/0



>



RETURN (# MOD ) ; IF IS KYWD 16 THEN RETURN (KEYWORD TOKEN) ; {A ..... Z .-.O,...9} ELSE RETURN (IDENTIFIER TOKEN); RETURN (# G T ) ; . READ; RETURN (#' GE ) ;



{ A,.... Z . - }



. .



RETURN (-# COLON ) ; READ i RETURN (# ASSIGN) ; RETURN (." LENGTH );



-#



0)



RETURN (#' INDEX) ;



.



II



RETURN (#' EQ ) ; IF NEWLINE REPORT ERROR RETURN (-# LITERAL) ; RETURN (* LITERAL) ;



{O,...,9}



(# NUMBER) ;



Fig. 2-2 Finite-state diagram of a scanner for GAUSS



2-2 SCANNER 31



SCANNER



GET NUM



Fig. 2-3 Structure chart of the GAUSS scanner



2-2.3 Top-down Description of Scanner



The scanner is designed to be an interface between the input stream and the parser for the compiler. The scanner may call three subordinate procedures, namely, get--<:har, getJ1um, and ;s-*ywd. The purpose of these procedures is now introduced and they will be discussed fully in Sec. 2-2.5. The get.-ehar routine is used to read in the source program and return characters from the input, one at a time. Adjacent blanks are merged. It is also responsible for processing the ?PROGRAM and ?OPTION commands which were discussed in Chap. 1. These commands are processed by the option routine. The getJ1um routine is used to read in the character representation of a number from the input stream and convert it to its corresponding numeric representation. The is-.kywd procedure takes a given identifier and determines whether it is actually a keyword, or an identifier. Figure 2-3 demonstrates the calling sequences of the scanner and its subordinate procedures. In the diagram, a procedure invokes those procedures at a lower level with which it is connected. For example, the scanner invokes the get.-ehar procedure, and the get.-ehar procedure invokes the option procedure.



2-2.4 Structural Representation of Data



Before examining the scanner and each of its submodules in more detail, it is necessary to describe how the necessary data can be structured. The first data structure to be discussed is the representation of the tokens. Although the tokens are not represented by a complex data structure, their use is so important that they are discussed here. When a token is encountered, some representation of that token must be returned to the parser. The easiest and simplest way to do this is to have a unique number for each token. Unfortunately, this representation makes the code for the compiler difficult to understand, since the numerical representation for each token must be remembered. To overcome this difficulty, our implementation uses a macro facility to define a literal text replacement for each token. Thus for a keyword such as ARRAY, there is a corresponding macro token called #array which is defined to be a certain integer value, say 2. When the keyword ARRAY is encountered in the input stream by the scanner, the scanner's action is to return the macro token #array (i.e., RETURN (#array);) which is equivalent, in this case, to returning a value of 2 (i.e., RETURN (2);).



32 THE IMPLEMENTATION OF A COMPILER FOR GAUSS



There are macros for the symbols of the GAUSS language as well. For example, the macro #assign is associated with the := symbol. The use of macros not only enhances the readability of the source code for the compiler, but it also makes it much easier to change the value associated with a particular token, if this should be required. Such a change only needs to be done once in the macro definition instead of each ,time it is used in the source code. The next important data structure to be discussed is the char -class vector. When the scanner is entered, the next state decision is made on the basis of the next input character as indicated by the state diagram given earlier . To perform the state transitions, a mapping from the set of possible input characters to the set of possible states is required. This is realized through the char -class vector. This vector contains 256 entries, one for each of the possible members of the EBCDIC character set. The EBCDIC value of the input character gives the index into this vector for the corresponding entry of the given character. Each entry in this vector contains the next state number for the given input symbol. For example, if a > symbol is the next character, it is converted into its EBCDIC encoded value (i.e., decimal 110) by the PL/I function UNSPEC. char -class (110) then yields a result of 17. Thus the initial state transition with > as the input, is to state 17. The state for each character can be determined by inspection from the state transition diagram. For example, since any uppercase letter of the alphabet, or the underscore, causes a transition to state 16, the corresponding entries for each of these characters in the char -class table is 16. It is important to note that the char -class vector is used only to determine the initial state transition ""hen the scanner is entered, and plays no role in determining subsequent state transitions. The label array state is associated with the char -class vector. It contains the addresses of the code which is to be executed for that particular state. For example, state (17) gives the address of the code which processes the input for state 17. In order to determine when a keyword has been read, it is necessary to compare the given string with the possible keywords. In order to do this, a list of the keywords is required for comparis9n. This list is found in the initial segment of the token-str vector. This vector contains the string representation for all keywords, special symbols, and nonterminal symbol names. For example, the entry for a keyword such as ARRAY is 'ARRAY'. For a special symbol, such as the assignment operator, the entry is := and for a nonterminal symbol name, such as <program>, the entry is 'PROGRAM'. The only use of this vector in the scanner is for identifying keywords. In the parser (Sec. 2-3) this vector is also used for debugging and error recovery. When the scanner encounters a string which may be a keyword, the part of the token-str vector which contains the keywords is searched. If there is a match with the input string, then the input token is a keyword; otherwise, it is some other type of token. The tokens have been chosen such that if the matching keyword is the ith one in this vector, then the value of the token for this keyword is i.



2-2.5 Implementation of the Scanner



The scanner may invoke anyone of three subordinate procedures. These are the get-char, getJ1um, and is-.kywd procedures.



2-2 SCANNER 33



The get.-ehar procedure is used to return the next character of the input stream to the scanner. Adjacent blanks are merged. This is done by reading one input line at a time into the global string variable line-1Juf. Because a token is not allowed to be split across lines, a blank, which acts as a delimiter between tokens, is appended to line-1Juf. The global variable line..ptr is used as an index into line-1Juf. Therefore, when this procedure is called, line..ptr is incremented and the character in line.1Juf pointed to by line-ptr is returned. If the get.-ehar routine is called when line..ptr points to the last blank character of line-1Juf, then the new...1ine flag is set to true and a blank is returned. The next time the get.-ehar routine is called with the new...1ine flag set to true, the error messages associated with the previous line are printed, a new line is read into line.1Juf, and the new...1ine flag is set to false. If the new line is a ?PROGRAM command (i.e., the start of a new GAUSS program) or a ?DA T A command (i.e., the data for a just compiled program), then a logical end-of-file condition is indicated by setting the global flag eof to true and returning a hlank. A physical end-of-file condition is indicated by setting the global variables real.-eo! and eo! to true and returning a blank. If the line is a ?OPTION command, then the option procedure is called to set the appropriate option flags. The line is then printed providing the list-flag is on, and the next character is returned. The getJ1um procedure simply reads in the digits of a number from the input stream, and converts them from their string representation into numerical form. Checking is also done to ensure that the number is not too large for the machine for which the compiler generates code. When an identifier is read by the scanner, it must be determined whether it is a GAUSS keyword or a user identifier. This is done by the is-.kywd procedure. This procedure takes the given string and compares it with the keyword elements of the vector token-str using a binary search (see Tremblay and Sorenson [1976]). If the given string matches an element in token-str, then it is a keyword, and the index of the string in token-str is returned, since this is the same as the value of the token for the given keyword. If the given string is not an element of token-str, then a 0 is returned, which indicates to the scanner that the token is an identifier. Before the scanner is called for the first time, a line is read into line.1Juf, line..ptr is set to 0, and next.-ehar is initialized to a blank. next.-ehar, a global character variable, contains the next character to be processed by the scanner. The scanner procedure works in the following manner. next.-ehar is converted into its binary representation by the UNSPEC function. This number is then used as an index into the char .-elass vector, which contains the number of the state to which the scanner should transfer. The states are entered by means of a GOTO and the state label array by the statement



GO TO state (x)



where x represents the value returned from the char .-elass vector. Although the use of a GOTO statement is considered a bad programming practice, it can be very useful when it is used to simulate a construct, such as a CASE statement, which is not available in the programming language being used. The use of a GOTO statement in this case makes. the program clearer and easier to read than if some other construct, such as multiple IF statements, had been used.



34 THE IMPLEMENTATION OF A COMPILER FOR GAUSS



Once in the appropriate state, the scanner proceeds to process as much input as necessary in order to obtain the next token. The processing of the input for most of the states is fairly simple and will not be discussed here. Two exceptions, however, require further explanation. When a minus sign is read, it must be determined whether it is a binary or unary minus sign. This can be done by looking at the token returned previously, since the set of items that can precede a unary minus is disjoint from the set of items that can precede a binary minus. The parser stores the tokens returned by the scanner in the global variable nextJoken. Thus, if nextJoken is one of #rparen, #simpident, #simpconident, or #procident, then the token to be returned is a binary minus sign. Otherwise, the token is a unary minus sign. If the input token is an identifier, then the scanner must determine what type of identifier is to be returned. This is done by calling the look-Llp routine which finds the symbol table entry of the given identifier and returns its address in the symbol table. The scanner then determines what token is to be returned by examining the use information of the symbol table entry. The possible tokens that can be returned for identifiers, along with their meanings are



#newident #simpident #simpconident #arrayident #arrayconident #procident



a new identifier a simple identifier a simple constant identifier an array identifier an array constant identifier a procedure identifier



Note that after the initial state transition, which is based on the particular input character, there are no further explicit state transitions. The state is imbedded into the code. For example, when state 17 is entered with the input symbol '>', the code basically reads as follows:



next...char = get...char; IF next...char = '=' THEN next...char = get...char; RETURN (#ge); ELSE RETURN (#gt);



The state entered is still decided by the next input character, but the transition is determined by the THEN and ELSE clauses of the IF statement. Note that in this example, when the token returned is the token for '>', the next input character has already been read into next.-ehar. However, when the token returned is the token for '>=', the next input character is unknown and hence it must be read into next.-ehar before returning. This corresponds to the READ operation shown in the state transition diagram. In three cases the scanner must also provide some additional information besides returning a token. This occurs whenever the token is an identifier, an integer constant, or a string constant. For identifiers, the symbol table entry location is placed in the global variable idptr. For integer constants, the value is placed in the global variable value. The value of the integer constant is obtained by the getJ1um procedure. The value of a string constant is its reference pointer. The calculation of the reference pointer is postponed until later in the code generation



2-2 SCANNER 35



phase (see Sec. 2-5.2). Instead, the string constant is stored by the scanner in the global string variable strbuf. To allow debugging of the scanner, all states converge to the point labelled out before returning the token to the parser. At this point, if the scanner Jiebug flag is on, the token value and its corresponding string representation are printed. The following section gives the source code for the scanner.



2-2.6 Scanner Source Code



Figure 2-4 contains the source code for the scanner.



/........................................ /



/. /. /.



SCANNER



./ ./ ./



/........................................ /



/. The scanner is used to return the tokens to the parser. Every time that it is called, it will return the next token. The scanner works by using the first character read to find out what class the next token will be. This is done by using the table as described below. Then the scanner processes the rest of the token, according to its class. The predefined token is then returned (actually a number) to the calling routine. · /



scanner: PROCEDURE RETURNS (FIXED BIN (15»;



DECLARE geLchar ENTRY RETURNS (CHARACTER (1 », geLnum ENTRY RETURNS (FIXED BIN (15», is

ywd ENTRY (CHARACTER (.) VARYING) RETURNS (FIXED BIN (15»;



DECLARE class FIXED BIN (1 5), /. class of the char · / state (1 : 23) LABEL, /. 23 state labels · / idbuf CHAR (idlen) VAR, /. holds identifiers · / char

nt FIXED BIN (1 5), /. into rep. of char · / (i,z) FIXED BIN (15), /. temporaries · / token FIXED BIN (15); /. token returned · /



/. The following table is used to map the characters returned by the geLchar routine into their various classes. Le., if the character returned



Fig. 2-4 Source code for the scanner



36 THE IMPLEMENTATION OF A COMPILER FOR GAUSS



is a comma, its integer value (1 07 or 6B hex) is used as an index into this table. The integer that is found at this location (14 in this case) represents the state that the scanner should go into with this character as the first character of the token being scanned. 0 entries indicate an illegal character. · /



DECLARE char_class (0 : 255) FIXED BIN (15) STATIC INITIAL «64) 1, 2, (11) 1, 3, 4, 5, 6,7, (9) 1,1, 8,9,10,11,1,12,13, (9) 1,14,15,16,17, 1 , (10) 1, 18, 19, 20, 1, 21, 22, (65) 1, (9) 16, (7) 1, (9) 16, (8) 1 , (8) 16, (6) 1 , (10) 23, (6) 1);



case: IF (eof) THEN DO; token = #eof; GO TO out;



END. ,



UNSPEC (char_int) = 'OOOOOOOO'B II UNSPEC (nexLchar); class = char_class (char_int); begin_tok = line_ptr;



/. The following go to, along with the state label · / /. array is used to simulate a case statement. · /



GO TO state (class);



state(1 ): /. unknown character · / CAll error (1 , 0, 0, 0); nexLchar = geLchar; GO TO case;



state(2): /. a blank character · / nexLchar = geLchar; GO TO case;



state(3): /. a less than sign «) · / nexLchar = geLchar; IF (nexLchar = '>') THEN DO; token = #ne; nexLchar = geLchar;



END. , ELSE



Fig. 2-4 Source code for the scanner (cont'd.)



IF (nexLchar = ' =') THEN DO; token = #Ie; nexLchar = geLchar;



END; ELSE token = #It; GO TO out;



state( 4): /. a left parenthesis · / token = #Iparen; nexLchar = geLchar; GO TO out;



state(5): /. a plus sign · / token = #plus; nexLchar = geLchar; GO TO out;



state(6): /. substring operator (I) · / token = #substr; nexLchar = geLchar; GO TO out;



state(7): /. a concatenation operator (&) · / token = #concat; nexLchar = geLchar; GO TO out;



state(8): /. a comment delimiter · / nexLchar = geLchar; IF (nexLchar -- '$') THEN CALL error (2, 0, 0, 0); DO WHILE (hew

ine); /. end of line · / nexLchar = geLchar;



END. , nexLchar = geLchar; GO TO case;



state(9): /. multiplication operator · / token = #mult; nexLchar = geLchar; GO TO out;



state(10): /. right parenthesis · / token = #rparen; nexLchar = geLchar; GO TO out;



Fig. 2-4 Source code for the scanner (cont'd.)



2-2 SCANNER 37



38 THE IMPLEMENTATION OF A COMPILER FOR GAUSS



state(11 ): /. a semi colon · / token == #semi; nexLchar == g

Lchar; GO TO out;



state(1 2): /. a minus sign · / /. check previous token · / IF (nexLtoken == #rparen I nexLtoken = #simpconident I nexLtoken = #simpident I nexLtoken = #procident) THEN token == #minus; ELSE token = #uminus; nexLchar = geLchar; GO TO out;



state(13): /. a division sign · / token = #div; nexLchar = geLchar; GO TO out;



state(14): /. a comma · / token = #comma; nexLchar = geLchar; GO TO out;



state(1 5): /. a mod operator (%) · / token = #mod; nexLchar = geLchar; GO TO out;



state(1 6): /. an alphabetic character · / idbuf = "; DO WHILE(class = 16 I class = 23); IF (length (idbuf) < idlen) THEN



idbuf == idbuf II nexLchar; nexLchar = geLchar; UNSPEC (char

nt) = 'OOOOOOOO'b II UNSPEC (nexLchar); class = char_class (char_int);



END; token == i8-kywd (idbuf); IF (token = 0) THEN DO; idptr == lookup(idbuf); token= symtab(idptr).use;



END- , GO TO out;



Fig. 2-4 Source code for the scanner (cont'd.)



state(17): /. a greater than sign · / nexLchar == geLchar; IF (nexLchar == ' ==') THEN DO; token == #ge; nexLchar == geLchar;



END' , ELSE token == #gt; GO TO out;



state(18): /. a colon · / nexLchar == geLchar; IF (nexLchar == ' ==') THEN DO; token == #assign; nexLchar == geLchar;



END' , ELSE token == #colon; GO TO out;



state(19): /. a string length operator (#) · / token == #Iength; nexLchar == geLchar; GO TO out;



state(20): /. a string index operator (@) · / token == #index; nexLchar == geLchar; GO TO out;



state(21 ): /. an equals sign · / token == #eq; nexLchar == geLchar; GO TO out;



state(22): /. a double quote mark · / strbuf == " ; z == 0; nexLchar == geLchar; strloop: DO WHILE (nexLchar -- ' , & "hew_line); IF (z < strlen) THEN strbuf == strbuf II nexLchar; z==z+1' , nexLchar == geLchar;



END' ,



IF (new_line)



Fig. 2-4 Source code for the scanner (cont'd.)



2-2 SCANNER 39



40 THE IMPLEMENTATION OF A COMPILER FOR GAUSS



THEN CALL error (4, 0, 0, 0); ELSE DO; nexLchar = geLchar; IF (nexLchar = ' ') THEN DO; IF (z < strlen) THEN strbuf = strbuf II ' '; z=z+1. , nexLchar = geLchar; GO TO strloop;



END- ,



END' ,



IF (z > strlen) THEN CALL error (3, 0, 0, 0); token = # literal; GO TO out;



state(23): /. a number · / value = geLnum; token = #number;



out: /. print token for debugging · / IF (scanner_debug) THEN DO; PUT EDIT ('token <', token, " " token_str (token), '>')(X(5), A, F(3), A, A, A); IF (token = #number) THEN PUT EDIT (value)(F(7»; IF (token = #literal) THEN PUT EDIT (strbuf)(A); IF (token = #newident I token = #simpident I token = #simpconident I token = #arrayident I token = #arrayconident I token = #procident) THEN PUT EDIT (idbuf)(A); PUT EDIT (' ')(A);



END. , RETURN (token);



/........................................ /



/. /. /.



geLchar



./ ./ ./



/........................................ /



/. The geLchar routine will return the next character in the input stream. The input is read in 80 character lines at



Fig. 2-4 Source code for the scanner (cont'd.)



2-2 SCANNER 41



a time. Before each line is read, the error messages (if any) from the previous statement are printed out. The next line of 80 characters is then read in, and also printed out in the listing if the IisLflag is true. If the line read is a '?OPTION' line, then the option routine is called to pick off the options from the line and set appropriate flags. If a '?PROGRAM' line is read, then a fake end of file is returned to indicate the end of the previous program. When all the programs have been read, then 'real_eof' is set to true, and the end of file marker is returned. Since the '?' is used as an end of file to be passed to the scanner, but '?' is an invalid character as far as GAUSS is concerned each input character must be checked against '?'. · /



geLchar: PROCEDURE RETURNS (CHARACTER (1 »; DECLARE



cha CHARACTER (1), /. character to be returned · / i FIXED BIN (1 5); /. index into line_buf · /



ON ENDFILE (SYSIN) BEGIN; real_eof = '1'B; line_buf = '?PROGRAM';



END. ,



IF (new_line) THEN DO; IF (line_errors> 0) THEN CALL lisLerr; geLline;



END' ,



IF (line_ptr = 0) THEN DO;



DO WHILE (SUBSTR (Iine_buf, 1, 7) = '?OPTION'); CALL option;



END. ,



IF (SUBSTR (line_buf, 1, 8) = '?PROGRAM' I SUBSTR (line_buf, 1, 5) = '?DATA') THEN DO; eof = '1'B; RETURN (' ');



END. ,



END. ,



Fig. 2-4 Source code for the scanner (cont' d.)



42 THE IMPLEMENTATION OF A COMPILER FOR GAUSS



i == VERIFY (SUBSTR (line_buf,line-ptr + 1), ' '); IF (i == 1) THEN DO; line-ptr == line-ptr + 1; cha == SUBSTR (Iine_buf,line-ptr,1); RETURN (cha);



END. ,



IF (i = 0) THEN DO; new.-line == '1'B; RETURN (' ');



END. ,



line-ptr == line-ptr + i - 1 ; RETURN (' ');



/........................................ /



/. /. /.



option



./ ./ ./



/........................................ /



/. The option routine will pick off the options from the '?OPTION' line and set the appropriate flags. · / option: PROCEDURE; DECLARE flag CHARACTER (1), /. the' +' or '-' sign · / enable BIT (1 ), /. turn opt. on or off · / name CHAR (15) VARYING; /. option name · /



line_buf == SUBSTR (line_buf, 8); line-ptr == VERIFY (line_buf, ' ');



DO WHILE (line-ptr -- 0);



/. isolate the flag, if present · /



line_buf == SUBSTR (line_buf, line-ptr); flag == SUBSTR (line_buf, 1, 1); IF (flag == ' +') THEN DO; enable == '1'B; line_buf == SUBSTR (line_buf, 2);



END. , ELSE



Fig. 2-4 Source code for the scanner (cont'd.)



2-2 SCANNER 43



IF (flag = '-') THEN DO; enable = 'O'B; line_buf = SUBSTR (line_buf, 2);



END' , ELSE



enable = '1'B;



/. isolate the option to be set · /



line-ptr = INDEX (lioe_buf, ' '); begin_tok = 80 - LENGTH (line_buf) + 1; IF (line-ptr == 0) THEN DO; CALL error (9, 0, 0, 0); name = ";



END. , ELSE DO; name = SUBSTR (line-buf, 1, line-ptr - 1); line_buf == SUBSTR (line_buf, line-ptr);



END;



/. now set the appropriate flag · /



IF (name = 'list') THEN IisLflag = enable; ELSE IF (name = 'xref') THEN xref_flag = enable; ELSE IF (name == 'scanner_debug') THEN scanner_debug == enable; ELSE IF (name == '1I1_debug') THEN 111_debug = enable; ELSE IF (name == 'code_debug') THEN code_debug == enable; ELSE IF (name == 'sYrTLdebug') THEN sym_debug == enable; ELSE IF (name = 'inLdebug') THEN inLdebug = enable; ELSE CALL error (1 0, 0, 0, 0);



line-ptr == VERIFY (line_buf, ' ');



END- , IF (line_errors> 0) THEN CALL IisLerr; geLline; END option; END geLchar;



Fig. 2-4 Source code for the scanner (cont'd.)



44 THE IMPLEMENTATION OF A COMPILER FOR GAUSS



/........................................ /



/. /. /.



geLnum



./ ./ ./



/........................................ /



/. The geLnum procedure reads a number and converts it from character format to integer format. If the number is too big to convert, an error message is printed, and the largest possible number is used. · /



geLnum: PROCEDURE RETURNS (FIXED BIN (15»;



DECLARE number FiXeD BIN (15), /. number to be returned · / digit FIXED BIN (1 5); /. converted digit · /



number = 0;



ON FIXEDOVERFLOW BEGIN; CALL error (5, 0, 0, 0); DO WHILE (nexLchar >= '0' & nexLchar <= '9' ): nexLchar = geLchar;



END. , number = 32767; GO TO done;



END. ,



DO WHILE (nexLchar >= '0' & nexLchar <= '9'); GET STRING (nexLchar) EDIT (digit) (f(1 »; number = number · 1 0 + digit; nexLchar = geLchar;



END. ,



done: RETURN (number); END geLnum;



/ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . * /



/. /. /.



i8-kywd



*/ */ */



/ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . * /



/. This procedure is designed to take the given string and decide whether or not it is a keyword. This is done by using a binary search on a list of the keywords



Fig. 2-4 Source code for the scanner (cont'd.)



2-3 PARSER 45



from token_str to find the match. If the string is a keyword, the token value of the keyword is simply its index in the vector. This value is then returned as the value of the keyword. If the string is not a keyword, then 0 is returned. · /



is

ywd: PROCEDURE (string) RETURNS (FIXED BIN (15»;



DECLARE string CHARACTER (. ) VARYING, /. given string · / first FIXED BIN (1 5), /. first record in search · / last FIXED BIN (1 5), /. last record in search · / i FIXED BIN (15); /. used for indexing · /



first = 1; last = #_of_keywords;



DO WHILE (first <= last); i = FLOOR «first + last) / 2.0); IF (string < token_str (i» THEN



last = i - 1 ;



ELSE



IF (string > token_str (i» THEN



fi rst = i + 1;



ELSE



RETURN (i);



END;



RETURN (0); END i8-kywd; END scanner;



Fig. 2-4 Source code for the scanner (cont' d.)



2-3 PARSER



The parser's task is to analyze the syntax of the GAUSS program being compiled. This section describes the parser used in this implementation of a GAUSS compiler. In Sec. 2-3.1, the basic concepts of a parser, and the type of parser that is used for this implementation are introduced. Section 2-3.2 describes the formalization of LL(I) grammars, the class of grammars for which the parser is constructed. Also given in this section is the LL(I) grammar for the GAUSS language. This section includes a discussion of how the FIRST and FOLLOW sets are constructed. Section 2-3.3 presents the parsing function which contains the basic information required to parse a GAUSS program. A top-down description of the parser is given in Sec. 2-3.4. In Sec. 2-3.5 the implementation of the parser is



46 THE IMPLEMENTATION OF A COMPILER FOR GAUSS



described, and in Sec. 2-3.6 the method used to recover from parsing errors is outlined. Finally, Sec. 2-3.7 presents the PL/I source code used in the implementation of the parser as described in this section.



2-3.1 Introduction



The purpose of the parser is to decide whether or not the given GAUSS program follows the syntax rules for the GAUSS language as described in Chap. 1. This process of validating the syntactic correctness of a program is accomplished by reading the input stream and deciding whether the next input symbol can legally follow what has already been read. The parser obtains tokens from the input stream by calling the scanner. For certain tokens, it will also expect additional information to be available in a global variable provided for this purpose. This happens whenever the input token is a number, a literal, or an identifier as was discussed in the description of the scanner (see Sec. 2-2.5). As the parser analyses the GAUSS program, it must at certain times call appropriate semantic routines to generate code for the program being parsed. After a semantic routine has completed its processing, the parser continues parsing the program. F or this implementation of a compiler for GAUSS, it was decided to perform a top-down parse with no back-up. The information required during such a parse can be characterized in terms of an LL(I) grammar. LL(I) grammars have been used for defining actual programming languages for several years, and the parsing algorithm for this class of grammars is efficient. Thus, a parser of this type should provide an efficient and practical example of a parsing technique. The LL(I) grammar for the GAUSS language is described in the next subsection.



2-3.2 LL(I) Grammar



An LL(I) grammar allows us to determine the production to be applied given that it is known which nonterminal is being expanded and what the next input symbol is. This 'is the problem that arises when a top-down parse with no back-up is attempted. Consider the following LL(I) example grammar:



o. S' ::= A# 4. S ::= [eC] 1. A ::= iB - e 5. S ..= i . . . 2. B ::= SB 6. C ::= eC 3. B ::= E 7. C ::= E



To illustrate the notions of a top-down parse using an LL(I) grammar, a derivation for the input string i[ e ]-e# is given. The sentential forms obtained during this parse are illustrated as follows.



1. S' ==> A# 2. ==> iB - e# 3. ==> iSB - e# 4. ==> i[eC]B - e# 5. ==> i[e]B - e# 6. ==> i[e] - e#



3 PARSER 47



In the first two expansions of this parse, the choice of which production to expand 'is trivial, since both S' and A have only one right-hand side. When the third expansion is about to be made, a decision must be made as to whether B should be expanded to SB or to E. The next input symbol to be matched by the expansion of B is a [. By examining the grammar, it can be seen that the expansion that should be made is given by rule number 2. Now S is the first symbol to be expanded and the decision of what expansion to make must again be made. The input symbol that must be matched is still a [. By examining the grammar, it can be seen that production 4 must be applied if the input symbol is to be matched. When the nonterminal C is to be expanded in step 4 of the parse given previously, the input symbol that is to be matched is a ]. By examining the grammar, it can be seen that if production 6 was applied, the symbol e would be generated instead of a ]. Thus production 7 must be the one to be applied. Finally, the second B must be expanded. Since the input symbol to be matched is now - , rule 3 of the grammar is used to expand B. Once this is done, all nonterminals have been expanded, and the parse is finished. Now that an informal strategy for parsing an LL(I) grammar has been introduced, there are several questions that must be answered. For example, how do you tell if a grammar is LL(I)? How can the parsing algorithm conveniently determine from the grammar which production to apply next? What is the actual parsing algorithm for this type of grammar? These questions are answered in the subsections that follow. First, however, we will present the LL(I) grammar for GAUSS. The grammar given in Chap. 1 to define the GAUSS language was designed to be easily understood. Unfortunately, this grammar is not LL(I). Thus the grammar had to be changed so that it would be LL(I), but still equivalent to the grammar which defines the GAUSS language. When the attempt was made to change the grammar to an LL(I) grammar, it was found that this could not be done easily. In order to simplify matters, the grammar is changed to allow a slightly larger language than that specified by the original grammar, and then to restrict these expanded areas in the semantic portion of the compiler. The compiler as a whole, however, accepts only the language specified by the original grammar. The two expanded areas affected by these changes involve the parsing of the type of the first procedure encountered, and the parsing of expressions. Because global variables are declared before procedures, the new grammar will allow the first procedure to have an ARRAY type. This is invalid, and is checked for in the semantic portion of the compiler. The grammar for the expressions allows the operands of an operator to be of any type. For example,



x = (3 + "ABC") < TRUE;



will parse correctly, since the only thing wrong with this expression is that the types of the operands do not match the types required by the operators. The type checking of operands is done in the semantic portion of the compiler. The revised grammar is given here for a reference.



1 <program> ::= <global> #eof 2 <global> ::= <procedure>; <procedure list> 3 I <type specifier> <core> 4 <core> ::= <procedure>; <procedure list> 5 I <entity list>; <global>



48 THE IMPLEMENTATION OF A COMPILER FOR GAUSS



6 <declaration list> ::= <declaration>; <declaration list> 7 IE 8 <procedure list> ::= <procedure>; <procedure list> 9 I <simple type> <procedure>; <procedure list> 10 IE 11 <declaration> ::= <type spec