Principal
Formal Languages and Compilation (3rd Edition)
Formal Languages and Compilation (3rd Edition)
Stefano Crespi Reghizzi, Luca Breveglieri, Angelo Morzenti
This classroomtested and clearlywritten textbook presents a focused guide to the conceptual foundations of compilation, explaining the fundamental principles and algorithms used for defining the syntax of languages, and for implementing simple translators.
This significantly updated and expanded third edition has been enhanced with additional coverage of regular expressions, visibly pushdown languages, bottomup and topdown deterministic parsing algorithms, and new grammar models.
Topics and features: describes the principles and methods used in designing syntaxdirected applications such as parsing and regular expression matching; covers translations, semantic functions (attribute grammars), and static program analysis by data flow equations; introduces an efficient method for string matching and parsing suitable for ambiguous regular expressions (NEW); presents a focus on extended BNF grammars with their general parser and with LR(1) and LL(1) parsers (NEW); introduces a parallel parsing algorithm that exploits multiple processing threads to speed up syntax analysis of large files; discusses recent formal models of inputdriven automata and languages (NEW); includes extensive use of theoretical models of automata, transducers and formal grammars, and describes all algorithms in pseudocode; contains numerous illustrative examples, and supplies a large set of exercises with solutions at an associated website.
Advanced undergraduate and graduate students of computer science will find this readerfriendly textbook to be an invaluable guide to the essential concepts of syntaxdirected compilation. The fundamental paradigms of language structures are elegantly explained in terms of the underlying theory, without requiring the use of software tools or knowledge of implementation, and through algorithms simple enough to be practiced by paper and pencil.
This significantly updated and expanded third edition has been enhanced with additional coverage of regular expressions, visibly pushdown languages, bottomup and topdown deterministic parsing algorithms, and new grammar models.
Topics and features: describes the principles and methods used in designing syntaxdirected applications such as parsing and regular expression matching; covers translations, semantic functions (attribute grammars), and static program analysis by data flow equations; introduces an efficient method for string matching and parsing suitable for ambiguous regular expressions (NEW); presents a focus on extended BNF grammars with their general parser and with LR(1) and LL(1) parsers (NEW); introduces a parallel parsing algorithm that exploits multiple processing threads to speed up syntax analysis of large files; discusses recent formal models of inputdriven automata and languages (NEW); includes extensive use of theoretical models of automata, transducers and formal grammars, and describes all algorithms in pseudocode; contains numerous illustrative examples, and supplies a large set of exercises with solutions at an associated website.
Advanced undergraduate and graduate students of computer science will find this readerfriendly textbook to be an invaluable guide to the essential concepts of syntaxdirected compilation. The fundamental paradigms of language structures are elegantly explained in terms of the underlying theory, without requiring the use of software tools or knowledge of implementation, and through algorithms simple enough to be practiced by paper and pencil.
Categories:
Año:
2019
Edición:
3rd
Editorial:
Springer
Idioma:
english
Páginas:
499 / 509
ISBN 13:
9783030048792
Series:
Texts in Computer Science
File:
PDF, 9.35 MB
Descarga (pdf, 9.35 MB)
 Checking other formats...
 Please login to your account first

Need help? Please read our short guide how to send a book to Kindle.
The file will be sent to your email address. It may take up to 15 minutes before you receive it.
The file will be sent to your Kindle account. It may takes up to 15 minutes before you received it.
Please note you need to add our email km@bookmail.org to approved email addresses. Read more.
Please note you need to add our email km@bookmail.org to approved email addresses. Read more.
You may be interested in
Most frequently terms
grammar^{1181}
languages^{538}
automaton^{485}
translation^{473}
deterministic^{402}
grammars^{396}
nonterminal^{393}
stack^{390}
syntax^{387}
fig^{384}
algorithm^{354}
parsing^{337}
parser^{334}
automata^{325}
pushdown^{307}
finite^{295}
terminal^{293}
input^{288}
graph^{268}
arc^{252}
attribute^{223}
strings^{220}
alphabet^{198}
node^{196}
semantic^{196}
derivation^{191}
pilot^{185}
arcs^{164}
ambiguous^{160}
linear^{157}
reduction^{148}
recursive^{145}
variable^{143}
operator^{139}
nondeterministic^{136}
symbols^{131}
nonterminals^{130}
corresponding^{129}
pushdown automata^{129}
transition^{129}
precedence^{126}
syntax tree^{126}
attributes^{125}
ambiguity^{124}
computation^{121}
instruction^{120}
nullable^{119}
recognizer^{116}
respectively^{115}
compiler^{110}
syntactic^{109}
token^{109}
closure^{108}
transducer^{103}
ini^{102}
elr^{102}
derivations^{98}
finite automata^{95}
automata and parsing^{93}
conflict^{92}
You can write a book review and share your experiences. Other readers will always be interested in your opinion of the books you've read. Whether you've loved the book or not, if you give your honest and detailed thoughts then people will find new books that are right for them.
1

2

Texts in Computer Science Stefano Crespi Reghizzi Luca Breveglieri Angelo Morzenti Formal Languages and Compilation Third Edition Texts in Computer Science Series Editors David Gries, Department of Computer Science, Cornell University, Ithaca, NY, USA Orit Hazzan, Faculty of Education in Technology and Science, Technion—Israel Institute of Technology, Haifa, Israel More information about this series at http://www.springer.com/series/3191 Stefano Crespi Reghizzi • Luca Breveglieri • Angelo Morzenti Formal Languages and Compilation Third Edition 123 Stefano Crespi Reghizzi Dipartimento di Elettronica, Informazione e Bioingegneria Politecnico di Milano Milan, Italy Luca Breveglieri Dipartimento di Elettronica, Informazione e Bioingegneria Politecnico di Milano Milan, Italy Angelo Morzenti Dipartimento di Elettronica, Informazione e Bioingegneria Politecnico di Milano Milan, Italy ISSN 18680941 ISSN 1868095X (electronic) Texts in Computer Science ISBN 9783030048785 ISBN 9783030048792 (eBook) https://doi.org/10.1007/9783030048792 Library of Congress Control Number: 2018965427 1st & 2nd editions: © SpringerVerlag London 2009, 2013 3rd edition: © Springer Nature Switzerland AG 2019 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, speciﬁcally the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microﬁlms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a speciﬁc statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the a; dvice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional afﬁliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland Preface This third edition updates, enriches, and revises the 2013 printing [1], without dulling the original structure of the book. The selection of materials and the presentation style reflect many years of teaching compiler courses and of doing research on formal language theory and formal methods, on compiler and language design, and to a lesser extent on natural language processing. In the turmoil of information technology developments, the subject of the book has kept the same fundamental principles since half a century, yet preserving its conceptual importance and practical relevance. This state of affairs of a topic, which is central to computer science and information processing and is based on established principles, might lead some people to believe that the corresponding textbooks are by now consolidated, much as the classical books on calculus or physics. In reality, this is not the case: there exist ﬁne classical books on the mathematical aspects of language and automata theory, but, for what concerns application to compiling and data representation, the best books are sort of encyclopedias of algorithms, design methods, and practical tricks used in compiler design. Indeed, a compiler is a microcosm, and it features many different aspects ranging from algorithmic wisdom to computer hardware, system software, and network interfaces. As a consequence, the textbooks have grown in size and compete with respect to their coverage of the latest developments in programming languages, processor and network architectures and clever mappings from the former to the latter. To put things into order, it is better to separate such complex topics into two parts, that we may call basic or fundamental, and advanced or technological, which to some extent correspond to the two subsystems that make a compiler: the userlanguage level speciﬁc frontend and the machinelevel speciﬁc backend. The basic part is the subject of this book; it covers the principles and algorithms to be used for deﬁning the syntax of languages and for implementing simple translators. It does not include: the specialized knowhow needed for various classes of programming languages (procedural, functional, object oriented, etc.), the computer architecture related aspects, and the optimization methods used to improve the performance of both the compilation process and the executable code produced. In other textbooks on compilers, the bias toward practical aspects has reduced the attention to fundamental concepts. This has prevented their authors from taking v vi Preface advantage of the improvements and simpliﬁcations made possible by decades of both extensive use and theoretical polishing, to avoid the irritating variants and repetitions that are found in the historical fundamental papers. Moving from these premises, we decided to present, in a simple minimalist way, the principles and methods currently used in designing syntaxdirected applications such as parsing and regular expression matching. Thus, we have substantially improved the standard presentation of parsing algorithms for both regular expressions and grammars, by unifying the concepts and notations used in various approaches, and by extending method coverage with a reduced deﬁnitional apparatus. An example that expert readers should appreciate is the uniﬁcation of topdown and bottomup parsing algorithms within a new rigorous yet practical framework. In this way, the reduced effort and space needed to cover classical concepts has made room for new advanced methods typically not present in similar textbooks. First, we have introduced in this edition a new efﬁcient algorithm for string matching using regular expressions, which are increasingly important in such applications as Web browsing and data ﬁltering. Second, our presentation of parsing algorithms focuses on the more expressive extended BNF grammars, which are the de facto standard in language reference manuals. Third, we present a parallel parsing algorithm that takes advantage of the many processing units of modern multicore devices to speed up the syntax analysis of large ﬁles. At last, we mention the addition in this edition of the recent formal models of inputdriven automata and languages (also known as visibly pushdown model), which suits the markup languages (such as XML and JSON) and has been developed also in other domains to formally verify critical systems. The presentation is illustrated by many small yet realistic examples, to ease the understanding of the theory and the transfer to application. Theoretical models of automata, transducers, and formal grammars are extensively used, whenever practical motivations exist. Algorithms are described in a pseudocode to avoid the disturbing details of a programming language, yet they are straightforward to convert into executable procedures, and for some, we have made available their code on the Web. The book is not restricted to syntax. The study of translations, semantic functions (attribute grammars), and static program analysis by data flow equations enlarges the understanding of the compilation process and covers the essentials of a syntaxdirected translator. A large collection of problems and solutions is available at the authors’ course site. This book should be welcome by those willing to teach or learn the essential concepts of syntaxdirected compilation, without needing to rely on software tools and implementations. We believe that learning by doing is not always the best approach and that an overcommitment to practical work may sometimes obscure the conceptual foundations. In the case of formal languages, the elegance and essentiality of the underlying theory allows students to acquire the fundamental paradigms of language structures, to avoid pitfalls such as ambiguity, and to adequately map structure to meaning. In this ﬁeld, many if not all relevant algorithms Preface vii are simple enough to be practiced by paper and pencil. Of course, students should be encouraged to enroll in a handson laboratory and to experiment syntaxdirected tools (like flex and bison) on realistic cases. The authors thank the colleagues Alessandro Barenghi and Marcello Bersani, who have taught compilers to computer engineering students by using this book. The ﬁrst author remembers the late Antonio Grasselli, the computer science pioneer who ﬁrst fascinated him with the subject of this book, combining linguistic, mathematical, and technological aspects. Milan, Italy April 2019 Stefano Crespi Reghizzi Luca Breveglieri Angelo Morzenti Reference 1. Crespi Reghizzi S, Breveglieri L, Morzenti A (2013) Formal languages and compilation, 2nd edn. Springer Contents 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Intended Scope and Readership . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Compiler Parts and Corresponding Concepts . . . . . . . . . . . . . . . 2 Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 Artiﬁcial and Formal Languages . . . . . . . . . . . . . 2.1.2 Language Types . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.3 Chapter Outline . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Formal Language Theory . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Alphabet and Language . . . . . . . . . . . . . . . . . . . 2.2.2 Language Operations . . . . . . . . . . . . . . . . . . . . . 2.2.3 Set Operations . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.4 Star and Cross . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.5 Language Quotient . . . . . . . . . . . . . . . . . . . . . . . 2.3 Regular Expressions and Languages . . . . . . . . . . . . . . . . 2.3.1 Deﬁnition of Regular Expression . . . . . . . . . . . . . 2.3.2 Derivation and Language . . . . . . . . . . . . . . . . . . 2.3.3 Other Operators . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.4 Closure Properties of Family REG . . . . . . . . . . . 2.4 Linguistic Abstraction: Abstract and Concrete Lists . . . . . 2.4.1 Lists with Separators and Opening/Closing Marks 2.4.2 Language Substitution . . . . . . . . . . . . . . . . . . . . 2.4.3 Hierarchical or Precedence Lists . . . . . . . . . . . . . 2.5 ContextFree Generative Grammars . . . . . . . . . . . . . . . . . 2.5.1 Limits of Regular Languages . . . . . . . . . . . . . . . 2.5.2 Introduction to ContextFree Grammars . . . . . . . . 2.5.3 Conventional Grammar Representations . . . . . . . . 2.5.4 Derivation and Language Generation . . . . . . . . . . 2.5.5 Erroneous Grammars and Useless Rules . . . . . . . 2.5.6 Recursion and Language Inﬁnity . . . . . . . . . . . . . 2.5.7 Syntax Tree and Canonical Derivation . . . . . . . . . 2.5.8 Parenthesis Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 2 5 5 5 6 7 8 8 11 13 14 17 17 18 20 24 25 26 27 28 29 31 31 32 34 37 39 41 43 47 ix x Contents 2.5.9 Regular Composition of ContextFree Languages . 2.5.10 Ambiguity in Grammars . . . . . . . . . . . . . . . . . . . 2.5.11 Catalogue of Ambiguous Forms and Remedies . . 2.5.12 Weak and Structural Equivalence . . . . . . . . . . . . 2.5.13 Grammar Transformations and Normal Forms . . . 2.6 Grammars of Regular Languages . . . . . . . . . . . . . . . . . . . 2.6.1 From Regular Expression to Grammar . . . . . . . . . 2.6.2 Linear and Unilinear Grammars . . . . . . . . . . . . . 2.6.3 Linear Language Equations . . . . . . . . . . . . . . . . . 2.7 Comparison of Regular and ContextFree Languages . . . . 2.7.1 Limits of ContextFree Languages . . . . . . . . . . . . 2.7.2 Closure Properties of Families REG and CF . . . . 2.7.3 Alphabetic Transformations . . . . . . . . . . . . . . . . 2.7.4 Grammars with Regular Expressions . . . . . . . . . . 2.8 More General Grammars and Language Families . . . . . . . 2.8.1 Chomsky Classiﬁcation of Grammars . . . . . . . . . 2.8.2 Further Developments and Final Remarks . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 52 54 62 65 82 82 84 87 89 93 95 97 100 104 105 109 112 3 Finite Automata as Regular Language Recognizers . . . . . . . . . 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Recognition Algorithms and Automata . . . . . . . . . . . . . . . . 3.2.1 A General Automaton . . . . . . . . . . . . . . . . . . . . . 3.3 Introduction to Finite Automata . . . . . . . . . . . . . . . . . . . . . 3.4 Deterministic Finite Automata . . . . . . . . . . . . . . . . . . . . . . 3.4.1 Error State and Total Automaton . . . . . . . . . . . . . . 3.4.2 Clean Automaton . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.3 Minimal Automaton . . . . . . . . . . . . . . . . . . . . . . . 3.4.4 From Automaton to Grammar . . . . . . . . . . . . . . . . 3.5 Nondeterministic Automata . . . . . . . . . . . . . . . . . . . . . . . . 3.5.1 Motivation of Nondeterminism . . . . . . . . . . . . . . . 3.5.2 Nondeterministic Recognizers . . . . . . . . . . . . . . . . 3.5.3 Automata with Spontaneous Moves . . . . . . . . . . . . 3.5.4 Correspondence Between Automata and Grammars 3.5.5 Ambiguity of Automata . . . . . . . . . . . . . . . . . . . . 3.5.6 LeftLinear Grammars and Automata . . . . . . . . . . . 3.6 From Automaton to Regular Expression: BMC Method . . . . 3.7 Elimination of Nondeterminism . . . . . . . . . . . . . . . . . . . . . 3.7.1 Elimination of Spontaneous Moves . . . . . . . . . . . . 3.7.2 Construction of Accessible Subsets . . . . . . . . . . . . 3.8 From Regular Expression to Recognizer . . . . . . . . . . . . . . . 3.8.1 Thompson Structural Method . . . . . . . . . . . . . . . . 3.8.2 Berry–Sethi Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 115 116 117 120 122 123 124 125 129 130 131 133 135 136 137 138 139 142 142 144 147 148 150 Contents 3.9 String Matching of Ambiguous Regular Expressions . . . . 3.9.1 Trees for Ambiguous r.e. and Their Phrases . . . 3.9.2 Local Recognizer of Linearized Syntax Trees . . 3.9.3 The Berry–Sethi Parser . . . . . . . . . . . . . . . . . . . 3.10 Expressions with Complement and Intersection . . . . . . . 3.10.1 Product of Automata . . . . . . . . . . . . . . . . . . . . 3.11 Summary of the Relations Between Regular Expressions, Grammars and Automata . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 165 172 172 188 190 . . . . . . 195 . . . . . . 196 4 Pushdown Automata and Parsing . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Pushdown Automaton . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 From Grammar to Pushdown Automaton . . . . . . . . . 4.2.2 Deﬁnition of Pushdown Automaton . . . . . . . . . . . . . 4.2.3 One Family for ContextFree Languages and Pushdown Automata . . . . . . . . . . . . . . . . . . . . . . . . 4.2.4 Intersection of Regular and ContextFree Languages 4.3 Deterministic Pushdown Automata and Languages . . . . . . . . 4.3.1 Closure Properties of Deterministic Languages . . . . . 4.3.2 Nondeterministic Languages . . . . . . . . . . . . . . . . . . 4.3.3 Determinism and Language Unambiguity . . . . . . . . 4.3.4 Subclasses of Deterministic Pushdown Automata and Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Syntax Analysis: TopDown and BottomUp Constructions . . 4.5 Grammar as Network of Finite Automata . . . . . . . . . . . . . . . 4.5.1 Syntax Charts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.2 Derivation for Machine Nets . . . . . . . . . . . . . . . . . . 4.5.3 Initial and LookAhead Characters . . . . . . . . . . . . . 4.6 BottomUp Deterministic Analysis . . . . . . . . . . . . . . . . . . . . 4.6.1 From Finite Recognizers to BottomUp Parser . . . . . 4.6.2 Construction of ELR ð1Þ Parsers . . . . . . . . . . . . . . . 4.6.3 ELR ð1Þ Condition . . . . . . . . . . . . . . . . . . . . . . . . . 4.6.4 Simpliﬁcations for BNF Grammars . . . . . . . . . . . . . 4.6.5 Parser Implementation Using a Vector Stack . . . . . . 4.6.6 Lengthening the LookAhead . . . . . . . . . . . . . . . . . 4.7 Deterministic TopDown Parsing . . . . . . . . . . . . . . . . . . . . . 4.7.1 ELL ð1Þ Condition . . . . . . . . . . . . . . . . . . . . . . . . . 4.7.2 StepbyStep Derivation of ELL ð1Þ Parsers . . . . . . . 4.7.3 Direct Construction of TopDown Predictive Parsers 4.7.4 A Graphical Method for Computing Guide Sets . . . . 4.7.5 Increasing LookAhead in TopDown Parsers . . . . . . . . . . . . . . . . . . . . 199 199 200 201 205 . . . . . . . . . . . . . . . . . . 209 213 215 216 218 220 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 230 233 237 239 241 244 245 250 253 267 271 275 282 286 288 305 314 322 xii Contents 4.8 OperatorPrecedence Grammars and Parallel Parsing . . . . . . 4.8.1 Floyd OperatorPrecedence Grammars and Parsers . 4.8.2 Sequential OperatorPrecedence Parser . . . . . . . . . 4.8.3 Comparisons and Closure Properties . . . . . . . . . . . 4.8.4 Parallel Parsing Algorithm . . . . . . . . . . . . . . . . . . 4.9 Deterministic Language Families: A Comparison . . . . . . . . 4.10 Discussion of Parsing Methods . . . . . . . . . . . . . . . . . . . . . 4.11 A General Parsing Algorithm . . . . . . . . . . . . . . . . . . . . . . 4.11.1 Introductory Presentation . . . . . . . . . . . . . . . . . . . 4.11.2 Earley Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 4.11.3 Syntax Tree Construction . . . . . . . . . . . . . . . . . . . 4.12 Managing Syntactic Errors and Changes . . . . . . . . . . . . . . 4.12.1 Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.12.2 Incremental Parsing . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325 326 330 333 339 346 350 352 353 357 366 373 373 379 379 5 Translation Semantics and Static Analysis . . . . . . . . . . . . . . . . 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Translation Relation and Function . . . . . . . . . . . . . . . . . . . 5.3 Transliteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Purely Syntactic Translation . . . . . . . . . . . . . . . . . . . . . . . 5.4.1 Inﬁx and Polish Notation . . . . . . . . . . . . . . . . . . . 5.4.2 Ambiguity of Source Grammar and Translation . . . 5.4.3 Translation Grammar and Pushdown Transducer . . 5.4.4 Syntax Analysis with Online Translation . . . . . . . . 5.4.5 TopDown Deterministic Translation by Recursive Procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.6 BottomUp Deterministic Translation . . . . . . . . . . . 5.5 Regular Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5.1 TwoInput Automaton . . . . . . . . . . . . . . . . . . . . . 5.5.2 Translation Function and Finite Transducer . . . . . . 5.5.3 Closure Properties of Translation . . . . . . . . . . . . . 5.6 Semantic Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6.1 Attribute Grammars . . . . . . . . . . . . . . . . . . . . . . . 5.6.2 Left and Right Attributes . . . . . . . . . . . . . . . . . . . 5.6.3 Deﬁnition of Attribute Grammar . . . . . . . . . . . . . . 5.6.4 Dependence Graph and Attribute Evaluation . . . . . 5.6.5 OneSweep Semantic Evaluation . . . . . . . . . . . . . . 5.6.6 Other Evaluation Methods . . . . . . . . . . . . . . . . . . 5.6.7 Combined Syntax and Semantic Analysis . . . . . . . 5.6.8 Typical Applications of Attribute Grammar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383 383 386 388 389 391 395 397 402 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403 406 414 416 421 426 427 429 431 435 437 442 446 447 457 Contents 5.7 Static Program Analysis . . . . . . . . . . . . 5.7.1 A Program as an Automaton . . . 5.7.2 Liveness Intervals of a Variable 5.7.3 Reaching Deﬁnition . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467 468 471 477 486 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 487 1 Introduction 1.1 Intended Scope and Readership The information technology revolution was made possible by the invention of electronic digital machines, yet without programming languages their use would have been restricted to the few people able to write binary machine code. A programming language as a text contains features coming both from human languages and from mathematical logic. The translation from a programming language to machine code is known as compilation.1 Language compilation is a very complex process, which would be impossible to master without systematic design methods. Such methods and their theoretical foundations are the argument of this book. They make up a consistent and largely consolidated body of concepts and algorithms, which are applied not just in compilers, but also in other fields. Automata theory is pervasive in all the branches of informatics, to model situations or phenomena classifiable as time and space discrete systems. Formal grammars, on the other hand, are originated in linguistic research, and together with automata, they are widely applied in document processing and in data representation and searching, in particular for the Web. Coming to the prerequisites, the reader should have a good background in programming, though a detailed knowledge of a specific programming language is not required, because our presentation of algorithms relies on selfexplanatory pseudocode. The reader is expected to be familiar with basic mathematical theories and notations, namely from set theory, algebra and logic. The above prerequisites are typically met by computer science/engineering or mathematics students with a university education of two or more years. 1 This term may sound strange; it originates in the early approach to the compilation of correspondence tables between a command in the language and a series of machine operations. © Springer Nature Switzerland AG 2019 S. Crespi Reghizzi et al., Formal Languages and Compilation, Texts in Computer Science, https://doi.org/10.1007/9783030048792_1 1 2 1 Introduction The selection of topics and the presentation based on rigorous definitions and algorithms illustrated by many motivating examples qualifies the book for a university course, aiming to expose students to the importance of good theories and of efficient algorithms for designing effective systems. In our experience some fifty hours of lecture suffice to cover the entire book. The authors’ long experience in teaching the subject to different audiences brings out the importance of combining theoretical concepts and examples. Moreover, it is advisable that the students take advantage of wellknown and documented software tools (such as the classical Flex and Bison generators of lexical and syntactic analyzers), to implement and experiment the main algorithm on realistic case studies. With regard to the reach and limits, the book covers the essential concepts and methods needed to design simple translators based on the very common syntaxdirected paradigm. Such systems analyze into syntactical units a source text, written in a formally specified language, and translate it into a different target representation. The translation function is nicely formulated as the composition of simpler functions keyed to the syntactical units of the source text. It goes without saying that a real compiler for a programming language includes other technological aspects and knowhow, in particular related to processor and computer architecture, which are not covered. Such knowhow is essential for automatically translating a program into machine instructions and for transforming a program in order to make the best use of the computational resources of a computer. The study of methods for program transformation and optimization is a more advanced topic, which follows the present introduction to compiler methods. The next section outlines the contents of the book. 1.2 Compiler Parts and Corresponding Concepts There are two external interfaces to a compiler: the source language to be analyzed and translated and the target language produced by the translator. Chapter 2 describes the socalled syntactic methods that are universally adopted in order to provide a rigorous definition of the texts (or character strings) written in the source language. The methods to be presented are regular expressions and contextfree (BNF) grammars. To avoid the pitfall of ambiguous definitions, we examine the relevant causes of ambiguity in the regular expressions and in the grammars. Useful transformations from regular expressions to grammars and between grammars of different forms are covered, which are needed in the following chapters. In particular we introduce the combination of regular expressions and grammars, which is the central model for the parsing algorithms. Chapter 2 ends with a critical hint to more powerful, but rarely applied models: the Chomsky contextsensitive type grammars and the recently proposed conjunctive grammars. The first task of a compiler is to check the correctness of the source text, that is, whether the text agrees with the syntactic definition of the source language. In order to perform such a check, the algorithm scans the source text character by character, 1.2 Compiler Parts and Corresponding Concepts 3 and at the end it rejects or accepts the input depending on the result of the analysis. By a minimalist abstract approach, such recognition algorithms are conveniently described as mathematical machines or automata, in the tradition of the wellknown Turing machine. Different automaton types correspond to the various families of formal languages, and the types practically relevant for compilation are considered in the following two chapters. Chapter 3 covers finite automata, which are machines with a finite random access memory. They are the recognizers of the languages defined by regular expressions. Within compilation, they are used for lexical analysis or scanning, to extract from the source text the keywords, numbers, identifiers and in general the pieces of text that encode the lexical units of the language, called lexemes or tokens. The basic models of finite automata, deterministic and nondeterministic, realtime and with spontaneous moves are presented along with the transformations between different models. The algorithms for converting a finite automaton into a regular expression and conversely are well covered. The other main topics of Chap. 3 are: the string matching problem for ambiguous regular expressions, and the use of Boolean operations, intersection and complement, in the regular expressions. Chapter 4 is devoted to the recognition problem for the languages defined by contextfree grammars, which is the backbone of syntaxdirected translators and justifies the salient length of the chapter. The chapter starts with the model of nondeterministic pushdown automata and its direct correspondence to contextfree grammars. The deterministic version of the model follows, which defines the leading family of deterministic languages. As a special case, the inputdriven or visibly pushdown automata have been included, because of their relevance for markup languages such as XML and JSON. A rich choice of parsing algorithms is presented. First, there is a unified modern presentation of the deterministic parsers, both bottomup (or LR (1)) and topdown (or LL (1)), for contextfree grammars extended with regular languages. The third deterministic parser is the operatorprecedence method, which is here presented in the parallel version suitable for multicore computers. The last parsing method by Earley is the most general and accepts any contextfree language. Chapter 4 includes comparisons of the different language families suitable for each parsing method, a discussion of grammar transformations, and the basic concepts on error handling by parsers. The ultimate job of a compiler is to translate a source text into another language. The compiler module responsible for finalizing the verification of the source language rules and for producing the translation is called the semantic analyzer. It operates on the structural representation, the syntax tree, produced by the parser. The formal translation models and the methods used to implement semantic analyzers are illustrated in Chap. 5. Actually, we describe two kinds of transformations. First, the pure syntactic translations are modeled by transducers that are either finite automata or pushdown automata extended with an output, and that compute a string. Second, the semantic translations are performed by functions or methods that operate on the syntax tree of the source text and may produce as output not just a string, but any sort of data values. Semantic translations are specified by a practical semiformal extension to contextfree grammars, which is called attribute grammar. This 4 1 Introduction approach, by combining the accuracy of formal syntax and the flexibility of programming, conveniently expresses the typical analysis and transformation of syntax trees that is required by compilers. To give a concrete idea of compilation, typical simple examples are included: the type consistency check between the variables declared and used in a programming language, the translation of highlevel statements into machine instructions and semanticsdirected parsing. For sure, compilers do much more than syntaxdirected translation. Static program analysis is an important example: it consists in examining a program to determine, ahead of execution, some properties, or to detect errors not detected by syntactic and semantic analysis. The purpose is to improve the robustness, reliability and efficiency of the program. An example of error detection is the identification of uninitialized variables used in the arithmetic expressions. For code improvement, an example is the elimination of useless assignment statements. Chapter 5 terminates with an introduction to the static analysis of programs modeled by their controlflow graph, viewed as a finite automaton. Several interesting problems can be formalized and statically analyzed by a common approach based on flow equations and their solution by iterative approximations converging to the least fixed point. Last, the book comes with an uptodate essential bibliography at the end of each chapter and is accompanied by a detailed index. 2 Syntax 2.1 Introduction 2.1.1 Artificial and Formal Languages Many centuries after the spontaneous emergence of natural language for human communication, mankind has purposively constructed other communication systems and languages, to be called artificial, intended for very specific tasks. A few artificial languages, like the logical propositions of Aristotle or the music sheet notation of Guittone d’Arezzo, are very ancient, but their number has exploded with the invention of computers. Many artificial languages are intended for man–machine communication, to instruct a programmable machine to do some task: to perform a computation, to prepare a document, to search a database or the Web, to control a robot, and so on. Other languages serve as interfaces between devices or between software applications, e.g., PDF is a language that many document and picture processors read and write. Any designed language is artificial by definition, but not all artificial languages are formalized: thus, a programming language like Java is formalized, but Esperanto, although designed by man, is not so. But how should we classify the language of genes or DNA, which deals with natural biological systems though it has been invented by scientists? The structures of molecular biology are threedimensional, yet viewing them as RNA sequences has opened the way to the methods of formal languages.1 For a language to be formalized (or formal), the form of its sentences (or syntax) and their meaning (or semantics) must be precisely and algorithmically defined. In 1 For a discussion of the role of linguistic methods in molecular biology, see Searls [1]. © Springer Nature Switzerland AG 2019 S. Crespi Reghizzi et al., Formal Languages and Compilation, Texts in Computer Science, https://doi.org/10.1007/9783030048792_2 5 6 2 Syntax other words, it should be possible for a computer to check that the sentences are grammatically correct, and to determine their meaning. Meaning, however, is a difficult and controversial notion. For our purposes, the meaning of a sentence can be taken to be the translation into another language, which is known to the computer or to a wellestablished software system. For instance, the meaning of a Java program is its translation into the machine language of the computer executing the program. In this book, the term formal language is used in a narrower sense, which excludes semantics. In the field of syntax, a formal language is a mathematical structure, defined on top of an alphabet, by means of certain axiomatic rules (formal grammar) or by using abstract machines such as the famous one due to A. Turing. The notions and methods of formal language are analogous to those used in number theory and in logic. Thus, formal language theory is primarily concerned with the form or syntax of sentences, not with meaning, at least not directly. A string (or text) is either valid or illegal, that is, it either belongs to the formal language or it does not. Such a theory makes a first important step toward the ultimate goal: the study of language translation and meaning, which will however require additional methods. 2.1.2 Language Types In this book, a language is a onedimensional communication medium, made by sequences of symbolic elements of an alphabet, called terminal characters. Actually, people often refer to language as other nontextual communication media, which are more or less formalized by means of rules. Thus, iconic languages focus on road traffic signs or video display icons. Musical language is concerned with sounds, rhythm, and harmony. Architects and industrial designers of buildings and things are interested in their spatial relations, which they describe as the language of design. Early child drawings are often considered as sentences of a pictorial language, which can be partially formalized in accordance with developmental psychology theories. Certainly, the formal approach to syntax to be developed has some interest for nontextual languages too, which however are out of scope for this book. Within computer science, the term language applies, as said, to a text made by a set of characters orderly written from, say, left to right. In addition, the term is used to refer to other discrete structures, such as graphs, trees, or arrays of pixels describing a digital picture. Formal language theories have been proposed and used more or less successfully also for such nontextual languages.2 Reverting to the main stream of textual languages, a frequent request directed to the specialist is to define and specify an artificial language. The specification may have several uses: as a language reference manual for future users, as an official 2 Just three examples and their references: graph grammars and languages [2], tree languages [3,4], and picture (or twodimensional) languages [5,6]. 2.1 Introduction 7 standard definition, or as a contractual document for compiler designers to ensure consistency of specification and implementation. It is not an easy task to write a complete and rigorous definition of a language. Of course, the exhaustive approach, to list all the possible sentences or phrases, is unfeasible because the possibilities are infinitely many, since the length of the sentences is usually unbounded. As a native language speaker, a programmer is not constrained by any strict limit on the length of the phrases he writes. The ensuing problem to represent an infinite number of cases by a finite description can be addressed by an enumeration procedure, as in logic. When executed, the procedure generates longer and longer sentences, in an unending process if the language to be modeled is not finite. This chapter presents a simple and established manner to express the rules of the enumeration procedure in the form of rules of a generative grammar (or syntax). 2.1.3 Chapter Outline The chapter starts with the basic components of language theory: alphabet, string, and operations, such as concatenation and repetition, on strings and sets of strings. The first model presented are the regular expressions, which define the family of regular languages. Then the lists are introduced as a fundamental and pervasive syntax structure in all the kinds of languages. From the exemplification of list variants, the idea of linguistic abstraction grows out naturally. This is a powerful reasoning tool to reduce the varieties of existing languages to a few paradigms. After discussing the limits of regular languages, the chapter moves to contextfree grammars. Following the basic definitions, the presentation focuses on structural properties, namely equivalence, ambiguity, and recursion. Exemplification continues with important linguistic paradigms such as hierarchical lists, parenthesized structures, polish notations, and operator precedence expressions. Their combination produces the variety of forms to be found in artificial languages. Then the classification of some common forms of ambiguity and corresponding remedies is offered as a practical guide for grammar designers. Various rule transformations (normal forms) are introduced, which should familiarize the reader with the modifications, to adjust a grammar without affecting the language, often needed for the analysis algorithms studied in Chap. 4. Returning to regular languages from the grammar perspective, the chapter evidences the greater descriptive capacity of contextfree grammars. The comparison of regular and contextfree languages continues by considering the operations that may cause a language to exit or remain in one or the other family. Alphabetical transformations anticipate the operations studied in Chap. 5 as translations. A discussion about the unavoidable regularities found in very long strings completes the theoretical picture. The last section mentions the Chomsky classification of grammar types and exemplifies contextsensitive grammars, stressing the difficulty of such rarely used model and discussing some recent attempts at more expressive grammar models. 8 2 Syntax 2.2 Formal Language Theory Formal language theory starts from the elementary notions of alphabet, string operations, and aggregate operations on sets of strings. By such operations, complex languages can be obtained starting from simpler ones. 2.2.1 Alphabet and Language An alphabet is a finite set of elements called terminal symbols or characters. Let Σ = { a1 , a2 , . . . , ak } be an alphabet with k elements, i.e., its cardinality is  Σ  = k. A string, also called a word, is a sequence, i.e., an ordered set possibly with repetitions, of characters. Example 2.1 (binary strings) Let Σ = { a, b } be an alphabet. Some strings are: a a b a, a a a, a b a a, and b. A language is a set of strings over a specified alphabet. Example 2.2 Three sample languages over the same alphabet Σ = { a, b }: L1 = { a a, a a a } L2 = { a b a, a a b } L3 = { a b, b a, a a b b, a b a b, . . . , a a a b b b, . . . } = set of the strings having as many letters a as letters b Notice that a formal language viewed as a set has two layers. At the first level, there is an unordered set of nonelementary elements, the strings. At the second level, each string is an ordered set of atomic elements, the terminal characters. Given a language, a string belonging to it is called a sentence or phrase. Thus, / L3 is an incorrect string. b b a a ∈ L3 is a sentence of L3 , whereas a b b ∈ The cardinality or size of a language is the number of sentences it contains. For instance,  L2  =  { a b a, a a b }  = 2. If the cardinality is finite, the language is called finite, too. Otherwise, there is no finite bound on the number of sentences and the language is termed infinite. To illustrate, languages L1 and L2 are finite, but language L3 is infinite. One can observe that a finite language is essentially a collection of words3 sometimes called a vocabulary. A special finite language is the empty set or language ∅, which contains no sentence, i.e.,  ∅  = 0. Usually, when a language contains just one element, the set braces are omitted by writing, e.g., a b b instead of { a b b }. 3 In mathematical writings, the terms word and string are synonymous, while in linguistics a word is a string that has a meaning. 2.2 Formal Language Theory 9 It is convenient to introduce the notation  x b for the number of characters b present in a string x. For instance,  a a b a = 2  a b a a = 2  b a a c = 0 The length  x  of a string x is the number of characters it contains, e.g.,  a b  = 2 and  a b a a  = 4. Two strings of length h and k, respectively: x = a1 a2 . . . ah y = b1 b2 . . . bk are equal if h = k and ai = bi (1 ≤ i ≤ h). In words, by examining the strings from left to right, their respective characters coincide. Thus, we obtain: a b a = b a a b a a = b a 2.2.1.1 String Operations In order to manipulate strings, it is convenient to introduce several operations. For strings: x = a1 a2 . . . ah concatenation4 is defined as: y = b1 b2 . . . bk x · y = a1 a2 . . . ah b1 b2 . . . bk The dot ‘·’ may be dropped, writing x y in place of x · y. This operation, essential for formal languages, plays the role addition has in number theory. Example 2.3 (string concatenation) For strings: x = well y = in z = formed we obtain: x y = wellin y x = inwell = x y (x y) z = wellin · formed = x (y z) = well · informed = wellinformed Concatenation is clearly noncommutative, that is, the identity x y = y x does not hold in general. The associative property holds: (xy)z = x(yz) Associativity permits to write without parentheses the concatenation of three or more strings. The length of the result is the sum of the lengths of the concatenated strings: xy = x + y 4 Also termed product in mathematical works. (2.1) 10 2 Syntax 2.2.1.2 Empty String It is useful to introduce the concept of empty (or null) string, denoted by the Greek epsilon ε, as the only string satisfying, for every string x, the identity: xε = εx = x From (2.1), it follows that the empty string has length zero:  ε  = 0. From an algebraic perspective, the empty string ε is the neutral element with respect to concatenation, because any string is unaffected by concatenating ε to the left or right. The empty string should not be confused with the empty set ∅. In fact, the empty set as a language contains no string, whereas the set { ε } contains one, the empty string; equivalently, the cardinalities of sets ∅ and { ε } are 0 and 1, respectively. A language L is said to be nullable if and only if it includes the empty string, i.e., ε ∈ L. 2.2.1.3 Substring Let string x = u y v be written as the concatenation of three, possibly empty, strings u, y, and v. Then strings u, y, v are substrings of x. Moreover, string u is a prefix of x and string v is a suffix of x. A nonempty substring (or prefix or suffix) is called proper if it does not coincide with string x. Let x be a nonempty string of length at least k, i.e.,  x  ≥ k ≥ 1. The notation Ini k (x) denotes the prefix u of x having length k, to be termed the initial (of x) of length k. Example 2.4 The string x = a a b a c b a contains the following components: prefixes suffixes substrings a, a a, a a b, a a b a, a a b a c, a a b a c b, and a a b a c b a a, b a, c b a, a c b a, b a c b a, a b a c b a, and a a b a c b a all the prefixes and suffixes listed above, and the internal strings such as a, a b, b a, b a c b, … Notice that the pair b c is not a substring of string x although both letters b and c occur in x. The initial of length two is Ini 2 ( a a b a c b a ) = a a. 2.2.1.4 Reversal or Mirror Reflection The characters of a string are usually read from left to right, but it is sometimes requested to reverse the order. The reversal or reflection of a string x = a1 a2 . . . ah is the string xR = ah ah−1 . . . a1 . For instance, it is: x = roma xR = amor 2.2 Formal Language Theory 11 The following string identities are immediate: (xR )R = x ( x y )R = yR xR εR = ε 2.2.1.5 Repetition When a string contains repetitions, it is handy to have an operator expressing them. The mth power xm (for an integer m ≥ 1) of a string x is the concatenation of x with itself for m − 1 times: xm = x · x · · · · · x m times By stipulation, the zero power of any string is defined to be the empty string. Thus, the complete definition is: xm = xm−1 · x for m ≥ 1 x0 = ε Here are a few examples: x = ab y = a2 = a a x0 = ε y3 = a2 a2 a2 = a6 x1 = x = a b ε0 = ε x2 = ( a b )2 = a b a b ε2 = ε When writing formulas, the string to be repeated must be parenthesized, if it is longer than one. Thus, to express the 2nd power of string a b, i.e., a b a b, one should write ( a b )2 , not a b2 , which is the string a b b. Expressed differently, we assume that the power operation takes precedence over concatenation. Similarly, reversal takes precedence over concatenation, e.g., a bR returns a b since bR = b, while ( a b )R = b a. 2.2.2 Language Operations It is straightforward to extend an operation, originally defined on a string, to an entire language: just apply the operation to each one of the language sentences. By means of this general principle, the string operations previously defined can be revisited, starting from those that have one argument. The reversal LR of a language L is the set of the strings that are the reversal of a sentence of L: LR = x  x = yR ∧ y ∈ L characteristic predicate Here the strings x are specified by the property expressed in the socalled characteristic predicate. Similarly, the set of the proper prefixes of a language L is: prefixes (L) = { y  x = y z ∧ x ∈ L ∧ y = ε ∧ z = ε } 12 2 Syntax Example 2.5 (prefixfree language) In some applications, the loss of one or more final characters of a language sentence is required to produce an incorrect string. The motivation is that the compiler is then able to detect the inadvertent truncation of a sentence. A language is prefixfree if none of the proper prefixes of its sentences is in the language, i.e., the set prefixes (L) is disjoint from L. Thus, language L1 = { x  x = an bn ∧ n ≥ 1 } is prefixfree since every prefix takes the form an bm , with n > m ≥ 0, and does not satisfy the characteristic predicate. On the other hand, the language L2 = { am bn  m > n ≥ 1 } contains a3 b2 as well as the proper prefix a3 b. Similarly, the operations on two strings can be extended to two languages, by letting the former and latter argument span the respective language. For instance, the concatenation of languages L and L is defined as: L · L = x · y  x ∈ L ∧ y ∈ L From this, extending the mth power operation to a language is straightforward (m ≥ 0): L0 = { ε } Lm = Lm−1 · L for m ≥ 1 Some special cases follow from the previous definitions: ∅0 = { ε } L · {ε} = {ε} · L = L L·∅=∅·L=∅ Example 2.6 (language concatenation) Consider languages L1 and L2 : L1 = ai  i ≥ 0 ∧ i is even = { ε, a a, a a a a, . . . } L2 = bj a  j ≥ 1 ∧ j is odd = { b a, b b b a, . . . } We obtain this concatenation L1 · L2 : L1 · L2 = ai · bj a  ( i ≥ 0 ∧ i is even ) ∧ ( j ≥ 1 ∧ j is odd ) = ε b a, a2 b a, a4 b a, . . . , ε b3 a, a2 b3 a, a4 b3 a, . . . A common error when computing a power of a base language is to take the same string for m times. The result is a different set, included in the power of the base: x  x = ym ∧ y ∈ L ⊆ Lm m≥2 (2.2) Thus, in (2.2) for L = { a, b } and m = 2, the set to the left of ‘⊆’ is { a a, b b }, while the value of Lm is { a a, a b, b a, b b }. 2.2 Formal Language Theory 13 Example 2.7 (strings of finite length) The power operation allows a concise definition of the strings of length not exceeding some integer k ≥ 0. Consider the alphabet Σ = { a, b }. For k = 3, the language L: L = { ε, a, b, a a, a b, b a, b b, a a a, a a b, a b a, a b b, b a a, b a b, b b a, b b b } = Σ0 ∪ Σ1 ∪ Σ2 ∪ Σ3 can also be defined as: L = { ε, a, b }3 Notice that the sentences shorter than 3 are obtained by using the empty string ε of the base language. By slightly changing the example, the language L = { x  1 ≤  x  ≤ 3 } is defined, with concatenation and power, by the formula: L = { a, b } · { ε, a, b }2 2.2.3 Set Operations Since a language is a set, the classical set operations of union ‘∪’, intersection ‘∩’, and difference ‘\’ apply to languages. The set relations of inclusion ‘⊆’, strict inclusion ‘⊂’, and equality ‘=’ apply as well. Before introducing the complement of a language, the notion of universal language is needed: it is defined as the set of all the strings, over an alphabet Σ, of any length including zero. Clearly, the universal language is infinite and can be viewed as the union of all the powers of the alphabet: Luniversal = Σ 0 ∪ Σ ∪ Σ 2 ∪ . . . The complement of a language L over an alphabet Σ, denoted by ¬ L, is the set difference: ¬ L = Luniversal \ L that is, the set of the strings over alphabet Σ that are not in L. When the alphabet is understood, the universal language can be expressed as the complement of the empty language: Luniversal = ¬ ∅ Example 2.8 (language complement) The complement of a finite language is always infinite; for instance, the set of strings of any length except two is: ¬ { a, b }2 = { ε } ∪ { a, b } ∪ { a, b }3 ∪ . . . 14 2 Syntax On the other hand, the complement of an infinite language may or may not be finite, as shown on one side by the complement of the universal language, and on the other side by the complement of the set of the strings of even length over alphabet { a }: L = a2n  n ≥ 0 ¬ L = a2n+1  n ≥ 0 Moving to set difference, consider an alphabet Σ = { a, b, c } and languages: L1 = { x   x a =  x b =  x c ≥ 0 } L2 = { x   x a =  x b ∧  x c = 1 } Then the set differences are: L1 \ L2 = { ε } ∪ { x   x a =  x b =  x c ≥ 2 } which represents the set of the strings that have the same number (except one) of occurrences of letters a, b, and c, and: L2 \ L1 = { x   x a =  x b =  x c ∧  x c = 1 } which is the set of the strings that have one c and the same number of occurrences (except one) of a and b. 2.2.4 Star and Cross Most artificial and natural languages include sentences that can be lengthened at will, so causing the number of sentences in the language to be unbounded. On the other hand, all the operations so far defined, with the exception of complement, do not allow to write a finite formula denoting an infinite language. In order to enable the definition of an infinite language, the next essential development extends the power operation to the limit. 2.2.4.1 Star The star 5 operation ‘∗’ is defined as the union of all the powers of the base language: L∗ = ∞ Lh = L0 ∪ L1 ∪ L2 ∪ . . . = { ε } ∪ L ∪ L2 ∪ . . . h=0 5 Also known as Kleene star and as reflexive transitive closure by concatenation. 2.2 Formal Language Theory 15 Example 2.9 (language star) For the language L = { a b, b a }, the star is: L∗ = { ε, a b, b a, a b a b, a b b a, b a a b, b a b a, . . . } Every nonempty string of the ‘starred’ language L∗ can be segmented into substrings, which are the sentences of the base language L. Notice that, if the base language contains at least one nonempty string, the starred language is infinite. It may happen that the starred and base languages are identical, as in: L = a2n  n ≥ 0 L∗ = a2n  n ≥ 0 = L An interesting special case occurs when the base is an alphabet Σ; then the star Σ ∗ contains all the strings6 obtained by concatenating terminal characters. This language is the same as the previous universal language7 of alphabet Σ. It is obvious that any formal language is a subset of the universal language over the same alphabet, and the relation L ⊆ Σ ∗ is often written to say that L is a language over an alphabet Σ. Some useful properties of star follow: properties of star L ⊆ L∗ if x ∈ L∗ ∧ y ∈ L∗ then x y ∈ L∗ ( L∗ )∗ = L∗ ∗ ( L∗ )R = LR meaning monotonicity closure by concatenation idempotence commutativity with reversal Example 2.10 (idempotence) The monotonicity that any language property states is included in its star. But for language L = a2n  n ≥ 0 , the equality L∗ = L follows from the idempotence property and from the fact that L can be equivalently defined by the starred formula { a a }∗ . For the empty language ∅ and empty string ε, we have the set identities: ∅∗ = { ε } { ε }∗ = { ε } length of a sentence in Σ ∗ is unbounded, yet it may not be considered infinite. A specialized branch of formal language theory (see Perrin and Pin [7]) is devoted to the socalled infinitary or omegalanguages, which include also sentences of infinite length. They effectively model the situations when a perpetual system can receive or produce messages of infinite length. 7 Another name for it is free monoid. In algebra, a monoid is a structure provided with an associative composition law (concatenation) and a neutral element (empty string). 6 The 16 2 Syntax Example 2.11 (identifiers) Many artificial languages assign a name or identifier to each entity, i.e., variable, file, document, subprogram, and object. A common naming rule prescribes that an identifier is a string that has the initial character in the set { A, B, . . . , Z } and contains any number of letters or digits { 0, 1, . . . , 9 }, e.g., LOOP3A2 . By using the alphabets ΣA and ΣN : ΣA = { A, B, . . . , Z } ΣN = { 0, 1, . . . , 9 } the language of identifiers I ⊆ ( ΣA ∪ ΣN )∗ is: I = ΣA ( ΣA ∪ ΣN )∗ To introduce a variant, prescribe that the length of the identifiers should not exceed five. By defining an alphabet Σ = ΣA ∪ ΣN , the language is: I5 = ΣA Σ 0 ∪ Σ 1 ∪ Σ 2 ∪ Σ 3 ∪ Σ 4 = ΣA ε ∪ Σ ∪ Σ 2 ∪ Σ 3 ∪ Σ 4 The formula expresses the concatenation of language ΣA , the sentences of which are single characters, with the language constructed as the union of powers. A more elegant writing is: I5 = ΣA ( ε ∪ Σ )4 2.2.4.2 Cross A useful though dispensable operator, derived from star, is the cross8 ‘+’: L+ = ∞ Lh = L ∪ L2 ∪ . . . h=1 Cross differs from star because the union excludes the power zero. The following relations hold: L+ ⊆ L∗ ε ∈ L+ if and only if ε ∈ L L+ = L L∗ = L∗ L Star and cross are called iteration operators. Example 2.12 (language cross) Two applications of cross to finite sets: { a b, b b }+ = a b, b2 , a b3 , b2 a b, a b a b, b4 , . . . { ε, a a }+ = ε, a2 , a4 , . . . = a2n  n ≥ 0 8 Or irreflexive closure by concatenation. 2.2 Formal Language Theory 17 Not surprisingly, a language can usually be defined by various formulas, which differ by their use of operators. Example 2.13 (controlling string length) Two ways of defining the strings of four or more characters: • concatenating the strings of length 4 with arbitrary strings: Σ 4 · Σ ∗ 4 • constructing the 4th power of the set of nonempty strings: Σ + 2.2.5 Language Quotient Operations like concatenation, star, or union lengthen the strings or increase the cardinality of the set of strings they operate upon. On the other hand, given two languages, the right quotient operation L /R L shortens the sentences of the first language L by cutting a suffix, which is a sentence of the second language L . The right quotient of L with respect to L is defined as: L = L /R L = y  ∃ z such that y z ∈ L ∧ z ∈ L Example 2.14 (quotient language) Consider two languages L and L : L = a2n b2n  n > 0 L = b2n+1  n ≥ 0 Their right quotients are: L /R L = { ar bs  ( r ≥ 2 is even ) ∧ ( 1 ≤ s < r is odd ) } = a2 b, a4 b, a4 b3 , . . . L /R L = ∅ A dual operation is the left quotient L /L L that shortens the sentences of language L by cutting a prefix, which is a sentence of language L . More operations will be introduced later, in order to transform or translate a formal language by replacing the terminal characters with other characters or strings. 2.3 Regular Expressions and Languages Theoretical investigation on formal languages has invented various categories of languages, in a way reminiscent of the classification of numerical domains introduced much earlier by number theory. Such categories are characterized by mathematical and algorithmic properties. 18 2 Syntax The first family of formal languages to be considered is called regular (or rational) and can be defined by an astonishing number of different approaches. Regular languages have been independently discovered in disparate scientific fields: the study of input signals driving a sequential circuit9 to a certain state, the lexicon of programming languages modeled by simple grammar rules, and the simplified analysis of neural behavior. Later such approaches have been complemented by a logical definition based on a restricted form of predicates. To introduce the family, the first definition will be algebraic, by using the operations of union, concatenation, and star; then the family will be defined again by means of certain simple grammar rules; last, Chap. 3 describes the algorithm for recognizing regular languages in the form of an abstract machine, the finite automaton.10 2.3.1 Definition of Regular Expression A language over alphabet Σ = { a1 , a2 , . . . , an } is regular if it can be expressed by applying for a finite number of times the operations of concatenation, union, and star, starting with the unitary languages11 { a1 }, { a2 }, …, { an } or the empty string ε. More precisely, a regular expression (r.e.) is a string r containing the terminal characters of the alphabet Σ and the following metasymbols12 : “ ∪ ” union “ · ” concatenation “ ∗ ” star “ ε ” empty (or null) string “ ( ) ” parentheses in accordance with the following rules: # 1 2 3 4 5 rule r=ε r=a r = (s ∪ t) r = ( s · t ) or r = ( s t ) r = ( s )∗ meaning empty (or null) string unitary language union of expressions concatenation of expressions iteration (star) of an expression where the symbols s and t are regular (sub)expressions. For expressivity, the metasymbol cross is allowed in an r.e., since it can be expressed using rules 4 and 5 as ( s )+ = (s ( s∗ )). For economy, some parentheses 9A digital component incorporating a memory. language family can also be defined by the form of the logical predicates characterizing language sentences, e.g., as in [8]. 11 A unitary language contains one sentence. 12 In order to prevent confusion between terminals and metasymbols, the latter should not be in the alphabet. If not so, the metasymbols must be suitably recoded to make them distinguishable. 10 The 2.3 Regular Expressions and Languages 19 can be dropped by stipulating the following precedence between operators: first apply star, then concatenation, and last union. Furthermore, the union of three (or more) terms, e.g., ((r ∪ s) ∪ t) and (r ∪ (s ∪ t)), can be simplified as (r ∪ s ∪ t) because union is an associative operation. The same remark applies to concatenation. It is customary to write the cup ‘∪’ symbol as a vertical bar ‘’, called an alternative. In more theoretical books on formal languages, the metasymbol ‘∅’ (empty set) is allowed in an r.e., which permits to obtain the empty string ‘ε’ from the basic operations through the identity { ε } = ∅∗ . We prefer the more direct and practical presentation shown above. The rules from 1 to 5 compose the syntax of the r.e., to be formalized later by means of a grammar (Example 2.31 on p. 33). The meaning or denotation of an r.e. r is a language Lr over the alphabet Σ, defined by the correspondence shown in Table 2.1. Example 2.15 (language of the multiples of 3) Let the alphabet be Σ = { 1 }, where character ‘1’ may be viewed as a pulse or signal. An r.e. e and the language Le it denotes are: e = ( 1 1 1 )∗ Le = 1n  n mod 3 = 0 Language Le contains the character sequences multiple of three. Notice that when dropping parentheses the language changes, due to the precedence of star over concatenation: e1 = 1 1 1∗ = 1 1 ( 1 )∗ Le1 = 1n  n ≥ 2 Example 2.16 (language of integers) Let the alphabet be Σ = { +, −, d }, where character d denotes any decimal digit 0, 1, …, 9. The expression e: e = ( + ∪ − ∪ ε ) d d∗ ≡ ( +  −  ε ) d d∗ produces the language Le = { +, −, ε } { d } { d }∗ of the integers with or without sign, such as +353, −5, 969, and +001. Table 2.1 Language Lr denoted by a regular expression r # Regular expression r Language Lr denoted by r 1 ε {ε} 2 a {a} 3 s∪t 4 s·t 5 ∗ s ,s or more often s  t or more often s t + where a ∈ Σ Ls ∪ Lt Ls · Lt L∗s , L+ s or more often Ls Lt 20 2 Syntax Actually, the correspondence between an r.e. and its denoted language is so direct that it is customary to refer to the language Le by the r.e. e itself. We introduce the established terminology for two language families. A language is regular if it is denoted by a regular expression. The empty language ∅ is considered a regular language as well, despite there is not an r.e. that defines it. The collection of all regular languages is called the family REG of regular languages. Another simple family of languages is the collection of all the finite languages, and it is called FIN. Thus, a language is in the family FIN if its cardinality is finite, as for instance the language of 32bit binary numbers. By comparing the families REG and FIN, it is easy to see that every finite language is regular, i.e., FIN ⊆ REG. In fact, a finite language is the union of finitely many strings x1 , x2 , …, xk , each one being the concatenation of finitely many characters, i.e., xi = a1 a2 . . . ani . Then an r.e. producing such finite language is simply the union of k terms xi , each one concatenating ni characters. Since family REG includes nonfinite languages, too, the inclusion between the two families is proper, i.e., FIN ⊂ REG. More language families will be later introduced and compared with REG. 2.3.2 Derivation and Language We formalize the mechanism through which a given r.e. e produces the denoted language. For the moment, we suppose that r.e. e is fully parenthesized (except for the atomic terms), and we introduce the notion of subexpression (s.e.) in the next example: subexpression e2 of e0 subexpression e1 of e0 e0 = ∗ (a ∪ (bb)) c + ∪ (a ∪ (bb)) subexpr. s of e2 This r.e. e0 is structured as the concatenation of two parts e1 and e2 , to be called subexpressions (of e0 ). In general, an s.e., f , of an r.e. e is a wellparenthesized substring immediately occurring inside the outermost parentheses. This means that there is not another wellparenthesized substring of e that contains f . In the example, the substring labelled s is not an s.e. of e0 , but it is an s.e. of e2 . When an r.e. is not fully parenthesized, in order to identify its subexpressions one has to insert (or to imagine) the missing parentheses, in agreement with the operator precedence. We recall that three or more terms combined by union do not need to be pairwise parenthesized, because union is an associative operation. The same remark applies to three or more concatenated terms. A union or iteration (star and cross) operator offers different choices for producing strings. By making a choice, one obtains an r.e. that defines a less general language, which is included in the original one. We say that an r.e. is a choice of another one in the following three cases: 2.3 Regular Expressions and Languages 21 1. r.e. ek (with 1 ≤ k ≤ m and m ≥ 2) is a choice of the union: ( e1 ∪ · · · ∪ ek ∪ · · · ∪ em ) 2. r.e. em = e · · · · · e (with m ≥ 1) is a choice of the star e∗ or cross e+ m times 3. the empty string ε is a choice of the star e∗ Let e be an r.e. By substituting some choice for e , a new r.e. e can be derived from e . The corresponding relation, called derivation, between two regular expressions e and e is defined next. Definition 2.17 (derivation13 ) We say that an r.e. e derives an r.e. e , written e ⇒ e , if one of the two conditions below holds true: 1. r.e. e is a choice of e 2. r.e. e is the concatenation of m ≥ 2 s.e., as follows: e = e1 . . . ek . . . em and r.e. e is obtained from e by substituting an s.e., say ek , with a choice of ek , say ek , as shown below: ∃ k, 1 ≤ k ≤ m, such that ek is a choice of ek ∧ e = e1 . . . ek . . . em Such a derivation ⇒ is called immediate as it makes exactly one choice. If two or more immediate derivations are applied in series, thus making as many choices altogether, we have a multistep derivation. We say that an r.e. e0 derives an r.e. en n in n ≥ 1 steps, written e0 = ⇒ en , if the following immediate derivations apply: e0 ⇒ e1 e1 ⇒ e 2 ... en−1 ⇒ en + The notation e = ⇒ e says that an r.e. e derives an r.e. e in n ≥ 1 steps, without specifying n. If the number of steps is n = 0, then the identity e0 = en holds and says ∗ + ⇒ e if it holds e = ⇒ e or that the derivation relation is reflexive. We also write e = e = e . Example 2.18 (immediate and multistep derivation) Immediate derivations: a∗ ∪ b+ ⇒ b+ a∗ ∪ b+ ⇒ a∗ ∗ ∗ a ∪ b b ⇒ a∗ ∪ b b a∗ ∪ b b 13 Also called implication. 22 2 Syntax Notice that the substrings of the r.e. considered must be chosen in order from external to internal, if one wants to produce all possible derivations. For instance, it would be ∗ unwise, starting from e = ( a∗ ∪ b b )∗ , to choose a2 ∪ b b , because a∗ is not an s.e. of e . Although the value 2 is a correct choice for the star, such a premature choice would rule out the derivation of a valid sentence such as a2 b b a3 . Multistep derivations: ⇒ε a∗ ∪ b+ ⇒ a∗ ⇒ ε that is a∗ ∪ b+ = 2 a∗ ∪ b+ ⇒ b+ ⇒ b b b + or also a∗ ∪ b+ = ⇒ε + or also a∗ ∪ b+ = ⇒ bbb Some expressions produced by derivation from an expression r contain the metasymbols of union, star, and cross. Others just contain terminal characters or the empty string, and possibly redundant parentheses, which can be canceled. The latter expressions compose the language denoted by the r.e. The language defined by a regular expression r is: ∗ ⇒x Lr = x ∈ Σ ∗  r = Two r.e. are equivalent if they define the same language. The coming example shows that different derivation orders may produce the same sentence. Example 2.19 (derivation orders) Compare the derivations 1 and 2 below: steps 1st 2nd 3rd 1 : a∗ ( b ∪ c ∪ d ) f + ⇒ a a a ( b ∪ c ∪ d ) f + ⇒ a a a c f + ⇒ a a a c f 2 : a∗ ( b ∪ c ∪ d ) f + ⇒ a∗ c f + ⇒ aaacf+ ⇒ aaacf At line 1, the first step takes the leftmost s.e. ( a∗ ) and chooses a3 = a a a, whereas at line 2 it takes another s.e. ( b ∪ c ∪ d ) and chooses c. Since such choices are mutually independent, they can be applied in any order. Thus, by the second step, we obtain the same r.e. a a a c f + . The third step takes s.e. f + , chooses f , and produces the sentence a a a c f . Since the last step is independent of the other two, it could be performed before, after, or between them. 2.3.2.1 Ambiguity of Regular Expressions The next example conceptually differs from the preceding one with respect to how different derivations produce the same sentence. Example 2.20 (ambiguous regular expression) The language over alphabet { a, b } such that every sentence contains one or more letters a is defined by: ( a ∪ b )∗ a ( a ∪ b )∗ 2.3 Regular Expressions and Languages 23 where the compulsory presence of an a is evident. Clearly, every sentence containing two or more occurrences of a can be obtained by multiple derivations, which differ with respect to the character identified as the compulsory one. For instance, sentence a a offers two possibilities: (a ∪ b)∗ a (a ∪ b )∗ ⇒ (a ∪ b) a (a ∪ b)∗ ⇒ a a (a ∪ b)∗ ⇒ a a ε = a a (a ∪ b)∗ a (a ∪ b)∗ ⇒ ε a (a ∪ b)∗ ⇒ ε a (a ∪ b) ⇒ ε a a = a a A sentence, and the r.e. that derives it, is said to be ambiguous if, and only if, it can be obtained via two structurally different derivations. Thus, sentence a a is ambiguous, while sentence b a is not ambiguous, because only one set of choices is possible, corresponding to the derivation: ( a ∪ b )∗ a ( a ∪ b )∗ ⇒ ( a ∪ b ) a ( a ∪ b )∗ ⇒ b a ( a ∪ b )∗ ⇒ b a ε = b a Ambiguous definitions are a source of trouble in many settings, although they may be useful in certain applications. In general, such definitions should be avoided although they may have the advantage of concision over unambiguous definitions. We would like to have an algorithm for checking if an r.e. is ambiguous, but for that we have to wait till Chap. 3. For the time being, we just state a simple sufficient condition for a sentence and its r.e. to be ambiguous. To this end, given an r.e. f , we number its letters and we obtain a numbered r.e., which we denote by f : f = ( a1 ∪ b2 )∗ a3 ( a4 ∪ b5 )∗ Notice that the numbered r.e. f defines a language over the numbered alphabet { a1 , b2 , a3 , a4 , b5 }, in general larger than the original one. An r.e. f is ambiguous if the language defined by the numbered r.e. f associated to f contains two distinct strings x and y that become identical when the numbers are erased. For instance, the strings a1 a3 and a3 a4 of language f prove the ambiguity of the string a a of language f . The same criterion can be applied to an r.e. e that includes the metasymbol ε. For instance, consider the r.e. e = ( a  ε )+ . Then the numbered r.e. is e = ( a1  ε2 )+ , from which the two strings a1 ε2 and ε2 a1 can be derived that are mapped to the same ambiguous string a. Notice that the sufficient condition does not cover cases like the following. The + + b, numbered a1+  b2 b3 , ambiguously derives the string a a b r.e. a+  b with the unique numbering a1 a1 b3 , yet with two distinct derivations: and a+  b + + + + + b ⇒ a+  b a+  b b = ⇒a a b= ⇒ aab a+  b + b ⇒ a+  b b ⇒ a+ b ⇒ a a b 24 2 Syntax Similarly, the r.e ( a∗  b∗ ) c has the ambiguous derivations: ∗ a  b∗ c ⇒ b∗ c ⇒ ε c = c a∗  b∗ c ⇒ a∗ c ⇒ ε c = c and which are not detected by our condition. The concept of ambiguity will be thoroughly studied for grammars. 2.3.3 Other Operators When regular expressions are used in practice, it is convenient to add to the basic operators of union, concatenation, and star, the derived operators of power and cross. Moreover, for better expressivity other derived operators may be practical: repetition from k ≥ 0 to n > k times: [ a ]nk = ak ∪ ak+1 ∪ · · · ∪ an option: [ a ] = [ a ]10 = a0 ∪ a1 = ε ∪ a interval of an ordered set: the short notation ( 0 . . . 9 ) represents any digit in the ordered set { 0, 1, . . . , 9 }; similarly, the notations ( a . . . z ) and ( A . . . Z ) respectively represent any lowercase/uppercase letter Sometimes other set operations are also used: intersection, set difference, and complement; the regular expressions using such operators are called extended, although the name is not standard and one has to specify the operators allowed case by case. Example 2.21 (extended r.e. with intersection) This operator provides a straightforward formulation when a string, to be valid, must obey two or more conditions. To illustrate, let the alphabet be { a, b } and assume a valid string must (i) contain the substring b b and (ii) have an even length. Condition (i) is imposed by r.e.: ( a  b )∗ b b ( a  b )∗ condition (ii) by r.e.: ( a  b )2 ∗ and the language is defined by the r.e. extended with intersection: ( a  b )∗ b b ( a  b )∗ ∗ ∩ ( a  b )2 The same language can be defined by a basic r.e. without intersection, but the formula is more complicated. It says that the substring b b can be surrounded by two strings, both of even length or both of odd length: ∗ ∗ ∗ ∗ ( a  b )2 b b ( a  b )2  ( a  b ) ( a  b )2 b b ( a  b ) ( a  b )2 2.3 Regular Expressions and Languages 25 Furthermore, it is sometimes simpler to define the sentences of a language ex negativo, by stating a property they should not have. Example 2.22 (extended r.e. with complement) Let L be the set of strings over alphabet { a, b } not containing a a as substring. Its complement is: ¬ L = x ∈ ( a  b )∗  x contains substring a a easily defined by r.e. ( a  b )∗ a a ( a  b )∗ , whence an extended r.e. for L is: L = ¬ ( a  b )∗ a a ( a  b )∗ We also show a definition of language L by a basic r.e.: L = ( a b  b )∗ ( a  ε ) which most readers are likely to find less readable. Actually, it is no coincidence that the languages in Examples 2.21 and 2.22 admit also an r.e. without intersection or complement. A theoretical result, to be presented in Chap. 3, states that the language produced by any r.e. extended with complement and intersection is regular, therefore by definition can be produced by a nonextended r.e. as well. 2.3.4 Closure Properties of Family REG Let op be an operator to be applied to one or two languages, to produce another language. A language family is closed under operator op if the language, obtained by applying op to any languages of the family, belongs to the same family. Property 2.23 (closure of REG) The family REG of regular languages is closed under the operators of concatenation, union, and star; therefore, it is closed also under any derived operator, such as cross. This closure property descends from the very definitions of r.e. and of family REG (p. 20). In spite of its theoretical connotation, Property 2.23 is very relevant in practice: two regular languages can be combined by using the above operations, at no risk of losing the nice features of family REG. This will have an important practical consequence, to permit the compositional design of the algorithms used to check if an input string is valid for a language. Furthermore, we anticipate that the REG family is closed under intersection, complement, and reversal, too, which will be proved later. From the above property, we derive an alternative definition of family REG. 26 2 Syntax Property 2.24 (characterization of REG) The family of regular languages, REG, is the smallest language family such that: (i) it contains all finite languages and (ii) it is closed by concatenation, union, and star. Proof By contradiction, assume there is a smaller family F, i.e., F ⊂ REG, that meets conditions (i) and (ii). Take any language Le ∈ REG defined by an r.e. e. Language Le is obtained by starting from unitary languages, and orderly applying the operators present in e. From the assumption, it follows that Le belongs also to family F. Then F contains any regular language and thus contradicts the strict inclusion F ⊂ REG. We anticipate that other language families exist which are closed under the same operators of Property 2.23. Chief among them is the family, CF, of contextfree languages, to be introduced soon. From Property 2.24, it follows a (strict) containment relation between these two families, i.e., REG ⊂ CF. 2.4 Linguistic Abstraction: Abstract and Concrete Lists If one recalls the programming languages he is familiar with, he may observe that, although superficially different in their use of keywords and separators, they are often quite similar at a deeper level. By shifting focus from concrete to abstract syntax, we can reduce the bewildering variety of language constructs to a few essential structures. The verb ‘to abstract’ means: to consider a concept without thinking of a specific example, WordNet 2.1 By abstracting away from the actual characters that represent a language construct, we perform a linguistic abstraction. This is a language transformation that replaces the terminal characters of the concrete language with others taken from an abstract alphabet. Abstract characters should be simpler and suitable to represent similar constructs from different artificial languages.14 By this approach, the abstract syntax structures of existing artificial languages are easily described as composition of few elementary paradigms, by means of standard language operations: union, iteration, and substitution (later defined). Starting from the abstract language, a concrete or real language is obtained by the reverse transformation, metaphorically also called ‘coating with syntax sugar.’ Factoring a language into its abstract and concrete syntax pays off in several ways. When studying different languages, it affords much conceptual economy. When designing compilers, abstraction helps for portability across different languages, if 14 The idea of language abstraction is inspired by the research line in linguistics that aims at discovering the underlying similarities between human languages, disregarding morphological and syntactic differences. 2.4 Linguistic Abstraction: Abstract and Concrete Lists 27 the compiler functions are designed to process abstract, instead of concrete, language constructs. Thus, parts of, say, a C compiler can be reused for similar languages, like Java. Since the number of abstract paradigms in use is surprisingly small, we will be able to present most of them in this chapter, starting from the ones conveniently specified by regular expressions, i.e., the lists. An abstract list contains an unbounded number of elements e of the same type. It is defined by r.e. e+ , or by r.e. e∗ if an empty list is permitted. For the time being, a list element is viewed as a terminal character. In later refinements, however, an element may be a string from another formal language; for instance, think of a list of numbers. 2.4.1 Lists with Separators and Opening/Closing Marks In many real cases, adjacent elements must be divided by a string called separator, denoted s in abstract syntax. Thus, in a list of numbers, a separator is needed to delimit the end of a number and the beginning of the next one. A list with separators is defined by r.e. e ( s e )∗ , saying that the first element e can be followed by zero or more pairs s e. The equivalent definition ( e s )∗ e differs by giving evidence to the last element. In many concrete cases, there is another requirement, intended for legibility or computer processing: to make the start and end of the list easily recognizable by respectively prefixing and suffixing some special signs: in the abstract, the initial character or opening mark i, and the final character or closing mark f . Lists with separators and opening/closing marks are defined as: i e ( s e )∗ f Example 2.25 (some concrete lists) Lists are everywhere in languages, as shown by typical examples. instruction block as in begin instr1 ; instr2 ; . . . ; instrn end where instr may stand for assignment, goto, ifstatement, writestatement, etc.; corresponding abstract and concrete terms are: abstract alphabet i e s f concrete alphabet begin instr ; end 28 2 Syntax procedure parameters as in procedure WRITE ( i par1 e , s par2 e , ... … ... parn e ) f Furthermore, should an empty parameter list be legal, as in a declaration like procedure WRITE ( ), the r.e. becomes i e ( s e )∗ f array definition as in array MATRIX [ i int1 e , s int2 e , ... … ... intn e ] f where each int is an interval such as 10 … 50 2.4.2 Language Substitution The above examples have illustrated the mapping from concrete to abstract symbols. Moreover, language designers find it useful to work by stepwise refinement, as done in any branch of engineering when a complex system is divided into its components, atomic or otherwise. To this end, we introduce the new language operation of substitution, which replaces a terminal character of a language called the source, with a sentence of another language called the target. As always, the source alphabet is Σ and the source language is L ⊆ Σ ∗ . Consider a sentence of L containing one or more occurrences of a source character b: x = a1 a2 . . . an ∈ L where ai = b for some 1 ≤ i ≤ n Let Δ be another alphabet, called target, and Lb ⊆ Δ∗ be the image language of b. The substitution of language Lb for b in string x produces a set of strings: { y  y = y1 y2 . . . yn ∧ if ai = b then yi = ai else yi ∈ Lb } which are a language over the alphabet ( Σ \ { b } ) ∪ Δ. Notice that all the characters other than b are left unchanged. By the usual approach, we can extend the substitution operation from one string to the whole source language, by applying it to every source sentence. Example 2.26 (Example 2.25 continued) Resuming the case of a parameter list, the abstract syntax is: i e ( s e )∗ f and the substitutions to be applied are tabulated below: 2.4 Linguistic Abstraction: Abstract and Concrete Lists abstract char. 29 image language i Li = procedure procedure identifier ‘ ( ’ e Le = parameter identifier s Ls = ‘ ,’ f Lf = ‘ ) ’ Thus, the opening mark i is replaced with a string of language Li , where the procedure identifier has to agree with the reference manual of the programming language. Clearly, the target languages of such substitutions depend on the ‘syntax sugar’ of the concrete language intended for. Notice that the four above substitutions are independent of one another and can be applied in any order. Example 2.27 (identifiers with dash) In certain programming languages, long mnemonic identifiers can be constructed by appending alphanumeric strings separated by a dash: thus, string LOOP3OF35 is a legal identifier. More precisely, the leftmost word, LOOP3, must initiate with a letter, the others may also initiate with a digit; and adjacent dashes, as well as a trailing dash, are forbidden. As a first approximation, each language sentence is a nonempty list of elements e, which are alphanumeric words, separated by a dash: e ( ‘ _ ’ e )∗ However, the first word must be different from the others and may be taken to be the opening mark i of a possibly empty list: i ( ‘  ’ e )∗ By substituting to i the language ( A . . . Z ) ( A . . . Z  0 . . . 9 )∗ , and to e the lan guage ( A . . . Z  0 . . . 9 )+ , the final r.e. is obtained. This is an overly simple instance of syntax design by abstraction and stepwise refinement, a method to be further developed now and after the introduction of grammars. Other language transformations are studied in Chap. 5. 2.4.3 Hierarchical or Precedence Lists A frequently recurrent construct is a list such that each element is in turn a list of a different type. The first list is attached to level 1, the second to level 2, and so on. However, the present abstract paradigm, called hierarchical list, is restricted to lists having a bounded number of levels. The case of unbounded levels is studied later by using grammars, under the name of nested structures. A hierarchical list is also called a list with precedence, because the list at level l ≥ 2 bounds its elements more strongly than the list at level l − 1. In other words, 30 2 Syntax the elements at higher level, i.e., with higher precedence (or more priority), must be assembled into a list, and each such list becomes an element at the next lower level, i.e., with lower priority. Of course, each level may have its own opening/closing marks and separator. Such delimiters are usually distinct level by level, in order to avoid confusion. The structure of a hierarchical list with k ≥ 2 levels is: hierarchy of lists level precedence/priority ∗ 1 lowest/min ∗ 2 list1 = i1 list2 ( s1 list2 ) f1 list2 = i2 list3 ( s2 list3 ) f2 …=… listk = ik ek ( sk ek )∗ fk k highest/max In this schema, only the highest level may contain atomic elements, ek . But a common variant permits, at any level l (1 ≤ l < k), atomic elements el to occur side by side with lists of level l + 1. A few concrete examples follow. Example 2.28 (two hierarchical lists) Structures in programming languages: block of print instructions as in begin instr1 ; instr2 ; … instrn end where instr is a print instruction, e.g., WRITE (var1 , var2, …, varn) , i.e., a list (from Example 2.25); there are two levels (k = 2): level 1 (low) level 2 (high) list of instructions instr separated by semicolon ‘;’, opened by begin and closed by end list of variables var separated by comma ‘,’, with i2 = WRITE‘ ( ’ and f2 = ‘ ) ’ arithmetic expression without parentheses the precedences of arithmetic operators determine how many levels the hierarchy has; for instance, the operators ‘×’, ‘÷’ and ‘+’, ‘−’ are layered on two levels (k = 2) and the expression: term term term 3 + 5 × 7 × 4 n n n ... − term term 8×2÷5 + 8 + 3 ..................... is a twolevel list, with neither opening nor closing marks; at level 1, we find a list of terms term = list2 , separated by signs ‘+’ and ‘−’, i.e., by low precedence operators; at level 2 we see a list of numbers n = e2 , separated by signs ‘×’ and ‘÷’, i.e., by high precedence operators; one may go further and introduce a level 3 with an exponentiation sign ‘∗∗’ as separator 2.4 Linguistic Abstraction: Abstract and Concrete Lists 31 Of course, hierarchical structures are omnipresent in natural languages as well. Think of a list of nouns: father, mother, son and daughter Here we may observe a difference with respect to the abstract paradigm: the penultimate element uses a distinct separator and, possibly in order to warn the listener of an utterance that the list is approaching the end. Furthermore, the items in the list may be enriched by secondlevel qualifiers, such as a list of adjectives. In all sorts of documents and written media, hierarchical lists are extremely common. For instance, a book is a list of chapters, separated by white pages, between a front and back cover. A chapter is a list of sections, a section is a list of paragraphs, and so on. 2.5 ContextFree Generative Grammars We start the study of the family of contextfree languages, which plays the central role in compilation. Initially invented by linguists for natural languages in the 1950s, contextfree grammars have proved themselves extremely useful in computer science applications: all existing technical languages have been defined using such grammars. Moreover, since the early 1960s, efficient algorithms have been found to analyze, recognize, and translate sentences of contextfree languages. This chapter presents the relevant properties of contextfree languages, illustrates their application by many typical cases, and compares and combines them with regular languages. At last, we position the contextfree model in the classical hierarchy of grammars and computational models due to Chomsky, and we briefly examine some contextsensitive models. 2.5.1 Limits of Regular Languages Regular expressions are very practical for describing lists and related paradigms but fall short of the capacity needed to define other frequently occurring constructs. A relevant case are the block structures (or nested parentheses) present in many technical languages, schematized by: begin begin begin … end begin … end … end end 1st inner block 2nd inner block outer block Example 2.29 (simple block structure) Let the alphabet be shortened to { b, e } and consider a quite limited case of nested structure, such that all opening marks precede all closing marks. Clearly, opening/closing marks must have identical count: L1 = bn en  n ≥ 1 32 2 Syntax We argue that language L1 cannot be defined by a regular expression, but we have to defer the formal proof to a later section. In fact, since each string must have all letters b left of any letter e, either we write an overly general r.e. such as b+ e+ , which unfortunately defines illegal strings like b3 e5 , or we write a too restricted r.e. that exhaustively lists a finite sample of strings up to a bounded length. On the other hand, if we comply with the condition that the count of the two letters is the same by writing ( b e )+ , illegal strings like b e b e creep in. For defining this and other useful languages, regular or not, we move to the formal model of generative grammars. 2.5.2 Introduction to ContextFree Grammars A generative grammar or syntax 15 is a set of simple rules that can be repeatedly applied in order to generate all and only the valid strings. Example 2.30 (palindromes) The language L, over the alphabet Σ = { a, b }, is first defined, using the reversal operation, as: L = u uR  u ∈ Σ ∗ = { ε, a a, b b, a b b a, b a a b, . . . , a b b b b a, . . . } It contains strings having specular symmetry, called palindromes, of evenlength. The following grammar G contains three rules: pal → ε pal → a pal a pal → b pal b The arrow ‘→’ is a metasymbol, exclusively used to separate the left part of a rule from the right part. To derive the strings, just replace the symbol pal, called nonterminal, with the right part of a rule, for instance pal ⇒ a pal a ⇒ a b pal b a ⇒ a b b pal b b a ⇒ . . . The derivation process can be chained and terminates when the last string obtained no longer contains a nonterminal symbol. At that moment, the generation of the sentence is concluded. We complete the derivation: a b b pal b b a ⇒ a b b ε b b a = a b b b b a Incidentally, the language of palindromes is not regular. 15 Sometimes the term grammar has a broader connotation than syntax, as when some rules for computing the meaning of sentences are added to the rules for enumerating them. When necessary, the intended meaning of the term will be made clear. 2.5 ContextFree Generative Grammars 33 Next, we enrich the example into a list of palindromes separated by commas, exemplified by sentence ‘a b b a, b b a a b b, a a’. The grammar adds two listgenerating rules to the previous ones: list → pal ‘ , ’ list list → pal pal → ε pal → a pal a pal → b pal b The top left rule says: the concatenation of palindrome, comma, and list produces a (longer) list; notice that the comma is between quotes, to stress that it is a terminal symbol like a and b. The other rule says that a list can be made of one palindrome. Now there are two nonterminal symbols: list and pal. The former is termed axiom because it defines the intended language. The latter defines certain component substrings, also called constituents of the language, the palindromes. Example 2.31 (metalanguage of regular expressions) A regular expression that defines a language over a fixed terminal alphabet, say Σ = { a, b }, is a formula, i.e., a string over the alphabet Σr.e. = { a, b, ‘ ∪ ’, ‘ ∗ ’, ‘ ε ’, ‘ ( ’, ‘)’ }, where the quotes around the r.e. metasymbols indicate their use as grammar terminals (concatenation ‘·’ could be added as well). Such formulas in turn are the sentences of the language Lr.e. defined by the next grammar. Looking back to the definition of r.e. on p. 18, we write the rules of grammar G r.e. : 1 : expr → ‘ ε ’ 2 : expr → a 3 : expr → b 4 : expr → ‘ ( ’ expr ‘ ∪ ’ expr ‘ ) ’ 5 : expr → ‘ ( ’ expr expr ‘ ) ’ 6 : expr → ‘ ( ’ expr ‘ ) ’‘ ∗ ’ where numbering is only for reference. A derivation is (quotes are omitted): expr ⇒ ( expr ∪ expr ) ⇒ ( ( expr expr ) ∪ expr ) ⇒ ( ( a expr ) ∪ expr ) 4 2 5 ⇒ a ( expr )∗ ∪ expr ⇒ a ( ( expr ∪ expr ) )∗ ∪ expr 6 4 ⇒ a ( ( a ∪ expr ) )∗ ∪ expr ⇒ a ( ( a ∪ b ) )∗ ∪ expr 2 3 ⇒ a ( ( a ∪ b ) )∗ ∪ b = e 3 Since the last string can be interpreted as an r.e. e, it denotes a second language Le over alphabet Σ, which is the set of the strings starting with letter a, plus string b: Le = { a, b, a a, a b, a a a, a b a, . . . } A word of caution: this example displays two language levels, since the syntax defines certain strings to be understood as definitions of other languages. To avoid 34 2 Syntax terminological confusion, we say that the syntax G r.e. stays at the metalinguistic level, that is, over the linguistic level, or, equivalently, that the syntax is a metagrammar. To set the two levels apart, it helps to consider the alphabets: at metalevel the alphabet is Σr.e. = { a, b, ‘ ∪ ’, ‘ ∗ ’, ‘ ε ’, ‘ ( ’, ‘)’ }, whereas the final language has alphabet Σ = { a, b }, devoid of metasymbols. An analogy with human language may also clarify the issue. A grammar of Russian can be written, say, in English. Then it contains both Cyrillic and Latin characters. Here English is the metalanguage and Russian the final language, which only contains Cyrillic characters. Another example of language versus metalanguage is provided by XML, the metanotation used to define a variety of Web document types such as HTML. Definition 2.32 (contextfree grammar) A contextfree (CF) (or type 2 or BNF 16 ) grammar G is defined by four entities: V Σ P S∈V nonterminal alphabet, a set of symbols termed nonterminals or metasymbols terminal alphabet, a set of symbols termed terminals, disjoint from the preceding set a set of syntactic rules (or productions) a particular nonterminal termed axiom A rule of set P is an ordered pair X → α, with X ∈ V and α ∈ ( V ∪ Σ )∗ . Two or more rules (n ≥ 2): X → α1 X → α2 ... X → αn with the same left part X can be concisely grouped in: X → α1  α2  · · ·  αn or X → α1 ∪ α2 ∪ · · · ∪ αn We say that the strings α1 , α2 , …, αn are the alternatives of X . 2.5.3 Conventional Grammar Representations To prevent confusion, the metasymbols ‘→’, ‘’, ‘∪’, and ε should not be used for terminal or nonterminal symbols (Example 2.31 is a motivated exception). Moreover, the terminal and nonterminal alphabets must be disjoint, i.e., Σ ∩ V = ∅. 16 Type 2 comes from Chomsky classification (p. 105). Backus Normal Form, or also Backus– Naur Form, comes from the names of John Backus and Peter Naur, who pioneered the use of such grammars for defining programming languages. 2.5 ContextFree Generative Grammars 35 Table 2.2 Different typographic styles for writing grammar rules Nonterminal Terminal A compound word between angle brackets, e.g., if sentence A word written as it is, without any special marks Example if sentence → if cond then sentence else sentence A word written as it is, A word written in bold, in without any special marks; it italic or quoted, e.g., then, may not contain blank spaces; then or ‘then’ e.g., sentence or sentence_list if_sentence → if cond then sentence else sentence or if_sentence → if cond then sentence else sentence or if_sentence → ‘if’ cond ‘then’ sentence ‘else’ sentence An uppercase Latin letter F → if C then D else D A word written as it is, without any special marks In professional and scientific practice, different styles are used to represent terminals and nonterminals, as specified in Table 2.2. In the first style, the grammar of Example 2.30 becomes: ⎧ ⎪ ⎨ sentence → ε sentence → a sentence a ⎪ ⎩ sentence → b sentence b Alternative rules may be grouped together: sentence → ε  a sentence a  b sentence b If a technical grammar is large, of the order of a few hundred rules, it should be written with care, to facilitate searching for specific definitions, making changes and cross referencing. The nonterminals should be identified by selfexplanatory names, and the rules should be divided into sections and be numbered for reference. On the other hand, in very simple examples the third style of Table 2.2 is more suitable, that is, to have disjoint short symbols for terminals and nonterminals. In this book, for simplicity we often adopt the following style: • lowercase Latin letters near the beginning of the alphabet { a, b, . . . } for terminal characters • uppercase Latin letters { A, B, . . . , Z } for nonterminal symbols • lowercase Latin letters near the end of the alphabet { . . . , r, s, . . . , z } for strings over the alphabet Σ, i.e., including only terminals • lowercase Greek letters { α, β, . . . , ω } for the strings over the combined alphabet V ∪Σ 36 2 Syntax Table 2.3 Classification of grammar rule forms Class Description Model Terminal Either RP contains only terminals or is the empty string →u  ε Empty or null RP is the empty string →ε Initial or axiomatic LP is the grammar axiom S→ Recursive LP occurs in RP A → αAβ Selfnesting or selfembedding LP occurs in RP between nonempty A → α A β strings Leftrecursive LP is the prefix of RP A → Aβ Rightrecursive LP is the suffix of RP A→βA Leftrightrecursive or twoside recur. The two cases above together A → Aβ A Copy, renaming, or categorization RP consists of one nonterminal A→B Identity LP and RP are identical A→A Linear RP contains at most one nonterminal (the rest is terminal) → uBv  w Leftlinear or type 3 As the linear case above, but the nonterminal is prefix → Bv  w Rightlinear or type 3 As the linear case above, but the nonterminal is suffix → uB  w Homogeneous normal RP consists of either n ≥ 2 nonterminals or one terminal → A 1 . . . An  a Chomsky normal or homogeneous of degree 2 RP consists of either two nonterminals or one terminal → BC  a Greibach normal or realtime RP is one terminal possibly followed by nonterminals → aσ  b Operator form RP consists of two nonterminals → AaB separated by one terminal, which is called operator (more generally, RP is devoid of adjacent nonterminals) Rule Types In grammar studies, the rules may be classified depending on their form, with the aim of making the study of language properties more immediate. For future reference, in Table 2.3 we list some common rule forms along with their technical names. Each rule form is next schematized, with symbols adhering to the following stipulations: a and b are terminals; u, v, and w denote strings of terminals and may be empty; A, B, and C are nonterminals; α and β denote strings containing a mix of terminals and nonterminals, and may be empty; lastly, σ denotes a string of nonterminals. The classification of Table 2.3 is based on the form of the right part RP of a rule, except the recursive rules, which may also consider the left part LP. We omit all the rule parts that are irrelevant for the classification. By the Chomsky classification, 2.5 ContextFree Generative Grammars 37 the leftlinear and rightlinear forms are also known as type 3 grammars. Most rule forms listed in Table 2.3 will occur in the book, and the remaining ones are listed for general reference. We will see that some of the grammar forms can be forced on any given grammar and leave the language unchanged. Such forms are called normal. 2.5.4 Derivation and Language Generation We reconsider and formalize the notion of string derivation. Let β = δ A η be a string containing a nonterminal A, where δ and η are any strings, possibly empty. Let A → α be a rule of grammar G and let γ = δ α η be the string obtained by replacing the nonterminal A in β with the rule right part α. The relation between two such strings is called derivation. We say that string β derives string γ for grammar G, and we write β= ⇒γ G or simply β ⇒ γ when the grammar name is understood. Rule A → α is applied in such a derivation and string α reduces to nonterminal A. Now consider a chain of derivations of length n ≥ 0: β0 ⇒ β1 ⇒ · · · ⇒ βn which can be shortened to: n β0 = ⇒ βn 0 If n = 0, for every string β we posit β = ⇒ β, that is, the derivation relation is reflexive. To express derivations of any length, we write: ∗ β0 = ⇒ βn or + β0 = ⇒ βn if the length of the chain is n ≥ 0 or n ≥ 1, respectively. The language generated or defined by a grammar G, starting from nonterminal A, is the set of the terminal str