file: 04.1.syntax.txt author: Bob Muller CS3366 Programming Languages Lecture Notes for Meeting 1 of Week 4 Today: - Syntax, Grammars and Derivations Context: - We're interested in the design, specification and implementation of programming languages - We're more or less following the slogan: Language = Syntax + Semantics - We don't have much to say about the design of syntax other than the common sense idea that it shouldn't interfere with understanding and, if you're lucky, it should be in reasonable taste. Specification of Syntax Notation: We're going to use the latex symbols \alpha, \beta etc for the Greek letters that we cannot render in plain text. Consider a finite set of symbols T = {a, b, c, ...}. We're going to view a language over T as elements of T* (i.e., sequences of symbols drawn from T). T* has no constraints so we'll employ a device, a context free grammar (CFG) to define a reasonable subset of T*. A context free grammar G is a 4-tuple: G = (N, T, P, S) where N is a finite set of nonterminal symbols, T is a finite set of terminal symbols, P \subseteq N X (N U T)* is a finite set of productions, and S \in N is a distinguished element of N called the start symbol. We'll follow well-established conventions using - uppercase letters A, B, ... to range over N - lowercase letters at the front of the alphabet a, b, ... to range over T - lowercase letters at the end of the alphabet w, u, v to range over T* - Greek letters \alpha, \beta, \gamma to range over (N U T)*. Example: G0 = ({S, A}, {a, b}, {(S, aA), (A, aA), (A, b)}, S) The elements of P are usually written thusly: S --> aA S ::= aA A --> aA or sometimes as A ::= aA A --> b A ::= b Given our conventions, we can say things like A ::= \alpha \in P. Derives in One Step Let A ::= \gamma in P. Then \alpha A \beta ==> \alpha \gamma \beta. The double arrow is pronounced "derives in one step". Example: For G0 we have that aaA ==> aaaA, using \alpha = aa, \beta = empty and \gamma = aA. Derives in Zero or More Steps The relation ==>* is the reflexive and transitive closure of ==>. Example: For G0 we have that aaA ==>* aaA (derives in zero steps) and that aaA ==>* aaaA (derives in one step) and aaA ==>* aaab (derives in two steps. Definition: sentential form Let G = (N, T, P, S) be a CFG. Then he set of sentential forms of G is { \alpha | S ==>* \alpha } Intuitively, \alpha is a string of terminals and nonterminals derivable from the start symbol. Definition: L(G) -- the language of G Let G = (N, T, P, S) be a CFG. Then the language defined by G is L(G) = { w | S ==>* w } Definition: parser for L(G): a parser for L(G) is a function parse(G, w) that returns true if w \in L(G) and false otherwise. Comments: 1. L(G) is considered as set of sentences. So from this perspective, programs are sentences. 2. It's easy to see that L(G) \subseteq T* --- we've thrown out a lot (but definitely not all!) of the nonsensical stuff. 3. Given a grammar G and a string of symbols w, a derivation S ==>* w can be seen as a proof that w \in L(G). (Technically speaking, it's actually called a "witness".) Conventions: 1. If A ::= \alpha in P and A ::= \beta in P we usually write A ::= \alpha | \beta where the vertical bar is pronounced "or". 2. Given a set of productions P, the grammar G = (N, T, P, S) is easy to infer. So here to fore, we'll write grammars by writing the productions. Leftmost/Rightmost Derivations A derivation step in which the leftmost nonterminal is replaced: w A \beta ==> w \alpha \beta is called a leftmost. A derivation: \alpha ==>* beta is leftmost if each step in the derivation is leftmost. And likewise for rightmost. Ambiguity A grammar G is ambiguous if there exists a w \in L(G) such that there are two different leftmost (rightmost) derivations. Derivations and Trees Let Xi range over N U T, let A ::= X1X2...Xn be a production such that Xi ==>* \betai with \beta = \beta1...\betan and let D be the derivation A ==>* \beta. Then Tree(D) is the tree structure with root A and with subtrees Tree(X1 ==>* \beta1)...Tree(Xn ==>*\betan). Example: E ::= (E + E) | a Let w = (a + a), D = E ==> (E + E) ==> (a + E) ==> (a + a). Then Tree(D) is E / / | \ \ ( E + E ) | | a a The tree of a derivation is called a -concrete syntax tree-. Abstract Syntax The phrase "abstract syntax tree (AST) was coined by John McCarthy, the inventor of LISP. The abstract syntax tree has only the essential information contained in the concrete syntax tree. For the example above, the AST is: + / \ a a