file: 04.1.syntax.txt
author: Bob Muller

CS3366 Programming Languages

Lecture Notes for Meeting 1 of Week 4

Today:

- Syntax, Grammars and Derivations

Context:

  - We're interested in the design, specification and implementation of
    programming languages

  - We're more or less following the slogan: Language = Syntax + Semantics

  - We don't have much to say about the design of syntax other than the
    common sense idea that it shouldn't interfere with understanding and,
    if you're lucky, it should be in reasonable taste.

Specification of Syntax

Notation: We're going to use the latex symbols \alpha, \beta etc for
   the Greek letters that we cannot render in plain text.

Consider a finite set of symbols T = {a, b, c, ...}. We're going to
view a language over T as elements of T* (i.e., sequences of symbols
drawn from T).  T* has no constraints so we'll employ a device, a
context free grammar (CFG) to define a reasonable subset of T*.

A context free grammar G is a 4-tuple:

G = (N, T, P, S) where

N is a finite set of nonterminal symbols,

T is a finite set of terminal symbols,

P \subseteq N X (N U T)* is a finite set of productions, and

S \in N is a distinguished element of N called the start symbol.

We'll follow well-established conventions using

- uppercase letters A, B, ... to range over N
- lowercase letters at the front of the alphabet a, b, ... to range over T
- lowercase letters at the end of the alphabet w, u, v to range over T*
- Greek letters \alpha, \beta, \gamma to range over (N U T)*.

Example:

G0 = ({S, A}, {a, b}, {(S, aA), (A, aA), (A, b)}, S)

The elements of P are usually written thusly:

S --> aA                             S ::= aA
A --> aA       or sometimes as       A ::= aA
A --> b                              A ::= b

Given our conventions, we can say things like A ::= \alpha \in P.

Derives in One Step

Let A ::= \gamma in P. Then \alpha A \beta ==> \alpha \gamma \beta.

The double arrow is pronounced "derives in one step".

Example:

For G0 we have that aaA ==> aaaA, using \alpha = aa, \beta = empty
and \gamma = aA.

Derives in Zero or More Steps

The relation ==>* is the reflexive and transitive closure of ==>.

Example:

For G0 we have that aaA ==>* aaA (derives in zero steps) and that
aaA ==>* aaaA (derives in one step) and aaA ==>* aaab (derives in
two steps.

Definition: sentential form

Let G = (N, T, P, S) be a CFG. Then he set of sentential forms of G is

  { \alpha | S ==>* \alpha }

Intuitively, \alpha is a string of terminals and nonterminals derivable
from the start symbol.

Definition: L(G) -- the language of G

Let G = (N, T, P, S) be a CFG. Then the language defined by G is

  L(G) = { w | S ==>* w }

Definition: parser for L(G):

a parser for L(G) is a function parse(G, w) that returns true
if w \in L(G) and false otherwise.

Comments:

1. L(G) is considered as set of sentences. So from this perspective,
   programs are sentences.

2. It's easy to see that L(G) \subseteq T* --- we've thrown out a lot
   (but definitely not all!) of the nonsensical stuff.

3. Given a grammar G and a string of symbols w, a derivation S ==>* w
   can be seen as a proof that w \in L(G). (Technically speaking, it's
   actually called a "witness".)
   
Conventions:

1. If A ::= \alpha in P and A ::= \beta in P we usually write
   A ::= \alpha | \beta where the vertical bar is pronounced "or".

2. Given a set of productions P, the grammar G = (N, T, P, S) is easy to
   infer. So here to fore, we'll write grammars by writing the productions.


Leftmost/Rightmost Derivations

A derivation step in which the leftmost nonterminal is replaced:

  w A \beta ==> w \alpha \beta

is called a leftmost. A derivation:

  \alpha ==>* beta

is leftmost if each step in the derivation is leftmost.

And likewise for rightmost.


Ambiguity

A grammar G is ambiguous if there exists a w \in L(G) such that
there are two different leftmost (rightmost) derivations.


Derivations and Trees

Let Xi range over N U T, let A ::= X1X2...Xn be a production such
that Xi ==>* \betai with \beta = \beta1...\betan and let D be the
derivation A ==>* \beta. Then Tree(D) is the tree structure with
root A and with subtrees Tree(X1 ==>* \beta1)...Tree(Xn ==>*\betan).

Example:

E ::= (E + E) | a

Let w = (a + a), D = E ==> (E + E) ==> (a + E) ==> (a + a). Then
Tree(D) is

       E
   / / | \ \
  ( E  +  E )
    |     |
    a     a

The tree of a derivation is called a -concrete syntax tree-.


Abstract Syntax

The phrase "abstract syntax tree (AST) was coined by John McCarthy,
the inventor of LISP.  The abstract syntax tree has only the essential
information contained in the concrete syntax tree. For the example
above, the AST is:

   +
  / \
 a   a