CSC347 Sep08

slide version

single file version

Contents

  1. Overview
  2. Administration
  3. Languages
  4. Texts
  5. Prerequisites
  6. Grading
  7. Class Web Page
  8. Thumbnail Evolution of Programming Languages
  9. Imperative Languages
  10. Functional Languages
  11. Object-Oriented Languages
  12. Other Paradigms
  13. Some Language Evaluation Criteria
  14. Some Influences on Programming Language Design
  15. Translating from high-level to low-level languages
  16. Compilers
  17. Interpreters
  18. Virtual Machines (or Interpretive Compilers)
  19. Just-in-time compilers
  20. Describing Syntax
  21. Structure of Programming Languages
  22. Programming Language Descriptions
  23. Lexical Structure
  24. Specify Lexical Structure
  25. Syntactic Structure
  26. Abstract Syntax Trees
  27. Generating Code
  28. Grammars
  29. Productions
  30. Context-Free Grammars
  31. Context-Free Grammars
  32. Context-Free Grammar Examples
  33. Context-Free Grammars for Expressions
  34. Regular Grammars
  35. Regular Grammar Examples
  36. Lexical Analysis and Syntactic Analysis Programs (written in Java)
  37. Derivations
  38. Derivation
  39. Example Derivations
  40. Parse Trees
  41. Parse Tree Examples
  42. Left Most Derivations
  43. Parsing Problem
  44. Regular Expressions
  45. Regular Expressions/Example
  46. Finite State Automata and Grammars

Overview[2]

Why should we study programming languages?

Administration[3]

Office: CTI, Room 832

Office Hours: Monday, Wednesday 12:00pm - 2:00pm and by appointment

Email: lancaster@cti.depaul.edu

Phone: 312 362-8718

Fax: 312 362-6116

Languages[4]

Imperative Paradigm: Perl (some C)

Functional: ML

Object-oriented: Java (some C++ and Perl)

There will be some programming assignments in Perl and in ML.

Texts[5]

Sebesta, Concepts of Programming Languages, Addison-Wesley.

Other References

Prerequisites[6]

Java or C++ through data structures.

That is, experience with an object oriented language - Java or C++ including standard abstract data structures: lists, stacks, queues, trees.

Grading[7]

Homework: 40%

Midterm 30%

Final 30%

Class Web Page[8]

See

http://condor.depaul.edu/~glancast

Thumbnail Evolution of Programming Languages[9]

Binary machine language: 01100001 01010001

Assembly language:

; code for A = A + B
MOV A, r1
MOV B, r2
ADD r2,r1
MOV r1, A

Imperative programming: x = x + y;

Structured programming: while, for, but goto considered harmful.

Functional programming: assignment considered harmful.

OO programming: encapsulate state and code together in objects.

Other paradigms...

Imperative Languages[10]

Fortran I, Fortran II, Fortran IV, Fortran 77, Fortran 90

ALGOL 58, ALGOL 60, ALGOL W

Pascal, MODULA-2, MODULA-3, Oberon

CPL, BCPL, B, C, ANSI C

BASIC, Visual Basic

COBOL

ADA

Perl

Functional Languages[11]

Lisp, Scheme, Common Lisp

ML, Standard ML, CAML, ML97

Lazy ML

Miranda

Haskell, Gofer

Object-Oriented Languages[12]

Simula

Smalltalk

C++

Eiffel

Java

AspectJ

Perl

Objective Caml

Other Paradigms[13]

Logic

Concurrent

Constraint

Visual

Some Language Evaluation Criteria[14]

Syntax

Control Structures

Data types

Support for Abstraction

Expressiveness

Type System

Efficiency

Some Influences on Programming Language Design[15]

Application Domain

Available Hardware

Programming Methodology

Implementation Methods

Programming Environments

Translating from high-level to low-level languages[16]

Programming languages are much higher-level now than in the 1950s.

Machine architecture hasn't changed much (additions: virtual memory, caches, pipelines).

How do we get from high-level:

  x := y + z;

to low-level:

  MOV X R1
  MOV Y R2
  ADD
  MOV R3 Z

Compilers[17]

Compilers use the traditional compile/link/run strategy:

Examples: C, C++, ML.

Interpreters[18]

Interpreters execute the source code directly:

This can be on the order of 100 times slower than machine compiled code execution.

Examples: BASIC, Perl, TCL/Tk, ML.

Virtual Machines (or Interpretive Compilers)[19]

Virtual machines (or interpretive compilers) use an intermediate byte code rather than native object code.

The source files are compiled (e.g. with javac) to the intermediate byte code. The byte code is then interpreted by the corresponding language specific interpreter program; e.g., java.

This is usually 'only' 10 times slower than target machine compiled code execution.

Examples: Pascal p-code, Java.

Just-in-time compilers[20]

Just-in-time compilers are like VMs, but compile the byte code to native code:

This speeds up execution to about 5 times slower than machine compiled code execution.

Examples: Java.

Describing Syntax[21]

Structure of Programming Languages[22]

Syntax: does it `look like a program'?

tmp = x;   // Looks like two assignments in Java
x = y;

but

x = y     // Doesn't look like an assignment; semicolon is missing
y = tmp;   

Static semantics: does it compile?

int[] x;
int y;

x = y; // Looks like an assignment, but doesn't compile because 
       // types of x and y are incompatible

Dynamic semantics: what does it do when it runs?


// syntax is ok and it compiles, but 
Emp e;
Manager m;
e = new Emp("Bob");
m = new Manager("Ted");

e.display(); // What 'display' method should this use (Emp's or
             // Manager's) at run time?
e = m;
e.display(); // Same expression at compile time; which 'display'
             // should be used at run time.

Programming Language Descriptions[23]

Tutorials

Reference Manuals

Formal Definitions:

In this course, we shall concentrate on syntax, and `wave our hands' about the other two.

Lexical Structure[24]

The lexical structure of a program defines the lexical tokens of the language including:

For example:

  return -x;

the tokens are:

The lexical structure also says which characters should be ignored, such as white space or comments.

Specify Lexical Structure[25]

The lexical structure of a language can be specified in a number of ways. For example, the lexical structure of identifiers in a given language might be described in three different ways:

Informally

Informally: Identifiers must begin with an alphabetic character,
followed by 0 or more alphabetic characters, digits, or the underscore
character. 

Using a regular grammar

<identifier> -> alpha <more>
<more> -> alpha <more> | 
          digit <more> |
          _ <more> | λ
          
where 
alpha represents any character in the set 
	{a, b, ... , z , A , B , ... , Z},

digit represents any character in  { 0, 1, ... , 9}, and

λ denotes the empty sequence of symbols.
          

Using regular expressions

Identifer = alpha ( alpha | digit | '_' )*
alpha = a-z | A-Z
digit = 0-9

The * means zero or more repetitions.

All three of these descriptions define a set of acceptable strings. The set of all these strings is called the associated formal language. For example, the following strings are acceptable:

a b1 euros HoBBit A__B 

but not

@ 1b euro$ _HoBBit _A_B 

Syntactic Structure[26]

The syntactic structure of a language specifies how and in what order the tokens or lexical elements may be combined into legal sentences of the language.

For program languages a formal sentence in this sense would be an entire program.

The syntactic structure, like the lexical structure can be described by a formal grammar, but in general not a regular grammar. Instead a second type of grammar is needed, a context-free grammar.

Abstract Syntax Trees[27]

Once syntactic structure of an expression has been determined, the result is often represented as an abstract syntax tree.

For example, consider the expression:

( a + b ) * c

The lexical phase of a compiler identifies the tokens:

token    type
(        left parenthesis
a        identifier
+        operator PLUS
b        identifier
)        right parenthesis
*        operator TIMES
c        identifier

The syntactic phase must determine that this sequence (in this order) is a valid expression and determine is syntactic structure (e.g., what are the operands of each operator). The result can be represented as an abstract syntax tree:


Generating Code[28]

The abstract syntax tree can be used by the code generating phase of the compiler.

A postfix traversal of a binary tree will a) visit the left subtree, then b) visit the right subtree, and c) finally, visit the root of the tree.

A postfix traversal of the abstract syntax tree above for (a+b)*c can output code for a stack based interpreter like this:

Node Visited		Output
  a                     Push a
  b                     Push b
  +                     Add
  c                     Push
  *                     Mult   

Grammars[29]

A formal grammar is a 4-tuple (N, T, Start, P) where

  1. N is a finite set of symbols - non-terminals
  2. T is a finite set of symbols disjoint from N - terminals
  3. Start is a designated element of N - the start symbol
  4. P is a finite set of rules or productions

Productions[30]

A grammar rule or production is of the form:

a  -> b

where, a and b are in (N union T)*

Note: The idea will be to use the rules on strings of symbols and this rule would say that the string a can be replaced by the string b.

Context-Free Grammars[31]

This definition of formal grammars is too general for application to programming languages, which are more structured than natural languages.

Two special types of grammars are, however, useful. These are regular grammars and context-free grammars.

Context-Free Grammars[32]

A context-free grammar is a grammar with the added restriction that the rules must all be of the form:

X -> b

where, X is in N, and b is still allowed to be in (N union T)*.

Context-Free Grammar Examples[33]

Example 1: A context-free grammar:

S ---> z A
A ---> i A 
A ---> p

Example 1b: Essentially the same grammar as 1, but with BNF notation for the grammar:

<S> ---> z <A>
<A> ---> i <A> 
<A> ---> p

Example 2: This grammar is also context-free.

S -> i S e S
S -> i S
S -> b

Example 3: This grammar is not context-free. Why?

 S ---> aSBC | l
CB ---> BC
BB ---> b
bB ---> bb
bC ---> b

where l is the null string of symbols.

Context-Free Grammars for Expressions[34]

Consider expressions like

a + b * c 
( a + b ) * c

Here are two possible grammars for such expressions. (Here 'id' represents any identifier, like 'a', 'b', etc.)

Grammar E1:

E -> E + E | E * E | ( E ) | id
Grammar E2:
 E -> E + T | T  (rules 1 and 2 for E)
 T -> T * F | F  (rules 3 and 4 for T)
 F -> id | ( E ) (rules 5 and 6 for F)

The first would seem preferable, but it doesn't distinguish the precedence of * and +. It has another undesirable property as we will see. It is ambiguous about how to group the tokens of

ida + idb * idc

Regular Grammars[35]

A regular grammar is a grammar where each rule must be of one of the following forms:

X -> t
X -> t Y
X -> λ

where X and Y are any elements of N and t is any element of T.

Note that X and Y are allowed to be the same non-terminal.

Regular Grammar Examples[36]

Example 1 (and 1b) above for context-free grammars are also regular.

Every regular grammar is also context-free.

Example 2 above is context-free, but not regular. Why?

Lexical Analysis and Syntactic Analysis Programs (written in Java)[37]

See the sample programs: Lex.java and Parser.java.

Derivations[38]

The rules of a grammar provide a set of templates for forming grammatically correct sequences of terminals. Note that grammatically correct means with reference to this particular grammar.

Definition:

A sentence for a given grammar is a sequence of terminals which is grammatically correct for that grammar.

The language of a given grammar is the collection of all possible sentences for that grammar.

Derivation[39]

A derivation of a sentence is a sequence of steps that

  1. Begins with a result-string consisting of just the start-symbol,
  2. At each subsequent step, one non-terminal symbol in the result-string is chosen and one rule with that non-terminal appearing on the left-hand-side is chosen. The non-terminal is then replaced by the corresponding right-hand-side of the chosen rule in the result-string.
  3. Step 2 is repeated until there are no more non-terminal symbols.

Definition:

A sequence of terminal symbols is grammatically correct if and only if there is a derivation that produces the sequence as the result.

Notation

Each derivation step (replacement of a non-terminal symbol) is denoted by string-before-replacement => string-after-replacement

Example Derivations[40]

Example derivations for grammar of example E21 above:

Derivation 1:

Derivation of id + id:
                                           Rule Used
E =>1 E + T                1: E -> E + T
  =>2 T + T                2: E -> T
  =>4 F + T                4: T -> F
  =>5 id + T               5: F -> id
  =>4 id + F               4: T -> F
  =>5 id + id              5: F -> id

 (The subscripts indicate the substitution rule that 
  was used at each derivation step.)

Derivation 2:

Derivation of (id):
                                            Rule Used
E =>2 T                    2: E -> T
  =>4 F                    4: T -> F
  =>6 ( E )                6: F -> ( E )
  =>2 ( T )                2: E -> T
  =>4 ( F )                4: T -> F
  =>5 ( id )               5: F -> id

Parse Trees[41]

Every derivation corresponds to a parse tree. The nodes of the parse trees are just the grammar symbols.

The root of the parse tree is the start symbol.

For each derivation step that expands a non-terminal, draw the non-terminal as a node with the expansion symbols its direct children.

Parse Tree Examples[42]

The parse tree for "id + id":

The parse tree for the derivation of "( id )" would be

Left Most Derivations[43]

Here is the derivation of id + id given above:

                                           Rule Used
E =>1 E + T                1: E -> E + T
  =>2 T + T                2: E -> T
  =>4 F + T                4: T -> F
  =>5 id + T               5: F -> id
  =>4 id + F               4: T -> F
  =>5 id + id              5: F -> id

And here is another derviation of the same string, 


E =>1 E + T                1: E -> E + T
  =>4 E + F                4: T -> F
  =>5 E + id               5: F -> id
  =>2 T + id               2: E -> T
  =>4 F + id               4: T -> F
  =>5 id + id              5: F -> id


The first derivation is a left-most derivation and
the second is a right-most. 

They differ only in the order that the nodes of the parse tree are expanded.

Parsing Problem[44]

Derivations are perfectly suitable for defining sentences of a grammar.

However, derivations begin with the start-symbol and produce some sentence. This is backwards from what is required by a compiler or programmers.

Namely, the compiler is given a sequence of terminals (the source code of a program, partitioned into tokens).

It is not sufficient for the compiler to produce a derivation of some other valid program.

Instead of compiler error messages, would you be happy if the compiler reported, "I can't seem to comile your program, but here is a different program which I can compile. I hope you like it."

The parsing problem is to discover a derivation for a given input sequence.

Regular Expressions[45]

Regular expressions represents sets of strings formed from some specified alphabet.

Sometimes the alphabet is just the set of all ASCII characters.

Three operations are defined on regular expressions to form new regular expressions:

Alphabet = Some set of symbols

Operators: (1) concatenation, (2) alternation, (3) closure

Operator         Symbol
--------         ------
concatenation    (None, just write expressions next to each other)
alternation        |
closure            *

Regular Expressions/Example [46]

Here are some regular expressions defined over the alphabet { a, b, c}

                                     Set of Strings
Regular Expression              Denoted by this Reg. Exp.
    a                            { a }
    b | c                        { b, c}
    ab                           { ab }
    a(b | c)                     { ab, ac }
    (a | b)(c | d)               { ac, ad, bc, bd }
    a*	                         { l, a, aa, aaa, aaaa, ... }
    (a | b)*                     { l, a, b, aa, ab, ba, bb, aaa,
                                     aab, aba, abb, baa, bab, bba,
                                     bbb, aaaa, ... }
    (ab | c)*                    { l, ab, c, abc, cab, 
                                   ababab, ababc, abcab, abcc,
                                   cabab, cabc, ccab, ccc, ... } 

Finite State Automata and Grammars[47]

It is not always easy to find a grammar even when you can describe the language informally.

However, if the language can be described by a Finite State Automaton, then it is easy to write the grammar.

Below is a simple FSA. I want to write a grammar that derives the same set of strings as are accepted by this FSA.

Grammar for this FSA is:

S -> a F | b F
F -> a F | b F | c X | l
X -> a F | b F

Rules?

What are the rules for convertin the FSA to a grammar?