CSC444 Oct09

slide version

single file version

Contents

  1. Regular Languages Pumping Lemma
  2. Example of Proof Idea of the Pumping Lemma
  3. Example 1 Using the Pumping Lemma
  4. Example 2 Using the Pumping Lemma
  5. Quiz Remarks/Questions
  6. Context-Free Grammars
  7. Examples
  8. Derivations
  9. Parse Trees
  10. Yields
  11. Context-Free Languages (CFL)
  12. Closure Properties for Regular Languages
  13. Closure Properties for CFLs
  14. Closure of CFLs under Union
  15. Closure of CFLs under Concatenation
  16. Closure of CFLs under Star
  17. Chomsky Normal Form
  18. Example 1
  19. Example 2
  20. Converting to Chomsky Normal Form
  21. Example
  22. Parsing Problem
  23. Parsing Algorithms
  24. LL(1) Grammars
  25. LL(1) Grammars are a Proper Subset of Context-Free Grammars
  26. LL(1) Languages are a Proper Subset of CFLs
  27. Push Down Automaton

Regular Languages Pumping Lemma[2] [top]

Pumping Lemma for Regular Languages: If A is a regular language, then there is a number p such that if s is any string in A of length at least p, then s may be divided into three pieces, s = xyz with the properties:

  1. for each i >= 0, xyiz ∈ A,
  2. |y| > 0, and
  3. |xy| <= p

Example of Proof Idea of the Pumping Lemma[3] [top]

The integer p associated with the pumping lemma is just the number of states a DFA that recognizes the regular language in question.


               a         b         c
    ---> [q0] ---> [q1] ---> [q2] ---> [[q3]]
                    ^                    |
                    |                    |
                    +--------------------+
                              a

    p = 4

    So we need a string of length at least 4

    w = abcabc is accepted and has length 6 >= 4.

    The first state that is repeated when w is input to this DFA
    is  q1. 

    x = string up to the first occurrence of repeated state q1
    y = string after x up to the second occurence of q1

    So x = a and y = bca which means z = bc

    w = |a|bca|bc|
         x  y  z

   Pumping lemma says these string are also in the language

   w0 = xy0z = xz = a bc
   w1 = xy1z = xyz = a bca bc
   w2 = xy2z = xyyz = a bca bca bc
                      

        

Example 1 Using the Pumping Lemma[4] [top]

   L = { aibaj | 0 <= i < j }

Proof is by contradiction, using the pumping lemma to get the contradiction.

Assume L is regular and let p be the constant given by the pumping lemma.

The string w = apbap+1 is in L and has length > p.

By the pumping lemma w = xyz with |xy| <= p and |y| > 0.

But this means the prefix xy must come before the 'b' and consist only of a's.

So this means y consists of 1 or more a's (x might be empty).

By the pumping lemma, w2 = xy2z must also be in L, but the number of a's before the b in w2 must be at least p + 1, while the number of a's after b is still p + 1.

But this contradicts the condition for w2 being in L and so the assumption that L is regular is false.

Example 2 Using the Pumping Lemma[5] [top]

   L = { aibaj | i >  j >= 0 }

Proof is by contradiction again, using the pumping lemma to get the contradiction, but has to work slightly differently.

Adding more y's will not work because now the condition is i > j., the leading a's are greater in number than the ones after the b.

Assume L is regular and let p be the constant given by the pumping lemma.

The string w = ap+1bap is in L and has length > p.

By the pumping lemma w = xyz with |xy| <= p and |y| > 0.

But this again means the prefix xy must come before the 'b' and consist only of a's.

So this means y consists of 1 or more a's (x might be empty).

By the pumping lemma, w0 = xz must also be in L, but the number of a's before the b in w0 must be no more than p since we have removed at least 1 a from w to get w0. But the number of a's after b in w0 is still p.

But this contradicts the condition for w0 being in L and so the assumption that L is regular is false.

Quiz Remarks/Questions[6] [top]

Quiz

Context-Free Grammars[7] [top]


A Context Free Grammar is a 5-tuple (V, ∑, R, S)

  V   Finite set of variables (also called nonterminals)
  ∑   Finite Set of terminal symbols disjoint from V
  R   A finite set of grammar rules 
      (also called productions) 
       formally a subset of V x (V U ∑)*
  S   A designated element of V, called the start symbol

  However, (X, w) in R is usually written with one of the following
  notations:

    X -> w

or

    X ::= w

and if there are multiple rules with the same X on the left, the
notation is used to abbreviate listing X multiple times. E.g.,

    X -> w1 | w2

instead of

    X -> w1
    X -> w2

Note that the right side of a rule can be ε

Examples[8] [top]

1. S -> a S b | ε

2. S -> S + M | M
   M -> M * T | T
   T -> n

V = {S,M,T}
∑ = { n }  (n represents any integer}
Start smbol is S

Derivations[9] [top]

A derivation of an element of w in ∑* is a sequence of strings w0 w1 ... wn n >=1 or more derivation steps where

  1. each wi is in (V U ∑)*
  2. w0 is the string consisting of the start symbol alone
  3. For each i = 0,1,...n-1, wi+1 is obtained from wi by replacing some nonterminal X in wi with the corresponding right side u of a rule X -> u in the grammar.
  4. wn = w

Parse Trees[10] [top]

A derivation can be represented graphically as a parse tree.

The terminals of the derived string w are leaves of the tree.

The Start symbol is the root of the tree.

The nonterminals are interior nodes of the tree.

Each derivation step using a rule X -> u1 u2 ... uk makes the ui's be child nodes of X in the associated parse tree.

Yields[11] [top]

A replacement of a nonterminal can be applied to a portion of a parse tree or in general to a string of the form uXv where u and v are in (V U ∑)* and X is in V to give uwv where X -> w is a grammar rule. If 0 or more repeated replacements of this type on uXv result in a string z, we say that uXv yields z (instead of derives z).

Context-Free Languages (CFL)[12] [top]

The language of a context free grammar is the set of all terminal strings that can be derived with the grammar.

A language is defined to be a context-free language (CFL) if it is the language of some context-free grammar.

Closure Properties for Regular Languages[13] [top]

Recall that regular languages are defined slightly differently, but we have equivalent properties.

A language is regular if it is

  1. recognized by a DFA
  2. recognized by an NFA
  3. described by a regular expression
  4. the language of some regular grammar

Regular languages are closed under these operations:

  1. union
  2. concatenation
  3. star
  4. intersection
  5. complement

Closure Properties for CFLs[14] [top]

Noting the regular grammar characterization of regular languages, one might expect various similarities including closure.

Context-free languages, it turns out, are closed under these operations:

  1. union
  2. concatenation
  3. star

Its actually pretty easy using grammars to see these closure properties. (How do you show concatenation, for example?)

Context-free languages are not closed under intersection. OK

But they are also not closed under complement, which may be a bit more surprising.

Closure of CFLs under Union[15] [top]

Let L1 and L2 be CFLs. To show that L1 ∪ L2 is a CFL, we must construct a context-free grammar whose language is L1 ∪ L2.

Since L1 and L2 are CFLs, there are grammars G1 = (V1, ∑1, R1, S1) and G2 = (V2, ∑2, R2, S2) with L(G1) = L1 and L(G2) = L2.

The idea is to introduce a new start symbol, S, and a rule:

S -> S1 | S2

More formally, let G = ( V1 ∪ V2, ∑1 ∪ ∑2, R1 ∪ R2 ∪ {S -> S1 | S2}, S).

It is assumed that V1 and V2 are disjoint. If not, then renaming the non-terminals so they are disjoint doesn't change the languages L1 and L2.

The language of G is easily seen to be L1 ∪ L2.

Closure of CFLs under Concatenation[16] [top]

To see that CFLs are closed under concatenation, use a similar construction to the one just used for union. The only difference is that the rule for the new start symbol S is

 S -> S1 S2

Closure of CFLs under Star[17] [top]

To construct a grammar whose language is L1* the idea is to introduce a new start symbol S and the rules:

S -> S1 S | ε

Formally, let G = (V1, ∑1, R1 ∪ {S -> S1 S | ε }, S)

The recursive rule for the new start symbol will generate 1 or more instances of L1 concatenated. Together with the ε rule, the language of G is easily seen to be L1*

Chomsky Normal Form[18] [top]

A grammar is in Chomsky Normal if the rules are all of one of the forms:

The right side of the rule for the start symbol is also allowed to be ε.

Example 1[19] [top]

    S -> X B
    X -> A S
    A -> a
    B -> b

Chomsky normal form

Example 2[20] [top]

    S -> X B | ε
    X -> A S
    A -> a
    B -> b

This is also in Chomsky normal form, but only the start symbol can
have a rule with ε right side.

Converting to Chomsky Normal Form[21] [top]

Intuitively, and in abbreviated form, the conversion of any context-free grammar to Chomsky Normal form is accomplished by these three steps:

  1. Introduce a new Start symbol and a rule.
  2. Replace all ε rules, except for the start symbol
  3. Replace all unit rules ( A unit rule is of the form X -> Y where Y is a single nonterminal)
  4. Convert rules with multiple right sides to Chomsky normal form by introducing additional nonterminals as neccessary.

Example[22] [top]

    S -> a B b 
    B -> a B b | ε

// Introduce a new start symbol
1.  P -> S
    S -> a B b 
    B -> a B b | ε
   
// Replace ε rules:
2.  P -> S
    S -> a B b | a b
    B -> a B b | a b

// Replace unit rules:
3.  P -> a B b | a b    
    B -> a B b | a b

// Convert to Chomsky Normal Form by adding nonterminals
4.  P -> X Y | X Z
    X -> a
    Y -> B Z
    Z -> b
    B -> X Y | X Z

See example on pages 108 - 109.

Parsing Problem[23] [top]

The parsing problem for a grammar is: Given a string, determine if it is in the language of the grammar.

This means we have to determine if there is a derivation for the string or not.

Parsing Algorithms[24] [top]

There are many parsing algorithms including ones with cryptic notational names:

  1. LL(1)
  2. LL(k) for k > 1
  3. LR(1)

Roughly the LL versions try to imitate a derivation; they begin with the start symbol and are called top down parsers.

The LR parsers try to imitate a derivation in reverse; they work their way backwards to the start symbol. The parse tree is built bottom up.

These algorithms are deterministic.

LL(1) Grammars[25] [top]

   S -> x A | y B
   A -> x A z | w
   B -> y B z | w

This is an LL(1) grammar. 
There is always only one choice for the next rule to use in a
left-most derivation.  

Predict(S -> x A) = { x } and Predict(S -> y B) = { y }

These "Predict" sets tell when to use the grammar rule when giving a left-most derivation with S as the left-most nonterminal.

If the next input is 'x', then use rule S -> x A; if 'y', use S -> y B; else the input is rejected.

A grammar is LL(1) if for each nonterminal X, the Predict sets for grammar rules for X are pariwise disjoint.

An LL(1) grammar can parse its input deterministically with no backup.

However, not every context-free grammar is LL(1).

LL(1) Grammars are a Proper Subset of Context-Free Grammars[26] [top]

Problem with left-recursion.

    S -> S '+' M | M
    M -> M '*' T | T
    T -> n

The left-recursion means that Predict(S -> S + M) = Predict(S -> M)

Predict(S -> M) is the set of terminal symbols that begin strings derived using this rule.

Predict(S -> M) is not { M } since M is a nonterminal. Instead we must examine what strings M can derive. So Predict(S -> M) = Predict(M -> M * T) ∪ Predict(M -> T) = Predict(T -> n) = { n }.

But now the left recursion means that to see what terminal symbols can begin strings derived using S -> S + M, we have to eventually replace the leading nonterminal S using S -> M.

Consequently, left-recursion means Predict(S -> S + M) and S -> M), far from being disjoint are eqal.

So left-recursion in a grammar guarantees the grammar is not LL(1).

We can often convert a context-free grammar to an equivalent LL(1) grammar, especially context-free grammars used to describe programming languages.

However, ...

LL(1) Languages are a Proper Subset of CFLs[27] [top]

A language is LL(1) if it is the language of some LL(1) grammar.

Unfortunately, LL(1) languages are a proper subset of CFLs.

That is, we can't hope to convert an arbitrary context-free grammar to LL(1).

Here is an example,


    { wwR | w in {a,b}* and wR is the reverse of w}

    E.g. for w = aaabab,

    wwR = aaababbabaaa

Here is a grammar for this language:

   S -> a S a | b S b | ε

This grammar is not LL(1). The S -> ε rule should be applied when the first half has been seen (the w part of wwR). But with only 1 input symbol look ahead, this can't be determined.

This language is an example of a non-deterministic CFL.

But every LL(1) language is deterministic

So there can be no transformation of the grammar to an equivalent LL(1) grammar.

Push Down Automaton[28] [top]

As we will see a language is context-free if it is the language recognized by a particular kind of automaon, the push-down automaton (PDA).

For regular languages there is no difference between the languages defined for the deterministic and non-deterministic finite state automata.

The situation change however, for context-free languages.

Nondeterministic and deterministic versions of push down automata don't recognize the same class of languages.

Next time.