Pumping Lemma for Regular Languages: If A is a regular
language, then there is a number p
such that if
s
is any string in A of length at least
p
, then s
may be divided into three
pieces, s = xyz
with the properties:
i >= 0, xyiz ∈
A,
|y| > 0
, and|xy| <= p
The integer p associated with the pumping lemma is just the number of states a DFA that recognizes the regular language in question.
a b c ---> [q0] ---> [q1] ---> [q2] ---> [[q3]] ^ | | | +--------------------+ a p = 4 So we need a string of length at least 4 w = abcabc is accepted and has length 6 >= 4. The first state that is repeated when w is input to this DFA is q1. x = string up to the first occurrence of repeated state q1 y = string after x up to the second occurence of q1 So x = a and y = bca which means z = bc w = |a|bca|bc| x y z Pumping lemma says these string are also in the language w0 = xy0z = xz = a bc w1 = xy1z = xyz = a bca bc w2 = xy2z = xyyz = a bca bca bc
L = { aibaj | 0 <= i < j }
Proof is by contradiction, using the pumping lemma to get the contradiction.
Assume L is regular and let p be the constant given by the pumping lemma.
The string w = apbap+1 is in L and has length > p.
By the pumping lemma w = xyz with |xy| <= p and |y| > 0.
But this means the prefix xy must come before the 'b' and consist only of a's.
So this means y consists of 1 or more a's (x might be empty).
By the pumping lemma, w2 = xy2z must also be in L, but the number of a's before the b in w2 must be at least p + 1, while the number of a's after b is still p + 1.
But this contradicts the condition for w2 being in L and so the assumption that L is regular is false.
L = { aibaj | i > j >= 0 }
Proof is by contradiction again, using the pumping lemma to get the contradiction, but has to work slightly differently.
Adding more y's will not work because now the condition is i > j., the leading a's are greater in number than the ones after the b.
Assume L is regular and let p be the constant given by the pumping lemma.
The string w = ap+1bap is in L and has length > p.
By the pumping lemma w = xyz with |xy| <= p and |y| > 0.
But this again means the prefix xy must come before the 'b' and consist only of a's.
So this means y consists of 1 or more a's (x might be empty).
By the pumping lemma, w0 = xz must also be in L, but the number of a's before the b in w0 must be no more than p since we have removed at least 1 a from w to get w0. But the number of a's after b in w0 is still p.
But this contradicts the condition for w0 being in L and so the assumption that L is regular is false.
A Context Free Grammar is a 5-tuple (V, ∑, R, S) V Finite set of variables (also called nonterminals) ∑ Finite Set of terminal symbols disjoint from V R A finite set of grammar rules (also called productions) formally a subset of V x (V U ∑)* S A designated element of V, called the start symbol However, (X, w) in R is usually written with one of the following notations: X -> w or X ::= w and if there are multiple rules with the same X on the left, the notation is used to abbreviate listing X multiple times. E.g., X -> w1 | w2 instead of X -> w1 X -> w2 Note that the right side of a rule can be ε
1. S -> a S b | ε 2. S -> S + M | M M -> M * T | T T -> n V = {S,M,T} ∑ = { n } (n represents any integer} Start smbol is S
A derivation of an element of w in ∑* is a sequence of strings w0 w1 ... wn n >=1 or more derivation steps where
A derivation can be represented graphically as a parse tree.
The terminals of the derived string w are leaves of the tree.
The Start symbol is the root of the tree.
The nonterminals are interior nodes of the tree.
Each derivation step using a rule X -> u1 u2 ... uk makes the ui's be child nodes of X in the associated parse tree.
A replacement of a nonterminal can be applied to a portion of a parse tree or in general to a string of the form uXv where u and v are in (V U ∑)* and X is in V to give uwv where X -> w is a grammar rule. If 0 or more repeated replacements of this type on uXv result in a string z, we say that uXv yields z (instead of derives z).
The language of a context free grammar is the set of all terminal strings that can be derived with the grammar.
A language is defined to be a context-free language (CFL) if it is the language of some context-free grammar.
Recall that regular languages are defined slightly differently, but we have equivalent properties.
A language is regular if it is
Regular languages are closed under these operations:
Noting the regular grammar characterization of regular languages, one might expect various similarities including closure.
Context-free languages, it turns out, are closed under these operations:
Its actually pretty easy using grammars to see these closure properties. (How do you show concatenation, for example?)
Context-free languages are not closed under intersection. OK
But they are also not closed under complement, which may be a bit more surprising.
Let L1 and L2 be CFLs. To show that L1 ∪ L2 is a CFL, we must construct a context-free grammar whose language is L1 ∪ L2.
Since L1 and L2 are CFLs, there are grammars G1 = (V1, ∑1, R1, S1) and G2 = (V2, ∑2, R2, S2) with L(G1) = L1 and L(G2) = L2.
The idea is to introduce a new start symbol, S, and a rule:
S -> S1 | S2
More formally, let G = ( V1 ∪ V2, ∑1 ∪ ∑2, R1 ∪ R2 ∪ {S -> S1 | S2}, S).
It is assumed that V1 and V2 are disjoint. If not, then renaming the non-terminals so they are disjoint doesn't change the languages L1 and L2.
The language of G is easily seen to be L1 ∪ L2.
To see that CFLs are closed under concatenation, use a similar construction to the one just used for union. The only difference is that the rule for the new start symbol S is
S -> S1 S2
To construct a grammar whose language is L1* the idea is to introduce a new start symbol S and the rules:
S -> S1 S | ε
Formally, let G = (V1, ∑1, R1 ∪ {S -> S1 S | ε }, S)
The recursive rule for the new start symbol will generate 1 or more instances of L1 concatenated. Together with the ε rule, the language of G is easily seen to be L1*
A grammar is in Chomsky Normal if the rules are all of one of the forms:
The right side of the rule for the start symbol is also allowed to be ε.
S -> X B X -> A S A -> a B -> b Chomsky normal form
S -> X B | ε X -> A S A -> a B -> b This is also in Chomsky normal form, but only the start symbol can have a rule with ε right side.
Intuitively, and in abbreviated form, the conversion of any context-free grammar to Chomsky Normal form is accomplished by these three steps:
S -> a B b B -> a B b | ε // Introduce a new start symbol 1. P -> S S -> a B b B -> a B b | ε // Replace ε rules: 2. P -> S S -> a B b | a b B -> a B b | a b // Replace unit rules: 3. P -> a B b | a b B -> a B b | a b // Convert to Chomsky Normal Form by adding nonterminals 4. P -> X Y | X Z X -> a Y -> B Z Z -> b B -> X Y | X Z See example on pages 108 - 109.
The parsing problem for a grammar is: Given a string, determine if it is in the language of the grammar.
This means we have to determine if there is a derivation for the string or not.
There are many parsing algorithms including ones with cryptic notational names:
Roughly the LL versions try to imitate a derivation; they begin with the start symbol and are called top down parsers.
The LR parsers try to imitate a derivation in reverse; they work their way backwards to the start symbol. The parse tree is built bottom up.
These algorithms are deterministic.
S -> x A | y B A -> x A z | w B -> y B z | w This is an LL(1) grammar. There is always only one choice for the next rule to use in a left-most derivation. Predict(S -> x A) = { x } and Predict(S -> y B) = { y }
These "Predict" sets tell when to use the grammar rule when giving a left-most derivation with S as the left-most nonterminal.
If the next input is 'x', then use rule S -> x A; if 'y', use S -> y B; else the input is rejected.
A grammar is LL(1) if for each nonterminal X, the Predict sets for grammar rules for X are pariwise disjoint.
An LL(1) grammar can parse its input deterministically with no backup.
However, not every context-free grammar is LL(1).
Problem with left-recursion.
S -> S '+' M | M M -> M '*' T | T T -> n
The left-recursion means that Predict(S -> S + M) = Predict(S -> M)
Predict(S -> M) is the set of terminal symbols that begin strings derived using this rule.
Predict(S -> M) is not { M } since M is a nonterminal. Instead we must examine what strings M can derive. So Predict(S -> M) = Predict(M -> M * T) ∪ Predict(M -> T) = Predict(T -> n) = { n }.
But now the left recursion means that to see what terminal symbols can begin strings derived using S -> S + M, we have to eventually replace the leading nonterminal S using S -> M.
Consequently, left-recursion means Predict(S -> S + M) and S -> M), far from being disjoint are eqal.
So left-recursion in a grammar guarantees the grammar is not LL(1).
We can often convert a context-free grammar to an equivalent LL(1) grammar, especially context-free grammars used to describe programming languages.
However, ...
A language is LL(1) if it is the language of some LL(1) grammar.
Unfortunately, LL(1) languages are a proper subset of CFLs.
That is, we can't hope to convert an arbitrary context-free grammar to LL(1).
Here is an example,
{ wwR | w in {a,b}* and wR is the reverse of w} E.g. for w = aaabab, wwR = aaababbabaaa
Here is a grammar for this language:
S -> a S a | b S b | ε
This grammar is not LL(1). The S -> ε rule should be applied when the first half has been seen (the w part of wwR). But with only 1 input symbol look ahead, this can't be determined.
This language is an example of a non-deterministic CFL.
But every LL(1) language is deterministic
So there can be no transformation of the grammar to an equivalent LL(1) grammar.
As we will see a language is context-free if it is the language recognized by a particular kind of automaon, the push-down automaton (PDA).
For regular languages there is no difference between the languages defined for the deterministic and non-deterministic finite state automata.
The situation change however, for context-free languages.
Nondeterministic and deterministic versions of push down automata don't recognize the same class of languages.
Next time.