Python中的编译器设计 LL(1) 解析器
先决条件: LL(1)解析表的构建,自顶向下解析器的分类, 第一组,跟随组
在本文中,我们将了解如何使用Python设计 LL(1) Parser 编译器。
LL(1) 语法
LL(1) 中的第一个“L”代表从左到右扫描输入,第二个“L”代表产生最左边的推导,“1”代表在每一步使用一个前瞻输入符号进行解析行动决定。 LL(1) 语法遵循自顶向下的解析方法。对于一类称为 LL(1) 的文法,我们可以构造文法预测解析器。这适用于不需要回溯的递归下降解析器的概念。
消除左递归
一个文法是左递归的,如果它有一个非终结符 A 使得有一个推导 A → A α | β。自顶向下解析方法无法处理左递归语法,因此需要进行转换以消除左递归。可以通过如下修改规则来消除左递归:(A' 是新的非终结符,ε 代表 epsilon)。
A → β A’
A’ → α A’ | ε
左保理
它是一种语法转换,可用于生成适用于预测或自上而下解析的语法。当两个替代产生式之间的选择不明确时,我们会重写产生式以推迟做出正确选择的决定。
For example, if we have grammar rule A → α β1 | α β2
A → α A’
A’ → β1 | β2
首先和跟随的概念
自顶向下解析器的构造由与语法 G 关联的 FIRST 和 FOLLOW 函数辅助。在自顶向下解析期间,FIRST 和 FOLLOW 允许我们根据下一个输入符号选择要应用的产生式。
第一次计算的规则:
1) If x is terminal, then FIRST(x)={x}
2) If X→ ε is production, then add ε to FIRST(X)
3) If X is a non-terminal and X → PQR then FIRST(X)=FIRST(P)
If FIRST(P) contains ε, then
FIRST(X) = (FIRST(P) – {ε}) U FIRST(QR)
跟随计算的规则:
注意: Epsilon (ε) 永远不能出现在任何非终结符号的 FOLLOW 中。
1) For Start symbol, place $ in FOLLOW(S)
2) If A→ α B, then FOLLOW(B) = FOLLOW(A)
3) If A→ α B β, then
If ε not in FIRST(β),
FOLLOW(B) = FIRST(β)
else do,
FOLLOW(B) = (FIRST(β)-{ε}) U FOLLOW(A)
解析表
在构造解析表之后,如果对于表中的任何非终结符号,我们对于表列中的任何终结符号都有多个产生规则,则文法不是 LL(1)。否则,文法被视为 LL(1)。
解析表的构建规则:
Step 1: For each production A → α , of the given grammar perform Step 2 and Step 3.
Step 2: For each terminal symbol ‘a’ in FIRST(α), ADD A → α in table T[A,a], where ‘A’ determines row & ‘a’ determines column.
Step 3: If ε is present in FIRST(α) then find FOLLOW(A), ADD A → ε, at all columns ‘b’, where ‘b’ is FOLLOW(A). (T[A,b])
Step 4: If ε is in FIRST(α) and $ is the FOLLOW(A), ADD A → α to T[A,$].
代码中的假设:
a) LHS symbol of First rule is considered as start symbol.
b) ‘#’ represents epsilon symbol.
方法
- 有七个函数和驱动程序代码,它们一起执行计算。代码将语法规则作为输入。用户需要确定哪些符号是终结符(列表:term_userdef),哪些是非终结符(列表:nonterm_userdef)。
- 代码在示例集 5 上执行,用于演示目的,如代码所示,该语法具有左递归,函数computeAllFirsts() 被调用。该函数负责调用函数 removeLeftRecursion() 和 LeftFactoring()。这些函数分别按照上述规则工作。现在,为每个非终端调用 first()函数。实现的递归逻辑仅基于提到的 FIRST 计算规则。
- first() 的基本条件是是否存在 epsilon 或终端符号,因为每个非终端代码都进入递归逻辑,如果 FIRST 有多个符号,则返回一个列表,否则返回一个字符串。因此,在代码行之间只是使程序类型安全。
- FIRST 计算是 FOLLOW 计算的先决条件,因为 follow()函数多次调用 first()函数。 start_symbol 是“规则”列表中给出的第一个规则的 LHS 符号。 (可以修改),对于 FOLLOW 计算,调用 computeAllFollows()函数。这导致在所有非终端上调用 follow()函数。 follow()函数具有递归逻辑,其基本条件是,在 start_symbol 上调用的函数将返回 $ 符号。
- 其余所有条件均按照上述 FOLLOW 计算中的规则进行处理。对于目标非终结符,遍历所有规则并计算所需的第一个和循环跟随。所有的中间结果在遍历过程中被累加,最终计算出的跟随列表在遍历结束时返回。在这里,基本情况也返回一个字符串,但如果结果中有多个符号,则返回一个列表。因此,在额外的代码行之间添加了类型安全。
- 在计算出 FIRST 和 FOLLOW 之后,调用 createPaseTable()函数。这里 first 和 follow 以格式化的方式输出。然后为解析表准备名为“mat”的二维列表,非终结符形成行,终结符形成列,“$”作为额外列。上述解析表的构建规则构成了逻辑实现的基础。
字符串验证方法
- 生成解析表后,我们必须验证给定语法的输入字符串。为此,我们使用堆栈和缓冲区。如果语法不是 LL(1),则无法执行字符串验证。在堆栈和缓冲区中输入 START SYMBOL, '$' 的以下内容。首先,将输入字符串以相反的顺序输入到缓冲区中。然后将 START SYMBOL 添加到堆栈顶部。
- 然后我们迭代遍历堆栈和缓冲区的当前状态并匹配Table[x][y]的条目,其中'x'是堆栈顶部的符号,'y'是缓冲区后面的符号.我们从表中获取相应的规则,并在接下来的迭代中在 TOS 处展开 Non-terminal。
- 如果 x 和 y 两个符号都是相同的终结符,那么我们将它们从堆栈和缓冲区中弹出并继续计算。如果堆栈和缓冲区仅保留'$'符号,则表明所有输入字符串符号都已匹配并且字符串属于给定语法。如果解析表中不存在 Rule 或 x 和 y 是不相等的终结符号,则字符串不属于给定的语法。
下面是完整的实现:
Python3
# LL(1) parser code in python
def removeLeftRecursion(rulesDiction):
# for rule: A->Aa|b
# result: A->bA',A'->aA'|#
# 'store' has new rules to be added
store = {}
# traverse over rules
for lhs in rulesDiction:
# alphaRules stores subrules with left-recursion
# betaRules stores subrules without left-recursion
alphaRules = []
betaRules = []
# get rhs for current lhs
allrhs = rulesDiction[lhs]
for subrhs in allrhs:
if subrhs[0] == lhs:
alphaRules.append(subrhs[1:])
else:
betaRules.append(subrhs)
# alpha and beta containing subrules are separated
# now form two new rules
if len(alphaRules) != 0:
# to generate new unique symbol
# add ' till unique not generated
lhs_ = lhs + "'"
while (lhs_ in rulesDiction.keys()) \
or (lhs_ in store.keys()):
lhs_ += "'"
# make beta rule
for b in range(0, len(betaRules)):
betaRules[b].append(lhs_)
rulesDiction[lhs] = betaRules
# make alpha rule
for a in range(0, len(alphaRules)):
alphaRules[a].append(lhs_)
alphaRules.append(['#'])
# store in temp dict, append to
# - rulesDiction at end of traversal
store[lhs_] = alphaRules
# add newly generated rules generated
# - after removing left recursion
for left in store:
rulesDiction[left] = store[left]
return rulesDiction
def LeftFactoring(rulesDiction):
# for rule: A->aDF|aCV|k
# result: A->aA'|k, A'->DF|CV
# newDict stores newly generated
# - rules after left factoring
newDict = {}
# iterate over all rules of dictionary
for lhs in rulesDiction:
# get rhs for given lhs
allrhs = rulesDiction[lhs]
# temp dictionary helps detect left factoring
temp = dict()
for subrhs in allrhs:
if subrhs[0] not in list(temp.keys()):
temp[subrhs[0]] = [subrhs]
else:
temp[subrhs[0]].append(subrhs)
# if value list count for any key in temp is > 1,
# - it has left factoring
# new_rule stores new subrules for current LHS symbol
new_rule = []
# temp_dict stores new subrules for left factoring
tempo_dict = {}
for term_key in temp:
# get value from temp for term_key
allStartingWithTermKey = temp[term_key]
if len(allStartingWithTermKey) > 1:
# left factoring required
# to generate new unique symbol
# - add ' till unique not generated
lhs_ = lhs + "'"
while (lhs_ in rulesDiction.keys()) \
or (lhs_ in tempo_dict.keys()):
lhs_ += "'"
# append the left factored result
new_rule.append([term_key, lhs_])
# add expanded rules to tempo_dict
ex_rules = []
for g in temp[term_key]:
ex_rules.append(g[1:])
tempo_dict[lhs_] = ex_rules
else:
# no left factoring required
new_rule.append(allStartingWithTermKey[0])
# add original rule
newDict[lhs] = new_rule
# add newly generated rules after left factoring
for key in tempo_dict:
newDict[key] = tempo_dict[key]
return newDict
# calculation of first
# epsilon is denoted by '#' (semi-colon)
# pass rule in first function
def first(rule):
global rules, nonterm_userdef, \
term_userdef, diction, firsts
# recursion base condition
# (for terminal or epsilon)
if len(rule) != 0 and (rule is not None):
if rule[0] in term_userdef:
return rule[0]
elif rule[0] == '#':
return '#'
# condition for Non-Terminals
if len(rule) != 0:
if rule[0] in list(diction.keys()):
# fres temporary list of result
fres = []
rhs_rules = diction[rule[0]]
# call first on each rule of RHS
# fetched (& take union)
for itr in rhs_rules:
indivRes = first(itr)
if type(indivRes) is list:
for i in indivRes:
fres.append(i)
else:
fres.append(indivRes)
# if no epsilon in result
# - received return fres
if '#' not in fres:
return fres
else:
# apply epsilon
# rule => f(ABC)=f(A)-{e} U f(BC)
newList = []
fres.remove('#')
if len(rule) > 1:
ansNew = first(rule[1:])
if ansNew != None:
if type(ansNew) is list:
newList = fres + ansNew
else:
newList = fres + [ansNew]
else:
newList = fres
return newList
# if result is not already returned
# - control reaches here
# lastly if eplison still persists
# - keep it in result of first
fres.append('#')
return fres
# calculation of follow
# use 'rules' list, and 'diction' dict from above
# follow function input is the split result on
# - Non-Terminal whose Follow we want to compute
def follow(nt):
global start_symbol, rules, nonterm_userdef, \
term_userdef, diction, firsts, follows
# for start symbol return $ (recursion base case)
solset = set()
if nt == start_symbol:
# return '$'
solset.add('$')
# check all occurrences
# solset - is result of computed 'follow' so far
# For input, check in all rules
for curNT in diction:
rhs = diction[curNT]
# go for all productions of NT
for subrule in rhs:
if nt in subrule:
# call for all occurrences on
# - non-terminal in subrule
while nt in subrule:
index_nt = subrule.index(nt)
subrule = subrule[index_nt + 1:]
# empty condition - call follow on LHS
if len(subrule) != 0:
# compute first if symbols on
# - RHS of target Non-Terminal exists
res = first(subrule)
# if epsilon in result apply rule
# - (A->aBX)- follow of -
# - follow(B)=(first(X)-{ep}) U follow(A)
if '#' in res:
newList = []
res.remove('#')
ansNew = follow(curNT)
if ansNew != None:
if type(ansNew) is list:
newList = res + ansNew
else:
newList = res + [ansNew]
else:
newList = res
res = newList
else:
# when nothing in RHS, go circular
# - and take follow of LHS
# only if (NT in LHS)!=curNT
if nt != curNT:
res = follow(curNT)
# add follow result in set form
if res is not None:
if type(res) is list:
for g in res:
solset.add(g)
else:
solset.add(res)
return list(solset)
def computeAllFirsts():
global rules, nonterm_userdef, \
term_userdef, diction, firsts
for rule in rules:
k = rule.split("->")
# remove un-necessary spaces
k[0] = k[0].strip()
k[1] = k[1].strip()
rhs = k[1]
multirhs = rhs.split('|')
# remove un-necessary spaces
for i in range(len(multirhs)):
multirhs[i] = multirhs[i].strip()
multirhs[i] = multirhs[i].split()
diction[k[0]] = multirhs
print(f"\nRules: \n")
for y in diction:
print(f"{y}->{diction[y]}")
print(f"\nAfter elimination of left recursion:\n")
diction = removeLeftRecursion(diction)
for y in diction:
print(f"{y}->{diction[y]}")
print("\nAfter left factoring:\n")
diction = LeftFactoring(diction)
for y in diction:
print(f"{y}->{diction[y]}")
# calculate first for each rule
# - (call first() on all RHS)
for y in list(diction.keys()):
t = set()
for sub in diction.get(y):
res = first(sub)
if res != None:
if type(res) is list:
for u in res:
t.add(u)
else:
t.add(res)
# save result in 'firsts' list
firsts[y] = t
print("\nCalculated firsts: ")
key_list = list(firsts.keys())
index = 0
for gg in firsts:
print(f"first({key_list[index]}) "
f"=> {firsts.get(gg)}")
index += 1
def computeAllFollows():
global start_symbol, rules, nonterm_userdef,\
term_userdef, diction, firsts, follows
for NT in diction:
solset = set()
sol = follow(NT)
if sol is not None:
for g in sol:
solset.add(g)
follows[NT] = solset
print("\nCalculated follows: ")
key_list = list(follows.keys())
index = 0
for gg in follows:
print(f"follow({key_list[index]})"
f" => {follows[gg]}")
index += 1
# create parse table
def createParseTable():
import copy
global diction, firsts, follows, term_userdef
print("\nFirsts and Follow Result table\n")
# find space size
mx_len_first = 0
mx_len_fol = 0
for u in diction:
k1 = len(str(firsts[u]))
k2 = len(str(follows[u]))
if k1 > mx_len_first:
mx_len_first = k1
if k2 > mx_len_fol:
mx_len_fol = k2
print(f"{{:<{10}}} "
f"{{:<{mx_len_first + 5}}} "
f"{{:<{mx_len_fol + 5}}}"
.format("Non-T", "FIRST", "FOLLOW"))
for u in diction:
print(f"{{:<{10}}} "
f"{{:<{mx_len_first + 5}}} "
f"{{:<{mx_len_fol + 5}}}"
.format(u, str(firsts[u]), str(follows[u])))
# create matrix of row(NT) x [col(T) + 1($)]
# create list of non-terminals
ntlist = list(diction.keys())
terminals = copy.deepcopy(term_userdef)
terminals.append('$')
# create the initial empty state of ,matrix
mat = []
for x in diction:
row = []
for y in terminals:
row.append('')
# of $ append one more col
mat.append(row)
# Classifying grammar as LL(1) or not LL(1)
grammar_is_LL = True
# rules implementation
for lhs in diction:
rhs = diction[lhs]
for y in rhs:
res = first(y)
# epsilon is present,
# - take union with follow
if '#' in res:
if type(res) == str:
firstFollow = []
fol_op = follows[lhs]
if fol_op is str:
firstFollow.append(fol_op)
else:
for u in fol_op:
firstFollow.append(u)
res = firstFollow
else:
res.remove('#')
res = list(res) +\
list(follows[lhs])
# add rules to table
ttemp = []
if type(res) is str:
ttemp.append(res)
res = copy.deepcopy(ttemp)
for c in res:
xnt = ntlist.index(lhs)
yt = terminals.index(c)
if mat[xnt][yt] == '':
mat[xnt][yt] = mat[xnt][yt] \
+ f"{lhs}->{' '.join(y)}"
else:
# if rule already present
if f"{lhs}->{y}" in mat[xnt][yt]:
continue
else:
grammar_is_LL = False
mat[xnt][yt] = mat[xnt][yt] \
+ f",{lhs}->{' '.join(y)}"
# final state of parse table
print("\nGenerated parsing table:\n")
frmt = "{:>12}" * len(terminals)
print(frmt.format(*terminals))
j = 0
for y in mat:
frmt1 = "{:>12}" * len(y)
print(f"{ntlist[j]} {frmt1.format(*y)}")
j += 1
return (mat, grammar_is_LL, terminals)
def validateStringUsingStackBuffer(parsing_table, grammarll1,
table_term_list, input_string,
term_userdef,start_symbol):
print(f"\nValidate String => {input_string}\n")
# for more than one entries
# - in one cell of parsing table
if grammarll1 == False:
return f"\nInput String = " \
f"\"{input_string}\"\n" \
f"Grammar is not LL(1)"
# implementing stack buffer
stack = [start_symbol, '$']
buffer = []
# reverse input string store in buffer
input_string = input_string.split()
input_string.reverse()
buffer = ['$'] + input_string
print("{:>20} {:>20} {:>20}".
format("Buffer", "Stack","Action"))
while True:
# end loop if all symbols matched
if stack == ['$'] and buffer == ['$']:
print("{:>20} {:>20} {:>20}"
.format(' '.join(buffer),
' '.join(stack),
"Valid"))
return "\nValid String!"
elif stack[0] not in term_userdef:
# take font of buffer (y) and tos (x)
x = list(diction.keys()).index(stack[0])
y = table_term_list.index(buffer[-1])
if parsing_table[x][y] != '':
# format table entry received
entry = parsing_table[x][y]
print("{:>20} {:>20} {:>25}".
format(' '.join(buffer),
' '.join(stack),
f"T[{stack[0]}][{buffer[-1]}] = {entry}"))
lhs_rhs = entry.split("->")
lhs_rhs[1] = lhs_rhs[1].replace('#', '').strip()
entryrhs = lhs_rhs[1].split()
stack = entryrhs + stack[1:]
else:
return f"\nInvalid String! No rule at " \
f"Table[{stack[0]}][{buffer[-1]}]."
else:
# stack top is Terminal
if stack[0] == buffer[-1]:
print("{:>20} {:>20} {:>20}"
.format(' '.join(buffer),
' '.join(stack),
f"Matched:{stack[0]}"))
buffer = buffer[:-1]
stack = stack[1:]
else:
return "\nInvalid String! " \
"Unmatched terminal symbols"
# DRIVER CODE - MAIN
# NOTE: To test any of the sample sets, uncomment ->
# 'rules' list, 'nonterm_userdef' list, 'term_userdef' list
# and for any String validation uncomment following line with
# 'sample_input_String' variable.
sample_input_string = None
# sample set 1 (Result: Not LL(1))
# rules=["A -> S B | B",
# "S -> a | B c | #",
# "B -> b | d"]
# nonterm_userdef=['A','S','B']
# term_userdef=['a','c','b','d']
# sample_input_string="b c b"
# sample set 2 (Result: LL(1))
# rules=["S -> A | B C",
# "A -> a | b",
# "B -> p | #",
# "C -> c"]
# nonterm_userdef=['A','S','B','C']
# term_userdef=['a','c','b','p']
# sample_input_string="p c"
# sample set 3 (Result: LL(1))
# rules=["S -> A B | C",
# "A -> a | b | #",
# "B-> p | #",
# "C -> c"]
# nonterm_userdef=['A','S','B','C']
# term_userdef=['a','c','b','p']
# sample_input_string="a c b"
# sample set 4 (Result: Not LL(1))
# rules = ["S -> A B C | C",
# "A -> a | b B | #",
# "B -> p | #",
# "C -> c"]
# nonterm_userdef=['A','S','B','C']
# term_userdef=['a','c','b','p']
# sample_input_string="b p p c"
# sample set 5 (With left recursion)
# rules=["A -> B C c | g D B",
# "B -> b C D E | #",
# "C -> D a B | c a",
# "D -> # | d D",
# "E -> E a f | c"
# ]
# nonterm_userdef=['A','B','C','D','E']
# term_userdef=["a","b","c","d","f","g"]
# sample_input_string="b a c a c"
# sample set 6
# rules=["E -> T E'",
# "E' -> + T E' | #",
# "T -> F T'",
# "T' -> * F T' | #",
# "F -> ( E ) | id"
# ]
# nonterm_userdef=['E','E\'','F','T','T\'']
# term_userdef=['id','+','*','(',')']
# sample_input_string="id * * id"
# example string 1
# sample_input_string="( id * id )"
# example string 2
# sample_input_string="( id ) * id + id"
# sample set 7 (left factoring & recursion present)
rules=["S -> A k O",
"A -> A d | a B | a C",
"C -> c",
"B -> b B C | r"]
nonterm_userdef=['A','B','C']
term_userdef=['k','O','d','a','c','b','r']
sample_input_string="a r k O"
# sample set 8 (Multiple char symbols T & NT)
# rules = ["S -> NP VP",
# "NP -> P | PN | D N",
# "VP -> V NP",
# "N -> championship | ball | toss",
# "V -> is | want | won | played",
# "P -> me | I | you",
# "PN -> India | Australia | Steve | John",
# "D -> the | a | an"]
#
# nonterm_userdef = ['S', 'NP', 'VP', 'N', 'V', 'P', 'PN', 'D']
# term_userdef = ["championship", "ball", "toss", "is", "want",
# "won", "played", "me", "I", "you", "India",
# "Australia","Steve", "John", "the", "a", "an"]
# sample_input_string = "India won the championship"
# diction - store rules inputed
# firsts - store computed firsts
diction = {}
firsts = {}
follows = {}
# computes all FIRSTs for all non terminals
computeAllFirsts()
# assuming first rule has start_symbol
# start symbol can be modified in below line of code
start_symbol = list(diction.keys())[0]
# computes all FOLLOWs for all occurrences
computeAllFollows()
# generate formatted first and follow table
# then generate parse table
(parsing_table, result, tabTerm) = createParseTable()
# validate string input using stack-buffer concept
if sample_input_string != None:
validity = validateStringUsingStackBuffer(parsing_table, result,
tabTerm, sample_input_string,
term_userdef,start_symbol)
print(validity)
else:
print("\nNo input String detected")
# Author: Tanmay P. Bisen
Rules:
S->[['A', 'k', 'O']]
A->[['A', 'd'], ['a', 'B'], ['a', 'C']]
C->[['c']]
B->[['b', 'B', 'C'], ['r']]
After elimination of left recursion:
S->[['A', 'k', 'O']]
A->[['a', 'B', "A'"], ['a', 'C', "A'"]]
C->[['c']]
B->[['b', 'B', 'C'], ['r']]
A'->[['d', "A'"], ['#']]
After left factoring:
S->[['A', 'k', 'O']]
A->[['a', "A''"]]
A''->[['B', "A'"], ['C', "A'"]]
C->[['c']]
B->[['b', 'B', 'C'], ['r']]
A'->[['d', "A'"], ['#']]
Calculated firsts:
first(S) => {'a'}
first(A) => {'a'}
first(A'') => {'c', 'r', 'b'}
first(C) => {'c'}
first(B) => {'r', 'b'}
first(A') => {'d', '#'}
Calculated follows:
follow(S) => {'$'}
follow(A) => {'k'}
follow(A'') => {'k'}
follow(C) => {'d', 'c', 'k'}
follow(B) => {'d', 'c', 'k'}
follow(A') => {'k'}
Firsts and Follow Result table
Non-T FIRST FOLLOW
S {'a'} {'$'}
A {'a'} {'k'}
A'' {'c', 'r', 'b'} {'k'}
C {'c'} {'d', 'c', 'k'}
B {'r', 'b'} {'d', 'c', 'k'}
A' {'d', '#'} {'k'}
Generated parsing table:
k O d a c b r $
S S->A k O
A A->a A''
A'' A''->C A' A''->B A' A''->B A'
C C->c
B B->b B C B->r
A' A'-># A'->d A'
Validate String => a r k O
Buffer Stack Action
$ O k r a S $ T[S][a] = S->A k O
$ O k r a A k O $ T[A][a] = A->a A''
$ O k r a a A'' k O $ Matched:a
$ O k r A'' k O $ T[A''][r] = A''->B A'
$ O k r B A' k O $ T[B][r] = B->r
$ O k r r A' k O $ Matched:r
$ O k A' k O $ T[A'][k] = A'->#
$ O k k O $ Matched:k
$ O O $ Matched:O
$ $ Valid
Valid String!
推理
样本集 7 用于表示代码输出,它涵盖了 LL(1) 解析的所有方面。左递归删除后和左因式打印后的语法。之后,我们还有 first and follow 计算结果。然后我们生成解析表,如果表中任何位置(Table[NT][T])都没有多个条目,我们说语法是LL(1)。最后,使用堆栈缓冲区验证来验证示例输入字符串。