📜  Python中的编译器设计 LL(1) 解析器

📅  最后修改于: 2022-05-13 01:54:34.862000             🧑  作者: Mango

Python中的编译器设计 LL(1) 解析器

先决条件: LL(1)解析表的构建自顶向下解析器的分类, 第一组跟随组 

在本文中,我们将了解如何使用Python设计 LL(1) Parser 编译器。

LL(1) 语法

LL(1) 中的第一个“L”代表从左到右扫描输入,第二个“L”代表产生最左边的推导,“1”代表在每一步使用一个前瞻输入符号进行解析行动决定。 LL(1) 语法遵循自顶向下的解析方法。对于一类称为 LL(1) 的文法,我们可以构造文法预测解析器。这适用于不需要回溯的递归下降解析器的概念。


一个文法是左递归的,如果它有一个非终结符 A 使得有一个推导 A → A α | β。自顶向下解析方法无法处理左递归语法,因此需要进行转换以消除左递归。可以通过如下修改规则来消除左递归:(A' 是新的非终结符,ε 代表 epsilon)。

A → β A’
A’ → α A’ | ε



For example, if we have grammar rule A → α β1 | α β2
A → α A’
A’ → β1 | β2


自顶向下解析器的构造由与语法 G 关联的 FIRST 和 FOLLOW 函数辅助。在自顶向下解析期间,FIRST 和 FOLLOW 允许我们根据下一个输入符号选择要应用的产生式。



注意: Epsilon (ε) 永远不能出现在任何非终结符号的 FOLLOW 中。


在构造解析表之后,如果对于表中的任何非终结符号,我们对于表列中的任何终结符号都有多个产生规则,则文法不是 LL(1)。否则,文法被视为 LL(1)。




  • 有七个函数和驱动程序代码,它们一起执行计算。代码将语法规则作为输入。用户需要确定哪些符号是终结符(列表:term_userdef),哪些是非终结符(列表:nonterm_userdef)。
  • 代码在示例集 5 上执行,用于演示目的,如代码所示,该语法具有左递归,函数computeAllFirsts() 被调用。该函数负责调用函数 removeLeftRecursion() 和 LeftFactoring()。这些函数分别按照上述规则工作。现在,为每个非终端调用 first()函数。实现的递归逻辑仅基于提到的 FIRST 计算规则。
  • first() 的基本条件是是否存在 epsilon 或终端符号,因为每个非终端代码都进入递归逻辑,如果 FIRST 有多个符号,则返回一个列表,否则返回一个字符串。因此,在代码行之间只是使程序类型安全。
  • FIRST 计算是 FOLLOW 计算的先决条件,因为 follow()函数多次调用 first()函数。 start_symbol 是“规则”列表中给出的第一个规则的 LHS 符号。 (可以修改),对于 FOLLOW 计算,调用 computeAllFollows()函数。这导致在所有非终端上调用 follow()函数。 follow()函数具有递归逻辑,其基本条件是,在 start_symbol 上调用的函数将返回 $ 符号。
  • 其余所有条件均按照上述 FOLLOW 计算中的规则进行处理。对于目标非终结符,遍历所有规则并计算所需的第一个和循环跟随。所有的中间结果在遍历过程中被累加,最终计算出的跟随列表在遍历结束时返回。在这里,基本情况也返回一个字符串,但如果结果中有多个符号,则返回一个列表。因此,在额外的代码行之间添加了类型安全。
  • 在计算出 FIRST 和 FOLLOW 之后,调用 createPaseTable()函数。这里 first 和 follow 以格式化的方式输出。然后为解析表准备名为“mat”的二维列表,非终结符形成行,终结符形成列,“$”作为额外列。上述解析表的构建规则构成了逻辑实现的基础。


  • 生成解析表后,我们必须验证给定语法的输入字符串。为此,我们使用堆栈和缓冲区。如果语法不是 LL(1),则无法执行字符串验证。在堆栈和缓冲区中输入 START SYMBOL, '$' 的以下内容。首先,将输入字符串以相反的顺序输入到缓冲区中。然后将 START SYMBOL 添加到堆栈顶部。
  • 然后我们迭代遍历堆栈和缓冲区的当前状态并匹配Table[x][y]的条目,其中'x'是堆栈顶部的符号,'y'是缓冲区后面的符号.我们从表中获取相应的规则,并在接下来的迭代中在 TOS 处展开 Non-terminal。
  • 如果 x 和 y 两个符号都是相同的终结符,那么我们将它们从堆栈和缓冲区中弹出并继续计算。如果堆栈和缓冲区仅保留'$'符号,则表明所有输入字符串符号都已匹配并且字符串属于给定语法。如果解析表中不存在 Rule 或 x 和 y 是不相等的终结符号,则字符串不属于给定的语法。


# LL(1) parser code in python
def removeLeftRecursion(rulesDiction):
    # for rule: A->Aa|b
    # result: A->bA',A'->aA'|#
    # 'store' has new rules to be added
    store = {}
    # traverse over rules
    for lhs in rulesDiction:
        # alphaRules stores subrules with left-recursion
        # betaRules stores subrules without left-recursion
        alphaRules = []
        betaRules = []
        # get rhs for current lhs
        allrhs = rulesDiction[lhs]
        for subrhs in allrhs:
            if subrhs[0] == lhs:
        # alpha and beta containing subrules are separated
        # now form two new rules
        if len(alphaRules) != 0:
            # to generate new unique symbol
            # add ' till unique not generated
            lhs_ = lhs + "'"
            while (lhs_ in rulesDiction.keys()) \
                    or (lhs_ in store.keys()):
                lhs_ += "'"
            # make beta rule
            for b in range(0, len(betaRules)):
            rulesDiction[lhs] = betaRules
            # make alpha rule
            for a in range(0, len(alphaRules)):
            # store in temp dict, append to
            # - rulesDiction at end of traversal
            store[lhs_] = alphaRules
    # add newly generated rules generated
    # - after removing left recursion
    for left in store:
        rulesDiction[left] = store[left]
    return rulesDiction
def LeftFactoring(rulesDiction):
    # for rule: A->aDF|aCV|k
    # result: A->aA'|k, A'->DF|CV
    # newDict stores newly generated
    # - rules after left factoring
    newDict = {}
    # iterate over all rules of dictionary
    for lhs in rulesDiction:
        # get rhs for given lhs
        allrhs = rulesDiction[lhs]
        # temp dictionary helps detect left factoring
        temp = dict()
        for subrhs in allrhs:
            if subrhs[0] not in list(temp.keys()):
                temp[subrhs[0]] = [subrhs]
        # if value list count for any key in temp is > 1,
        # - it has left factoring
        # new_rule stores new subrules for current LHS symbol
        new_rule = []
        # temp_dict stores new subrules for left factoring
        tempo_dict = {}
        for term_key in temp:
            # get value from temp for term_key
            allStartingWithTermKey = temp[term_key]
            if len(allStartingWithTermKey) > 1:
                # left factoring required
                # to generate new unique symbol
                # - add ' till unique not generated
                lhs_ = lhs + "'"
                while (lhs_ in rulesDiction.keys()) \
                        or (lhs_ in tempo_dict.keys()):
                    lhs_ += "'"
                # append the left factored result
                new_rule.append([term_key, lhs_])
                # add expanded rules to tempo_dict
                ex_rules = []
                for g in temp[term_key]:
                tempo_dict[lhs_] = ex_rules
                # no left factoring required
        # add original rule
        newDict[lhs] = new_rule
        # add newly generated rules after left factoring
        for key in tempo_dict:
            newDict[key] = tempo_dict[key]
    return newDict
# calculation of first
# epsilon is denoted by '#' (semi-colon)
# pass rule in first function
def first(rule):
    global rules, nonterm_userdef, \
        term_userdef, diction, firsts
    # recursion base condition
    # (for terminal or epsilon)
    if len(rule) != 0 and (rule is not None):
        if rule[0] in term_userdef:
            return rule[0]
        elif rule[0] == '#':
            return '#'
    # condition for Non-Terminals
    if len(rule) != 0:
        if rule[0] in list(diction.keys()):
            # fres temporary list of result
            fres = []
            rhs_rules = diction[rule[0]]
            # call first on each rule of RHS
            # fetched (& take union)
            for itr in rhs_rules:
                indivRes = first(itr)
                if type(indivRes) is list:
                    for i in indivRes:
            # if no epsilon in result
            # - received return fres
            if '#' not in fres:
                return fres
                # apply epsilon
                # rule => f(ABC)=f(A)-{e} U f(BC)
                newList = []
                if len(rule) > 1:
                    ansNew = first(rule[1:])
                    if ansNew != None:
                        if type(ansNew) is list:
                            newList = fres + ansNew
                            newList = fres + [ansNew]
                        newList = fres
                    return newList
                # if result is not already returned
                # - control reaches here
                # lastly if eplison still persists
                # - keep it in result of first
                return fres
# calculation of follow
# use 'rules' list, and 'diction' dict from above
# follow function input is the split result on
# - Non-Terminal whose Follow we want to compute
def follow(nt):
    global start_symbol, rules, nonterm_userdef, \
        term_userdef, diction, firsts, follows
    # for start symbol return $ (recursion base case)
    solset = set()
    if nt == start_symbol:
        # return '$'
    # check all occurrences
    # solset - is result of computed 'follow' so far
    # For input, check in all rules
    for curNT in diction:
        rhs = diction[curNT]
        # go for all productions of NT
        for subrule in rhs:
            if nt in subrule:
                # call for all occurrences on
                # - non-terminal in subrule
                while nt in subrule:
                    index_nt = subrule.index(nt)
                    subrule = subrule[index_nt + 1:]
                    # empty condition - call follow on LHS
                    if len(subrule) != 0:
                        # compute first if symbols on
                        # - RHS of target Non-Terminal exists
                        res = first(subrule)
                        # if epsilon in result apply rule
                        # - (A->aBX)- follow of -
                        # - follow(B)=(first(X)-{ep}) U follow(A)
                        if '#' in res:
                            newList = []
                            ansNew = follow(curNT)
                            if ansNew != None:
                                if type(ansNew) is list:
                                    newList = res + ansNew
                                    newList = res + [ansNew]
                                newList = res
                            res = newList
                        # when nothing in RHS, go circular
                        # - and take follow of LHS
                        # only if (NT in LHS)!=curNT
                        if nt != curNT:
                            res = follow(curNT)
                    # add follow result in set form
                    if res is not None:
                        if type(res) is list:
                            for g in res:
    return list(solset)
def computeAllFirsts():
    global rules, nonterm_userdef, \
        term_userdef, diction, firsts
    for rule in rules:
        k = rule.split("->")
        # remove un-necessary spaces
        k[0] = k[0].strip()
        k[1] = k[1].strip()
        rhs = k[1]
        multirhs = rhs.split('|')
        # remove un-necessary spaces
        for i in range(len(multirhs)):
            multirhs[i] = multirhs[i].strip()
            multirhs[i] = multirhs[i].split()
        diction[k[0]] = multirhs
    print(f"\nRules: \n")
    for y in diction:
    print(f"\nAfter elimination of left recursion:\n")
    diction = removeLeftRecursion(diction)
    for y in diction:
    print("\nAfter left factoring:\n")
    diction = LeftFactoring(diction)
    for y in diction:
    # calculate first for each rule
    # - (call first() on all RHS)
    for y in list(diction.keys()):
        t = set()
        for sub in diction.get(y):
            res = first(sub)
            if res != None:
                if type(res) is list:
                    for u in res:
        # save result in 'firsts' list
        firsts[y] = t
    print("\nCalculated firsts: ")
    key_list = list(firsts.keys())
    index = 0
    for gg in firsts:
        print(f"first({key_list[index]}) "
              f"=> {firsts.get(gg)}")
        index += 1
def computeAllFollows():
    global start_symbol, rules, nonterm_userdef,\
        term_userdef, diction, firsts, follows
    for NT in diction:
        solset = set()
        sol = follow(NT)
        if sol is not None:
            for g in sol:
        follows[NT] = solset
    print("\nCalculated follows: ")
    key_list = list(follows.keys())
    index = 0
    for gg in follows:
              f" => {follows[gg]}")
        index += 1
# create parse table
def createParseTable():
    import copy
    global diction, firsts, follows, term_userdef
    print("\nFirsts and Follow Result table\n")
    # find space size
    mx_len_first = 0
    mx_len_fol = 0
    for u in diction:
        k1 = len(str(firsts[u]))
        k2 = len(str(follows[u]))
        if k1 > mx_len_first:
            mx_len_first = k1
        if k2 > mx_len_fol:
            mx_len_fol = k2
    print(f"{{:<{10}}} "
          f"{{:<{mx_len_first + 5}}} "
          f"{{:<{mx_len_fol + 5}}}"
          .format("Non-T", "FIRST", "FOLLOW"))
    for u in diction:
        print(f"{{:<{10}}} "
              f"{{:<{mx_len_first + 5}}} "
              f"{{:<{mx_len_fol + 5}}}"
              .format(u, str(firsts[u]), str(follows[u])))
    # create matrix of row(NT) x [col(T) + 1($)]
    # create list of non-terminals
    ntlist = list(diction.keys())
    terminals = copy.deepcopy(term_userdef)
    # create the initial empty state of ,matrix
    mat = []
    for x in diction:
        row = []
        for y in terminals:
        # of $ append one more col
    # Classifying grammar as LL(1) or not LL(1)
    grammar_is_LL = True
    # rules implementation
    for lhs in diction:
        rhs = diction[lhs]
        for y in rhs:
            res = first(y)
            # epsilon is present,
            # - take union with follow
            if '#' in res:
                if type(res) == str:
                    firstFollow = []
                    fol_op = follows[lhs]
                    if fol_op is str:
                        for u in fol_op:
                    res = firstFollow
                    res = list(res) +\
            # add rules to table
            ttemp = []
            if type(res) is str:
                res = copy.deepcopy(ttemp)
            for c in res:
                xnt = ntlist.index(lhs)
                yt = terminals.index(c)
                if mat[xnt][yt] == '':
                    mat[xnt][yt] = mat[xnt][yt] \
                                   + f"{lhs}->{' '.join(y)}"
                    # if rule already present
                    if f"{lhs}->{y}" in mat[xnt][yt]:
                        grammar_is_LL = False
                        mat[xnt][yt] = mat[xnt][yt] \
                                       + f",{lhs}->{' '.join(y)}"
    # final state of parse table
    print("\nGenerated parsing table:\n")
    frmt = "{:>12}" * len(terminals)
    j = 0
    for y in mat:
        frmt1 = "{:>12}" * len(y)
        print(f"{ntlist[j]} {frmt1.format(*y)}")
        j += 1
    return (mat, grammar_is_LL, terminals)
def validateStringUsingStackBuffer(parsing_table, grammarll1,
                                   table_term_list, input_string,
    print(f"\nValidate String => {input_string}\n")
    # for more than one entries
    # - in one cell of parsing table
    if grammarll1 == False:
        return f"\nInput String = " \
               f"\"{input_string}\"\n" \
               f"Grammar is not LL(1)"
    # implementing stack buffer
    stack = [start_symbol, '$']
    buffer = []
    # reverse input string store in buffer
    input_string = input_string.split()
    buffer = ['$'] + input_string
    print("{:>20} {:>20} {:>20}".
          format("Buffer", "Stack","Action"))
    while True:
        # end loop if all symbols matched
        if stack == ['$'] and buffer == ['$']:
            print("{:>20} {:>20} {:>20}"
                  .format(' '.join(buffer),
                          ' '.join(stack),
            return "\nValid String!"
        elif stack[0] not in term_userdef:
            # take font of buffer (y) and tos (x)
            x = list(diction.keys()).index(stack[0])
            y = table_term_list.index(buffer[-1])
            if parsing_table[x][y] != '':
                # format table entry received
                entry = parsing_table[x][y]
                print("{:>20} {:>20} {:>25}".
                      format(' '.join(buffer),
                             ' '.join(stack),
                             f"T[{stack[0]}][{buffer[-1]}] = {entry}"))
                lhs_rhs = entry.split("->")
                lhs_rhs[1] = lhs_rhs[1].replace('#', '').strip()
                entryrhs = lhs_rhs[1].split()
                stack = entryrhs + stack[1:]
                return f"\nInvalid String! No rule at " \
            # stack top is Terminal
            if stack[0] == buffer[-1]:
                print("{:>20} {:>20} {:>20}"
                      .format(' '.join(buffer),
                              ' '.join(stack),
                buffer = buffer[:-1]
                stack = stack[1:]
                return "\nInvalid String! " \
                       "Unmatched terminal symbols"
# NOTE: To test any of the sample sets, uncomment ->
# 'rules' list, 'nonterm_userdef' list, 'term_userdef' list
# and for any String validation uncomment following line with
# 'sample_input_String' variable.
sample_input_string = None
# sample set 1 (Result: Not LL(1))
# rules=["A -> S B | B",
#        "S -> a | B c | #",
#        "B -> b | d"]
# nonterm_userdef=['A','S','B']
# term_userdef=['a','c','b','d']
# sample_input_string="b c b"
# sample set 2 (Result: LL(1))
# rules=["S -> A | B C",
#        "A -> a | b",
#        "B -> p | #",
#        "C -> c"]
# nonterm_userdef=['A','S','B','C']
# term_userdef=['a','c','b','p']
# sample_input_string="p c"
# sample set 3 (Result: LL(1))
# rules=["S -> A B | C",
#        "A -> a | b | #",
#        "B-> p | #",
#        "C -> c"]
# nonterm_userdef=['A','S','B','C']
# term_userdef=['a','c','b','p']
# sample_input_string="a c b"
# sample set 4 (Result: Not LL(1))
# rules = ["S -> A B C | C",
#          "A -> a | b B | #",
#          "B -> p | #",
#         "C -> c"]
# nonterm_userdef=['A','S','B','C']
# term_userdef=['a','c','b','p']
# sample_input_string="b p p c"
# sample set 5 (With left recursion)
# rules=["A -> B C c | g D B",
#        "B -> b C D E | #",
#        "C -> D a B | c a",
#        "D -> # | d D",
#        "E -> E a f | c"
#       ]
# nonterm_userdef=['A','B','C','D','E']
# term_userdef=["a","b","c","d","f","g"]
# sample_input_string="b a c a c"
# sample set 6
# rules=["E -> T E'",
#        "E' -> + T E' | #",
#        "T -> F T'",
#        "T' -> * F T' | #",
#        "F -> ( E ) | id"
# ]
# nonterm_userdef=['E','E\'','F','T','T\'']
# term_userdef=['id','+','*','(',')']
# sample_input_string="id * * id"
# example string 1
# sample_input_string="( id * id )"
# example string 2
# sample_input_string="( id ) * id + id"
# sample set 7 (left factoring & recursion present)
rules=["S -> A k O",
       "A -> A d | a B | a C",
       "C -> c",
       "B -> b B C | r"]
sample_input_string="a r k O"
# sample set 8 (Multiple char symbols T & NT)
# rules = ["S -> NP VP",
#          "NP -> P | PN | D N",
#          "VP -> V NP",
#          "N -> championship | ball | toss",
#          "V -> is | want | won | played",
#          "P -> me | I | you",
#          "PN -> India | Australia | Steve | John",
#          "D -> the | a | an"]
# nonterm_userdef = ['S', 'NP', 'VP', 'N', 'V', 'P', 'PN', 'D']
# term_userdef = ["championship", "ball", "toss", "is", "want",
#                 "won", "played", "me", "I", "you", "India",
#                 "Australia","Steve", "John", "the", "a", "an"]
# sample_input_string = "India won the championship"
# diction - store rules inputed
# firsts - store computed firsts
diction = {}
firsts = {}
follows = {}
# computes all FIRSTs for all non terminals
# assuming first rule has start_symbol
# start symbol can be modified in below line of code
start_symbol = list(diction.keys())[0]
# computes all FOLLOWs for all occurrences
# generate formatted first and follow table
# then generate parse table
(parsing_table, result, tabTerm) = createParseTable()
# validate string input using stack-buffer concept
if sample_input_string != None:
    validity = validateStringUsingStackBuffer(parsing_table, result,
                                              tabTerm, sample_input_string,
    print("\nNo input String detected")
# Author: Tanmay P. Bisen


S->[['A', 'k', 'O']]
A->[['A', 'd'], ['a', 'B'], ['a', 'C']]
B->[['b', 'B', 'C'], ['r']]

After elimination of left recursion:

S->[['A', 'k', 'O']]
A->[['a', 'B', "A'"], ['a', 'C', "A'"]]
B->[['b', 'B', 'C'], ['r']]
A'->[['d', "A'"], ['#']]

After left factoring:

S->[['A', 'k', 'O']]
A->[['a', "A''"]]
A''->[['B', "A'"], ['C', "A'"]]
B->[['b', 'B', 'C'], ['r']]
A'->[['d', "A'"], ['#']]

Calculated firsts: 
first(S) => {'a'}
first(A) => {'a'}
first(A'') => {'c', 'r', 'b'}
first(C) => {'c'}
first(B) => {'r', 'b'}
first(A') => {'d', '#'}

Calculated follows: 
follow(S) => {'$'}
follow(A) => {'k'}
follow(A'') => {'k'}
follow(C) => {'d', 'c', 'k'}
follow(B) => {'d', 'c', 'k'}
follow(A') => {'k'}

Firsts and Follow Result table

Non-T      FIRST                FOLLOW              
S          {'a'}                {'$'}               
A          {'a'}                {'k'}               
A''        {'c', 'r', 'b'}      {'k'}               
C          {'c'}                {'d', 'c', 'k'}     
B          {'r', 'b'}           {'d', 'c', 'k'}     
A'         {'d', '#'}           {'k'}               

Generated parsing table:

           k           O           d           a           c           b           r           $
S                                         S->A k O                                                
A                                         A->a A''                                                
A''                                                    A''->C A'   A''->B A'   A''->B A'            
C                                                         C->c                                    
B                                                                 B->b B C        B->r            
A'        A'->#                A'->d A'                                                            

Validate String => a r k O

              Buffer                Stack               Action
           $ O k r a                  S $        T[S][a] = S->A k O
           $ O k r a              A k O $        T[A][a] = A->a A''
           $ O k r a          a A'' k O $            Matched:a
             $ O k r            A'' k O $     T[A''][r] = A''->B A'
             $ O k r           B A' k O $            T[B][r] = B->r
             $ O k r           r A' k O $            Matched:r
               $ O k             A' k O $          T[A'][k] = A'->#
               $ O k                k O $            Matched:k
                 $ O                  O $            Matched:O
                   $                    $                Valid

Valid String!


样本集 7 用于表示代码输出,它涵盖了 LL(1) 解析的所有方面。左递归删除后和左因式打印后的语法。之后,我们还有 first and follow 计算结果。然后我们生成解析表,如果表中任何位置(Table[NT][T])都没有多个条目,我们说语法是LL(1)。最后,使用堆栈缓冲区验证来验证示例输入字符串。