从正则式到LR分析表：手把手复现一个简易词法&语法分析器（Python实现）

写手一条城

234人浏览 · 2026-06-10 12:33:29

写手一条城 · 2026-06-10 12:33:29 发布

从正则式到LR分析表：手把手复现一个简易词法&语法分析器（Python实现）

在编程语言的世界里，编译原理就像是一座连接人类思维与机器执行的桥梁。而这座桥梁的第一道关卡，就是词法分析和语法分析——它们如同语言的"解码器"，将我们编写的代码转化为计算机能够理解的结构。本文将以Python为工具，带你从零实现一个完整的编译前端，涵盖从正则表达式到LR分析表的核心算法。

1. 词法分析器构建基础

词法分析器的核心任务是将字符流转化为有意义的词素（token）。这个过程就像英语阅读时把字母组合成单词，我们需要先定义编程语言的基本词汇规则。

1.1 定义词法规则

我们先为简易语言定义几种基本token类型：

from enum import Enum

class TokenType(Enum):
    IDENTIFIER = 1      # 标识符
    NUMBER = 2          # 数字
    KEYWORD = 3         # 关键字
    OPERATOR = 4        # 运算符
    DELIMITER = 5       # 分隔符
    EOF = 6             # 文件结束

对应的正则表达式规则可以表示为：

Token类型	正则模式	示例
IDENTIFIER	`[a-zA-Z_]\w*`	`count`
NUMBER	`\d+`	`42`
KEYWORD	`if	else
OPERATOR	`[+\-*/=<>]`	`+=`
DELIMITER	`[();{},]`	`(`

1.2 从正则到NFA的转换

正则表达式到NFA的转换有一套系统化的规则。以下是将正则片段 (a|b)*abb 转换为NFA的Python实现：

class NFAState:
    def __init__(self):
        self.transitions = {}  # 字符到状态集合的映射
        self.epsilon_trans = set()  # ε转移

def regex_to_nfa(pattern):
    # 实现正则到NFA的转换
    stack = []
    # 这里简化处理，实际需要完整实现Thompson算法
    for char in pattern:
        if char == '|':
            right = stack.pop()
            left = stack.pop()
            new_nfa = _handle_union(left, right)
            stack.append(new_nfa)
        elif char == '*':
            operand = stack.pop()
            new_nfa = _handle_kleene_star(operand)
            stack.append(new_nfa)
        else:
            stack.append(_create_basic_nfa(char))
    return stack[0] if stack else None

提示：NFA的状态转换可以用图结构表示，每个状态记录其出边和转移条件。

2. NFA到DFA的转换与优化

NFA虽然直观，但效率较低。我们需要通过子集构造法将其转换为确定性的DFA。

2.1 子集构造法实现

def nfa_to_dfa(nfa_start):
    dfa_states = {}
    initial_closure = epsilon_closure({nfa_start})
    dfa_start = DFAState(initial_closure)
    unmarked_states = [dfa_start]
    
    while unmarked_states:
        current_dfa = unmarked_states.pop()
        for symbol in get_alphabet():
            nfa_states = move(current_dfa.nfa_states, symbol)
            if not nfa_states:
                continue
            new_closure = epsilon_closure(nfa_states)
            existing_dfa = find_dfa_state(dfa_states, new_closure)
            if not existing_dfa:
                existing_dfa = DFAState(new_closure)
                dfa_states[frozenset(new_closure)] = existing_dfa
                unmarked_states.append(existing_dfa)
            current_dfa.transitions[symbol] = existing_dfa
    return dfa_start

2.2 DFA最小化算法

DFA最小化通过合并等价状态来优化自动机：

初始划分：将状态分为接受态和非接受态
不断细分：对每个分区，检查同一符号转换后是否仍在同一分区
合并不可区分状态

def minimize_dfa(dfa):
    partitions = [set(dfa.accept_states), 
                 set(dfa.states) - set(dfa.accept_states)]
    changed = True
    while changed:
        changed = False
        new_partitions = []
        for partition in partitions:
            split_dict = {}
            for state in partition:
                key = tuple(get_partition_index(partitions, state.transitions[s]) 
                          for s in alphabet)
                if key not in split_dict:
                    split_dict[key] = set()
                split_dict[key].add(state)
            if len(split_dict) > 1:
                changed = True
            new_partitions.extend(split_dict.values())
        partitions = new_partitions
    return build_minimized_dfa(partitions)

3. LR分析表生成原理

LR分析是自底向上语法分析的主流方法，其核心是构造LR分析表。

3.1 LR(0)项集族构造

LR(0)项的形式为 A → α·β ，点号表示当前分析位置。以下是项集闭包的计算：

def closure(items, grammar):
    closure_set = set(items)
    changed = True
    while changed:
        changed = False
        for item in list(closure_set):
            dot_pos = item.dot_position
            if dot_pos < len(item.production.rhs):
                next_symbol = item.production.rhs[dot_pos]
                if next_symbol in grammar.nonterminals:
                    for prod in grammar.productions_for(next_symbol):
                        new_item = LR0Item(prod, 0)
                        if new_item not in closure_set:
                            closure_set.add(new_item)
                            changed = True
    return frozenset(closure_set)

3.2 SLR分析表生成

SLR在LR(0)基础上利用FOLLOW集解决冲突：

def build_slr_table(grammar):
    canonical_collection = build_canonical_lr0_items(grammar)
    action_table = {}
    goto_table = {}
    
    for i, state in enumerate(canonical_collection):
        for item in state:
            if item.at_end():
                # 规约项目
                if item.production.lhs == grammar.augmented_start:
                    action_table[(i, '$')] = ('accept',)
                else:
                    for terminal in grammar.follow(item.production.lhs):
                        action_table[(i, terminal)] = ('reduce', item.production)
            else:
                next_symbol = item.next_symbol()
                next_state = goto(state, next_symbol, grammar)
                if next_symbol in grammar.terminals:
                    action_table[(i, next_symbol)] = ('shift', canonical_collection.index(next_state))
                else:
                    goto_table[(i, next_symbol)] = canonical_collection.index(next_state)
    return action_table, goto_table

4. 完整分析器实现与测试

现在我们将词法分析和语法分析模块整合，构建完整的分析流程。

4.1 分析器架构设计

class CompilerFrontend:
    def __init__(self, grammar):
        self.lexer = Lexer(grammar.token_rules)
        self.grammar = grammar
        self.action_table, self.goto_table = build_slr_table(grammar)
    
    def parse(self, source_code):
        tokens = self.lexer.tokenize(source_code)
        stack = [0]  # 初始状态
        cursor = 0
        
        while True:
            state = stack[-1]
            current_token = tokens[cursor] if cursor < len(tokens) else Token('$', TokenType.EOF)
            action = self.action_table.get((state, current_token.type))
            
            if not action:
                raise SyntaxError(f"Unexpected token {current_token}")
                
            if action[0] == 'shift':
                stack.append(action[1])
                cursor += 1
            elif action[0] == 'reduce':
                prod = action[1]
                for _ in range(len(prod.rhs)):
                    stack.pop()
                state = stack[-1]
                stack.append(self.goto_table[(state, prod.lhs)])
            elif action[0] == 'accept':
                return True

4.2 测试案例与调试

我们定义一个简单的算术表达式文法进行测试：

E → E + T | T
T → T * F | F
F → ( E ) | id

测试输入 id + id * id 的解析过程：

栈状态	输入	动作
[0]	id + id * id $	shift 5
[0, 5]	+ id * id $	reduce F → id
[0, 3]	+ id * id $	reduce T → F
[0, 2]	+ id * id $	reduce E → T
[0, 1]	+ id * id $	shift 6
[0, 1, 6]	id * id $	shift 5
[0, 1, 6, 5]	* id $	reduce F → id
[0, 1, 6, 3]	* id $	reduce T → F
[0, 1, 6, 9]	* id $	shift 7
[0, 1, 6, 9, 7]	id $	shift 5
[0, 1, 6, 9, 7, 5]	$	reduce F → id
[0, 1, 6, 9, 7, 10]	$	reduce T → T * F
[0, 1, 6, 9]	$	reduce E → E + T
[0, 1]	$	accept

在实现过程中，常见的调试技巧包括：

打印每个步骤的栈状态和输入
可视化DFA和LR项集族
对冲突项进行优先级和结合性调整

通过这个项目，我们不仅实现了编译原理的核心算法，更重要的是理解了如何将形式语言理论转化为实际可运行的代码。这种从理论到实践的转化能力，正是高级开发者区别于初学者的关键所在。

亚马逊云科技技术品牌专区

更多推荐

AI Agent 面试题 730：Agent安全的全生命周期管理和持续改进

对齐技术是 AI Agent 技术体系中的重要组成部分。简单来说，它涉及到 Agent 如何在 Agent安全与对齐层面实现智能化的行为和决策。在实际应用中，对齐技术的核心目标是让 Agent 能够更加高效、准确地完成特定任务。这需要我们深入理解其底层原理和实现机制。从学术角度来看，对齐技术的研究可以追溯到人工智能的早期阶段。早在 1950 年代，Alan Turing 就提出了关于机器智

亚马逊云科技技术品牌专区

AI Agent 面试题 711：Agent的Prompt注入防御的实时监控和告警

Prompt 注入攻击与防御是 AI Agent 技术体系中的重要组成部分。简单来说，它涉及到 Agent 如何在 Agent安全与对齐层面实现智能化的行为和决策。在实际应用中，Prompt 注入攻击与防御的核心目标是让 Agent 能够更加高效、准确地完成特定任务。这需要我们深入理解其底层原理和实现机制。从学术角度来看，Prompt 注入攻击与防御的研究可以追溯到人工智能的早期阶段。早在