从正则式到LR分析表:手把手复现一个简易词法&语法分析器(Python实现)
·
从正则式到LR分析表:手把手复现一个简易词法&语法分析器(Python实现)
在编程语言的世界里,编译原理就像是一座连接人类思维与机器执行的桥梁。而这座桥梁的第一道关卡,就是词法分析和语法分析——它们如同语言的"解码器",将我们编写的代码转化为计算机能够理解的结构。本文将以Python为工具,带你从零实现一个完整的编译前端,涵盖从正则表达式到LR分析表的核心算法。
1. 词法分析器构建基础
词法分析器的核心任务是将字符流转化为有意义的词素(token)。这个过程就像英语阅读时把字母组合成单词,我们需要先定义编程语言的基本词汇规则。
1.1 定义词法规则
我们先为简易语言定义几种基本token类型:
from enum import Enum
class TokenType(Enum):
IDENTIFIER = 1 # 标识符
NUMBER = 2 # 数字
KEYWORD = 3 # 关键字
OPERATOR = 4 # 运算符
DELIMITER = 5 # 分隔符
EOF = 6 # 文件结束
对应的正则表达式规则可以表示为:
| Token类型 | 正则模式 | 示例 |
|---|---|---|
| IDENTIFIER | [a-zA-Z_]\w* |
count |
| NUMBER | \d+ |
42 |
| KEYWORD | `if | else |
| OPERATOR | [+\-*/=<>] |
+= |
| DELIMITER | [();{},] |
( |
1.2 从正则到NFA的转换
正则表达式到NFA的转换有一套系统化的规则。以下是将正则片段 (a|b)*abb 转换为NFA的Python实现:
class NFAState:
def __init__(self):
self.transitions = {} # 字符到状态集合的映射
self.epsilon_trans = set() # ε转移
def regex_to_nfa(pattern):
# 实现正则到NFA的转换
stack = []
# 这里简化处理,实际需要完整实现Thompson算法
for char in pattern:
if char == '|':
right = stack.pop()
left = stack.pop()
new_nfa = _handle_union(left, right)
stack.append(new_nfa)
elif char == '*':
operand = stack.pop()
new_nfa = _handle_kleene_star(operand)
stack.append(new_nfa)
else:
stack.append(_create_basic_nfa(char))
return stack[0] if stack else None
提示:NFA的状态转换可以用图结构表示,每个状态记录其出边和转移条件。
2. NFA到DFA的转换与优化
NFA虽然直观,但效率较低。我们需要通过子集构造法将其转换为确定性的DFA。
2.1 子集构造法实现
def nfa_to_dfa(nfa_start):
dfa_states = {}
initial_closure = epsilon_closure({nfa_start})
dfa_start = DFAState(initial_closure)
unmarked_states = [dfa_start]
while unmarked_states:
current_dfa = unmarked_states.pop()
for symbol in get_alphabet():
nfa_states = move(current_dfa.nfa_states, symbol)
if not nfa_states:
continue
new_closure = epsilon_closure(nfa_states)
existing_dfa = find_dfa_state(dfa_states, new_closure)
if not existing_dfa:
existing_dfa = DFAState(new_closure)
dfa_states[frozenset(new_closure)] = existing_dfa
unmarked_states.append(existing_dfa)
current_dfa.transitions[symbol] = existing_dfa
return dfa_start
2.2 DFA最小化算法
DFA最小化通过合并等价状态来优化自动机:
- 初始划分:将状态分为接受态和非接受态
- 不断细分:对每个分区,检查同一符号转换后是否仍在同一分区
- 合并不可区分状态
def minimize_dfa(dfa):
partitions = [set(dfa.accept_states),
set(dfa.states) - set(dfa.accept_states)]
changed = True
while changed:
changed = False
new_partitions = []
for partition in partitions:
split_dict = {}
for state in partition:
key = tuple(get_partition_index(partitions, state.transitions[s])
for s in alphabet)
if key not in split_dict:
split_dict[key] = set()
split_dict[key].add(state)
if len(split_dict) > 1:
changed = True
new_partitions.extend(split_dict.values())
partitions = new_partitions
return build_minimized_dfa(partitions)
3. LR分析表生成原理
LR分析是自底向上语法分析的主流方法,其核心是构造LR分析表。
3.1 LR(0)项集族构造
LR(0)项的形式为 A → α·β ,点号表示当前分析位置。以下是项集闭包的计算:
def closure(items, grammar):
closure_set = set(items)
changed = True
while changed:
changed = False
for item in list(closure_set):
dot_pos = item.dot_position
if dot_pos < len(item.production.rhs):
next_symbol = item.production.rhs[dot_pos]
if next_symbol in grammar.nonterminals:
for prod in grammar.productions_for(next_symbol):
new_item = LR0Item(prod, 0)
if new_item not in closure_set:
closure_set.add(new_item)
changed = True
return frozenset(closure_set)
3.2 SLR分析表生成
SLR在LR(0)基础上利用FOLLOW集解决冲突:
def build_slr_table(grammar):
canonical_collection = build_canonical_lr0_items(grammar)
action_table = {}
goto_table = {}
for i, state in enumerate(canonical_collection):
for item in state:
if item.at_end():
# 规约项目
if item.production.lhs == grammar.augmented_start:
action_table[(i, '$')] = ('accept',)
else:
for terminal in grammar.follow(item.production.lhs):
action_table[(i, terminal)] = ('reduce', item.production)
else:
next_symbol = item.next_symbol()
next_state = goto(state, next_symbol, grammar)
if next_symbol in grammar.terminals:
action_table[(i, next_symbol)] = ('shift', canonical_collection.index(next_state))
else:
goto_table[(i, next_symbol)] = canonical_collection.index(next_state)
return action_table, goto_table
4. 完整分析器实现与测试
现在我们将词法分析和语法分析模块整合,构建完整的分析流程。
4.1 分析器架构设计
class CompilerFrontend:
def __init__(self, grammar):
self.lexer = Lexer(grammar.token_rules)
self.grammar = grammar
self.action_table, self.goto_table = build_slr_table(grammar)
def parse(self, source_code):
tokens = self.lexer.tokenize(source_code)
stack = [0] # 初始状态
cursor = 0
while True:
state = stack[-1]
current_token = tokens[cursor] if cursor < len(tokens) else Token('$', TokenType.EOF)
action = self.action_table.get((state, current_token.type))
if not action:
raise SyntaxError(f"Unexpected token {current_token}")
if action[0] == 'shift':
stack.append(action[1])
cursor += 1
elif action[0] == 'reduce':
prod = action[1]
for _ in range(len(prod.rhs)):
stack.pop()
state = stack[-1]
stack.append(self.goto_table[(state, prod.lhs)])
elif action[0] == 'accept':
return True
4.2 测试案例与调试
我们定义一个简单的算术表达式文法进行测试:
E → E + T | T
T → T * F | F
F → ( E ) | id
测试输入 id + id * id 的解析过程:
| 栈状态 | 输入 | 动作 |
|---|---|---|
| [0] | id + id * id $ | shift 5 |
| [0, 5] | + id * id $ | reduce F → id |
| [0, 3] | + id * id $ | reduce T → F |
| [0, 2] | + id * id $ | reduce E → T |
| [0, 1] | + id * id $ | shift 6 |
| [0, 1, 6] | id * id $ | shift 5 |
| [0, 1, 6, 5] | * id $ | reduce F → id |
| [0, 1, 6, 3] | * id $ | reduce T → F |
| [0, 1, 6, 9] | * id $ | shift 7 |
| [0, 1, 6, 9, 7] | id $ | shift 5 |
| [0, 1, 6, 9, 7, 5] | $ | reduce F → id |
| [0, 1, 6, 9, 7, 10] | $ | reduce T → T * F |
| [0, 1, 6, 9] | $ | reduce E → E + T |
| [0, 1] | $ | accept |
在实现过程中,常见的调试技巧包括:
- 打印每个步骤的栈状态和输入
- 可视化DFA和LR项集族
- 对冲突项进行优先级和结合性调整
通过这个项目,我们不仅实现了编译原理的核心算法,更重要的是理解了如何将形式语言理论转化为实际可运行的代码。这种从理论到实践的转化能力,正是高级开发者区别于初学者的关键所在。
更多推荐
所有评论(0)