从上下文无关语法生成字符串时如何防止重复
How to prevent duplications when generating strings from context free grammar
我需要定义上下文无关语法并从该语言生成字符串。
包含所有带匹配括号的字符串的语言的语法:
s->ss
s->(s)
s->()
示例:(), ()(), (()),…
为了生成字符串,我通常使用 Breadth-first search。根是 s
。生产规则是边缘。每个终端节点都是一个字符串。
问题:在某个时候存在节点:sss
。从这个节点存在 3 个不同的边来得到相同的节点 ssss
:
[s->ss]ss = s[s->ss]s=ss[s->ss]
修剪重复边缘的正确方法是什么?
- 消除语法歧义。这门语言怎么做?
- 更改生成程序(BFS)?
代码:
from collections import deque
# rules
rules1 = [('s', 'ss'),
('s', '(s)'),
('s', '()')]
# tree
def get_edges(node, rules):
edges = []
for count, item in enumerate(node):
for rule in rules:
if item == rule[0]:
edges.append((count, rule[1], rule[0]))
return edges
def get_nodes(node, edges, rules):
nodes = map(lambda edge: node[:edge[0]] + edge[1] + node[edge[0]+len(edge[2]):], edges)
return list(nodes)
def get_adjacent_nodes(node, rules):
edges = get_edges(node, rules)
nodes = get_nodes(node, edges, rules)
return nodes
# search
def BFS_simple(G, root, goal, max_nodes_visited=10 ** 4):
Q = deque()
explored = set()
explored.add(root)
Q.append(root)
log_total_nodes_visited = 0
log_number_of_duplications = 0
while len(Q) > 0:
v = Q.popleft()
if v == goal:
print('total nodes visited = ', log_total_nodes_visited, 'duplications = ', log_number_of_duplications)
return v
for node in G(v):
if node not in explored:
explored.add(node)
Q.append(node)
else:
log_number_of_duplications += 1
log_total_nodes_visited += 1
if log_total_nodes_visited > max_nodes_visited:
return False
return False
BFS_simple(lambda node: get_adjacent_nodes(node, rules1), 's', '(()()())')
结果:total nodes visited = 1244 duplications = 6527
BFS 上的维基百科条目说:(强调)
Breadth-first search can be generalized to graphs, when the start node (sometimes referred to as a 'search key') is explicitly given, and precautions are taken against following a vertex twice.
并且伪代码实现包括这样的预防措施(参见下面复制的第 10 行和第 11 行):
9 for all edges from v to w in G.adjacentEdges(v) do
10 if w is not labeled as explored then
11 label w as explored
12 Q.enqueue(w)
它没有提到如何“将 w 标记为已探索”;两种常见的策略是使用将节点映射到布尔值的附加数据结构(字典或数组,具体取决于节点的识别方式)或使用附加布尔字段扩充节点数据结构。
这适用于任何语法,但对于明确的语法则没有必要。由于为 Dyck language 找到明确的语法是微不足道的,您可以通过使用它而不是拒绝重复的结果来节省一些时间。但我建议以任何方式编写面向图的 BFS,因为开销很小并且它允许更广泛的可能语法。
根据 @rici 的评论,我进行了 2 项改进并显着提高了搜索速度:
消除语法歧义和正确顺序:[('s', '()'), ('s', '()s'), ('s', '(s)'), ('s', '(s)s')]
不是将每个可能的替换都视为边,而是使用第一个非终端节点 (left most derivation) 的可能替换。
以下代码只是对我的 post 代码的扩展,以显示解决方案。 (可以跳过过滤并只为最左边的非终端节点生成边)
def get_leftmost_non_terminal(edges):
if len(edges) == 0:
return edges
return filter(lambda edge: edge[0] == edges[0][0], edges)
def get_adjacent_nodes(node, rules):
edges = get_edges(node, rules)
left_most = get_leftmost_non_terminal(edges)
nodes = get_nodes(node, left_most, rules)
return nodes
性能测试:
s = '(()(())())'
BFS_simple(lambda node: get_adjacent_nodes(node, rules1), 's', s)
#total nodes visited = 656 duplications = 188
BFS_simple(lambda node: get_adjacent_nodes(node, rules_unambiguous_empty), 's', s)
#total nodes visited = 537 duplications = 490
BFS_simple(lambda node: get_adjacent_nodes(node, rules_unambiguous), 's', s)
#total nodes visited = 536 duplications = 0
BFS_simple(lambda node: get_adjacent_nodes(node, rules_unambiguous_ordered), 's', s)
#total nodes visited = 361 duplications = 0
我需要定义上下文无关语法并从该语言生成字符串。
包含所有带匹配括号的字符串的语言的语法:
s->ss
s->(s)
s->()
示例:(), ()(), (()),…
为了生成字符串,我通常使用 Breadth-first search。根是 s
。生产规则是边缘。每个终端节点都是一个字符串。
问题:在某个时候存在节点:sss
。从这个节点存在 3 个不同的边来得到相同的节点 ssss
:
[s->ss]ss = s[s->ss]s=ss[s->ss]
修剪重复边缘的正确方法是什么?
- 消除语法歧义。这门语言怎么做?
- 更改生成程序(BFS)?
代码:
from collections import deque
# rules
rules1 = [('s', 'ss'),
('s', '(s)'),
('s', '()')]
# tree
def get_edges(node, rules):
edges = []
for count, item in enumerate(node):
for rule in rules:
if item == rule[0]:
edges.append((count, rule[1], rule[0]))
return edges
def get_nodes(node, edges, rules):
nodes = map(lambda edge: node[:edge[0]] + edge[1] + node[edge[0]+len(edge[2]):], edges)
return list(nodes)
def get_adjacent_nodes(node, rules):
edges = get_edges(node, rules)
nodes = get_nodes(node, edges, rules)
return nodes
# search
def BFS_simple(G, root, goal, max_nodes_visited=10 ** 4):
Q = deque()
explored = set()
explored.add(root)
Q.append(root)
log_total_nodes_visited = 0
log_number_of_duplications = 0
while len(Q) > 0:
v = Q.popleft()
if v == goal:
print('total nodes visited = ', log_total_nodes_visited, 'duplications = ', log_number_of_duplications)
return v
for node in G(v):
if node not in explored:
explored.add(node)
Q.append(node)
else:
log_number_of_duplications += 1
log_total_nodes_visited += 1
if log_total_nodes_visited > max_nodes_visited:
return False
return False
BFS_simple(lambda node: get_adjacent_nodes(node, rules1), 's', '(()()())')
结果:total nodes visited = 1244 duplications = 6527
BFS 上的维基百科条目说:(强调)
Breadth-first search can be generalized to graphs, when the start node (sometimes referred to as a 'search key') is explicitly given, and precautions are taken against following a vertex twice.
并且伪代码实现包括这样的预防措施(参见下面复制的第 10 行和第 11 行):
9 for all edges from v to w in G.adjacentEdges(v) do 10 if w is not labeled as explored then 11 label w as explored 12 Q.enqueue(w)
它没有提到如何“将 w 标记为已探索”;两种常见的策略是使用将节点映射到布尔值的附加数据结构(字典或数组,具体取决于节点的识别方式)或使用附加布尔字段扩充节点数据结构。
这适用于任何语法,但对于明确的语法则没有必要。由于为 Dyck language 找到明确的语法是微不足道的,您可以通过使用它而不是拒绝重复的结果来节省一些时间。但我建议以任何方式编写面向图的 BFS,因为开销很小并且它允许更广泛的可能语法。
根据 @rici 的评论,我进行了 2 项改进并显着提高了搜索速度:
消除语法歧义和正确顺序:
[('s', '()'), ('s', '()s'), ('s', '(s)'), ('s', '(s)s')]
不是将每个可能的替换都视为边,而是使用第一个非终端节点 (left most derivation) 的可能替换。
以下代码只是对我的 post 代码的扩展,以显示解决方案。 (可以跳过过滤并只为最左边的非终端节点生成边)
def get_leftmost_non_terminal(edges):
if len(edges) == 0:
return edges
return filter(lambda edge: edge[0] == edges[0][0], edges)
def get_adjacent_nodes(node, rules):
edges = get_edges(node, rules)
left_most = get_leftmost_non_terminal(edges)
nodes = get_nodes(node, left_most, rules)
return nodes
性能测试:
s = '(()(())())'
BFS_simple(lambda node: get_adjacent_nodes(node, rules1), 's', s)
#total nodes visited = 656 duplications = 188
BFS_simple(lambda node: get_adjacent_nodes(node, rules_unambiguous_empty), 's', s)
#total nodes visited = 537 duplications = 490
BFS_simple(lambda node: get_adjacent_nodes(node, rules_unambiguous), 's', s)
#total nodes visited = 536 duplications = 0
BFS_simple(lambda node: get_adjacent_nodes(node, rules_unambiguous_ordered), 's', s)
#total nodes visited = 361 duplications = 0