从上下文无关语法生成字符串时如何防止重复

How to prevent duplications when generating strings from context free grammar

我需要定义上下文无关语法并从该语言生成字符串。

包含所有带匹配括号的字符串的语言的语法:

s->ss
s->(s)
s->()

示例:(), ()(), (()),…

为了生成字符串,我通常使用 Breadth-first search。根是 s。生产规则是边缘。每个终端节点都是一个字符串。

问题:在某个时候存在节点:sss。从这个节点存在 3 个不同的边来得到相同的节点 ssss:

[s->ss]ss = s[s->ss]s=ss[s->ss]

修剪重复边缘的正确方法是什么?

  1. 消除语法歧义。这门语言怎么做?
  2. 更改生成程序(BFS)?

代码:

from collections import deque

# rules
rules1 = [('s', 'ss'), 
         ('s', '(s)'),
          ('s', '()')]

# tree
def get_edges(node, rules):
  edges = []
  for count, item in enumerate(node):
    for rule in rules:
      if item == rule[0]:
        edges.append((count, rule[1], rule[0]))

  return edges

def get_nodes(node, edges, rules):
  nodes = map(lambda edge: node[:edge[0]] + edge[1] + node[edge[0]+len(edge[2]):], edges)
  
  return list(nodes)

def get_adjacent_nodes(node, rules):
  edges = get_edges(node, rules)
  nodes = get_nodes(node, edges, rules) 

  return nodes

# search
def BFS_simple(G, root, goal, max_nodes_visited=10 ** 4):
  Q = deque()
  explored = set()

  explored.add(root)
  Q.append(root)

  log_total_nodes_visited = 0
  log_number_of_duplications = 0

  while len(Q) > 0:
    v = Q.popleft()

    if v == goal:
      print('total nodes visited = ', log_total_nodes_visited, 'duplications = ', log_number_of_duplications)
      return v

    for node in G(v):
      if node not in explored:
        explored.add(node)
        Q.append(node)

      else:
        log_number_of_duplications += 1

    log_total_nodes_visited += 1
    if log_total_nodes_visited > max_nodes_visited:
      return False

  return False

BFS_simple(lambda node: get_adjacent_nodes(node, rules1), 's', '(()()())')

结果:total nodes visited = 1244 duplications = 6527

BFS 上的维基百科条目说:(强调)

Breadth-first search can be generalized to graphs, when the start node (sometimes referred to as a 'search key') is explicitly given, and precautions are taken against following a vertex twice.

并且伪代码实现包括这样的预防措施(参见下面复制的第 10 行和第 11 行):

  9  for all edges from v to w in G.adjacentEdges(v) do
  10   if w is not labeled as explored then
  11     label w as explored
  12     Q.enqueue(w)

它没有提到如何“将 w 标记为已探索”;两种常见的策略是使用将节点映射到布尔值的附加数据结构(字典或数组,具体取决于节点的识别方式)或使用附加布尔字段扩充节点数据结构。

这适用于任何语法,但对于明确的语法则没有必要。由于为 Dyck language 找到明确的语法是微不足道的,您可以通过使用它而不是拒绝重复的结果来节省一些时间。但我建议以任何方式编写面向图的 BFS,因为开销很小并且它允许更广泛的可能语法。

根据 @rici 的评论,我进行了 2 项改进并显着提高了搜索速度:

  1. 消除语法歧义和正确顺序:[('s', '()'), ('s', '()s'), ('s', '(s)'), ('s', '(s)s')]

  2. 不是将每个可能的替换都视为边,而是使用第一个非终端节点 (left most derivation) 的可能替换。

以下代码只是对我的 post 代码的扩展,以显示解决方案。 (可以跳过过滤并只为最左边的非终端节点生成边)

def get_leftmost_non_terminal(edges):
  if len(edges) == 0:
    return edges
  return filter(lambda edge: edge[0] == edges[0][0], edges)


def get_adjacent_nodes(node, rules):
  edges = get_edges(node, rules)
  left_most = get_leftmost_non_terminal(edges)
  nodes = get_nodes(node, left_most, rules) 
  
  return nodes

性能测试:

s = '(()(())())'

BFS_simple(lambda node: get_adjacent_nodes(node, rules1), 's', s)
#total nodes visited =  656 duplications =  188

BFS_simple(lambda node: get_adjacent_nodes(node, rules_unambiguous_empty), 's', s)
#total nodes visited =  537 duplications =  490

BFS_simple(lambda node: get_adjacent_nodes(node, rules_unambiguous), 's', s)
#total nodes visited =  536 duplications =  0

BFS_simple(lambda node: get_adjacent_nodes(node, rules_unambiguous_ordered), 's', s)
#total nodes visited =  361 duplications =  0