我怎样才能找到一个大字符串的最合适的子序列?

How can I find the best fit subsequences of a large string?

假设我有一个大字符串和一组子字符串,它们在连接时等于大字符串(有细微差别)。

例如(注意字符串之间的细微差别):

large_str = "hello, this is a long string, that may be made up of multiple
 substrings that approximately match the original string"

sub_strs = ["hello, ths is a lng strin", ", that ay be mad up of multiple",
 "subsrings tat aproimately ", "match the orginal strng"]

如何最好地对齐字符串以从原始 large_str 生成一组新的子字符串?例如:

["hello, this is a long string", ", that may be made up of multiple",
 "substrings that approximately ", "match the original string"]

附加信息

这个用例是从 PDF 文档中提取的文本的现有分页符中找到原始文本的分页符。从 PDF 中提取的文本经过 OCR,与原始文本相比有小错误,但原始文本没有分页符。目标是准确分页原文,避免PDF文本的OCR错误。

(附加信息使得以下很多内容变得不必要。它是为提供的子字符串可能是它们在主字符串中出现的顺序的任何排列的情况而编写的)

对于与此非常接近的问题,将会有一个动态规划解决方案。在为您提供编辑距离的动态规划算法中,动态规划的状态为 (a, b),其中 a 是第一个字符串的偏移量,b 是第二个字符串的偏移量。对于每一对 (a, b),您计算出使第一个字符串的前 a 个字符与第二个字符串的前 b 个字符相匹配的最小可能编辑距离,从 (a-1, b) 计算出 (a, b) -1), (a-1, b), 和 (a, b-1).

您现在可以使用状态 (a, n, m, b) 编写类似的算法,其中 a 是到目前为止子字符串消耗的字符总数,n 是当前子字符串的索引,m 是位置在当前子字符串中,b 是在第二个字符串中匹配的字符数。这解决了将 b 与通过将任何可用子字符串的任意数量的副本粘贴在一起而组成的字符串进行匹配的问题。

这是一个不同的问题,因为如果您试图从片段中重建一个长字符串,您可能会得到一个多次使用同一片段的解决方案,但如果您这样做,您可能希望答案很明显,它产生的子字符串集合恰好是给定集合的排列。

因为当您强制排列时,此方法返回的编辑距离将始终至少与最佳编辑距离一样好,您还可以使用它来计算排列的最佳可能编辑距离的下限, 以及 运行 一种用于寻找最佳排列的分支定界算法。

  1. 连接子字符串
  2. 将连接与原始字符串对齐
  3. 跟踪原始字符串中的哪些位置与子字符串之间的边界对齐
  4. 在与这些边界对齐的位置拆分原始字符串

使用 Python 的 difflib 的实现:

from difflib import SequenceMatcher
from itertools import accumulate

large_str = "hello, this is a long string, that may be made up of multiple substrings that approximately match the original string"

sub_strs = [
  "hello, ths is a lng strin",
  ", that ay be mad up of multiple",
  "subsrings tat aproimately ",
  "match the orginal strng"]

sub_str_boundaries = list(accumulate(len(s) for s in sub_strs))

sequence_matcher = SequenceMatcher(None, large_str, ''.join(sub_strs), autojunk = False)

match_index = 0
matches = [''] * len(sub_strs)

for tag, i1, i2, j1, j2 in sequence_matcher.get_opcodes():
  if tag == 'delete' or tag == 'insert' or tag == 'replace':
    matches[match_index] += large_str[i1:i2]
    while j1 < j2:
      submatch_len = min(sub_str_boundaries[match_index], j2) - j1
      while submatch_len == 0:
        match_index += 1
        submatch_len = min(sub_str_boundaries[match_index], j2) - j1
      j1 += submatch_len
  else:
    while j1 < j2:
      submatch_len = min(sub_str_boundaries[match_index], j2) - j1
      while submatch_len == 0:
        match_index += 1
        submatch_len = min(sub_str_boundaries[match_index], j2) - j1
      matches[match_index] += large_str[i1:i1+submatch_len]
      j1 += submatch_len
      i1 += submatch_len

print(matches)

输出:

['hello, this is a long string', 
 ', that may be made up of multiple ', 
 'substrings that approximately ', 
 'match the original string']

您正在尝试解决序列比对问题。在您的情况下,它是 "local" 序列比对。用Smith-Waterman approach. One possible implementation is here即可解决。 如果你 运行 它,你将收到:

large_str = "hello, this is a long string, that may be made up of multiple substrings that approximately match the original string"
sub_strs = ["hello, ths is a lng sin", ", that ay be md up of mulple", "susrings tat aproimately ", "manbvch the orhjgnal strng"]

for sbs in sub_strs:
    water(large_str, sbs)


 >>>

Identity = 85.185 percent
Score = 210
hello, this is a long strin
hello, th s is a l ng s  in
hello, th-s is a l-ng s--in

Identity = 84.848 percent
Score = 255
, that may be made up of multiple
, that  ay be m d  up of mul  ple
, that -ay be m-d- up of mul--ple

Identity = 83.333 percent
Score = 225
substrings that approximately 
su s rings t at a pro imately 
su-s-rings t-at a-pro-imately 

Identity = 75.000 percent
Score = 175
ma--tch the or-iginal string
ma   ch the or  g nal str ng
manbvch the orhjg-nal str-ng

中间一行显示匹配的字符。如果您需要位置,请查找 max_i 值以获得原始字符串中的 ending 位置。 起始位置将是water()函数结束时i的值。