如何获得 difflib.SequenceMatcher 的多个匹配项？

Question

我正在使用 difflib 识别较长序列中短字符串的所有匹配项。然而，当有多个匹配项时，difflib 似乎只有 returns 一个：

> sm = difflib.SequenceMatcher(None, a='ACT', b='ACTGACT')
> sm.get_matching_blocks()
[Match(a=0, b=0, size=3), Match(a=3, b=7, size=0)]

我预期的输出是：

[Match(a=0, b=0, size=3), Match(a=0, b=4, size=3), Match(a=3, b=7, size=0)]

事实上，字符串 ACTGACT 包含 ACT 的两个匹配项，位于位置 0 和 4，大小均为 3（加上字符串末尾的另一个大小为 0 的匹配项）。

如何获得多个匹配项？我期待 difflib 到 return 两个位置。

Answer 1

为什么要使用 difflib？您应该能够只使用标准正则表达式。

import re
pattern = "ACT"
text = "ACTGACT"
matches = [m.span() for m in re.finditer(pattern, text)]

这会给你：

[(0, 3), (4, 7)]

或者这是否出于某种原因没有包含您感兴趣的信息？它当然不是 return difflib return 的最后一个空匹配，但您可以轻松地创建它。

Answer 2

正如 Jerry 所指出的，并且 k-nut 正确回答，您对问题使用了错误的算法。老实说，k-nut 的回答并没有那么糟糕，但它并不是解决这个 class 问题的最有效方法。我是一名生物信息学家，考虑到你的问题和示例案例，你似乎很想解决 "our" classical DNA 序列 alignment/search 问题（参见 scientific literature由 Altschul 或 "Gene" Myers 等科学巨星就此问题发表评论，如果您对细节感兴趣并想阅读有史以来被引用次数最多的论文之一）。

在长段数据库中有效地查找短段正是 Altschul 现在著名的 BLAST algorithm solves heuristically and/or can be done using Smith-Waterman 的 exact 查找。在 Python 中执行此操作的最有效方法可能是使用 BioPython, and in particular, you might want to look at the section describing how to set up a local NCBI BLAST+ instance。如果你不是 "married" 到 Python，今天有更快的 BLAST 实现，比如 FSA-BLAST.

另一方面，如果您需要精确匹配（与 BLAST 的启发式算法相反），如果您不介意较长的查询时间，则可能会出现这种情况并且有一个小的参考序列（B 在你的例子中），你可以使用官方的 Smith-Waterman (SW) 比对。如果不是，并且您仍然需要精确匹配，请先使用 BLAST 过滤匹配，然后使用候选人的 SW 比对减少您的集合。

你可以在纯 Python 中实现 SW，甚至只使用任何现有的纯 Python 实现，但我只推荐纯粹的路径教育目的（查看 swalign on GitHub, for example). If you nonetheless want a reasonably strong Python-based implementation check out scikit-bio for SW alignments, although scikit-bio is still in alpha-status. But first read up on the SW WikiPedia page already linked above, and depending on the hardware you have, you might instead use a GPU- or at least SIMD-optimized implementation in CUDA or C++. If you want a nice version with a Python wrapper, check out the SSWlib.

如何获得 difflib.SequenceMatcher 的多个匹配项？

how to get multiple matches with difflib.SequenceMatcher?

python

regex

difflib