Python 字符串匹配的正则表达式问题

Question

抱歉这么久post

编辑--

从 Norman 的解决方案修改为打印和 return 如果我们找到一个精确的解决方案，否则打印所有近似匹配。对于在下面提供的第三个 pastebin link.

的字典文件中搜索 etnse 的特定示例，目前仍然只能获得 83/85 匹配。

def doMatching(file, origPattern):
    entireFile = file.read()
    patterns = []
    startIndices = []

    begin = time.time()

    # get all of the patterns associated with the given phrase
    for pattern in generateFuzzyPatterns(origPattern):
        patterns.append(pattern)
        for m in re.finditer(pattern, entireFile):
            startIndices.append((m.start(), m.end(), m.group()))
        # if the first pattern(exact match) is valid, then just print the results and we're done
        if len(startIndices) != 0 and startIndices[0][2] == origPattern:
            print("\nThere is an exact match at: [{}:{}] for {}").format(*startIndices[0])
            return

    print('Used {} patterns:').format(len(patterns))
    for i, p in enumerate(patterns, 1):
        print('- [{}]  {}').format(i, p)

    # list for all non-overlapping starting indices
    nonOverlapping = []
    # hold the last matches ending position
    lastEnd = 0
    # find non-overlapping matches by comparing each matches starting index to the previous matches ending index
    # if the starting index > previous items ending index they aren't overlapping
    for start in sorted(startIndices):
        print(start)
        if start[0] >= lastEnd:
            # startIndicex[start][0] gets the ending index from the current matches tuple
            lastEnd = start[1]
            nonOverlapping.append(start)

    print()
    print('Found {} matches:').format(len(startIndices))
    # i is the key <starting index> assigned to the value of the indices (<ending index>, <string at those indices>
    for start in sorted(startIndices):
        # *startIndices[i] means to unpack the tuple associated to the key i's value to be used by format as 2 inputs
        # for explanation, see: 
        print('- [{}:{}]  {}').format(*start)

    print()
    print('Found {} non-overlapping matches:').format(len(nonOverlapping))
    for ov in nonOverlapping:
        print('- [{}:{}]  {}').format(*ov)

    end = time.time()
    print(end-begin)

def generateFuzzyPatterns(origPattern):
    # Escape individual symbols.
    origPattern = [re.escape(c) for c in origPattern]

    # Find exact matches.
    pattern = ''.join(origPattern)
    yield pattern

    # Find matches with changes. (replace)
    for i in range(len(origPattern)):
        t = origPattern[:]
        # replace with a wildcard for each index
        t[i] = '.'
        pattern = ''.join(t)
        yield pattern

    # Find matches with deletions. (omitted)
    for i in range(len(origPattern)):
        t = origPattern[:]
        # remove a char for each index
        t[i] = ''
        pattern = ''.join(t)
        yield pattern

    # Find matches with insertions.
    for i in range(len(origPattern) + 1):
        t = origPattern[:]
        # insert a wildcard between adjacent chars for each index
        t.insert(i, '.')
        pattern = ''.join(t)
        yield pattern

    # Find two adjacent characters being swapped.
    for i in range(len(origPattern) - 1):
        t = origPattern[:]
        if t[i] != t[i + 1]:
            t[i], t[i + 1] = t[i + 1], t[i]
            pattern = ''.join(t)
            yield pattern

原文： http://pastebin.com/bAXeYZcD - 实际函数

http://pastebin.com/YSfD00Ju - 要使用的数据，应该是 'ware' 的 8 个匹配但只得到 6

http://pastebin.com/S9u50ig0 - 要使用的数据，应该为 'etnse' 获得 85 个匹配项，但只获得 77 个

我将所有原始代码留在函数中，因为我不确定到底是什么导致了问题。

您可以在任何内容上搜索 'Board:isFull()' 以获得下面所述的错误。

示例：

假设您在名为 files 的文件夹中命名第二个 pastebin 'someFile.txt'，该文件夹与 .py 文件位于同一目录中。

file = open('./files/someFile.txt', 'r')
doMatching(file, "ware")

或

file = open('./files/someFile.txt', 'r')
doMatching(file, "Board:isFull()")

或

假设您在名为 files 的文件夹中命名了第三个 pastebin 'dictionary.txt'，该文件夹与 .py 文件位于同一目录中。

file = open('./files/dictionary.txt', 'r')
doMatching(file, "etnse")

--编辑

函数参数是这样工作的：

file 是文件的位置。

origPattern 是一个短语。

该功能基本上应该是模糊搜索。它应该采用模式并搜索文件以找到完全匹配或有 1 个字符偏差的匹配项。即：1 个缺失字符、1 个额外字符、1 个替换字符或 1 个字符与相邻字符交换。

大部分情况下它都有效，但我运行遇到了一些问题。

首先，当我尝试对 origPattern 使用类似 'Board:isFull()' 的东西时，我得到以下信息：

    raise error, v # invalid expression
sre_constants.error: unbalanced parenthesis

以上来自re库

我试过使用 re.escape() 但它没有任何改变。

其次，当我尝试 'Fun()' 等其他操作时，它说它在某个索引处有一个匹配项，但它甚至不包含任何索引；只是一行'*'

第三，当它确实找到匹配项时，它并不总能找到所有匹配项。例如，我有一个文件应该找到 85 个匹配项，但它只出现了 77 个，另一个文件有 8 个，但只出现了 6 个。但是，它们只是按字母顺序排列的，所以这可能只是我的问题做搜索什么的。

感谢任何帮助。

我也不会使用 fuzzyfinder

Answer 1

我在代码中发现了一些问题：

re.escape() 似乎不起作用，因为它的结果没有赋值。
做 origPattern = re.escape(origPattern)。
当模式被正确转义时，注意在操作模式时不要破坏转义。
示例：re.escape('Fun()') 生成字符串 Fun\(\)。其中的两个 \( 子字符串绝不能分开：永远不要删除、替换或交换没有转义字符的 \。
错误操作：Fun(\)（移除），Fu\n(\)（交换），Fun\.{0,2}\).
好的操作：Fun\)（移除），Fu\(n\)（交换），Fun.{0,2}\).
您找到的匹配项太少，因为如果没有完全匹配项，您只会尝试查找模糊匹配项。（参见第 if indices.__len__() != 0: 行。）您必须始终寻找它们。
插入 '.{0,2}' 的循环产生了太多的模式，例如'ware.{0,2}' 对于 ware。除非您打算这样做，否则此模式会找到 wareXY 其中有两个插入。
带有 .{0,2} 的模式与描述的不一样；他们允许一次更改和一次插入。
我不确定涉及 difflib.Differ 的代码。我不明白，但我怀疑不应该有 break 语句。
即使您使用 set 来存储索引，来自不同正则表达式的匹配项仍可能重叠。
您不在正则表达式中使用单词边界 (\b)，但对于自然语言来说这很有意义。
不是错误，而是：为什么要显式调用魔法方法？
（例如 indices.__len__() != 0 而不是 len(indices) != 0。）

我稍微重写了您的代码以解决我看到的任何问题：

def doMatching(file, origPattern):
    entireFile = file.read()
    patterns = []
    startIndices = {}

    for pattern in generateFuzzyPatterns(origPattern):
        patterns.append(pattern)
        startIndices.update((m.start(), (m.end(), m.group())) for m in re.finditer(pattern, entireFile))

    print('Used {} patterns:'.format(len(patterns)))
    for i, p in enumerate(patterns, 1):
        print('- [{}]  {}'.format(i, p))

    nonOverlapping = []
    lastEnd = 0
    for start in sorted(startIndices):
        if start >= lastEnd:
            lastEnd = startIndices[start][0]
            nonOverlapping.append(start)

    print()
    print('Found {} matches:'.format(len(startIndices)))
    for i in sorted(startIndices):
        print('- [{}:{}]  {}'.format(i, *startIndices[i]))

    print()
    print('Found {} non-overlapping matches:'.format(len(nonOverlapping)))
    for i in nonOverlapping:
        print('- [{}:{}]  {}'.format(i, *startIndices[i]))


def generateFuzzyPatterns(origPattern):
    # Escape individual symbols.
    origPattern = [re.escape(c) for c in origPattern]

    # Find exact matches.
    pattern = ''.join(origPattern)
    yield pattern

    # Find matches with changes.
    for i in range(len(origPattern)):
        t = origPattern[:]
        t[i] = '.'
        pattern = ''.join(t)
        yield pattern

    # Find matches with deletions.
    for i in range(len(origPattern)):
        t = origPattern[:]
        t[i] = ''
        pattern = ''.join(t)
        yield pattern

    # Find matches with insertions.
    for i in range(len(origPattern) + 1):
        t = origPattern[:]
        t.insert(i, '.')
        pattern = ''.join(t)
        yield pattern

    # Find two adjacent characters being swapped.
    for i in range(len(origPattern) - 1):
        t = origPattern[:]
        if t[i] != t[i + 1]:
            t[i], t[i + 1] = t[i + 1], t[i]
            pattern = ''.join(t)
            yield pattern

Python 字符串匹配的正则表达式问题

Python regex problems with string matching

python

regex

fuzzy-search

fuzzy-logic

python-3.x