如何在文档中找到一组关键字,其中 all/some 个关键字位于某个距离截止点?

How can one find a set of keywords in a document, all/some of them at a certain distance cutoff?

我有一组关键字,大约 10 个。我想在一个很长的文档中执行搜索,并检查我是否可以在那里找到这组关键字,而不仅仅是它们在文本中的存在,此外,如果它们中的 all/some 个或它们的一个子集位于例如 3 个句子或 30 个单词或任何其他接近度度量的距离截止点处。如何做到这一点?我刚刚想写一些 python 代码来找到其中一个关键字,然后检查是否有任何其他关键字在 3 行文本左右。但这需要大量的计算能力,而且效率低下。

要确定一组关键字是否在较大文档中彼此之间的给定距离内,您可以使用长度等于给定距离的滑动 window 并在文档中移动它.当您移动 window 时,请跟踪落入和落出 window 的每个单词。如果 window 包含所有关键字,则满足条件。这种方法的时间复杂度为 O(len(document)),内存复杂度为 O(len(window)).

以下是上述方法在 Python 中的示例实现:

from collections import defaultdict
from sets import Set
def isInProximityWindow(doc, keywords, windowLen):
    words = doc.split()
    wordsLen = len(words)
    if (windowLen > wordsLen):
        windowLen = wordsLen

    keywordsLen = len(keywords)
    allKeywordLocs = defaultdict(Set)
    numKeywordsInWindow = 0
    locKeyword = {}
    for i in range(wordsLen):
        windowContents = sorted([k for k in allKeywordLocs.keys() if allKeywordLocs[k]])
        print "On beginning of iteration #%i, window contains '%s'" % (i, ','.join(windowContents))

        oldKeyword = locKeyword.pop(i-windowLen, None)
        if oldKeyword:
            keywordLocs = allKeywordLocs[oldKeyword]
            keywordLocs.remove(i-windowLen)
            if not keywordLocs:
                print "'%s' fell out of window" % oldKeyword
                numKeywordsInWindow -= 1
        word = words[i]
        print "Next word is '%s'" % word
        if word in keywords:
            locKeyword[i] = word
            keywordLocs = allKeywordLocs[word]
            if not keywordLocs:
                print "'%s' fell in window" % word
                numKeywordsInWindow += 1
                if numKeywordsInWindow == keywordsLen:
                    return True
            keywordLocs.add(i)
    return False

示例输出:

>>> isInProximityWindow("the brown cow jumped over the moon and the red fox jumped over the black dog", Set(["fox", "over", "the"]), 4)
On beginning of iteration #0, window contains ''
Next word is 'the'
'the' fell in window
On beginning of iteration #1, window contains 'the'
Next word is 'brown'
On beginning of iteration #2, window contains 'the'
Next word is 'cow'
On beginning of iteration #3, window contains 'the'
Next word is 'jumped'
On beginning of iteration #4, window contains 'the'
'the' fell out of window
Next word is 'over'
'over' fell in window
On beginning of iteration #5, window contains 'over'
Next word is 'the'
'the' fell in window
On beginning of iteration #6, window contains 'over,the'
Next word is 'moon'
On beginning of iteration #7, window contains 'over,the'
Next word is 'and'
On beginning of iteration #8, window contains 'over,the'
'over' fell out of window
Next word is 'the'
On beginning of iteration #9, window contains 'the'
Next word is 'red'
On beginning of iteration #10, window contains 'the'
Next word is 'fox'
'fox' fell in window
On beginning of iteration #11, window contains 'fox,the'
Next word is 'jumped'
On beginning of iteration #12, window contains 'fox,the'
'the' fell out of window
Next word is 'over'
'over' fell in window
On beginning of iteration #13, window contains 'fox,over'
Next word is 'the'
'the' fell in window
True

解决这个问题的建议是创建一个 (Hash)Map,输入每个单词作为键,并将单词的位置作为值添加到列表中,这是 Map 中的值。

对于文本 The quick brown fox jumps over the lazy dog 这将产生一个模型,如下所示(采用 json 格式)。

备注:这里所有的单词都加到索引里,就好像是小写的一样

{
    "document": [
        {
            "key": "the",
            "value": [
                {
                    "location": 1
                },
                {
                    "location": 7
                }
            ]
        },
        {
            "key": "quick",
            "value": [
                {
                    "location": 2
                }
            ]
        },
        {
            "key": "brown",
            "value": [
                {
                    "location": 3
                }
            ]
        },
        {
            "key": "fox",
            "value": [
                {
                    "location": 4
                }
            ]
        },
        {
            "key": "jumps",
            "value": [
                {
                    "location": 5
                }
            ]
        },
        {
            "key": "over",
            "value": [
                {
                    "location": 6
                }
            ]
        },
        {
            "key": "lazy",
            "value": [
                {
                    "location": 8
                }
            ]
        },
        {
            "key": "dog",
            "value": [
                {
                    "location": 9
                }
            ]
        }
    ] 
}

一旦创建了这个 index,就很容易看出不同单词之间的距离有多远。如单词 the 所示,它位于位置 1 和 7.

另外,一个词在文本中出现的次数也可以很容易地得到,通过为一个词指定的位置数。

提示: 添加其他位置信息,例如章节/部分/页面等

我 运行 在这些条件下进行了一些简单的基准测试:

  • Python 3.4 在 Windows
  • 150 个不同的随机词,长度 5 - 16 个字符
  • 10个搜索词,必须全部找到
  • window长度为75
  • 迭代了5000万个单词,总共约5.14亿个字符

词生成:

def generator(gen_salt):
    words = [word(i) for i in range(n_distinct_words)]
    np.random.seed(123)

    for i in range(int(n_words)):
        yield words[np.random.randint(0, n_distinct_words)]

搜索代码,words = generator, search_words = set, window_len = int

from collections import deque
from time import time

def deque_window(words, search_words, window_len):
    start = time()
    result = []
    pos = 0

    window = deque([], window_len)

    for word in words:
        window.append(word)

        if word in search_words:
            all_found = True
            for search_word in search_words:
                if search_word not in window:
                    all_found = False
                    break

            if all_found:
                result.append(pos)

        pos += 1

    return result, time() - start

实际上,甚至计算字符总数也需要 31 秒,而找到包含搜索中所有单词的索引仅需 48 秒 window。我不确定 randint 或列表查找是否真的那么慢。我需要一个更高效的生成器,也许我会将结果存储在磁盘上并尝试从那里读取它(这将更接近您的场景)。

这样计算的长度总和:

sum(len(w) for w in words)

您只需要一个开源 Apache Solr 软件即可。

Apache Solr is the popular, blazing-fast, open source enterprise search platform built on Apache Lucene™

单击此 link 了解更多信息。相信我,即使对于数 TB 的数据,这也能提供快速的结果。