如何在文档中找到一组关键字,其中 all/some 个关键字位于某个距离截止点?
How can one find a set of keywords in a document, all/some of them at a certain distance cutoff?
我有一组关键字,大约 10 个。我想在一个很长的文档中执行搜索,并检查我是否可以在那里找到这组关键字,而不仅仅是它们在文本中的存在,此外,如果它们中的 all/some 个或它们的一个子集位于例如 3 个句子或 30 个单词或任何其他接近度度量的距离截止点处。如何做到这一点?我刚刚想写一些 python 代码来找到其中一个关键字,然后检查是否有任何其他关键字在 3 行文本左右。但这需要大量的计算能力,而且效率低下。
要确定一组关键字是否在较大文档中彼此之间的给定距离内,您可以使用长度等于给定距离的滑动 window 并在文档中移动它.当您移动 window 时,请跟踪落入和落出 window 的每个单词。如果 window 包含所有关键字,则满足条件。这种方法的时间复杂度为 O(len(document))
,内存复杂度为 O(len(window))
.
以下是上述方法在 Python 中的示例实现:
from collections import defaultdict
from sets import Set
def isInProximityWindow(doc, keywords, windowLen):
words = doc.split()
wordsLen = len(words)
if (windowLen > wordsLen):
windowLen = wordsLen
keywordsLen = len(keywords)
allKeywordLocs = defaultdict(Set)
numKeywordsInWindow = 0
locKeyword = {}
for i in range(wordsLen):
windowContents = sorted([k for k in allKeywordLocs.keys() if allKeywordLocs[k]])
print "On beginning of iteration #%i, window contains '%s'" % (i, ','.join(windowContents))
oldKeyword = locKeyword.pop(i-windowLen, None)
if oldKeyword:
keywordLocs = allKeywordLocs[oldKeyword]
keywordLocs.remove(i-windowLen)
if not keywordLocs:
print "'%s' fell out of window" % oldKeyword
numKeywordsInWindow -= 1
word = words[i]
print "Next word is '%s'" % word
if word in keywords:
locKeyword[i] = word
keywordLocs = allKeywordLocs[word]
if not keywordLocs:
print "'%s' fell in window" % word
numKeywordsInWindow += 1
if numKeywordsInWindow == keywordsLen:
return True
keywordLocs.add(i)
return False
示例输出:
>>> isInProximityWindow("the brown cow jumped over the moon and the red fox jumped over the black dog", Set(["fox", "over", "the"]), 4)
On beginning of iteration #0, window contains ''
Next word is 'the'
'the' fell in window
On beginning of iteration #1, window contains 'the'
Next word is 'brown'
On beginning of iteration #2, window contains 'the'
Next word is 'cow'
On beginning of iteration #3, window contains 'the'
Next word is 'jumped'
On beginning of iteration #4, window contains 'the'
'the' fell out of window
Next word is 'over'
'over' fell in window
On beginning of iteration #5, window contains 'over'
Next word is 'the'
'the' fell in window
On beginning of iteration #6, window contains 'over,the'
Next word is 'moon'
On beginning of iteration #7, window contains 'over,the'
Next word is 'and'
On beginning of iteration #8, window contains 'over,the'
'over' fell out of window
Next word is 'the'
On beginning of iteration #9, window contains 'the'
Next word is 'red'
On beginning of iteration #10, window contains 'the'
Next word is 'fox'
'fox' fell in window
On beginning of iteration #11, window contains 'fox,the'
Next word is 'jumped'
On beginning of iteration #12, window contains 'fox,the'
'the' fell out of window
Next word is 'over'
'over' fell in window
On beginning of iteration #13, window contains 'fox,over'
Next word is 'the'
'the' fell in window
True
解决这个问题的建议是创建一个 (Hash)Map,输入每个单词作为键,并将单词的位置作为值添加到列表中,这是 Map 中的值。
对于文本 The quick brown fox jumps over the lazy dog 这将产生一个模型,如下所示(采用 json 格式)。
备注:这里所有的单词都加到索引里,就好像是小写的一样
{
"document": [
{
"key": "the",
"value": [
{
"location": 1
},
{
"location": 7
}
]
},
{
"key": "quick",
"value": [
{
"location": 2
}
]
},
{
"key": "brown",
"value": [
{
"location": 3
}
]
},
{
"key": "fox",
"value": [
{
"location": 4
}
]
},
{
"key": "jumps",
"value": [
{
"location": 5
}
]
},
{
"key": "over",
"value": [
{
"location": 6
}
]
},
{
"key": "lazy",
"value": [
{
"location": 8
}
]
},
{
"key": "dog",
"value": [
{
"location": 9
}
]
}
]
}
一旦创建了这个 index,就很容易看出不同单词之间的距离有多远。如单词 the 所示,它位于位置 1 和 7.
另外,一个词在文本中出现的次数也可以很容易地得到,通过为一个词指定的位置数。
提示: 添加其他位置信息,例如章节/部分/页面等
我 运行 在这些条件下进行了一些简单的基准测试:
- Python 3.4 在 Windows
- 150 个不同的随机词,长度 5 - 16 个字符
- 10个搜索词,必须全部找到
- window长度为75
- 迭代了5000万个单词,总共约5.14亿个字符
词生成:
def generator(gen_salt):
words = [word(i) for i in range(n_distinct_words)]
np.random.seed(123)
for i in range(int(n_words)):
yield words[np.random.randint(0, n_distinct_words)]
搜索代码,words = generator, search_words = set, window_len = int
:
from collections import deque
from time import time
def deque_window(words, search_words, window_len):
start = time()
result = []
pos = 0
window = deque([], window_len)
for word in words:
window.append(word)
if word in search_words:
all_found = True
for search_word in search_words:
if search_word not in window:
all_found = False
break
if all_found:
result.append(pos)
pos += 1
return result, time() - start
实际上,甚至计算字符总数也需要 31 秒,而找到包含搜索中所有单词的索引仅需 48 秒 window。我不确定 randint 或列表查找是否真的那么慢。我需要一个更高效的生成器,也许我会将结果存储在磁盘上并尝试从那里读取它(这将更接近您的场景)。
这样计算的长度总和:
sum(len(w) for w in words)
您只需要一个开源 Apache Solr
软件即可。
Apache Solr is the popular, blazing-fast, open source enterprise search
platform built on Apache Lucene™
单击此 link 了解更多信息。相信我,即使对于数 TB 的数据,这也能提供快速的结果。
我有一组关键字,大约 10 个。我想在一个很长的文档中执行搜索,并检查我是否可以在那里找到这组关键字,而不仅仅是它们在文本中的存在,此外,如果它们中的 all/some 个或它们的一个子集位于例如 3 个句子或 30 个单词或任何其他接近度度量的距离截止点处。如何做到这一点?我刚刚想写一些 python 代码来找到其中一个关键字,然后检查是否有任何其他关键字在 3 行文本左右。但这需要大量的计算能力,而且效率低下。
要确定一组关键字是否在较大文档中彼此之间的给定距离内,您可以使用长度等于给定距离的滑动 window 并在文档中移动它.当您移动 window 时,请跟踪落入和落出 window 的每个单词。如果 window 包含所有关键字,则满足条件。这种方法的时间复杂度为 O(len(document))
,内存复杂度为 O(len(window))
.
以下是上述方法在 Python 中的示例实现:
from collections import defaultdict
from sets import Set
def isInProximityWindow(doc, keywords, windowLen):
words = doc.split()
wordsLen = len(words)
if (windowLen > wordsLen):
windowLen = wordsLen
keywordsLen = len(keywords)
allKeywordLocs = defaultdict(Set)
numKeywordsInWindow = 0
locKeyword = {}
for i in range(wordsLen):
windowContents = sorted([k for k in allKeywordLocs.keys() if allKeywordLocs[k]])
print "On beginning of iteration #%i, window contains '%s'" % (i, ','.join(windowContents))
oldKeyword = locKeyword.pop(i-windowLen, None)
if oldKeyword:
keywordLocs = allKeywordLocs[oldKeyword]
keywordLocs.remove(i-windowLen)
if not keywordLocs:
print "'%s' fell out of window" % oldKeyword
numKeywordsInWindow -= 1
word = words[i]
print "Next word is '%s'" % word
if word in keywords:
locKeyword[i] = word
keywordLocs = allKeywordLocs[word]
if not keywordLocs:
print "'%s' fell in window" % word
numKeywordsInWindow += 1
if numKeywordsInWindow == keywordsLen:
return True
keywordLocs.add(i)
return False
示例输出:
>>> isInProximityWindow("the brown cow jumped over the moon and the red fox jumped over the black dog", Set(["fox", "over", "the"]), 4)
On beginning of iteration #0, window contains ''
Next word is 'the'
'the' fell in window
On beginning of iteration #1, window contains 'the'
Next word is 'brown'
On beginning of iteration #2, window contains 'the'
Next word is 'cow'
On beginning of iteration #3, window contains 'the'
Next word is 'jumped'
On beginning of iteration #4, window contains 'the'
'the' fell out of window
Next word is 'over'
'over' fell in window
On beginning of iteration #5, window contains 'over'
Next word is 'the'
'the' fell in window
On beginning of iteration #6, window contains 'over,the'
Next word is 'moon'
On beginning of iteration #7, window contains 'over,the'
Next word is 'and'
On beginning of iteration #8, window contains 'over,the'
'over' fell out of window
Next word is 'the'
On beginning of iteration #9, window contains 'the'
Next word is 'red'
On beginning of iteration #10, window contains 'the'
Next word is 'fox'
'fox' fell in window
On beginning of iteration #11, window contains 'fox,the'
Next word is 'jumped'
On beginning of iteration #12, window contains 'fox,the'
'the' fell out of window
Next word is 'over'
'over' fell in window
On beginning of iteration #13, window contains 'fox,over'
Next word is 'the'
'the' fell in window
True
解决这个问题的建议是创建一个 (Hash)Map,输入每个单词作为键,并将单词的位置作为值添加到列表中,这是 Map 中的值。
对于文本 The quick brown fox jumps over the lazy dog 这将产生一个模型,如下所示(采用 json 格式)。
备注:这里所有的单词都加到索引里,就好像是小写的一样
{
"document": [
{
"key": "the",
"value": [
{
"location": 1
},
{
"location": 7
}
]
},
{
"key": "quick",
"value": [
{
"location": 2
}
]
},
{
"key": "brown",
"value": [
{
"location": 3
}
]
},
{
"key": "fox",
"value": [
{
"location": 4
}
]
},
{
"key": "jumps",
"value": [
{
"location": 5
}
]
},
{
"key": "over",
"value": [
{
"location": 6
}
]
},
{
"key": "lazy",
"value": [
{
"location": 8
}
]
},
{
"key": "dog",
"value": [
{
"location": 9
}
]
}
]
}
一旦创建了这个 index,就很容易看出不同单词之间的距离有多远。如单词 the 所示,它位于位置 1 和 7.
另外,一个词在文本中出现的次数也可以很容易地得到,通过为一个词指定的位置数。
提示: 添加其他位置信息,例如章节/部分/页面等
我 运行 在这些条件下进行了一些简单的基准测试:
- Python 3.4 在 Windows
- 150 个不同的随机词,长度 5 - 16 个字符
- 10个搜索词,必须全部找到
- window长度为75
- 迭代了5000万个单词,总共约5.14亿个字符
词生成:
def generator(gen_salt):
words = [word(i) for i in range(n_distinct_words)]
np.random.seed(123)
for i in range(int(n_words)):
yield words[np.random.randint(0, n_distinct_words)]
搜索代码,words = generator, search_words = set, window_len = int
:
from collections import deque
from time import time
def deque_window(words, search_words, window_len):
start = time()
result = []
pos = 0
window = deque([], window_len)
for word in words:
window.append(word)
if word in search_words:
all_found = True
for search_word in search_words:
if search_word not in window:
all_found = False
break
if all_found:
result.append(pos)
pos += 1
return result, time() - start
实际上,甚至计算字符总数也需要 31 秒,而找到包含搜索中所有单词的索引仅需 48 秒 window。我不确定 randint 或列表查找是否真的那么慢。我需要一个更高效的生成器,也许我会将结果存储在磁盘上并尝试从那里读取它(这将更接近您的场景)。
这样计算的长度总和:
sum(len(w) for w in words)
您只需要一个开源 Apache Solr
软件即可。
Apache Solr is the popular, blazing-fast, open source enterprise search platform built on Apache Lucene™
单击此 link 了解更多信息。相信我,即使对于数 TB 的数据,这也能提供快速的结果。