计算 Python 中大文本中两个多词精确短语的位置接近度
calculate positional proximity of two multiword exact phrases inside a large text in Python
如何使用Python计算大文本(例如文章)中两个多词、精确短语之间的最小位置距离?
假设这两个短语可能多次出现。
为避免误解,这不是关于模糊字符串匹配、编辑距离、单词列表的问题etc.This是关于计算文本中两个精确短语之间的位置proximity/distance的问题。
编辑(由 https://whosebug.com/users/2359945/razzle-shazl 修改的解决方案):
def str_to_raw(s):
raw_map = {8:r'\b', 7:r'\a', 12:r'\f', 10:r'\n', 13:r'\r', 9:r'\t', 11:r'\v'}
return r''.join(i if ord(i) > 32 else raw_map.get(ord(i), i) for i in s)
def find_smallest_distance1(sentence, word1, word2):
distances = []
dist = float('inf')
p1 = str_to_raw(word1)
p2 = str_to_raw(word2)
s = sentence
"""
f1 = re.finditer(r'\bred fox\b', s, re.I)
f2 = re.finditer(r'\bblue hen\b', s, re.I)
"""
f1 = re.finditer(p1, s, re.I)
f2 = re.finditer(p2, s, re.I)
_f1 = _f2 = None
while True:
try:
_f1 = next(f1)
except StopIteration:
break
if _f2 == None:
try:
_f2 = next(f2)
except StopIteration:
break
if _f1.span()[0] > _f2.span()[0]:
# we want f1 to always be closer to start / lower start index
f1, f2 = f2, f1
_f1, _f2 = _f2, _f1
dist = min(dist, _f2.span()[0] - _f1.span()[1])
return dist
我在想,如何修改,让phrase2(word2)的距离只计算从phrase1(word1)的位置向左或向右的距离?
让我们找到两个子字符串的索引。然后我们可以一次遍历两个索引列表并计算最小距离。
我会使用正则表达式,因为它们灵活(想想未来的维护者)和强大。
我们创建两个迭代器,return 匹配两个子字符串。然后我们弹出具有较低值的迭代器(在本例中,最低起始索引)。
当这个“较短”的迭代器最终耗尽时,我们可以跳过检查另一个迭代器的剩余部分,因为这些索引的距离将比已经获得的更差。
最短距离
import re
def positionalProximity(re1: str, re2: str, s: str, bidir: bool = True, regexFlags: int = 0) -> int:
# returns shortest positional distance between re1 and re2
# when not bidirectional, then search for re1 only to the left of re2
dist = float('inf')
f1 = re.finditer(re1, s, regexFlags)
f2 = re.finditer(re2, s, regexFlags)
_f1 = _f2 = None
while True:
try:
_f1 = next(f1)
except StopIteration:
break
if _f2 == None:
try:
_f2 = next(f2)
except StopIteration:
break
if bidir and _f1.span()[0] > _f2.span()[0]:
# we want f1 to always be closer to start / lower start index
f1, f2 = f2, f1
_f1, _f2 = _f2, _f1
if bidir or _f2.span()[0] > _f1.span()[1]:
dist = min(dist, _f2.span()[0] - _f1.span()[1])
return dist
s = 'The red fox took stock of the blue hen,\
and the blue hen took stock of the red fox.\
"Blue hen!" cried red fox. "Blue hen!"'
re1 = r'\bred fox\b'
re2 = r'\bblue hen\b'
print(f'dist = {positionalProximity(re1, re2, s, False)}')
print(f'dist = {positionalProximity(re1, re2, s, regexFlags = re.I)}')
输出:
dist = 19
dist = 4
如果您对 span()
感到好奇,它 return 是您匹配的包含开始和结束的索引:
print([f.span() for f in f1])
print([f.span() for f in f2])
输出:
(4, 11)
(75, 82)
(105, 112)
(30, 38)
(48, 56)
(88, 96)
(116, 124)
如何使用Python计算大文本(例如文章)中两个多词、精确短语之间的最小位置距离? 假设这两个短语可能多次出现。 为避免误解,这不是关于模糊字符串匹配、编辑距离、单词列表的问题etc.This是关于计算文本中两个精确短语之间的位置proximity/distance的问题。
编辑(由 https://whosebug.com/users/2359945/razzle-shazl 修改的解决方案):
def str_to_raw(s):
raw_map = {8:r'\b', 7:r'\a', 12:r'\f', 10:r'\n', 13:r'\r', 9:r'\t', 11:r'\v'}
return r''.join(i if ord(i) > 32 else raw_map.get(ord(i), i) for i in s)
def find_smallest_distance1(sentence, word1, word2):
distances = []
dist = float('inf')
p1 = str_to_raw(word1)
p2 = str_to_raw(word2)
s = sentence
"""
f1 = re.finditer(r'\bred fox\b', s, re.I)
f2 = re.finditer(r'\bblue hen\b', s, re.I)
"""
f1 = re.finditer(p1, s, re.I)
f2 = re.finditer(p2, s, re.I)
_f1 = _f2 = None
while True:
try:
_f1 = next(f1)
except StopIteration:
break
if _f2 == None:
try:
_f2 = next(f2)
except StopIteration:
break
if _f1.span()[0] > _f2.span()[0]:
# we want f1 to always be closer to start / lower start index
f1, f2 = f2, f1
_f1, _f2 = _f2, _f1
dist = min(dist, _f2.span()[0] - _f1.span()[1])
return dist
我在想,如何修改,让phrase2(word2)的距离只计算从phrase1(word1)的位置向左或向右的距离?
让我们找到两个子字符串的索引。然后我们可以一次遍历两个索引列表并计算最小距离。
我会使用正则表达式,因为它们灵活(想想未来的维护者)和强大。
我们创建两个迭代器,return 匹配两个子字符串。然后我们弹出具有较低值的迭代器(在本例中,最低起始索引)。
当这个“较短”的迭代器最终耗尽时,我们可以跳过检查另一个迭代器的剩余部分,因为这些索引的距离将比已经获得的更差。
最短距离
import re
def positionalProximity(re1: str, re2: str, s: str, bidir: bool = True, regexFlags: int = 0) -> int:
# returns shortest positional distance between re1 and re2
# when not bidirectional, then search for re1 only to the left of re2
dist = float('inf')
f1 = re.finditer(re1, s, regexFlags)
f2 = re.finditer(re2, s, regexFlags)
_f1 = _f2 = None
while True:
try:
_f1 = next(f1)
except StopIteration:
break
if _f2 == None:
try:
_f2 = next(f2)
except StopIteration:
break
if bidir and _f1.span()[0] > _f2.span()[0]:
# we want f1 to always be closer to start / lower start index
f1, f2 = f2, f1
_f1, _f2 = _f2, _f1
if bidir or _f2.span()[0] > _f1.span()[1]:
dist = min(dist, _f2.span()[0] - _f1.span()[1])
return dist
s = 'The red fox took stock of the blue hen,\
and the blue hen took stock of the red fox.\
"Blue hen!" cried red fox. "Blue hen!"'
re1 = r'\bred fox\b'
re2 = r'\bblue hen\b'
print(f'dist = {positionalProximity(re1, re2, s, False)}')
print(f'dist = {positionalProximity(re1, re2, s, regexFlags = re.I)}')
输出:
dist = 19
dist = 4
如果您对 span()
感到好奇,它 return 是您匹配的包含开始和结束的索引:
print([f.span() for f in f1])
print([f.span() for f in f2])
输出:
(4, 11)
(75, 82)
(105, 112)
(30, 38)
(48, 56)
(88, 96)
(116, 124)