如何在固定的邻近范围内获取单词周围的所有单词

How to get all the words around a word within a fixed proximity

我有可变大小的文本(1k-100k 个字符)。我想在固定的邻近范围内获取给定单词周围的所有单词。给定的单词是从正则表达式中获得的,所以我有单词的开头和结尾。

例如:

PROXIMITY_LENGTH = 10  # the fixed proximity
my_text = 'some random words 1123 word1 word123 a'
start, stop = re.search(r'\b1123\b', my_text).span()

print(f'start = {start}, stop = {stop}')
print(my_text[start - PROXIMITY_LENGTH: start]) 
print(my_text[stop: stop + PROXIMITY_LENGTH])

left_limit = my_text[:start - PROXIMITY_LENGTH].rfind(' ') + 1
right_limit = stop + PROXIMITY_LENGTH + my_text[stop + PROXIMITY_LENGTH:].find(' ') 

print('\n')
print(my_text[left_limit: start]) 
print(my_text[stop: right_limit])

输出:

start = 18, stop = 22
dom words 
 word1 wor


random words 
 word1 word123

问题已经到了极限,固定距离可以切到最后一个字(来自right/left极限)。 在上面的例子中,我试图提出一个解决方案,但如果我有制表符或换行符作为单词之间的分隔符,我的解决方案就会失败,例如:

对于 my_text = 'some\trandom words 1123 word1 word123 a' 我的解决方案是在左侧:some random words 这是错误的。

感谢任何帮助!谢谢!

不看字,我要找字。那样的话,你会说,找到我的目标,前后加N个字:

PROXIMITY_LENGTH = 2  # the fixed proximity
my_text = 'some random words 1123 word1 word123 a \t1123 this too will work'.split()

found = [x.find('1123') for x in my_text]

k = [' '.join(my_text[index-PROXIMITY_LENGTH:index+PROXIMITY_LENGTH+1]) for index, item in enumerate(found) if item == 0]


print(k)

# ['random words 1123 word1 word123', 'word123 a 1123 this too']

使用正则表达式,我们可以将 found 变量替换为;


found = []
for x in my_text:
    if re.search(r'\b1123\b',x):
        found.append(0)
    else:
        found.append(-1)

我唯一的想法就是将字符串拆分为一个列表:)

这可以通过简单地扩展您的正则表达式模式以在目标匹配周围包含所需数量的单词来完成:

L = 2 # using a proximity length of just 2 for demo
my_text = 'some random words 1123 word1 word123 a'
print(re.search(r'(\w+\s+){{0,{0}}}\b1123\b(\s+\w+){{0,{0}}}'.format(L), my_text).group())

这输出:

random words 1123 word1 word123

如果你想根据标志获得接近度(与start/stop的距离)并且你希望一旦接近距离结束在单词的中间就到达洞字。

在这种情况下,我建议搜索第一个既不是字母也不是数字的 None 字符。 试试下面的代码:

import re
import string

def get_left_limit(left_string, proximity, right_limit=False):
    if proximity >= len(left_string):
        return len(left_string)

    start_diff = 0
    for letter in reversed(list(left_string[:-proximity])):
        if letter not in (string.ascii_letters + string.digits):
            break
        start_diff += 1
    return proximity + start_diff

def get_right_limit(right_string, proximity):
    if proximity >= len(right_string):
        return len(right_string)

    end_diff = 0
    for letter in list(right_string[proximity:]):
        if letter not in (string.ascii_letters + string.digits):
            break
        end_diff += 1
    return proximity + end_diff


PROXIMITY_LENGTH = 10  # the fixed proximity


# example 1
print('Example: 1')
my_text = 'some random words 1123 word1 word123 a'
start, stop = re.search(r'\b1123\b', my_text).span()
print(f'start = {start}, stop = {stop}')
#
left_proximity = get_left_limit(my_text[:start], PROXIMITY_LENGTH)
right_proximity = get_right_limit(my_text[stop:], PROXIMITY_LENGTH)
print(my_text[start - left_proximity:start])
print(my_text[stop:stop + right_proximity])

# example 2
print()
print('Example: 2')
my_text = 'some\trandom words 1123 word1 word123 a'
start, stop = re.search(r'\b1123\b', my_text).span()
print(f'start = {start}, stop = {stop}')
#
left_proximity = get_left_limit(my_text[:start], PROXIMITY_LENGTH)
right_proximity = get_right_limit(my_text[stop:], PROXIMITY_LENGTH)
print(my_text[start - left_proximity:start])
print(my_text[stop:stop + right_proximity])

以上代码将产生:

Example: 1
start = 18, stop = 22
random words 
 word1 word123

Example: 2
start = 18, stop = 22
random words 
 word1 word123
  • 创建单词分隔符列表索引('\s+')
  • 使用找到的词 .span() 在列表
  • 中查找搜索子字符串的 start/end 位置
  • 从上面提到的位置左右取所需数量的项目将得到文本中的左右"limits"

代码:

text = ' some random\twords 123 123 - 123 some other random words.' 
regex = r'\b\d((\s*|\s*-\s*)\d){8}\b'
neighbor = 2

search_b, search_e = re.search(regex, text).span()
splitted = [(0,0)] + [m.span(0) for m in re.finditer('\s+', text)] + [(len(text), len(text))]
left_limit, right_limit = None, None
for ix, (beg, end) in enumerate(splitted):
    if left_limit is None and beg >= search_b:
        left_limit = splitted[max(0, ix - 1 - neighbor)][1]
    if right_limit is None and search_e <= end:
        right_limit = splitted[min(len(splitted)-1, ix + neighbor)][0]
print(text[left_limit:right_limit])


>>>
random  words 123 123 - 123 some other

所有的答案都非常有帮助,但我提出了一个简单的方法,将邻近范围内的所有单词都拿走,但限制范围内的单词除外,因此如果邻近限制将删除一个单词,则该单词将不会被考虑在内.这种方法效率更高:

text = ' some random\twords 123 123 - 123 some other random words.' 
regex = r'\b\d((\s*|\s*-\s*)\d){8}\b'
PROXIMITY_LENGTH = 10
REGEX_NO_START_END_WORD = r'\W.+\W'

start, end = re.search(regex, text).span()

left_limit = start - PROXIMITY_LENGTH
if left_limit < 0:
    left_limit = 0

right_limit = end + PROXIMITY_LENGTH
if right_limit > len(text):
    right_limit = len(text)

text_within_proximity = text[left_limit: right_limit]
re.search(REGEX_NO_START_END_WORD, text_within_proximity, flags=re.DOTALL).group()

输出:

'\twords 123 123 - 123 some '