Python正则表达式获取一行文本中关键字前后的n个字符
Python regex to get n characters before and after a keyword in a line of text
我正在尝试通过文件进行解析并在字符串列表中搜索关键字。我需要在每次出现前后 return 'n' 个字符。我让它在没有正则表达式的情况下工作,但效率不高。知道如何用正则表达式和 findall 做同样的事情吗? Lookup 是一个字符串列表。这是我没有正则表达式的结果:
with open(file, 'r') as temp:
for num, line in enumerate(temp, 1):
for string in lookup:
if string in line:
# Split the line in 2 substrings
tmp1 = line.split(string)[0]
tmp2 = line.split(string)[1]
# Truncate only 'n' characters before and after the keyword
tmp = tmp1[-n:] + string + tmp2[:n]
# Do something here...
这是正则表达式的开头:
with open(file, 'r') as temp:
for num, line in enumerate(temp, 1):
for string in lookup:
# Regex search with Ignorecase
searchObj = re.findall(string, line, re.M | re.I)
if searchObj:
print "search --> : ", searchObj
# Loop trough searchObj and get n characters
来自https://docs.python.org/2/library/re.html
start([group])
end([group])
Return the indices of the start and end of the substring matched by
group; group defaults to zero (meaning the whole matched substring).
Return -1 if group exists but did not contribute to the match. For a
match object m, and a group g that did contribute to the match, the
substring matched by group g (equivalent to m.group(g)) is
m.string[m.start(g):m.end(g)]
Note that m.start(group) will equal m.end(group) if group matched a
null string. For example, after m = re.search('b(c?)', 'cba'),
m.start(0) is 1, m.end(0) is 2, m.start(1) and m.end(1) are both
2, and m.start(2) raises an IndexError exception.
使用 re.finditer
可以生成 MatchObject
的迭代器,然后使用这些属性获取子字符串的开始和结束。
我让它工作了。如果有人需要,下面是代码:
with open(file, 'r') as temp:
for num, line in enumerate(temp, 1):
for string in lookup:
# Regex
searchObj = re.finditer(string, line, re.M | re.I)
if searchObj:
for match in searchObj:
# Find the start index of the keyword
start = match.span()[0]
# Find the end index of the keyword
end = match.span()[1]
# Truncate line to get only 'n' characters before and after the keyword
tmp = line[start-n:end+n] + '\n'
print tmp
我正在尝试通过文件进行解析并在字符串列表中搜索关键字。我需要在每次出现前后 return 'n' 个字符。我让它在没有正则表达式的情况下工作,但效率不高。知道如何用正则表达式和 findall 做同样的事情吗? Lookup 是一个字符串列表。这是我没有正则表达式的结果:
with open(file, 'r') as temp:
for num, line in enumerate(temp, 1):
for string in lookup:
if string in line:
# Split the line in 2 substrings
tmp1 = line.split(string)[0]
tmp2 = line.split(string)[1]
# Truncate only 'n' characters before and after the keyword
tmp = tmp1[-n:] + string + tmp2[:n]
# Do something here...
这是正则表达式的开头:
with open(file, 'r') as temp:
for num, line in enumerate(temp, 1):
for string in lookup:
# Regex search with Ignorecase
searchObj = re.findall(string, line, re.M | re.I)
if searchObj:
print "search --> : ", searchObj
# Loop trough searchObj and get n characters
来自https://docs.python.org/2/library/re.html
start([group])
end([group])
Return the indices of the start and end of the substring matched by
group; group defaults to zero (meaning the whole matched substring).
Return -1 if group exists but did not contribute to the match. For a
match object m, and a group g that did contribute to the match, the
substring matched by group g (equivalent to m.group(g)) is
m.string[m.start(g):m.end(g)]
Note that m.start(group) will equal m.end(group) if group matched a
null string. For example, after m = re.search('b(c?)', 'cba'),
m.start(0) is 1, m.end(0) is 2, m.start(1) and m.end(1) are both
2, and m.start(2) raises an IndexError exception.
使用 re.finditer
可以生成 MatchObject
的迭代器,然后使用这些属性获取子字符串的开始和结束。
我让它工作了。如果有人需要,下面是代码:
with open(file, 'r') as temp:
for num, line in enumerate(temp, 1):
for string in lookup:
# Regex
searchObj = re.finditer(string, line, re.M | re.I)
if searchObj:
for match in searchObj:
# Find the start index of the keyword
start = match.span()[0]
# Find the end index of the keyword
end = match.span()[1]
# Truncate line to get only 'n' characters before and after the keyword
tmp = line[start-n:end+n] + '\n'
print tmp