使用 python (acora) 查找包含关键字的行
Using python (acora) to find lines containing keywords
我正在编写一个程序,该程序读取文本文件目录并找到重叠的特定字符串组合(即在所有文件中共享)。我目前的方法是从此目录中获取一个文件,对其进行解析,构建每个字符串组合的列表,然后在其他文件中搜索该字符串组合。例如,如果我有十个文件,我会读取一个文件,解析它,存储我需要的关键字,然后搜索其他九个文件以找到这种组合。我会为每个文件重复此操作(确保单个文件不会自行搜索)。为此,我正在尝试使用 python 的 acora 模块。
到目前为止我的代码是:
def match_lines(f, *keywords):
"""Taken from [https://pypi.python.org/pypi/acora/], FAQs and Recipes #3."""
builder = AcoraBuilder('\r', '\n', *keywords)
ac = builder.build()
line_start = 0
matches = False
for kw, pos in ac.filefind(f): # Modified from original function; search a file, not a string.
if kw in '\r\n':
if matches:
yield f[line_start:pos]
matches = False
line_start = pos + 1
else:
matches = True
if matches:
yield f[line_start:]
def find_overlaps(f_in, fl_in, f_out):
"""f_in: input file to extract string combo from & use to search other files.
fl_in: list of other files to search against.
f_out: output file that'll have all lines and file names that contain the matching string combo from f_in.
"""
string_list = build_list(f_in) # Open the first file, read each line & build a list of tuples (string #1, string #2). The "build_list" function isn't shown in my pasted code.
found_lines = [] # Create a list to hold all the lines (and file names, from fl_in) that are found to have the matching (string #1, string #2).
for keywords in string_list: # For each tuple (string #1, string #2) in the list of tuples
for f in fl_in: # For each file in the input file list
for line in match_lines(f, *keywords):
found_lines.append(line)
您可能已经知道,我使用了 acora 网页 "FAQ and recipes" #3 中的函数 match_lines
。我也在模式中使用它来解析文件(使用ac.filefind()
),也是从网页上找到的。
该代码似乎有效,但它只为我提供了具有匹配字符串组合的文件名。我想要的输出是从包含我的匹配字符串组合(元组)的其他文件中写出整行。
我没有看到这里会产生文件名,就像你说的那样。
无论如何,要获得行号,您只需要在 match_lines():
中传递它们时对其进行计数
line_start = 0
line_number = 0
matches = False
text = open(f, 'r').read()
for kw, pos in ac.filefind(f): # Modified from original function; search a file, not a string.
if kw in '\r\n':
if matches:
yield line_number, text[line_start:pos]
matches = False
line_start = pos + 1
line_number += 1
else:
matches = True
if matches:
line_number, yield text[line_start:]
我正在编写一个程序,该程序读取文本文件目录并找到重叠的特定字符串组合(即在所有文件中共享)。我目前的方法是从此目录中获取一个文件,对其进行解析,构建每个字符串组合的列表,然后在其他文件中搜索该字符串组合。例如,如果我有十个文件,我会读取一个文件,解析它,存储我需要的关键字,然后搜索其他九个文件以找到这种组合。我会为每个文件重复此操作(确保单个文件不会自行搜索)。为此,我正在尝试使用 python 的 acora 模块。
到目前为止我的代码是:
def match_lines(f, *keywords):
"""Taken from [https://pypi.python.org/pypi/acora/], FAQs and Recipes #3."""
builder = AcoraBuilder('\r', '\n', *keywords)
ac = builder.build()
line_start = 0
matches = False
for kw, pos in ac.filefind(f): # Modified from original function; search a file, not a string.
if kw in '\r\n':
if matches:
yield f[line_start:pos]
matches = False
line_start = pos + 1
else:
matches = True
if matches:
yield f[line_start:]
def find_overlaps(f_in, fl_in, f_out):
"""f_in: input file to extract string combo from & use to search other files.
fl_in: list of other files to search against.
f_out: output file that'll have all lines and file names that contain the matching string combo from f_in.
"""
string_list = build_list(f_in) # Open the first file, read each line & build a list of tuples (string #1, string #2). The "build_list" function isn't shown in my pasted code.
found_lines = [] # Create a list to hold all the lines (and file names, from fl_in) that are found to have the matching (string #1, string #2).
for keywords in string_list: # For each tuple (string #1, string #2) in the list of tuples
for f in fl_in: # For each file in the input file list
for line in match_lines(f, *keywords):
found_lines.append(line)
您可能已经知道,我使用了 acora 网页 "FAQ and recipes" #3 中的函数 match_lines
。我也在模式中使用它来解析文件(使用ac.filefind()
),也是从网页上找到的。
该代码似乎有效,但它只为我提供了具有匹配字符串组合的文件名。我想要的输出是从包含我的匹配字符串组合(元组)的其他文件中写出整行。
我没有看到这里会产生文件名,就像你说的那样。
无论如何,要获得行号,您只需要在 match_lines():
中传递它们时对其进行计数line_start = 0
line_number = 0
matches = False
text = open(f, 'r').read()
for kw, pos in ac.filefind(f): # Modified from original function; search a file, not a string.
if kw in '\r\n':
if matches:
yield line_number, text[line_start:pos]
matches = False
line_start = pos + 1
line_number += 1
else:
matches = True
if matches:
line_number, yield text[line_start:]