Python: 访问 "field" 行
Python: Access "field" in line
我有以下 .txt 文件(修改 bash emboss-dreg 报告,原始报告有 seqtable 格式):
Start End Strand Pattern Sequence
43392 43420 + regex:[T][G][A][TC][C][CTG]\D{15,17}[CA][G][T][AT][AT][CTA] TGATCGCACGCCGAATGGAAACACGTTTT
52037 52064 + regex:[T][G][A][TC][C][CTG]\D{15,17}[CA][G][T][AT][AT][CTA] TGACCCTGCTTGGCGATCCCGGCGTTTC
188334 188360 + regex:[T][G][A][TC][C][CTG]\D{15,17}[CA][G][T][AT][AT][CTA] TGATCGCGCAACTGCAGCGGGAGTTAC
我只想访问 "sequence" 下的元素,将它们与一些变量进行比较并删除整行,如果比较没有给出所需的结果(使用 Levenshtein 距离用于比较)。
但我什至无法开始....:(
我正在搜索类似 linux -f 选项的内容,以直接到达行中的右侧 "field" 进行比较。
我遇到了 re.split:
with open(textFile) as f:
for line in f:
cleaned=re.split(r'\t',line)
print(cleaned)
这导致:
[' Start End Strand Pattern Sequence\n']
['\n']
[' 43392 43420 + regex:[T][G][A][TC][C][CTG]\D{15,17}[CA][G][T][AT][AT][CTA] TGATCGCACGCCGAATGGAAACACGTTTT\n']
['\n']
[' 52037 52064 + regex:[T][G][A][TC][C][CTG]\D{15,17}[CA][G][T][AT][AT][CTA] TGACCCTGCTTGGCGATCCCGGCGTTTC\n']
['\n']
[' 188334 188360 + regex:[T][G][A][TC][C][CTG]\D{15,17}[CA][G][T][AT][AT][CTA] TGATCGCGCAACTGCAGCGGGAGTTAC\n']
['\n']
那是我最接近的 "split my lines into elements"。我觉得完全走错了路,但是搜索 Stack Overflow 和 google 没有任何结果:(
我以前从未使用过 seqtable-format,所以我尝试将其作为 .txt 处理。也许,还有另一种更好的处理方法?
Python是我学习的主要语言,我对Bash不是很坚定,但是bash-处理问题的答案对我来说也可以.
感谢任何 hint/link/help :)
格式本身似乎使用多行作为分隔符,而您的 r'\t'
没有做任何事情(您指示 Python 按文字 \t
拆分)。此外,根据您粘贴的内容,数据不是使用制表符分隔符,而是使用随机数的白色 space 来填充 table.
要解决这两个问题,您可以读取文件,将第一行视为 header(如果需要),然后逐行读取其余部分,去掉 trailing\leading 白色 space,检查那里是否有任何数据,如果有 - 进一步将其拆分为白色 space 以获取您的行元素:
with open("your_data", "r") as f:
header = f.readline().split() # read the first line as a header
for line in f: # read the rest of the file line-by-line
line = line.strip() # first clear out the whitespace
if line: # check if there is any content left or is it an empty line
elements = line.split() # split the data on whitespace to get your elements
print(elements[-1]) # print the last element
TGATCGCACGCCGAATGGAAACACGTTTT
TGACCCTGCTTGGCGATCCCGGCGTTTC
TGATCGCGCAACTGCAGCGGGAGTTAC
作为奖励,因为你有 header,你可以把它变成一个地图,然后使用 'proxied' 命名访问来获取你正在寻找的元素,这样你就不会需要担心元素位置:
with open("your_data", "r") as f:
# read the header and turn it into a value:index map
header = {v: i for i, v in enumerate(f.readline().split())}
for line in f: # read the rest of the file line-by-line
line = line.strip() # first clear out the whitespace
if line: # check if there is any content left or is it an empty line
elements = line.split()
print(elements[header["Sequence"]]) # print the Sequence element
您还可以使用 header 映射将您的行转换为 dict
结构以便于访问。
更新:以下是如何创建 header 地图,然后使用它从您的台词中构建 dict
:
with open("your_data", "r") as f:
# read the header and turn it into an index:value map
header = {i: v for i, v in enumerate(f.readline().split())}
for line in f: # read the rest of the file line-by-line
line = line.strip() # first clear out the whitespace
if line: # check if there is any content left or is it an empty line
# split the line, iterate over it and use the header map to create a dict
row = {header[i]: v for i, v in enumerate(line.split())}
print(row["Sequence"]) # ... or you can append it to a list for later use
至于如何 'delete' 由于某种原因你不想要的行,你必须创建一个临时文件,循环遍历你的原始文件,比较你的值,写下你想保留到临时文件中,删除原始文件,最后重命名临时文件以匹配您的原始文件,例如:
import shutil
from tempfile import NamedTemporaryFile
SOURCE_FILE = "your_data" # path to the original file to process
def compare_func(seq): # a simple comparison function for our sequence
return not seq.endswith("TC") # use Levenshtein distance or whatever you want instead
# open a temporary file for writing and our source file for reading
with NamedTemporaryFile(mode="w", delete=False) as t, open(SOURCE_FILE, "r") as f:
header_line = f.readline() # read the header
t.write(header_line) # write the header immediately to the temporary file
header = {v: i for i, v in enumerate(header_line.split())} # create a header map
last_line = "" # a var to store the whitespace to keep the same format
for line in f: # read the rest of the file line-by-line
row = line.strip() # first clear out the whitespace
if row: # check if there is any content left or is it an empty line
elements = row.split() # split the row into elements
# now lets call our comparison function
if compare_func(elements[header["Sequence"]]): # keep the line if True
t.write(last_line) # write down the last whitespace to the temporary file
t.write(line) # write down the current line to the temporary file
else:
last_line = line # store the whitespace for later use
shutil.move(t.name, SOURCE_FILE) # finally, overwrite the source with the temporary file
这将生成与示例中第二行相同的文件,因为它的序列以 TC
结尾,而我们的 comp_function()
returns False
在这种情况下。
为了降低复杂性,您可以将整个源文件加载到工作内存中,然后覆盖它,而不是使用临时文件,但这仅适用于适合您工作内存的文件,而上述方法可以处理与您的可用存储空间一样大的文件 space.
我有以下 .txt 文件(修改 bash emboss-dreg 报告,原始报告有 seqtable 格式):
Start End Strand Pattern Sequence
43392 43420 + regex:[T][G][A][TC][C][CTG]\D{15,17}[CA][G][T][AT][AT][CTA] TGATCGCACGCCGAATGGAAACACGTTTT
52037 52064 + regex:[T][G][A][TC][C][CTG]\D{15,17}[CA][G][T][AT][AT][CTA] TGACCCTGCTTGGCGATCCCGGCGTTTC
188334 188360 + regex:[T][G][A][TC][C][CTG]\D{15,17}[CA][G][T][AT][AT][CTA] TGATCGCGCAACTGCAGCGGGAGTTAC
我只想访问 "sequence" 下的元素,将它们与一些变量进行比较并删除整行,如果比较没有给出所需的结果(使用 Levenshtein 距离用于比较)。
但我什至无法开始....:(
我正在搜索类似 linux -f 选项的内容,以直接到达行中的右侧 "field" 进行比较。
我遇到了 re.split:
with open(textFile) as f:
for line in f:
cleaned=re.split(r'\t',line)
print(cleaned)
这导致:
[' Start End Strand Pattern Sequence\n']
['\n']
[' 43392 43420 + regex:[T][G][A][TC][C][CTG]\D{15,17}[CA][G][T][AT][AT][CTA] TGATCGCACGCCGAATGGAAACACGTTTT\n']
['\n']
[' 52037 52064 + regex:[T][G][A][TC][C][CTG]\D{15,17}[CA][G][T][AT][AT][CTA] TGACCCTGCTTGGCGATCCCGGCGTTTC\n']
['\n']
[' 188334 188360 + regex:[T][G][A][TC][C][CTG]\D{15,17}[CA][G][T][AT][AT][CTA] TGATCGCGCAACTGCAGCGGGAGTTAC\n']
['\n']
那是我最接近的 "split my lines into elements"。我觉得完全走错了路,但是搜索 Stack Overflow 和 google 没有任何结果:(
我以前从未使用过 seqtable-format,所以我尝试将其作为 .txt 处理。也许,还有另一种更好的处理方法?
Python是我学习的主要语言,我对Bash不是很坚定,但是bash-处理问题的答案对我来说也可以.
感谢任何 hint/link/help :)
格式本身似乎使用多行作为分隔符,而您的 r'\t'
没有做任何事情(您指示 Python 按文字 \t
拆分)。此外,根据您粘贴的内容,数据不是使用制表符分隔符,而是使用随机数的白色 space 来填充 table.
要解决这两个问题,您可以读取文件,将第一行视为 header(如果需要),然后逐行读取其余部分,去掉 trailing\leading 白色 space,检查那里是否有任何数据,如果有 - 进一步将其拆分为白色 space 以获取您的行元素:
with open("your_data", "r") as f:
header = f.readline().split() # read the first line as a header
for line in f: # read the rest of the file line-by-line
line = line.strip() # first clear out the whitespace
if line: # check if there is any content left or is it an empty line
elements = line.split() # split the data on whitespace to get your elements
print(elements[-1]) # print the last element
TGATCGCACGCCGAATGGAAACACGTTTT TGACCCTGCTTGGCGATCCCGGCGTTTC TGATCGCGCAACTGCAGCGGGAGTTAC
作为奖励,因为你有 header,你可以把它变成一个地图,然后使用 'proxied' 命名访问来获取你正在寻找的元素,这样你就不会需要担心元素位置:
with open("your_data", "r") as f:
# read the header and turn it into a value:index map
header = {v: i for i, v in enumerate(f.readline().split())}
for line in f: # read the rest of the file line-by-line
line = line.strip() # first clear out the whitespace
if line: # check if there is any content left or is it an empty line
elements = line.split()
print(elements[header["Sequence"]]) # print the Sequence element
您还可以使用 header 映射将您的行转换为 dict
结构以便于访问。
更新:以下是如何创建 header 地图,然后使用它从您的台词中构建 dict
:
with open("your_data", "r") as f:
# read the header and turn it into an index:value map
header = {i: v for i, v in enumerate(f.readline().split())}
for line in f: # read the rest of the file line-by-line
line = line.strip() # first clear out the whitespace
if line: # check if there is any content left or is it an empty line
# split the line, iterate over it and use the header map to create a dict
row = {header[i]: v for i, v in enumerate(line.split())}
print(row["Sequence"]) # ... or you can append it to a list for later use
至于如何 'delete' 由于某种原因你不想要的行,你必须创建一个临时文件,循环遍历你的原始文件,比较你的值,写下你想保留到临时文件中,删除原始文件,最后重命名临时文件以匹配您的原始文件,例如:
import shutil
from tempfile import NamedTemporaryFile
SOURCE_FILE = "your_data" # path to the original file to process
def compare_func(seq): # a simple comparison function for our sequence
return not seq.endswith("TC") # use Levenshtein distance or whatever you want instead
# open a temporary file for writing and our source file for reading
with NamedTemporaryFile(mode="w", delete=False) as t, open(SOURCE_FILE, "r") as f:
header_line = f.readline() # read the header
t.write(header_line) # write the header immediately to the temporary file
header = {v: i for i, v in enumerate(header_line.split())} # create a header map
last_line = "" # a var to store the whitespace to keep the same format
for line in f: # read the rest of the file line-by-line
row = line.strip() # first clear out the whitespace
if row: # check if there is any content left or is it an empty line
elements = row.split() # split the row into elements
# now lets call our comparison function
if compare_func(elements[header["Sequence"]]): # keep the line if True
t.write(last_line) # write down the last whitespace to the temporary file
t.write(line) # write down the current line to the temporary file
else:
last_line = line # store the whitespace for later use
shutil.move(t.name, SOURCE_FILE) # finally, overwrite the source with the temporary file
这将生成与示例中第二行相同的文件,因为它的序列以 TC
结尾,而我们的 comp_function()
returns False
在这种情况下。
为了降低复杂性,您可以将整个源文件加载到工作内存中,然后覆盖它,而不是使用临时文件,但这仅适用于适合您工作内存的文件,而上述方法可以处理与您的可用存储空间一样大的文件 space.