从特定行之间的文件中提取部分
Extract a part from a file between specific lines
我想知道,如何从大数据文件中提取特定范围内的一些数据?有没有办法读取以 "buzzwords".
开头和结尾的内容
我想在 *NODE
和 **
之间逐行阅读
*NODE
13021145, 2637.6073002472617, 55.011929824413045, 206.0394346892517
13021146, 2637.6051226039867, 55.21115693303926, 206.05686503802065
13021147, 2634.226986419154, 54.98263035830583, 205.9520084547658
13021148, 2634.224808775879, 55.181857466932044, 205.96943880353476
**
在*NODE
之前和**
之后有几千行...
我知道它看起来应该类似于:
a = []
with open('file.txt') as file:
for line in file:
if line.startswith('*NODE'):
# NOW THERE SHOULD FOLLOW SOMETHING LIKE:
# Go to next line and "a.append" till there comes the "magical"
# "**"
有什么想法吗?我是 python 的新手。感谢帮助!
我希望你明白我的意思。
您几乎做到了 - 唯一缺少的是,一旦找到开头,您就会搜索序列结尾,直到发生这种情况,将您要迭代的每一行附加到您的列表中。即:
data = None # a placeholder to store your lines
with open("file.txt", "r") as f: # do not shadow the built-in `file`
for line in f: # iterate over the lines
if data is None: # we haven't found `NODE*` yet
if line[:5] == "NODE*": # search for `NODE*` at the line beginning
data = [] # make `data` an empty list to begin collecting
elif line[:2] == "**": # data initialized, we look for the sequence's end
break # no need to iterate over the file anymore
else: # data initialized but not at the end...
data.append(line) # append the line to our data
现在 data
将包含 NODE*
和 **
之间的行列表,或者 None
如果未找到序列。
试试这个:
with open('file.txt') as file:
a = []
running = False # avoid NameError when 'if' statement below isn't reached
for line in file:
if line.startswith('*NODE'):
running = True # show that we are starting to add values
continue # make sure we don't add '*NODE'
if line.startswith('**'):
running = False # show that we're done adding values
continue # make sure we don't add '**'
if running: # only add the values if 'running' is True
a.extend([i.strip() for i in line.split(',')])
输出是一个包含以下字符串的列表:
(我用的是print('\n'.join(a))
)
13021145
2637.6073002472617
55.011929824413045
206.0394346892517
13021146
2637.6051226039867
55.21115693303926
206.05686503802065
13021147
2634.226986419154
54.98263035830583
205.9520084547658
13021148
2634.224808775879
55.181857466932044
205.96943880353476
我们可以遍历行,直到没有任何剩余或者我们已经到达块的末尾,如
a = []
with open('file.txt') as file:
for line in file:
if line.startswith('*NODE'):
# collect block-related lines
while True:
try:
line = next(file)
except StopIteration:
# there is no lines left
break
if line.startswith('**'):
# we've reached the end of block
break
a.append(line)
# stop iterating over file
break
会给我们
print(a)
['13021145, 2637.6073002472617, 55.011929824413045, 206.0394346892517\n',
'13021146, 2637.6051226039867, 55.21115693303926, 206.05686503802065\n',
'13021147, 2634.226986419154, 54.98263035830583, 205.9520084547658\n',
'13021148, 2634.224808775879, 55.181857466932044, 205.96943880353476\n']
或者我们可以编写辅助谓词,例如
def not_a_block_start(line):
return not line.startswith('*NODE')
def not_a_block_end(line):
return not line.startswith('**')
然后像
一样使用itertools
module的光彩
from itertools import (dropwhile,
takewhile)
with open('file.txt') as file:
block_start = dropwhile(not_a_block_start, file)
# skip block start line
next(block_start)
a = list(takewhile(not_a_block_end, block_start))
这将为我们提供相同的值 a
。
我想知道,如何从大数据文件中提取特定范围内的一些数据?有没有办法读取以 "buzzwords".
开头和结尾的内容我想在 *NODE
和 **
*NODE
13021145, 2637.6073002472617, 55.011929824413045, 206.0394346892517
13021146, 2637.6051226039867, 55.21115693303926, 206.05686503802065
13021147, 2634.226986419154, 54.98263035830583, 205.9520084547658
13021148, 2634.224808775879, 55.181857466932044, 205.96943880353476
**
在*NODE
之前和**
之后有几千行...
我知道它看起来应该类似于:
a = []
with open('file.txt') as file:
for line in file:
if line.startswith('*NODE'):
# NOW THERE SHOULD FOLLOW SOMETHING LIKE:
# Go to next line and "a.append" till there comes the "magical"
# "**"
有什么想法吗?我是 python 的新手。感谢帮助! 我希望你明白我的意思。
您几乎做到了 - 唯一缺少的是,一旦找到开头,您就会搜索序列结尾,直到发生这种情况,将您要迭代的每一行附加到您的列表中。即:
data = None # a placeholder to store your lines
with open("file.txt", "r") as f: # do not shadow the built-in `file`
for line in f: # iterate over the lines
if data is None: # we haven't found `NODE*` yet
if line[:5] == "NODE*": # search for `NODE*` at the line beginning
data = [] # make `data` an empty list to begin collecting
elif line[:2] == "**": # data initialized, we look for the sequence's end
break # no need to iterate over the file anymore
else: # data initialized but not at the end...
data.append(line) # append the line to our data
现在 data
将包含 NODE*
和 **
之间的行列表,或者 None
如果未找到序列。
试试这个:
with open('file.txt') as file:
a = []
running = False # avoid NameError when 'if' statement below isn't reached
for line in file:
if line.startswith('*NODE'):
running = True # show that we are starting to add values
continue # make sure we don't add '*NODE'
if line.startswith('**'):
running = False # show that we're done adding values
continue # make sure we don't add '**'
if running: # only add the values if 'running' is True
a.extend([i.strip() for i in line.split(',')])
输出是一个包含以下字符串的列表:
(我用的是print('\n'.join(a))
)
13021145
2637.6073002472617
55.011929824413045
206.0394346892517
13021146
2637.6051226039867
55.21115693303926
206.05686503802065
13021147
2634.226986419154
54.98263035830583
205.9520084547658
13021148
2634.224808775879
55.181857466932044
205.96943880353476
我们可以遍历行,直到没有任何剩余或者我们已经到达块的末尾,如
a = []
with open('file.txt') as file:
for line in file:
if line.startswith('*NODE'):
# collect block-related lines
while True:
try:
line = next(file)
except StopIteration:
# there is no lines left
break
if line.startswith('**'):
# we've reached the end of block
break
a.append(line)
# stop iterating over file
break
会给我们
print(a)
['13021145, 2637.6073002472617, 55.011929824413045, 206.0394346892517\n',
'13021146, 2637.6051226039867, 55.21115693303926, 206.05686503802065\n',
'13021147, 2634.226986419154, 54.98263035830583, 205.9520084547658\n',
'13021148, 2634.224808775879, 55.181857466932044, 205.96943880353476\n']
或者我们可以编写辅助谓词,例如
def not_a_block_start(line):
return not line.startswith('*NODE')
def not_a_block_end(line):
return not line.startswith('**')
然后像
一样使用itertools
module的光彩
from itertools import (dropwhile,
takewhile)
with open('file.txt') as file:
block_start = dropwhile(not_a_block_start, file)
# skip block start line
next(block_start)
a = list(takewhile(not_a_block_end, block_start))
这将为我们提供相同的值 a
。