仅分析最新数据的在线文本文件
Parse online text file for only most recent data
8/23 编辑:
感谢你们的回复和可能比我的代码更高效的代码。但是,我没有尽最大努力准确描述我要捕捉的内容。
@DarkKnight 是正确的,我正在查询的重要标记在第 5 列中。但是对于这些重要标记中的每一个,我最多需要解析 15 行文本才能捕获完整模型运行。以“TVCN”为例,我需要捕捉所有这些:
AL, 07, 2021082118, 03, TVCN, 0, 197N, 995W, 0
AL, 07, 2021082118, 03, TVCN, 12, 194N, 1026W, 0
AL, 07, 2021082118, 03, TVCN, 24, 191N, 1055W, 0
AL, 07, 2021082118, 03, TVCN, 36, 198N, 1084W, 0
AL, 07, 2021082118, 03, TVCN, 48, 202N, 1113W, 0
AL, 07, 2021082118, 03, TVCN, 60, 204N, 1139W, 0
AL, 07, 2021082118, 03, TVCN, 72, 208N, 1164W, 0
AL, 07, 2021082118, 03, TVCN, 84, 210N, 1188W, 0
AL, 07, 2021082118, 03, TVCN, 96, 211N, 1209W, 0
AL, 07, 2021082118, 03, TVCN, 108, 206N, 1230W, 0
AL, 07, 2021082118, 03, TVCN, 120, 201N, 1251W, 0
第 3 列是模型 运行 (yyyymmddhh) 的 date/time,而第 6 列是预测小时。因此,为了通过时间绘制预测但仅捕获最新模型 运行,我需要 return 所有 TVCN 日期为“2021082118”的实例。当然,每次模型再次 运行 时,日期值都会更新。这有意义吗?
我的代码可以部分满足我的需要,但我一直在努力将它准确地放在我想要的位置。我从在线文本文件中提取逗号分隔的数据。然后我的代码抛出我不想要的行。这些是飓风预报模型的原始数据。但是,在线文本文件存储了给定风暴的所有模型 运行。我只想为我选择的模型提取最新的 运行。对于给定模型 运行(预测 t+12、t+24 等),每个模型都有多行文本。这能实现吗?
这是我的部分工作:
import urllib.request
webf = urllib.request.urlopen("http://hurricanes.ral.ucar.edu/realtime/plots/northatlantic/2021/al072021/aal072021.dat")
lines = webf.readlines()
important_codes = ["AEM2", "AEMI", "AVNI", "CEM2", "COT2", "CTC1", "DSHP", "EGR2", "HMON", "HWFI", "NNIB", "LGEM", "NNIC", "OFCI", "OFCL", "SHIP", "TVCN", "UKX2"]
def is_important_line(line):
return any(code in line for code in important_codes)
output_lines = []
for line in lines:
decoded_line = line.decode("utf-8")
if not is_important_line(decoded_line):
continue
output_lines.append(decoded_line)
f = open('test.txt', 'w')
f.write("".join(output_lines))
f.close()
如果在遍历输入数据时编写输出文件可能会更好。 “重要”标记似乎在第 5 列(基数 1)中。例如,如果 'AVNI' 出现在一行的其他地方,您的代码可能会导致不明确的结果。试试这个:-
import requests
IC = ["AEM2", "AEMI", "AVNI", "CEM2", "COT2", "CTC1", "DSHP", "EGR2", "HMON",
"HWFI", "NNIB", "LGEM", "NNIC", "OFCI", "OFCL", "SHIP", "TVCN", "UKX2"]
with open('test.txt', 'w') as outfile:
with requests.Session() as session:
response = session.get(
'http://hurricanes.ral.ucar.edu/realtime/plots/northatlantic/2021/al072021/aal072021.dat')
response.raise_for_status()
for line in response.text.splitlines():
try:
if line.split(',')[4].strip() in IC:
outfile.write(f'{line}\n')
except IndexError:
pass
print('Done')
编辑:如果您只对最近出现的“重要”标记感兴趣,那么您可以这样做:-
import requests
IC = ["AEM2", "AEMI", "AVNI", "CEM2", "COT2", "CTC1", "DSHP", "EGR2", "HMON",
"HWFI", "NNIB", "LGEM", "NNIC", "OFCI", "OFCL", "SHIP", "TVCN", "UKX2"]
with open('test.txt', 'w') as outfile:
TD = {}
with requests.Session() as session:
response = session.get(
'http://hurricanes.ral.ucar.edu/realtime/plots/northatlantic/2021/al072021/aal072021.dat')
response.raise_for_status()
for line in response.text.splitlines():
try:
if (k := line.split(',')[4].strip()) in IC:
TD[k] = line
except IndexError:
pass
for v in TD.values():
outfile.write(f'{v}\n')
print('Done')
好的,我过滤错了列。这应该有效
output_lines = []
for line in lines:
line = line.decode("utf-8")
line = line.split(', ')[:-1]
if line[4] not in important_codes:
continue
output_lines.append(line)
output_lines = sorted(output_lines, key=lambda x: x[4])
new_output = []
for code, group in groupby(output_lines, key=lambda x: x[4]):
best_date = 0
temp_entries = []
for date, entries in groupby(group, key=lambda x: x[2]):
date = int(date)
if date > best_date:
best_date = date
temp_entries = list(entries)
for entry in temp_entries:
new_output.append(', '.join(entry))
with open('mydata.dat', 'w') as f:
f.write('\n'.join(new_output))
8/23 编辑:
感谢你们的回复和可能比我的代码更高效的代码。但是,我没有尽最大努力准确描述我要捕捉的内容。
@DarkKnight 是正确的,我正在查询的重要标记在第 5 列中。但是对于这些重要标记中的每一个,我最多需要解析 15 行文本才能捕获完整模型运行。以“TVCN”为例,我需要捕捉所有这些:
AL, 07, 2021082118, 03, TVCN, 0, 197N, 995W, 0
AL, 07, 2021082118, 03, TVCN, 12, 194N, 1026W, 0
AL, 07, 2021082118, 03, TVCN, 24, 191N, 1055W, 0
AL, 07, 2021082118, 03, TVCN, 36, 198N, 1084W, 0
AL, 07, 2021082118, 03, TVCN, 48, 202N, 1113W, 0
AL, 07, 2021082118, 03, TVCN, 60, 204N, 1139W, 0
AL, 07, 2021082118, 03, TVCN, 72, 208N, 1164W, 0
AL, 07, 2021082118, 03, TVCN, 84, 210N, 1188W, 0
AL, 07, 2021082118, 03, TVCN, 96, 211N, 1209W, 0
AL, 07, 2021082118, 03, TVCN, 108, 206N, 1230W, 0
AL, 07, 2021082118, 03, TVCN, 120, 201N, 1251W, 0
第 3 列是模型 运行 (yyyymmddhh) 的 date/time,而第 6 列是预测小时。因此,为了通过时间绘制预测但仅捕获最新模型 运行,我需要 return 所有 TVCN 日期为“2021082118”的实例。当然,每次模型再次 运行 时,日期值都会更新。这有意义吗?
我的代码可以部分满足我的需要,但我一直在努力将它准确地放在我想要的位置。我从在线文本文件中提取逗号分隔的数据。然后我的代码抛出我不想要的行。这些是飓风预报模型的原始数据。但是,在线文本文件存储了给定风暴的所有模型 运行。我只想为我选择的模型提取最新的 运行。对于给定模型 运行(预测 t+12、t+24 等),每个模型都有多行文本。这能实现吗?
这是我的部分工作:
import urllib.request
webf = urllib.request.urlopen("http://hurricanes.ral.ucar.edu/realtime/plots/northatlantic/2021/al072021/aal072021.dat")
lines = webf.readlines()
important_codes = ["AEM2", "AEMI", "AVNI", "CEM2", "COT2", "CTC1", "DSHP", "EGR2", "HMON", "HWFI", "NNIB", "LGEM", "NNIC", "OFCI", "OFCL", "SHIP", "TVCN", "UKX2"]
def is_important_line(line):
return any(code in line for code in important_codes)
output_lines = []
for line in lines:
decoded_line = line.decode("utf-8")
if not is_important_line(decoded_line):
continue
output_lines.append(decoded_line)
f = open('test.txt', 'w')
f.write("".join(output_lines))
f.close()
如果在遍历输入数据时编写输出文件可能会更好。 “重要”标记似乎在第 5 列(基数 1)中。例如,如果 'AVNI' 出现在一行的其他地方,您的代码可能会导致不明确的结果。试试这个:-
import requests
IC = ["AEM2", "AEMI", "AVNI", "CEM2", "COT2", "CTC1", "DSHP", "EGR2", "HMON",
"HWFI", "NNIB", "LGEM", "NNIC", "OFCI", "OFCL", "SHIP", "TVCN", "UKX2"]
with open('test.txt', 'w') as outfile:
with requests.Session() as session:
response = session.get(
'http://hurricanes.ral.ucar.edu/realtime/plots/northatlantic/2021/al072021/aal072021.dat')
response.raise_for_status()
for line in response.text.splitlines():
try:
if line.split(',')[4].strip() in IC:
outfile.write(f'{line}\n')
except IndexError:
pass
print('Done')
编辑:如果您只对最近出现的“重要”标记感兴趣,那么您可以这样做:-
import requests
IC = ["AEM2", "AEMI", "AVNI", "CEM2", "COT2", "CTC1", "DSHP", "EGR2", "HMON",
"HWFI", "NNIB", "LGEM", "NNIC", "OFCI", "OFCL", "SHIP", "TVCN", "UKX2"]
with open('test.txt', 'w') as outfile:
TD = {}
with requests.Session() as session:
response = session.get(
'http://hurricanes.ral.ucar.edu/realtime/plots/northatlantic/2021/al072021/aal072021.dat')
response.raise_for_status()
for line in response.text.splitlines():
try:
if (k := line.split(',')[4].strip()) in IC:
TD[k] = line
except IndexError:
pass
for v in TD.values():
outfile.write(f'{v}\n')
print('Done')
好的,我过滤错了列。这应该有效
output_lines = []
for line in lines:
line = line.decode("utf-8")
line = line.split(', ')[:-1]
if line[4] not in important_codes:
continue
output_lines.append(line)
output_lines = sorted(output_lines, key=lambda x: x[4])
new_output = []
for code, group in groupby(output_lines, key=lambda x: x[4]):
best_date = 0
temp_entries = []
for date, entries in groupby(group, key=lambda x: x[2]):
date = int(date)
if date > best_date:
best_date = date
temp_entries = list(entries)
for entry in temp_entries:
new_output.append(', '.join(entry))
with open('mydata.dat', 'w') as f:
f.write('\n'.join(new_output))