python 如何拆分不规则分隔的文件?
python How to split irreguarly delimited file?
我遇到了一个有趣的问题,该问题是由于一些新格式和结构不良的数据文件引起的。它们是文本文件,以逗号分隔,包含多组数据,每组数据都有唯一的 header。最初我使用 genFromTxt 只用一个 header 读取一个数据实例。现在有了多个实例,genFromTxt 无法处理它。将文件拆分并将每个单独的实例提供给 genFromTxt 的最佳方法是什么?这是该文件的示例。来自第一个实例的数据立即与第二个实例的 header 对接。每个文件重复大约 20 次。我还没有找到能够将它们分开的通用分隔符。
0.8 9999.0 999.0 999.0 999.0 9999.0 9999.0 999.0 999.0 999.0 9999.000 999.000 999.0 999.0 99999.0 9.0 9.0 9.0 9.0 9.0 9.0
0.5 9999.0 999.0 999.0 999.0 9999.0 9999.0 999.0 999.0 999.0 72.380 -7.761 999.0 999.0 99999.0 9.0 9.0 9.0 9.0 9.0 9.0
0.3 9999.0 999.0 999.0 999.0 9999.0 9999.0 999.0 999.0 999.0 9999.000 999.000 999.0 999.0 99999.0 9.0 9.0 9.0 9.0 9.0 9.0
0.0 9999.0 999.0 999.0 999.0 9999.0 9999.0 999.0 999.0 999.0 72.381 -7.760 999.0 999.0 99999.0 9.0 9.0 9.0 9.0 9.0 9.0
-1.0 906.7 20.0 18.9 92.8 -10.1 -3.7 10.7 70.0 999.0 72.380 -7.761 999.0 999.0 953.8 1.0 1.0 1.0 1.0 1.0 9.0
Data Type: AVAPS SOUNDING DATA, Channel 2/Descending
Project ID: DYNAMO
Release Site Type/Site ID: NOAA P3/N43RF 20111116I1
Release Location (lon,lat,alt): 072 12.04'E, 08 11.50'S, 72.201, -8.192, 966.4
UTC Release Time (y,m,d,h,m,s): 2011, 11, 16, 04:22:07
Reference Launch Data Source/Time: IWGADTS Format (IWG1)/04:22:07
Sonde Id: 110355308
System Operator/Comments: TMR/none, Good Drop
Post Processing Comments: Aspen Version 3.1; Created on 01 Feb 2012 23:18 UTC; Configuration research-dropsonde
/
/
Nominal Release Time (y,m,d,h,m,s):2011, 11, 16, 04:22:07
Time Press Temp Dewpt RH Ucmp Vcmp spd dir Wcmp Lon Lat Ele Azi Alt Qp Qt Qrh Qu Qv QdZ
sec mb C C % m/s m/s m/s deg m/s deg deg deg deg m code code code code code code
------ ------ ----- ----- ----- ------ ------ ----- ----- ----- -------- ------- ----- ----- ------- ---- ---- ---- ---- ---- ----
89.8 1011.6 27.3 23.9 81.0 9999.0 9999.0 999.0 999.0 999.0 9999.000 999.000 999.0 999.0 0.0 1.0 1.0 1.0 9.0 9.0 9.0
您可以修改代码,例如...(Python3 警告,如果您想 运行 在 Python2.7+ 中使用此代码,请将 range()
替换为 xrange()
(为了提高效率))
def readSacredAttribute(holyInput):
raw = [ x.strip() for x in holyInput.readline()[:-1].split(':') ]
newRaw = []
for i in range(len(raw) - 1):
for x in [ x.strip() for x in raw[ i + 1 ].split(',') ]:
newRaw.append(x)
raw[ 1 : ] = newRaw
parameters = {}
if '(' in raw[0]:
base = raw[0].index('(') + 1
to = raw[0].index(')')
splitted = [ x.strip() for x in raw[1].split(',') ]
for i, x in enumerate([ x.strip() for x in raw[0][ base : to ].split(',') ]):
parameters[x] = splitted[i]
return (raw, parameters)
def splitThisStupidMess(holyInput):
holyHeader = []
for i in range(6):
holyHeader.append([ float(x) for x in holyInput.readline().split()])
sacredAttributes = { x[0][0] : (x[0][1], x[1]) for x in [ readSacredAttribute(holyInput) for i in range(9) ] }
# Ignore the '\' lines
for i in range(2):
holyInput.readline()
nominalTime = readSacreAttribute(holyInput)
sacredAttributes[nominalTime[0][0]] = (nominalTime[0][1], nominalTime[1])
divineNames = holyInput.readline().split()
divineUnits = holyInput.readline().split()
holyInput.readline() # Avoid decoration...
divineValues = [ float(x) for x in holyInput.readline().split() ]
divineFooter = { divineNames[i] : (divineUnits[i], divineValues[i]) for i in len(divineNames) }
return (holyHeader, sacredAttributes, divineFooter)
我遇到了一个有趣的问题,该问题是由于一些新格式和结构不良的数据文件引起的。它们是文本文件,以逗号分隔,包含多组数据,每组数据都有唯一的 header。最初我使用 genFromTxt 只用一个 header 读取一个数据实例。现在有了多个实例,genFromTxt 无法处理它。将文件拆分并将每个单独的实例提供给 genFromTxt 的最佳方法是什么?这是该文件的示例。来自第一个实例的数据立即与第二个实例的 header 对接。每个文件重复大约 20 次。我还没有找到能够将它们分开的通用分隔符。
0.8 9999.0 999.0 999.0 999.0 9999.0 9999.0 999.0 999.0 999.0 9999.000 999.000 999.0 999.0 99999.0 9.0 9.0 9.0 9.0 9.0 9.0
0.5 9999.0 999.0 999.0 999.0 9999.0 9999.0 999.0 999.0 999.0 72.380 -7.761 999.0 999.0 99999.0 9.0 9.0 9.0 9.0 9.0 9.0
0.3 9999.0 999.0 999.0 999.0 9999.0 9999.0 999.0 999.0 999.0 9999.000 999.000 999.0 999.0 99999.0 9.0 9.0 9.0 9.0 9.0 9.0
0.0 9999.0 999.0 999.0 999.0 9999.0 9999.0 999.0 999.0 999.0 72.381 -7.760 999.0 999.0 99999.0 9.0 9.0 9.0 9.0 9.0 9.0
-1.0 906.7 20.0 18.9 92.8 -10.1 -3.7 10.7 70.0 999.0 72.380 -7.761 999.0 999.0 953.8 1.0 1.0 1.0 1.0 1.0 9.0
Data Type: AVAPS SOUNDING DATA, Channel 2/Descending
Project ID: DYNAMO
Release Site Type/Site ID: NOAA P3/N43RF 20111116I1
Release Location (lon,lat,alt): 072 12.04'E, 08 11.50'S, 72.201, -8.192, 966.4
UTC Release Time (y,m,d,h,m,s): 2011, 11, 16, 04:22:07
Reference Launch Data Source/Time: IWGADTS Format (IWG1)/04:22:07
Sonde Id: 110355308
System Operator/Comments: TMR/none, Good Drop
Post Processing Comments: Aspen Version 3.1; Created on 01 Feb 2012 23:18 UTC; Configuration research-dropsonde
/
/
Nominal Release Time (y,m,d,h,m,s):2011, 11, 16, 04:22:07
Time Press Temp Dewpt RH Ucmp Vcmp spd dir Wcmp Lon Lat Ele Azi Alt Qp Qt Qrh Qu Qv QdZ
sec mb C C % m/s m/s m/s deg m/s deg deg deg deg m code code code code code code
------ ------ ----- ----- ----- ------ ------ ----- ----- ----- -------- ------- ----- ----- ------- ---- ---- ---- ---- ---- ----
89.8 1011.6 27.3 23.9 81.0 9999.0 9999.0 999.0 999.0 999.0 9999.000 999.000 999.0 999.0 0.0 1.0 1.0 1.0 9.0 9.0 9.0
您可以修改代码,例如...(Python3 警告,如果您想 运行 在 Python2.7+ 中使用此代码,请将 range()
替换为 xrange()
(为了提高效率))
def readSacredAttribute(holyInput):
raw = [ x.strip() for x in holyInput.readline()[:-1].split(':') ]
newRaw = []
for i in range(len(raw) - 1):
for x in [ x.strip() for x in raw[ i + 1 ].split(',') ]:
newRaw.append(x)
raw[ 1 : ] = newRaw
parameters = {}
if '(' in raw[0]:
base = raw[0].index('(') + 1
to = raw[0].index(')')
splitted = [ x.strip() for x in raw[1].split(',') ]
for i, x in enumerate([ x.strip() for x in raw[0][ base : to ].split(',') ]):
parameters[x] = splitted[i]
return (raw, parameters)
def splitThisStupidMess(holyInput):
holyHeader = []
for i in range(6):
holyHeader.append([ float(x) for x in holyInput.readline().split()])
sacredAttributes = { x[0][0] : (x[0][1], x[1]) for x in [ readSacredAttribute(holyInput) for i in range(9) ] }
# Ignore the '\' lines
for i in range(2):
holyInput.readline()
nominalTime = readSacreAttribute(holyInput)
sacredAttributes[nominalTime[0][0]] = (nominalTime[0][1], nominalTime[1])
divineNames = holyInput.readline().split()
divineUnits = holyInput.readline().split()
holyInput.readline() # Avoid decoration...
divineValues = [ float(x) for x in holyInput.readline().split() ]
divineFooter = { divineNames[i] : (divineUnits[i], divineValues[i]) for i in len(divineNames) }
return (holyHeader, sacredAttributes, divineFooter)