在 python 中的文件中间用新的 header 读取 data-file

Reading data-file with new header in the middle of the file in python

我有一个 .txt data-file,其中包含许多具有不同 header 的列。我可以读取包含所有列和行的文件。但是,我的问题是该文件包含一个额外的 header 三列,附加在初始 header 数据的最后一行。如何将最后三列与第一列分开?此外,我想删除三列中的第一列,因为它是第一列的副本,并将其他两列 column-wise 附加到文件顶部的列。我已经使用 pandas 读取这样的文件:

c = pd.read_csv('C:\filepath.txt', sep=',',header=None,names=['<Title1>','<Title2>','<Title3>','<Title4>','<Title5>','<Title6>','<Title7>','<Title8>','<Title9>','<Title10>','<Title11>','<Title12>'],skiprows=[0,1])

结果是:

                         <Title1>  ... <Title12>
134849000   -0.420384078515376  ...    244.507248
135016000   -0.406915327374619  ...    244.507248
135183000   -0.406915327374619  ...    244.507248
135349000   -0.406915327374619  ...    244.507248
135516000   -0.406915327374619  ...    244.507248
...                        ...  ...           ... <-- (somewhere in here there is a new header with three columns)
2316226000   0.349323222511261  ...           NaN
2316393000   0.359268272664523  ...           NaN
2316560000   0.346797179431672  ...           NaN
2316726000   0.291363936474923  ...           NaN
2316893000   0.256587672540276  ...           NaN

[26188 rows x 12 columns] 

可以看出数据集的“4.th quandrant”(或第 4-12 列,第 x 行,1-indexed)包含 NaN 值,因为三列已附加在最后一行第一个 header ,因此它们保留空值,因为该文件包含从顶部算起的 12 列。此外,header 都有两行不需要第一行,因此我需要跳过这些行。

Sample-file:

<Header1>
<Title1><Title2><Title3><Title4><Title5><Title6><Title7><Title8><Title9><Title10><Title11><Title12><Title13>
134849000,-0.420384078515376,-0.46532291072594,53.3941583535493,3.94861381238115,0.999999938482075,-0.000223083188831434,-0.000166347560402173,3.08661080398315E-06,304.11793518,274.23748016,189.97101594,244.50724792
135016000,-0.406915327374619,-0.433547012456629,53.3941583535493,3.94861381238115,0.999999910346576,-0.000180534505822662,-0.000206991530844074,2.40981161937076E-06,304.0821228,274.15297698,189.97101594,244.50724792
135183000,-0.406915327374619,-0.433547012456629,53.3941583535493,3.94861381238115,0.999999992511006,-0.000151940021895918,-0.000103313480817761,1.89050478219266E-06,304.0821228,274.15297698,189.97101594,244.50724792
135349000,-0.406915327374619,-0.433547012456629,53.3941583535493,3.94861381238115,0.999999945135159,-0.000162536174319313,-7.40562207892995E-05,2.04948428941809E-06,304.0821228,274.15297698,189.97101594,244.50724792
135516000,-0.406915327374619,-0.433547012456629,53.3941583535493,3.94861381238115,0.99999997640256,-0.000243086633501367,-6.9024988784798E-05,3.36047709420528E-06,304.0821228,274.15297698,189.97101594,244.50724792
135683000,-0.406915327374619,-0.433547012456629,53.3941583535493,3.94861381238115,0.99999997640256,-0.000243086633501367,-6.9024988784798E-05,3.36047709420528E-06,304.0821228,274.15297698,189.97101594,244.50724792
135849000,-0.406915327374619,-0.433547012456629,53.3941583535493,3.94861381238115,0.999999931122814,-0.000250245794219842,-0.000134729677676283,3.5093405085021E-06,304.0821228,274.15297698,189.97101594,244.50724792
136016000,-0.406915327374619,-0.433547012456629,53.3941583535493,3.94861381238115,0.999999952747184,-0.000248275760427849,-0.000209879516698194,3.49816745295883E-06,304.0821228,274.15297698,189.97101594,244.50724792
136183000,-0.420384078515376,-0.46532291072594,53.3941583535493,3.94861381238115,0.99999992607031,-0.000294028840627048,-0.000210060717325711,4.25711234103981E-06,304.11793518,274.23748016,189.97101594,244.50724792
136349000,-0.420916391233475,-0.442738942185795,53.3941583535493,3.94861381238115,0.999999919180309,-0.00029795985581717,-0.000124844955889991,4.29227691325224E-06,304.11935424,274.17742156,189.97101594,244.50724792
136516000,-0.420384078515376,-0.46532291072594,53.3941583535493,3.94861381238115,0.999999888009148,-0.000316878274912839,-3.29402653026431E-05,4.57532859246546E-06,304.11793518,274.23748016,189.97101594,244.50724792
136683000,-0.420916391233475,-0.442738942185795,53.3941583535493,3.94861381238115,0.999999944701863,-0.000302288971167524,-0.000119271820769005,4.36801259359743E-06,304.11935424,274.17742156,189.97101594,244.50724792
136849000,-0.405802775661793,-0.444669714471277,53.3941583535493,3.94861381238115,0.999999944701863,-0.000302288971167524,-0.000119271820769005,4.36801259359743E-06,304.0791626,274.18255616,189.97101594,244.50724792
137016000,-0.420916391233475,-0.442738942185795,53.3941583535493,3.94861381238115,0.99999991055272,-0.00029252456348538,-0.000168782643050744,4.22385527217017E-06,304.11935424,274.17742156,189.97101594,244.50724792
137183000,-0.412309946883439,-0.450987020223235,53.3941583535493,3.94861381238115,0.999999942521442,-0.000255490185269549,-0.00024667166566595,3.6414759449141E-06,304.09646606,274.19935608,189.97101594,244.50724792
137349000,-0.406915327374619,-0.433547012456629,53.3941583535493,3.94861381238115,0.999999876479583,-0.000264577733448331,-0.000298287883815869,3.80576077658318E-06,304.0821228,274.15297698,189.97101594,244.50724792
137516000,-0.406915327374619,-0.433547012456629,53.3941583535493,3.94861381238115,0.999999903983449,-0.000251750438760731,-0.000355224963982992,3.60887866227011E-06,304.0821228,274.15297698,189.97101594,244.50724792
137683000,-0.391801749871831,-0.435460567656641,53.3941583535493,3.94861381238115,0.999999885967664,-0.000231035684436353,-0.000293282668086245,3.24666448882349E-06,304.04193116,274.1580658,189.97101594,244.50724792
137849000,-0.406915327374619,-0.433547012456629,53.3941583535493,3.94861381238115,0.999999885967664,-0.000231035684436353,-0.000293282668086245,3.24666448882349E-06,304.0821228,274.15297698,189.97101594,244.50724792
<Header2>
<Title13(same as Title 1)><Title14><Title15>
134849000,0.120862187115588,0
135016000,0.171543242833847,0
135183000,0.146335932645973,0
135349000,0.09773669641824,0
135516000,0.0882672298282907,0
135683000,0.124406962864472,0
135849000,0.186013875486258,0
136016000,0.219045896500945,0
136183000,0.197246332120462,0
136349000,0.150083583561413,0
136516000,0.0838562129822536,0
136683000,0.00269632558524612,0
136849000,-0.0447052988191479,0
137016000,-0.00496292706410619,0
137183000,0.0799457149607322,0
137349000,0.137388731956788,0
137516000,0.142305654943302,0
137683000,0.115943857754048,0
137849000,0.0991913228381935,0

这里有两种解决方案,第一种方法生成一个新文件,第二种方法修复read_csv操作期间的header。如果文件将被多次处理,您可能会使用第一个,但您需要至少读取所有行两次。如果您只需要读取许多大文件一次,则首选第二种方法。

解决方案 1:pre-process 您的文件

解析一次文件以删除多余的 headers。

# create second file with unique header
with open('file.csv', 'r') as f_in, open('file_single_header.csv', 'w') as f_out:
    header = f_in.readlines(1)[0]
    f_out.write(header)
    for line in f_in.readlines():
        if line != header:
            f_out.write(line)

# then read corrected csv file
pd.read_csv('file_single_header.csv')

解决方案 2:将 header 行视为注释并手动分配 header

# read first line of the file to get header and split names
import re
with open('file.csv', 'r') as f:
    header = re.split('\s+', f.readlines(1)[0].strip())

# exclude header lines and assign names manually
pd.read_csv('file.csv', comment='<', names= header)

注意。如果您真的有另一个分隔符的 comma 分隔值,您的示例并不清楚。如果这是 space,您需要按如下方式调整 read_csv。此外,如果索引在 csv 文件中,则需要为选项 2 添加名称(此处 None)。

# option 1
pd.read_csv('file_fixed_header.csv', sep='\s+')

# option 2
pd.read_csv('file.csv',
        comment='<',
        names=[None] + header, # added None for index
        sep='\s+',
        index_col=0
       ).dropna(axis=0, how='all')

解决方案 3:修复错误的 csv 文件

with open('file.csv', 'r') as f_in, open('file_single_header.csv', 'w') as f_out:
    i = 0
    for line in f_in.readlines():
        if line.strip().startswith('<'):
            if i == 1:
                f_out.write(','.join(line.strip('<>\n').split('><'))+'\n')
            i+=1
        else:
            f_out.write(line)

可能的解决方案:

临时更改索引

c.reset_index(inplace=True)

找到第二个新列的行header

newcols = c.iloc[c[c.iloc[:, 1].isna()].index.min() + 2:, [1, 2]].reset_index(drop=True)

重命名新列

newcols.rename(columns={'<Title1>' : '<Title14>', '<Title2>' : '<Title15>'}, inplace=True)

添加新列,删除第二个header中的行,并恢复原始索引

c = pd.concat([c, newcols], axis=1).dropna().set_index('index')

我设法为这个特定的数据集找到了一个相对稳健和简单的解决方案。

读取数据后,跳过第一个 header:

raw_data = pd.read_csv('C:datafile.txt', sep=',',header=None, skiprows=[0,1])

我检查了第一列的 non-numerical 值,以找出下一个 header 所在的位置:

a = pd.to_numeric(pd.to_numeric(raw_data[0], errors='coerce').isnull())

结果:

0        False
1        False
2        False
3        False
4        False
         ...  
26183    False
26184    False
26185    False
26186    False
26187    False
Name: 0, Length: 26188, dtype: bool

然后我找到语句为真的索引:

a = np.where(a)[0]

结果:

[13093 13094]

从这里我可以简单地索引两个 header 的数据,使用索引:

d = raw_data.iloc[:raw_data.index.get_loc(a[0])]
e = raw_data.iloc[raw_data.index.get_loc(a[0])+2:, :3]

在 e 中,我还确保为列编制索引,因为第二个列只有三列 header

结果:

d =

               0                   1   ...          11          12
0       134849000  -0.420384078515376  ...  189.971016  244.507248
1       135016000  -0.406915327374619  ...  189.971016  244.507248
2       135183000  -0.406915327374619  ...  189.971016  244.507248
3       135349000  -0.406915327374619  ...  189.971016  244.507248
4       135516000  -0.406915327374619  ...  189.971016  244.507248
...           ...                 ...  ...         ...         ...
13088  2316226000   -0.30945361835179  ...  188.914284  243.942856
13089  2316393000    -0.4099956694033  ...  188.914284  243.942856
13090  2316560000    -0.4099956694033  ...  188.914284  243.942856
13091  2316726000    -0.4099956694033  ...  188.914284  243.942856
13092  2316893000  -0.429752713005517  ...  188.914284  243.942856

[13093 rows x 13 columns]

e =

                0                   1  2
13095   134849000   0.120862187115588  0
13096   135016000   0.171543242833847  0
13097   135183000   0.146335932645973  0
13098   135349000    0.09773669641824  0
13099   135516000  0.0882672298282907  0
...           ...                 ... ..
26183  2316226000   0.349323222511261  0
26184  2316393000   0.359268272664523  0
26185  2316560000   0.346797179431672  0
26186  2316726000   0.291363936474923  0
26187  2316893000   0.256587672540276  0

[13093 rows x 3 columns]

因为两个数据集都有一个公共列(每个数据集的第一列 header)我使用合并将底部数据集附加到顶部数据集:

f = pd.merge(d,e, on=[0,0])

结果:

                0                 1_x  ...                 1_y 2_y
0       134849000  -0.420384078515376  ...   0.120862187115588   0
1       135016000  -0.406915327374619  ...   0.171543242833847   0
2       135183000  -0.406915327374619  ...   0.146335932645973   0
3       135349000  -0.406915327374619  ...    0.09773669641824   0
4       135516000  -0.406915327374619  ...  0.0882672298282907   0
...           ...                 ...  ...                 ...  ..
13088  2316226000   -0.30945361835179  ...   0.349323222511261   0
13089  2316393000    -0.4099956694033  ...   0.359268272664523   0
13090  2316560000    -0.4099956694033  ...   0.346797179431672   0
13091  2316726000    -0.4099956694033  ...   0.291363936474923   0
13092  2316893000  -0.429752713005517  ...   0.256587672540276   0

[13093 rows x 15 columns]

现在我有了可以保存的正确数据集,用 .to_csv!

定义了我自己的 headers