如何使用分层索引保存和检索 Pandas 数据帧?
How to save and retrive Pandas dataframes with hierarchichal indexing?
我需要创建并保存具有层次索引的 Pandas 数据框。在下文中,我创建了两个数据框,然后将它们连接起来以创建一个具有层次索引的新数据框。
data1 = np.random.rand(5,5)
data2 = np.random.rand(5,5)
df1 = pd.DataFrame(data1, columns = ['a', 'b', 'c', 'd', 'e'], index=['i1', 'i2', 'i3', 'i4', 'i5'])
df2 = pd.DataFrame(data2, columns = ['a', 'b', 'c', 'd', 'e'], index=['i1', 'i2', 'i3', 'i4', 'i5'])
df = pd.concat([df1, df2], keys=['first', 'second'])
print "Original Data frame"
print df
# Save to file.
df.to_csv('test')
# Read from file.
df_new = pd.DataFrame.from_csv('test')
print "Saved Data frame"
print df_new
这是我得到的输出,
Original Data frame
a b c d e
first i1 0.926553 0.180306 0.182887 0.783061 0.832914
i2 0.899054 0.130367 0.615534 0.965580 0.669495
i3 0.931004 0.425528 0.068938 0.166522 0.714399
i4 0.082365 0.587194 0.993864 0.187864 0.066035
i5 0.668671 0.294744 0.136317 0.358732 0.529674
second i1 0.916310 0.361423 0.700380 0.386119 0.273667
i2 0.102542 0.454106 0.565760 0.259323 0.104743
i3 0.410280 0.379986 0.288921 0.177819 0.919343
i4 0.447279 0.113711 0.032273 0.335358 0.717824
i5 0.995781 0.356817 0.146785 0.972401 0.169360
Saved Data frame
Unnamed: 1 a b c d e
first i1 0.926553 0.180306 0.182887 0.783061 0.832914
first i2 0.899054 0.130367 0.615534 0.965580 0.669495
first i3 0.931004 0.425528 0.068938 0.166522 0.714399
first i4 0.082365 0.587194 0.993864 0.187864 0.066035
first i5 0.668671 0.294744 0.136317 0.358732 0.529674
second i1 0.916310 0.361423 0.700380 0.386119 0.273667
second i2 0.102542 0.454106 0.565760 0.259323 0.104743
second i3 0.410280 0.379986 0.288921 0.177819 0.919343
second i4 0.447279 0.113711 0.032273 0.335358 0.717824
second i5 0.995781 0.356817 0.146785 0.972401 0.169360
当我将这个新数据帧保存到 csv 文件 ('test') 并读回时,我松散了层次索引。有没有办法将数据保存到文件中,这样当我读回它时,我保留了层次索引?
以不同于使用 csv 的其他方式保存它。比如泡菜:
df.to_pickle('dataframe.pickle')
这保留了分层索引。您再次阅读它:
pd.read_pickle('dataframe.pickle')
Pandas 有几种 IO 方法,你可以在 documentation.
中了解它们
您可以:
重置索引并将DataFrame保存为csv,从csv中读回,然后
将索引设置回原始位置(就地)。
df
Out[11]:
a b c d e
first i1 0.935478 0.455757 0.607418 0.850291 0.704326
i2 0.675752 0.339017 0.999949 0.508480 0.888817
i3 0.463371 0.803389 0.048469 0.599697 0.423603
i4 0.935294 0.933699 0.843289 0.182535 0.255847
i5 0.321236 0.120010 0.647876 0.000517 0.032592
second i1 0.172044 0.691660 0.799164 0.194785 0.302880
i2 0.432988 0.511229 0.451268 0.203145 0.560563
i3 0.442584 0.771483 0.839945 0.716374 0.533183
i4 0.167898 0.962646 0.152245 0.400280 0.210355
i5 0.736365 0.511057 0.256672 0.619250 0.790739
df.reset_index()
Out[12]:
level_0 level_1 a b c d e
0 first i1 0.935478 0.455757 0.607418 0.850291 0.704326
1 first i2 0.675752 0.339017 0.999949 0.508480 0.888817
2 first i3 0.463371 0.803389 0.048469 0.599697 0.423603
3 first i4 0.935294 0.933699 0.843289 0.182535 0.255847
4 first i5 0.321236 0.120010 0.647876 0.000517 0.032592
5 second i1 0.172044 0.691660 0.799164 0.194785 0.302880
6 second i2 0.432988 0.511229 0.451268 0.203145 0.560563
7 second i3 0.442584 0.771483 0.839945 0.716374 0.533183
8 second i4 0.167898 0.962646 0.152245 0.400280 0.210355
9 second i5 0.736365 0.511057 0.256672 0.619250 0.790739
df.reset_index().to_csv('test.csv', index=False)
df3 = pd.read_csv('test.csv')
df3.set_index(['level_0', 'level_1'], inplace=True)
>>> df3
Out[15]:
a b c d e
level_0 level_1
first i1 0.935478 0.455757 0.607418 0.850291 0.704326
i2 0.675752 0.339017 0.999949 0.508480 0.888817
i3 0.463371 0.803389 0.048469 0.599697 0.423603
i4 0.935294 0.933699 0.843289 0.182535 0.255847
i5 0.321236 0.120010 0.647876 0.000517 0.032592
second i1 0.172044 0.691660 0.799164 0.194785 0.302880
i2 0.432988 0.511229 0.451268 0.203145 0.560563
i3 0.442584 0.771483 0.839945 0.716374 0.533183
i4 0.167898 0.962646 0.152245 0.400280 0.210355
i5 0.736365 0.511057 0.256672 0.619250 0.790739
我需要创建并保存具有层次索引的 Pandas 数据框。在下文中,我创建了两个数据框,然后将它们连接起来以创建一个具有层次索引的新数据框。
data1 = np.random.rand(5,5)
data2 = np.random.rand(5,5)
df1 = pd.DataFrame(data1, columns = ['a', 'b', 'c', 'd', 'e'], index=['i1', 'i2', 'i3', 'i4', 'i5'])
df2 = pd.DataFrame(data2, columns = ['a', 'b', 'c', 'd', 'e'], index=['i1', 'i2', 'i3', 'i4', 'i5'])
df = pd.concat([df1, df2], keys=['first', 'second'])
print "Original Data frame"
print df
# Save to file.
df.to_csv('test')
# Read from file.
df_new = pd.DataFrame.from_csv('test')
print "Saved Data frame"
print df_new
这是我得到的输出,
Original Data frame
a b c d e
first i1 0.926553 0.180306 0.182887 0.783061 0.832914
i2 0.899054 0.130367 0.615534 0.965580 0.669495
i3 0.931004 0.425528 0.068938 0.166522 0.714399
i4 0.082365 0.587194 0.993864 0.187864 0.066035
i5 0.668671 0.294744 0.136317 0.358732 0.529674
second i1 0.916310 0.361423 0.700380 0.386119 0.273667
i2 0.102542 0.454106 0.565760 0.259323 0.104743
i3 0.410280 0.379986 0.288921 0.177819 0.919343
i4 0.447279 0.113711 0.032273 0.335358 0.717824
i5 0.995781 0.356817 0.146785 0.972401 0.169360
Saved Data frame
Unnamed: 1 a b c d e
first i1 0.926553 0.180306 0.182887 0.783061 0.832914
first i2 0.899054 0.130367 0.615534 0.965580 0.669495
first i3 0.931004 0.425528 0.068938 0.166522 0.714399
first i4 0.082365 0.587194 0.993864 0.187864 0.066035
first i5 0.668671 0.294744 0.136317 0.358732 0.529674
second i1 0.916310 0.361423 0.700380 0.386119 0.273667
second i2 0.102542 0.454106 0.565760 0.259323 0.104743
second i3 0.410280 0.379986 0.288921 0.177819 0.919343
second i4 0.447279 0.113711 0.032273 0.335358 0.717824
second i5 0.995781 0.356817 0.146785 0.972401 0.169360
当我将这个新数据帧保存到 csv 文件 ('test') 并读回时,我松散了层次索引。有没有办法将数据保存到文件中,这样当我读回它时,我保留了层次索引?
以不同于使用 csv 的其他方式保存它。比如泡菜:
df.to_pickle('dataframe.pickle')
这保留了分层索引。您再次阅读它:
pd.read_pickle('dataframe.pickle')
Pandas 有几种 IO 方法,你可以在 documentation.
中了解它们您可以:
重置索引并将DataFrame保存为csv,从csv中读回,然后 将索引设置回原始位置(就地)。
df
Out[11]:
a b c d e
first i1 0.935478 0.455757 0.607418 0.850291 0.704326
i2 0.675752 0.339017 0.999949 0.508480 0.888817
i3 0.463371 0.803389 0.048469 0.599697 0.423603
i4 0.935294 0.933699 0.843289 0.182535 0.255847
i5 0.321236 0.120010 0.647876 0.000517 0.032592
second i1 0.172044 0.691660 0.799164 0.194785 0.302880
i2 0.432988 0.511229 0.451268 0.203145 0.560563
i3 0.442584 0.771483 0.839945 0.716374 0.533183
i4 0.167898 0.962646 0.152245 0.400280 0.210355
i5 0.736365 0.511057 0.256672 0.619250 0.790739
df.reset_index()
Out[12]:
level_0 level_1 a b c d e
0 first i1 0.935478 0.455757 0.607418 0.850291 0.704326
1 first i2 0.675752 0.339017 0.999949 0.508480 0.888817
2 first i3 0.463371 0.803389 0.048469 0.599697 0.423603
3 first i4 0.935294 0.933699 0.843289 0.182535 0.255847
4 first i5 0.321236 0.120010 0.647876 0.000517 0.032592
5 second i1 0.172044 0.691660 0.799164 0.194785 0.302880
6 second i2 0.432988 0.511229 0.451268 0.203145 0.560563
7 second i3 0.442584 0.771483 0.839945 0.716374 0.533183
8 second i4 0.167898 0.962646 0.152245 0.400280 0.210355
9 second i5 0.736365 0.511057 0.256672 0.619250 0.790739
df.reset_index().to_csv('test.csv', index=False)
df3 = pd.read_csv('test.csv')
df3.set_index(['level_0', 'level_1'], inplace=True)
>>> df3
Out[15]:
a b c d e
level_0 level_1
first i1 0.935478 0.455757 0.607418 0.850291 0.704326
i2 0.675752 0.339017 0.999949 0.508480 0.888817
i3 0.463371 0.803389 0.048469 0.599697 0.423603
i4 0.935294 0.933699 0.843289 0.182535 0.255847
i5 0.321236 0.120010 0.647876 0.000517 0.032592
second i1 0.172044 0.691660 0.799164 0.194785 0.302880
i2 0.432988 0.511229 0.451268 0.203145 0.560563
i3 0.442584 0.771483 0.839945 0.716374 0.533183
i4 0.167898 0.962646 0.152245 0.400280 0.210355
i5 0.736365 0.511057 0.256672 0.619250 0.790739