Pandas 透视 Table 格式化列名称
Pandas Pivot Table formatting column names
我在 pandas 数据帧上使用了 pandas.pivot_table
函数,我的输出看起来与此类似:
Winners Runnerup
year 2016 2015 2014 2016 2015 2014
Country Sport
india badminton
india wrestling
我真正需要的是下面这样的东西
Country Sport Winners_2016 Winners_2015 Winners_2014 Runnerup_2016 Runnerup_2015 Runnerup_2014
india badminton 1 1 1 1 1 1
india wrestling 1 0 1 0 1 0
我有很多专栏和年份,所以我无法手动编辑它们,所以任何人都可以告诉我如何做吗?
试试这个:
df.columns=['{}_{}'.format(x,y) for x,y in zip(df.columns.get_level_values(0),df.columns.get_level_values(1))]
get_level_values
是你只需要得到结果多索引的一个级别。
旁注:您可以尝试按原样处理数据。很长一段时间以来,我真的很讨厌 pandas multiIndex,但它越来越适合我了。
您还可以使用列表理解:
df.columns = ['_'.join(col) for col in df.columns]
print (df)
Winners_2016 Winners_2015 Winners_2014 Runnerup_2016 \
Country Sport
india badminton 1 1 1 1
wrestling 1 1 1 1
Runnerup_2015 Runnerup_2014
Country Sport
india badminton 1 1
wrestling 1 1
转换columns
to_series
and then call join
:
的另一个解决方案
df.columns = df.columns.to_series().str.join('_')
print (df)
Winners_2016 Winners_2015 Winners_2014 Runnerup_2016 \
Country Sport
india badminton 1 1 1 1
wrestling 1 1 1 1
Runnerup_2015 Runnerup_2014
Country Sport
india badminton 1 1
wrestling 1 1
我对时间很感兴趣:
In [45]: %timeit ['_'.join(col) for col in df.columns]
The slowest run took 7.82 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 4.05 µs per loop
In [44]: %timeit ['{}_{}'.format(x,y) for x,y in zip(df.columns.get_level_values(0),df.columns.get_level_values(1))]
The slowest run took 4.56 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 131 µs per loop
In [46]: %timeit df.columns.to_series().str.join('_')
The slowest run took 4.31 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 452 µs per loop
我在 pandas 数据帧上使用了 pandas.pivot_table
函数,我的输出看起来与此类似:
Winners Runnerup
year 2016 2015 2014 2016 2015 2014
Country Sport
india badminton
india wrestling
我真正需要的是下面这样的东西
Country Sport Winners_2016 Winners_2015 Winners_2014 Runnerup_2016 Runnerup_2015 Runnerup_2014
india badminton 1 1 1 1 1 1
india wrestling 1 0 1 0 1 0
我有很多专栏和年份,所以我无法手动编辑它们,所以任何人都可以告诉我如何做吗?
试试这个:
df.columns=['{}_{}'.format(x,y) for x,y in zip(df.columns.get_level_values(0),df.columns.get_level_values(1))]
get_level_values
是你只需要得到结果多索引的一个级别。
旁注:您可以尝试按原样处理数据。很长一段时间以来,我真的很讨厌 pandas multiIndex,但它越来越适合我了。
您还可以使用列表理解:
df.columns = ['_'.join(col) for col in df.columns]
print (df)
Winners_2016 Winners_2015 Winners_2014 Runnerup_2016 \
Country Sport
india badminton 1 1 1 1
wrestling 1 1 1 1
Runnerup_2015 Runnerup_2014
Country Sport
india badminton 1 1
wrestling 1 1
转换columns
to_series
and then call join
:
df.columns = df.columns.to_series().str.join('_')
print (df)
Winners_2016 Winners_2015 Winners_2014 Runnerup_2016 \
Country Sport
india badminton 1 1 1 1
wrestling 1 1 1 1
Runnerup_2015 Runnerup_2014
Country Sport
india badminton 1 1
wrestling 1 1
我对时间很感兴趣:
In [45]: %timeit ['_'.join(col) for col in df.columns]
The slowest run took 7.82 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 4.05 µs per loop
In [44]: %timeit ['{}_{}'.format(x,y) for x,y in zip(df.columns.get_level_values(0),df.columns.get_level_values(1))]
The slowest run took 4.56 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 131 µs per loop
In [46]: %timeit df.columns.to_series().str.join('_')
The slowest run took 4.31 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 452 µs per loop