pandas 交叉表和规范化索引
pandas crosstab and normalising indexes
如何规范化具有多索引的 pandas 交叉表?
假设你有这样的 df:
# RANDOM DATA
np.random.seed(2)
Year = [2020,2019,2018,2017]*1000
col1 = ['A','B','C','D']*1000
col2 = np.random.randint(0,2,4000)
weight = np.random.randint(1,3,4000)
random.shuffle(Year)
random.shuffle(col1)
random.shuffle(col2)
random.shuffle(weight)
column_names = ['Year', 'weight', 'col1', 'col2']
df = pd.DataFrame(columns=column_names)
df['Year'] = Year
df['col1'] = col1
df['col2'] = col2
df['weight'] = weight
现在你用它做一个交叉表:
pd.crosstab(index=[df['Year']],
columns=[df['col1'], df['col2']],
values=df['weight'],
aggfunc=sum)
col1 A B C D \
col2 0 1 0 1 0 1 0
Year
2017 0.117962 0.128686 0.128016 0.130697 0.137399 0.122654 0.115282
2018 0.116832 0.111551 0.120132 0.118152 0.138614 0.125413 0.131353
2019 0.137584 0.126846 0.127517 0.108725 0.114765 0.138255 0.114765
2020 0.116356 0.134309 0.113032 0.143617 0.121676 0.118351 0.121676
col1
col2 1
Year
2017 0.119303
2018 0.137954
2019 0.131544
2020 0.130984
如何在多索引中进行归一化?
我的预期输出是:
col1 A A B B C C D D
col2 0 1 0 1 0 1 0 1
Year
2017 0.478 0.522 0.495 0.505 0.528 0.472 0.491 0.509
2018 0.512 0.488 0.504 0.496 0.525 0.475 0.488 0.512
2019 0.520 0.480 0.540 0.460 0.454 0.546 0.466 0.534
2020 0.464 0.536 0.440 0.560 0.507 0.493 0.482 0.518
我认为你可以使用 groupby().value_counts
和 normalize:
#### Random data ####
np.random.seed(2)
Year = [2020,2019,2018,2017]*1000
col1 = np.repeat(['A','B','C','D'],1000)
col2 = np.random.randint(0,2,4000)
column_names = ['Year', 'col1', 'col2']
df = pd.DataFrame(columns=column_names)
df['Year'] = Year
df['col1'] = col1
df['col2'] = col2
#### End random data ####
(df.groupby(['Year','col1'])['col2']
.value_counts(normalize=True)
.unstack(['col1','col2'])
)
输出:
col1 A B C D
col2 0 1 1 0 0 1 1 0
Year
2017 0.524 0.476 0.528 0.472 0.540 0.460 0.512 0.488
2018 0.496 0.504 0.484 0.516 0.560 0.440 0.524 0.476
2019 0.496 0.504 0.552 0.448 0.484 0.516 0.508 0.492
2020 0.500 0.500 0.540 0.460 0.436 0.564 0.484 0.516
更新:对于加权value_counts
:
new_df = df.groupby(['Year','col1','col2'])['weight'].sum()
new_df /= new_df.sum(level=['Year','col1'])
new_df = new_df.unstack(['col1','col2'])
输出:
col1 A B C D \
col2 0 1 0 1 0 1 0
Year
2017 0.441176 0.558824 0.542500 0.457500 0.474286 0.525714 0.482857
2018 0.500000 0.500000 0.474114 0.525886 0.481268 0.518732 0.489461
2019 0.521622 0.478378 0.474490 0.525510 0.498801 0.501199 0.531343
2020 0.493671 0.506329 0.479339 0.520661 0.515789 0.484211 0.513021
col1
col2 1
Year
2017 0.517143
2018 0.510539
2019 0.468657
2020 0.486979
如何规范化具有多索引的 pandas 交叉表?
假设你有这样的 df:
# RANDOM DATA
np.random.seed(2)
Year = [2020,2019,2018,2017]*1000
col1 = ['A','B','C','D']*1000
col2 = np.random.randint(0,2,4000)
weight = np.random.randint(1,3,4000)
random.shuffle(Year)
random.shuffle(col1)
random.shuffle(col2)
random.shuffle(weight)
column_names = ['Year', 'weight', 'col1', 'col2']
df = pd.DataFrame(columns=column_names)
df['Year'] = Year
df['col1'] = col1
df['col2'] = col2
df['weight'] = weight
现在你用它做一个交叉表:
pd.crosstab(index=[df['Year']],
columns=[df['col1'], df['col2']],
values=df['weight'],
aggfunc=sum)
col1 A B C D \
col2 0 1 0 1 0 1 0
Year
2017 0.117962 0.128686 0.128016 0.130697 0.137399 0.122654 0.115282
2018 0.116832 0.111551 0.120132 0.118152 0.138614 0.125413 0.131353
2019 0.137584 0.126846 0.127517 0.108725 0.114765 0.138255 0.114765
2020 0.116356 0.134309 0.113032 0.143617 0.121676 0.118351 0.121676
col1
col2 1
Year
2017 0.119303
2018 0.137954
2019 0.131544
2020 0.130984
如何在多索引中进行归一化?
我的预期输出是:
col1 A A B B C C D D
col2 0 1 0 1 0 1 0 1
Year
2017 0.478 0.522 0.495 0.505 0.528 0.472 0.491 0.509
2018 0.512 0.488 0.504 0.496 0.525 0.475 0.488 0.512
2019 0.520 0.480 0.540 0.460 0.454 0.546 0.466 0.534
2020 0.464 0.536 0.440 0.560 0.507 0.493 0.482 0.518
我认为你可以使用 groupby().value_counts
和 normalize:
#### Random data ####
np.random.seed(2)
Year = [2020,2019,2018,2017]*1000
col1 = np.repeat(['A','B','C','D'],1000)
col2 = np.random.randint(0,2,4000)
column_names = ['Year', 'col1', 'col2']
df = pd.DataFrame(columns=column_names)
df['Year'] = Year
df['col1'] = col1
df['col2'] = col2
#### End random data ####
(df.groupby(['Year','col1'])['col2']
.value_counts(normalize=True)
.unstack(['col1','col2'])
)
输出:
col1 A B C D
col2 0 1 1 0 0 1 1 0
Year
2017 0.524 0.476 0.528 0.472 0.540 0.460 0.512 0.488
2018 0.496 0.504 0.484 0.516 0.560 0.440 0.524 0.476
2019 0.496 0.504 0.552 0.448 0.484 0.516 0.508 0.492
2020 0.500 0.500 0.540 0.460 0.436 0.564 0.484 0.516
更新:对于加权value_counts
:
new_df = df.groupby(['Year','col1','col2'])['weight'].sum()
new_df /= new_df.sum(level=['Year','col1'])
new_df = new_df.unstack(['col1','col2'])
输出:
col1 A B C D \
col2 0 1 0 1 0 1 0
Year
2017 0.441176 0.558824 0.542500 0.457500 0.474286 0.525714 0.482857
2018 0.500000 0.500000 0.474114 0.525886 0.481268 0.518732 0.489461
2019 0.521622 0.478378 0.474490 0.525510 0.498801 0.501199 0.531343
2020 0.493671 0.506329 0.479339 0.520661 0.515789 0.484211 0.513021
col1
col2 1
Year
2017 0.517143
2018 0.510539
2019 0.468657
2020 0.486979