Groupby 和 Normalize 选定列 Pandas DF
Groupby and Normalize selected columns Pandas DF
我有一个样本 DF,我想根据 2 个条件对其进行归一化
正在创建示例 DF:
sample_df = pd.DataFrame(np.random.randint(1,20,size=(10, 3)), columns=list('ABC'))
sample_df["date"]= ["2020-02-01","2020-02-01","2020-02-01","2020-02-01","2020-02-01",
"2020-02-02","2020-02-02","2020-02-02","2020-02-02","2020-02-02"]
sample_df["date"] = pd.to_datetime(sample_df["date"])
sample_df.set_index(sample_df["date"],inplace=True)
del sample_df["date"]
sample_df["A_cat"] = ["ind","sa","sa","sa","ind","ind","sa","sa","ind","sa"]
sample_df["B_cat"] = ["sa","ind","ind","sa","sa","sa","ind","sa","ind","sa"]
sample_df
print (sample_df)
OP:
A B C A_cat B_cat
date
2020-02-01 14 11 7 ind sa
2020-02-01 19 17 3 sa ind
2020-02-01 19 6 3 sa ind
2020-02-01 3 16 5 sa sa
2020-02-01 12 6 16 ind sa
2020-02-02 1 8 12 ind sa
2020-02-02 10 13 19 sa ind
2020-02-02 17 2 7 sa sa
2020-02-02 9 13 17 ind ind
2020-02-02 17 16 3 sa sa
标准化条件:
1. Groupby based on index, and
2. Nomalize selected columns
比如选中的列是["A","B"]
,在这种情况下应该先groupby索引2020-02-01
,然后在group的5行中对选中的列进行归一化。
其他输入:
selected_column = ["A","B"]
我可以在 for loop
中通过遍历组并连接标准化值来完成此操作。因此,任何关于更基于 efficient/pandas 的方法的建议都会很棒。
尝试使用 Pandas 的代码:
from sklearn.preprocessing import StandardScaler
dfg = StandardScaler()
sample_df.groupby([sample_df.index.get_level_values(0)])[selected_columns].transform(dfg.fit_transform)
错误:
('Expected 2D array, got 1D array instead:\narray=[14. 19. 19. 3. 12.].\nReshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.', 'occurred at index A')
希望我答对了你的问题。您是否只想按索引分组,select 来自 A 和 B 的值并计算百分比?
sample_df.reset_index(inplace=True)
sample_df['date']=pd.to_datetime(sample_df['date'])
sample_df.set_index('date', inplace=True)
df2=sample_df[(sample_df['A']>10)&(sample_df['B']>5)]
df2.groupby(df2.index.month)['A_cat'].value_counts(normalize=True)
如果您想要除 A 和 B 之外的所有其余列。请尝试
df2.groupby(df2.index.month).agg({i:'value_counts' for i in df2.columns[2:]}).groupby(level=0).transform(lambda x: x.div(x.sum()))
或者,在 select 将 A 和 B 放入数据框后,删除列 A 和 P 并应用 pd.series value count
df2.drop(columns=['A','B'], inplace=True)
df2.apply(pd.Series.value_counts).transform(lambda x: x.div(x.sum()))
这个有效:
sample_df.groupby([sample_df.index.get_level_values(0)])[selected_column].transform(lambda x: (x-np.mean(x))/(np.std(x)))
我有一个样本 DF,我想根据 2 个条件对其进行归一化
正在创建示例 DF:
sample_df = pd.DataFrame(np.random.randint(1,20,size=(10, 3)), columns=list('ABC'))
sample_df["date"]= ["2020-02-01","2020-02-01","2020-02-01","2020-02-01","2020-02-01",
"2020-02-02","2020-02-02","2020-02-02","2020-02-02","2020-02-02"]
sample_df["date"] = pd.to_datetime(sample_df["date"])
sample_df.set_index(sample_df["date"],inplace=True)
del sample_df["date"]
sample_df["A_cat"] = ["ind","sa","sa","sa","ind","ind","sa","sa","ind","sa"]
sample_df["B_cat"] = ["sa","ind","ind","sa","sa","sa","ind","sa","ind","sa"]
sample_df
print (sample_df)
OP:
A B C A_cat B_cat
date
2020-02-01 14 11 7 ind sa
2020-02-01 19 17 3 sa ind
2020-02-01 19 6 3 sa ind
2020-02-01 3 16 5 sa sa
2020-02-01 12 6 16 ind sa
2020-02-02 1 8 12 ind sa
2020-02-02 10 13 19 sa ind
2020-02-02 17 2 7 sa sa
2020-02-02 9 13 17 ind ind
2020-02-02 17 16 3 sa sa
标准化条件:
1. Groupby based on index, and
2. Nomalize selected columns
比如选中的列是["A","B"]
,在这种情况下应该先groupby索引2020-02-01
,然后在group的5行中对选中的列进行归一化。
其他输入:
selected_column = ["A","B"]
我可以在 for loop
中通过遍历组并连接标准化值来完成此操作。因此,任何关于更基于 efficient/pandas 的方法的建议都会很棒。
尝试使用 Pandas 的代码:
from sklearn.preprocessing import StandardScaler
dfg = StandardScaler()
sample_df.groupby([sample_df.index.get_level_values(0)])[selected_columns].transform(dfg.fit_transform)
错误:
('Expected 2D array, got 1D array instead:\narray=[14. 19. 19. 3. 12.].\nReshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.', 'occurred at index A')
希望我答对了你的问题。您是否只想按索引分组,select 来自 A 和 B 的值并计算百分比?
sample_df.reset_index(inplace=True)
sample_df['date']=pd.to_datetime(sample_df['date'])
sample_df.set_index('date', inplace=True)
df2=sample_df[(sample_df['A']>10)&(sample_df['B']>5)]
df2.groupby(df2.index.month)['A_cat'].value_counts(normalize=True)
如果您想要除 A 和 B 之外的所有其余列。请尝试
df2.groupby(df2.index.month).agg({i:'value_counts' for i in df2.columns[2:]}).groupby(level=0).transform(lambda x: x.div(x.sum()))
或者,在 select 将 A 和 B 放入数据框后,删除列 A 和 P 并应用 pd.series value count
df2.drop(columns=['A','B'], inplace=True)
df2.apply(pd.Series.value_counts).transform(lambda x: x.div(x.sum()))
这个有效:
sample_df.groupby([sample_df.index.get_level_values(0)])[selected_column].transform(lambda x: (x-np.mean(x))/(np.std(x)))