pandas 组中出现的前 3 个值的总和
Total sum of top 3 value occurrences in groups in pandas
我想得到每个 W,X,Y
组在 col Z
中出现的值的总和,并且只保留前 3 个。[=15 中所有其他行的出现总和=] 组应归入“其他”
我能够在新列 COUNT
中获得每个值的总和,但不确定如何将其限制在前 3 位,以及如何将所有其他值分组到“其他”下。任何帮助将不胜感激...
data_grouped = data.groupby(["W", "X", "Y"])
for group_name, group in data_grouped:
res = group.groupby(["Z"]).size().reset_index(name="COUNT")
More processing stuff and store in db...
输入
| W | X | Y | Z |
| - | - | - | - |
| a | d | x | |
| b | d | f | h |
| b | d | f | h |
| a | d | f | |
| b | d | f | h |
| b | d | f | h |
| b | d | f | h |
| b | d | f | h |
| b | d | f | h |
| b | d | f | h |
| b | d | f | h |
| b | d | f | h |
| b | d | f | h |
| b | d | f | h |
| b | d | f | i |
| b | d | f | i |
| b | d | f | i |
| b | d | f | i |
| b | d | f | i |
| b | d | f | i |
| b | d | f | i |
| b | d | f | j |
| b | d | f | j |
| b | d | f | j |
| b | d | f | k |
| b | d | f | k |
| b | d | f | l |
| b | d | f | l |
| b | d | f | m |
| b | d | f | m |
| b | d | f | n |
| b | d | f | |
| b | d | f | |
| b | d | f | |
| a | d | f | |
| a | d | f | |
| c | e | g | |
| c | e | g | |
| c | e | g | |
预期输出
| Z | W | X | Y | COUNT |
| ----- | - | - | - | ----- |
| h | b | d | f | 12 |
| i | b | d | f | 7 |
| j | b | d | f | 3 |
| Other | b | d | f | 7 | <-- sum of k,l,m,n
and so on...
这是完成此操作的一种方法:
df = df[df["Z"] != " "] # EDIT
data_grouped = df.groupby(["W", "X", "Y"])
grand_output = pd.DataFrame(columns = ["Z", "W", "X", "Y", "COUNT"])
for group_name, group in data_grouped:
# output dataframe for group
output = pd.DataFrame(columns=[])
res = group.groupby(["Z"]).size().reset_index(name="COUNT")
# create dataframe of res and W, X, Y columns
output = pd.concat([pd.DataFrame([list(group_name)]*len(res), columns=["W", "X", "Y"]), res], axis=1, ignore_index=True)
output.columns = ["W", "X", "Y", "Z", "COUNT"]
# sort and sum
output.sort_values(["COUNT", "Z"], ascending=False, inplace=True)
if len(output) > 3:
others = output.iloc[3:]["COUNT"].sum()
output = pd.concat([output.iloc[:3], pd.DataFrame([list(group_name)+["other", others]], columns=["W", "X", "Y", "Z", "COUNT"])])
# append to final output
grand_output = grand_output.append(output)
grand_output # Edited with blank Z rows dropped
#Out:
# Z W X Y COUNT
#0 h b d f 12
#1 i b d f 7
#2 j b d f 3
#0 other b d f 7
您可以使用 value_counts
来查找计数;然后 groupby.head
得到前 3 个。然后过滤掉前 3 个值并使用 groupby.sum
得到 OTHER
的总和。最后,append
这回top3
:
counts = df.value_counts(['W','X','Y','Z'])
top3 = counts.groupby(level=[0,1,2]).head(3)
out = (top3.append(counts[~counts.index.isin(top3.index)].reset_index(level='Z')
.assign(Z='Other').set_index('Z', append=True).squeeze()
.groupby(level=[0,1,2,3]).sum()).reset_index(name='COUNT'))
输出:
W X Y Z COUNT
0 b d f h 12
1 b d f i 7
2 b d f j 3
3 b d f Other 7
我想得到每个 W,X,Y
组在 col Z
中出现的值的总和,并且只保留前 3 个。[=15 中所有其他行的出现总和=] 组应归入“其他”
我能够在新列 COUNT
中获得每个值的总和,但不确定如何将其限制在前 3 位,以及如何将所有其他值分组到“其他”下。任何帮助将不胜感激...
data_grouped = data.groupby(["W", "X", "Y"])
for group_name, group in data_grouped:
res = group.groupby(["Z"]).size().reset_index(name="COUNT")
More processing stuff and store in db...
输入
| W | X | Y | Z |
| - | - | - | - |
| a | d | x | |
| b | d | f | h |
| b | d | f | h |
| a | d | f | |
| b | d | f | h |
| b | d | f | h |
| b | d | f | h |
| b | d | f | h |
| b | d | f | h |
| b | d | f | h |
| b | d | f | h |
| b | d | f | h |
| b | d | f | h |
| b | d | f | h |
| b | d | f | i |
| b | d | f | i |
| b | d | f | i |
| b | d | f | i |
| b | d | f | i |
| b | d | f | i |
| b | d | f | i |
| b | d | f | j |
| b | d | f | j |
| b | d | f | j |
| b | d | f | k |
| b | d | f | k |
| b | d | f | l |
| b | d | f | l |
| b | d | f | m |
| b | d | f | m |
| b | d | f | n |
| b | d | f | |
| b | d | f | |
| b | d | f | |
| a | d | f | |
| a | d | f | |
| c | e | g | |
| c | e | g | |
| c | e | g | |
预期输出
| Z | W | X | Y | COUNT |
| ----- | - | - | - | ----- |
| h | b | d | f | 12 |
| i | b | d | f | 7 |
| j | b | d | f | 3 |
| Other | b | d | f | 7 | <-- sum of k,l,m,n
and so on...
这是完成此操作的一种方法:
df = df[df["Z"] != " "] # EDIT
data_grouped = df.groupby(["W", "X", "Y"])
grand_output = pd.DataFrame(columns = ["Z", "W", "X", "Y", "COUNT"])
for group_name, group in data_grouped:
# output dataframe for group
output = pd.DataFrame(columns=[])
res = group.groupby(["Z"]).size().reset_index(name="COUNT")
# create dataframe of res and W, X, Y columns
output = pd.concat([pd.DataFrame([list(group_name)]*len(res), columns=["W", "X", "Y"]), res], axis=1, ignore_index=True)
output.columns = ["W", "X", "Y", "Z", "COUNT"]
# sort and sum
output.sort_values(["COUNT", "Z"], ascending=False, inplace=True)
if len(output) > 3:
others = output.iloc[3:]["COUNT"].sum()
output = pd.concat([output.iloc[:3], pd.DataFrame([list(group_name)+["other", others]], columns=["W", "X", "Y", "Z", "COUNT"])])
# append to final output
grand_output = grand_output.append(output)
grand_output # Edited with blank Z rows dropped
#Out:
# Z W X Y COUNT
#0 h b d f 12
#1 i b d f 7
#2 j b d f 3
#0 other b d f 7
您可以使用 value_counts
来查找计数;然后 groupby.head
得到前 3 个。然后过滤掉前 3 个值并使用 groupby.sum
得到 OTHER
的总和。最后,append
这回top3
:
counts = df.value_counts(['W','X','Y','Z'])
top3 = counts.groupby(level=[0,1,2]).head(3)
out = (top3.append(counts[~counts.index.isin(top3.index)].reset_index(level='Z')
.assign(Z='Other').set_index('Z', append=True).squeeze()
.groupby(level=[0,1,2,3]).sum()).reset_index(name='COUNT'))
输出:
W X Y Z COUNT
0 b d f h 12
1 b d f i 7
2 b d f j 3
3 b d f Other 7