pandas 组中出现的前 3 个值的总和

Total sum of top 3 value occurrences in groups in pandas

我想得到每个 W,X,Y 组在 col Z 中出现的值的总和,并且只保留前 3 个。[=15 中所有其他行的出现总和=] 组应归入“其他”

我能够在新列 COUNT 中获得每个值的总和,但不确定如何将其限制在前 3 位,以及如何将所有其他值分组到“其他”下。任何帮助将不胜感激...

data_grouped = data.groupby(["W", "X", "Y"])

for group_name, group in data_grouped: 
  res = group.groupby(["Z"]).size().reset_index(name="COUNT")

  More processing stuff and store in db...

输入

| W | X | Y | Z |
| - | - | - | - |
| a | d | x |   |
| b | d | f | h |
| b | d | f | h |
| a | d | f |   |
| b | d | f | h |
| b | d | f | h |
| b | d | f | h |
| b | d | f | h |
| b | d | f | h |
| b | d | f | h |
| b | d | f | h |
| b | d | f | h |
| b | d | f | h |
| b | d | f | h |
| b | d | f | i |
| b | d | f | i |
| b | d | f | i |
| b | d | f | i |
| b | d | f | i |
| b | d | f | i |
| b | d | f | i |
| b | d | f | j |
| b | d | f | j |
| b | d | f | j |
| b | d | f | k |
| b | d | f | k |
| b | d | f | l |
| b | d | f | l |
| b | d | f | m |
| b | d | f | m |
| b | d | f | n |
| b | d | f |   |
| b | d | f |   |
| b | d | f |   |
| a | d | f |   |
| a | d | f |   |
| c | e | g |   |
| c | e | g |   |
| c | e | g |   |

预期输出

| Z     | W | X | Y | COUNT |
| ----- | - | - | - | ----- |
| h     | b | d | f |  12   |
| i     | b | d | f |  7    | 
| j     | b | d | f |  3    | 
| Other | b | d | f |  7    |  <-- sum of k,l,m,n
and so on...

这是完成此操作的一种方法:

df = df[df["Z"] != " "]   # EDIT

data_grouped = df.groupby(["W", "X", "Y"])
grand_output = pd.DataFrame(columns = ["Z", "W", "X", "Y", "COUNT"])

for group_name, group in data_grouped: 
    # output dataframe for group
    output = pd.DataFrame(columns=[])
    res = group.groupby(["Z"]).size().reset_index(name="COUNT")
    # create dataframe of res and W, X, Y columns
    output = pd.concat([pd.DataFrame([list(group_name)]*len(res), columns=["W", "X", "Y"]), res], axis=1, ignore_index=True)
    output.columns = ["W", "X", "Y", "Z", "COUNT"]
    # sort and sum
    output.sort_values(["COUNT", "Z"], ascending=False, inplace=True)
    if len(output) > 3:
        others = output.iloc[3:]["COUNT"].sum()
        output = pd.concat([output.iloc[:3], pd.DataFrame([list(group_name)+["other", others]], columns=["W", "X", "Y", "Z", "COUNT"])])
    # append to final output
    grand_output = grand_output.append(output)

grand_output  # Edited with blank Z rows dropped
#Out: 
#       Z  W  X  Y COUNT
#0      h  b  d  f    12
#1      i  b  d  f     7
#2      j  b  d  f     3
#0  other  b  d  f     7

您可以使用 value_counts 来查找计数;然后 groupby.head 得到前 3 个。然后过滤掉前 3 个值并使用 groupby.sum 得到 OTHER 的总和。最后,append这回top3:

counts = df.value_counts(['W','X','Y','Z'])
top3 = counts.groupby(level=[0,1,2]).head(3)
out = (top3.append(counts[~counts.index.isin(top3.index)].reset_index(level='Z')
                   .assign(Z='Other').set_index('Z', append=True).squeeze()
                   .groupby(level=[0,1,2,3]).sum()).reset_index(name='COUNT'))

输出:

   W  X  Y      Z  COUNT
0  b  d  f      h     12
1  b  d  f      i      7
2  b  d  f      j      3
3  b  d  f  Other      7