整理数据:如何创建一个列,将特定行的值分配到 DataFrame 中的组中

Tidying data : how to create a column distributing values of specific rows into groups in the DataFrame

这是我的第一个问题,我想知道执行以下操作的聪明方法:

我有一个看起来像这样的大型数据集:

identifier name group period product gross_sales net_sales expense
1 nameone groupone q1 baloons 20000 10000 0
1 nameone groupone q1 cartoons 2000 10000 0
1 nameone groupone q2 cartoons 20000 10000 0
2 nametwo groupone q1 baloons 1 1000 0
3 namethree grouptwo q4 cartoons 0 0 0
1 nameone groupone q1 expense 0 -1000 1000

我想在其销售条目([volume > 0 and (product != expense )]) 使用 gross_sales 按产品按比例分配。 DF 最后看起来像这样:

identifier name group period product gross_sales net_sales expense
1 nameone groupone q1 baloons 20000 9500 500
1 nameone groupone q1 cartoons 20000 9500 500
1 nameone groupone q2 cartoons 20000 10000 0
2 nametwo groupone q1 baloons 20000 1000 0
3 namethree grouptwo q4 cartoons 0 0 0

谢谢! :D

@Andrej Kesely 之前提出的解决方案向我指出:

## Since I have only one expense row per identifier per period or none
m = df["product"] == "expense"
expenses = df[m].groupby(["identifier, "period"])["expense"].first().agg(dict)

df["expense"] = (
    df[~m]
    .groupby(["identifier", "period"])["gross_sales"]
    .transform(lambda x: expenses.get(x.name, np.nan) / len(x))
)

我成功了,但它是在产品之间平分费用,我需要按比例分配。

然后我尝试了:

df["expense"] = (
    df[~m]
    .groupby(["identifier", "period"])["gross_sales"]
    .transform(lambda x: expenses.get(x.name, np.nan)/ sum(x) if (sum(x) > 0) else 0)

而且虽然有效,但是不正常,所有费用的总和没有达到改造前的金额。

谢谢!!

如果只有一个expense或none,可以试试:

m = df["product"] == "expense"
expenses = df[m].groupby(["name", "group"])["expense"].first().agg(dict)

df["expense"] = (
    df[~m]
    .groupby(["name", "group"])["volume"]
    .transform(lambda x: expenses.get(x.name, np.nan) / len(x))
)
print(df[~m])

打印:

   identifier       name     group period   product  volume  net_sales  expense
0           1    nameone  groupone     q1   baloons      10      10000    500.0
1           1    nameone  groupone     q1  cartoons       1      10000    500.0
2           2    nametwo  groupone     q1   baloons       1       1000      NaN
3           3  namethree  grouptwo     q4  cartoons       0          0      NaN

编辑:要按比例分配费用,您可以尝试:

m = df["product"] == "expense"
expenses = df[m].groupby(["identifier", "period"])["expense"].first().agg(dict)

df["expense"] = (
    df[~m]
    .groupby(["identifier", "period"])["gross_sales"]
    .transform(
        lambda x: [(gs / x.sum()) * expenses.get(x.name, np.nan) for gs in x]
    )
)
print(df[~m])

打印:

   identifier       name     group period   product  gross_sales  net_sales     expense
0           1    nameone  groupone     q1   baloons        20000      10000  909.090909
1           1    nameone  groupone     q1  cartoons         2000      10000   90.909091
2           1    nameone  groupone     q2  cartoons        20000      10000         NaN
3           2    nametwo  groupone     q1   baloons            1       1000         NaN
4           3  namethree  grouptwo     q4  cartoons            0          0         NaN