整理数据:如何创建一个列,将特定行的值分配到 DataFrame 中的组中
Tidying data : how to create a column distributing values of specific rows into groups in the DataFrame
这是我的第一个问题,我想知道执行以下操作的聪明方法:
我有一个看起来像这样的大型数据集:
identifier
name
group
period
product
gross_sales
net_sales
expense
1
nameone
groupone
q1
baloons
20000
10000
0
1
nameone
groupone
q1
cartoons
2000
10000
0
1
nameone
groupone
q2
cartoons
20000
10000
0
2
nametwo
groupone
q1
baloons
1
1000
0
3
namethree
grouptwo
q4
cartoons
0
0
0
1
nameone
groupone
q1
expense
0
-1000
1000
我想在其销售条目([volume > 0 and (product != expense )]) 使用 gross_sales 按产品按比例分配。 DF 最后看起来像这样:
identifier
name
group
period
product
gross_sales
net_sales
expense
1
nameone
groupone
q1
baloons
20000
9500
500
1
nameone
groupone
q1
cartoons
20000
9500
500
1
nameone
groupone
q2
cartoons
20000
10000
0
2
nametwo
groupone
q1
baloons
20000
1000
0
3
namethree
grouptwo
q4
cartoons
0
0
0
谢谢! :D
@Andrej Kesely 之前提出的解决方案向我指出:
## Since I have only one expense row per identifier per period or none
m = df["product"] == "expense"
expenses = df[m].groupby(["identifier, "period"])["expense"].first().agg(dict)
df["expense"] = (
df[~m]
.groupby(["identifier", "period"])["gross_sales"]
.transform(lambda x: expenses.get(x.name, np.nan) / len(x))
)
我成功了,但它是在产品之间平分费用,我需要按比例分配。
然后我尝试了:
df["expense"] = (
df[~m]
.groupby(["identifier", "period"])["gross_sales"]
.transform(lambda x: expenses.get(x.name, np.nan)/ sum(x) if (sum(x) > 0) else 0)
而且虽然有效,但是不正常,所有费用的总和没有达到改造前的金额。
谢谢!!
如果只有一个expense
或none,可以试试:
m = df["product"] == "expense"
expenses = df[m].groupby(["name", "group"])["expense"].first().agg(dict)
df["expense"] = (
df[~m]
.groupby(["name", "group"])["volume"]
.transform(lambda x: expenses.get(x.name, np.nan) / len(x))
)
print(df[~m])
打印:
identifier name group period product volume net_sales expense
0 1 nameone groupone q1 baloons 10 10000 500.0
1 1 nameone groupone q1 cartoons 1 10000 500.0
2 2 nametwo groupone q1 baloons 1 1000 NaN
3 3 namethree grouptwo q4 cartoons 0 0 NaN
编辑:要按比例分配费用,您可以尝试:
m = df["product"] == "expense"
expenses = df[m].groupby(["identifier", "period"])["expense"].first().agg(dict)
df["expense"] = (
df[~m]
.groupby(["identifier", "period"])["gross_sales"]
.transform(
lambda x: [(gs / x.sum()) * expenses.get(x.name, np.nan) for gs in x]
)
)
print(df[~m])
打印:
identifier name group period product gross_sales net_sales expense
0 1 nameone groupone q1 baloons 20000 10000 909.090909
1 1 nameone groupone q1 cartoons 2000 10000 90.909091
2 1 nameone groupone q2 cartoons 20000 10000 NaN
3 2 nametwo groupone q1 baloons 1 1000 NaN
4 3 namethree grouptwo q4 cartoons 0 0 NaN
这是我的第一个问题,我想知道执行以下操作的聪明方法:
我有一个看起来像这样的大型数据集:
identifier | name | group | period | product | gross_sales | net_sales | expense |
---|---|---|---|---|---|---|---|
1 | nameone | groupone | q1 | baloons | 20000 | 10000 | 0 |
1 | nameone | groupone | q1 | cartoons | 2000 | 10000 | 0 |
1 | nameone | groupone | q2 | cartoons | 20000 | 10000 | 0 |
2 | nametwo | groupone | q1 | baloons | 1 | 1000 | 0 |
3 | namethree | grouptwo | q4 | cartoons | 0 | 0 | 0 |
1 | nameone | groupone | q1 | expense | 0 | -1000 | 1000 |
我想在其销售条目([volume > 0 and (product != expense )]) 使用 gross_sales 按产品按比例分配。 DF 最后看起来像这样:
identifier | name | group | period | product | gross_sales | net_sales | expense |
---|---|---|---|---|---|---|---|
1 | nameone | groupone | q1 | baloons | 20000 | 9500 | 500 |
1 | nameone | groupone | q1 | cartoons | 20000 | 9500 | 500 |
1 | nameone | groupone | q2 | cartoons | 20000 | 10000 | 0 |
2 | nametwo | groupone | q1 | baloons | 20000 | 1000 | 0 |
3 | namethree | grouptwo | q4 | cartoons | 0 | 0 | 0 |
谢谢! :D
@Andrej Kesely 之前提出的解决方案向我指出:
## Since I have only one expense row per identifier per period or none
m = df["product"] == "expense"
expenses = df[m].groupby(["identifier, "period"])["expense"].first().agg(dict)
df["expense"] = (
df[~m]
.groupby(["identifier", "period"])["gross_sales"]
.transform(lambda x: expenses.get(x.name, np.nan) / len(x))
)
我成功了,但它是在产品之间平分费用,我需要按比例分配。
然后我尝试了:
df["expense"] = (
df[~m]
.groupby(["identifier", "period"])["gross_sales"]
.transform(lambda x: expenses.get(x.name, np.nan)/ sum(x) if (sum(x) > 0) else 0)
而且虽然有效,但是不正常,所有费用的总和没有达到改造前的金额。
谢谢!!
如果只有一个expense
或none,可以试试:
m = df["product"] == "expense"
expenses = df[m].groupby(["name", "group"])["expense"].first().agg(dict)
df["expense"] = (
df[~m]
.groupby(["name", "group"])["volume"]
.transform(lambda x: expenses.get(x.name, np.nan) / len(x))
)
print(df[~m])
打印:
identifier name group period product volume net_sales expense
0 1 nameone groupone q1 baloons 10 10000 500.0
1 1 nameone groupone q1 cartoons 1 10000 500.0
2 2 nametwo groupone q1 baloons 1 1000 NaN
3 3 namethree grouptwo q4 cartoons 0 0 NaN
编辑:要按比例分配费用,您可以尝试:
m = df["product"] == "expense"
expenses = df[m].groupby(["identifier", "period"])["expense"].first().agg(dict)
df["expense"] = (
df[~m]
.groupby(["identifier", "period"])["gross_sales"]
.transform(
lambda x: [(gs / x.sum()) * expenses.get(x.name, np.nan) for gs in x]
)
)
print(df[~m])
打印:
identifier name group period product gross_sales net_sales expense
0 1 nameone groupone q1 baloons 20000 10000 909.090909
1 1 nameone groupone q1 cartoons 2000 10000 90.909091
2 1 nameone groupone q2 cartoons 20000 10000 NaN
3 2 nametwo groupone q1 baloons 1 1000 NaN
4 3 namethree grouptwo q4 cartoons 0 0 NaN