如何根据R中的条件找到第二个四分位数的总和

Question

在本例中，我拥有的数据表示销售额及其与给定商店 One 和 Two 的距离 (Dist)。我想做的是，根据销售密度定义商店集水区。集水区定义为包含 50% 销售额的半径。从与商店距离最小 (Dist) 的订单开始，我想计算包含给定商店 50% 销售额的半径。

我在之前的模型中计算了以下 df。

df <- data.frame(ID = c(1,2,3,4,5,6,7,8),
                 Store = c('One','One','One','One','Two','Two','Two','Two'),
                 Dist = c(1,5,7,23,1,9,9,23),
                 Sales = c(10,8,4,1,11,9,4,2))

现在我想找到使收盘价达到 Sales 的 50% 的最小距离 dist。所以我的输出如下所示：

Output <- data.frame(Store = c('One','Two'),
                 Dist = c(5,9),
                 Sales = c(18,20))

我的实际 df 中有很多观察值，我无法准确地解决 50%，所以我需要四舍五入到最近的观察值。

有什么建议吗？

注意：对于糟糕的标题，我提前表示歉意，我试图想出更好的方法来表述问题，欢迎提出建议...

Answer 1

我认为如果以某种格式重新排列数据会更容易。我的逻辑是先按组取 cumsum。然后将组的总和合并到数据中。最后我计算百分比。现在你已经得到了数据，你可以以任何你想从组中获得第一个 obs 的方式进行子集化。

df$cums=unlist(lapply(split(df$Sales, df$Store), cumsum), use.names = F)
zz=aggregate(df$Sales, by = list(df$Store), sum)
names(zz)=c('Store', 'TotSale')
df = merge(df, zz)
df$perc=df$cums/df$TotSale

子设置数据：

merge(aggregate(perc ~ Store,data=subset(df,perc>=0.5), min),df)
 Store      perc ID Dist Sales cums TotSale
1   One 0.7826087  2    5     8   18      23
2   Two 0.7692308  6    9     9   20      26

Answer 2

这是 data.table 的一种方法：

library(data.table)
setDT(df)

df[order(Store, Dist), 
   .(Dist, Sales = cumsum(Sales), Pct = cumsum(Sales) / sum(Sales)),
   by = "Store"][Pct >= 0.5, .SD[1,], by = "Store"]
#    Store Dist Sales       Pct
# 1:   One    5    18 0.7826087
# 2:   Two    9    20 0.7692308

setDT(df) 将 df 转换为 data.table
.(...)表达式选择Dist，通过Store
Pct >= 0.5 仅将其子集化为累计销售额超过阈值的情况，并且 .SD[1,] 仅取顶行（即 Dist 的最小值），通过 Store

如何根据R中的条件找到第二个四分位数的总和

How to find the sum of the 2nd quartile based on a condition in R

optimization

r

mathematical-optimization