根据 R 中值的总和制作一系列 ID

Question

对r不是很了解，不知道是不是简单的问题。我想根据占总和 60%（或大约）的值之和创建一系列 ID。这是数据框。 DF

这样我会先按 ID 对 DF 进行排序，然后检查 ID 的哪个范围内的值总和达到 60%，然后将它们分组，剩下的，按 10%、10%、10 对它们进行分组%,10%（或者它可以是随机的 10%、10%、20% 或 5%、15%、10%、10%）。这样我的数据框看起来像

ID     Val
3-24   35           # (11+6+8+1+3+2) ~ 62% of the total sum of `Val` column
46-59  9            # (1+2+6) = 18% of the total sum of `Val` column
98     7            # (2+1+4) =14% of the total sum of `Val` column

我可以试试这个

DF=DF[with(DF, order(DF$ID)), ]
perce = round(sum(DF$ID)*60/100)
for(i in 1:dim(DF)[1]){
     if(sum(DF$Val) == perce){
      ID=which(DF$ID)
       .
       .
       .
put those ID's in a range that constitutes 60%

       }
    }

我不知道这是否可能。?

谢谢多姆尼克

Answer 1

首先，我们对数据进行排序并得到每个 ID 组的 sum。

然后我们可以使用 cumsum(Val) 得到运行总数。我们需要 lag 这个所以它代表 "the sum of all ID-group's values before this row".

现在，我们可以使用cut将累积和分配给区间组(-∞, 0.6 * total]、(0.7 * total, 0.8 * total]和(0.8 * total, ∞)。

然后我们可以group_by这个区间，得到Val的sum。

library('tidyverse')

df <- tribble(
  ~ID, ~Val,
   98,    2,
   98,    1,
   98,    4,
    3,    11,
    3,    6,
    3,    8,
    3,    1,
   24,    3,
   24,    2,
   46,    1,
   46,    2,
   59,    6
)

breaks_proportions <- c(0.6, 0.1, 0.1)
breaks_values <- cumsum(breaks_proportions) * sum(df$Val)

df %>%
  arrange(ID) %>%
  group_by(ID) %>%
  summarise(Val = sum(Val)) %>%
  mutate(
    running_total = lag(cumsum(Val), default = 0),
    group = cut(
      running_total,
      c(-Inf, breaks_values, Inf))) %>%
  group_by(group) %>%
  summarise(
    ID = stringr::str_c(min(ID), '-', max(ID)),
    Val = sum(Val)) %>%
  select(ID, Val)
# # A tibble: 4 x 2
#      ID   Val
#   <chr> <dbl>
# 1  3-24    31
# 2 46-46     3
# 3 59-59     6
# 4 98-98     7

根据 R 中值的总和制作一系列 ID

Make a range of ID's based on sum of values in R

for-loop

if-statement

r

dataframe

cumsum