根据前一列中的值按比例填充新变量?

Proportionally fill new variable based on values in previous column?

我想使用数据框中其他地方的信息创建一个新变量。这看起来很简单,但我想按比例分配新变量的水平。

我有一个数据框:

dd<-read.table(text="
group     piece      answer
group1     A          noise
group1     A          silence
group1     A          silence
group1     B          silence
group1     B          loud_noise
group1     B          noise
group1     B          loud_noise
group1     B          noise
group2     C          silence
group2     C          silence", header=TRUE)

我想创建一个具有两个级别的新变量 'majority_annotation':好的和坏的。好意味着每件作品都有多数人同意 (>55%)。 Bad 表示该作品没有获得多数人的同意。

    group     piece      answer       majority_agreement
    group1     A          noise       good 
    group1     A          silence     good
    group1     A          silence     good
    group1     B          silence     bad
    group1     B          loud_noise  bad
    group1     B          noise       bad
    group1     B          loud_noise  bad
    group1     B          noise       bad
    group2     C          silence     good
    group2     C          silence     good

我可以二进制执行此操作(全部或不同意):

    newdf <- df %>% 
      group_by(group) %>% 
      mutate(majority_agreement = ifelse(length(unique(answer)) <= 1,        
    'good', 
          ifelse(length(unique(answer) > 1) & 
          (length(unique(answer)) >= 2), 'bad', 'bad'))) %>% 
      as.data.frame

我怎样才能按比例进行呢?

library(dplyr)
newdf <- df %>% 
  count(group, piece, answer) %>%   # How many of each answer for each group & piece
  group_by(group, piece) %>%
  mutate(share = n / sum(n)) %>%  # What share have this answer?
  summarize(max_share = max(share)) %>%  # What's the largest share among them?
  mutate(majority_agreement = if_else(max_share > 0.55, "good", "bad")) %>%
  ungroup() %>%
  right_join(df)  # Add the conclusion back to the original data

这似乎可以使用 dplyr

来完成您想要的操作
library(dplyr)
dd %>% 
  group_by(piece) %>% 
  mutate(majority_agreement = if_else(max(table(answer)/n())>.55, "good", "bad"))

在每个 "piece" 中,我们使用 table() 来计算不同响应的数量并将其除以 n() 以获得每个响应的比例。我们查看最大比例是否大于 0.55。如果是,我们给出标签"good",否则我们给出标签"bad"