Group_by 使用过滤器和 str_detect 时无法正常运行

Group_by not functioning correctly when using filter and str_detect

Group_by 在我使用过滤器时似乎无法正常工作。这是我的数据的简单复制:

data = my_data_raw_quest 

user_id     question          dv
1            Allergies?        na     
1            food choice       left
2            Allergies?        yes, I hate gluten  
2            food choice       left        
3            Allergies?        allergic to soy 
3            food choice       left                   
4            Allergies?        na
4            food choice       left             
5            Allergies?        na
5            food choice       left            
6            Allergies?        Soy 
6            food choice       right          
7            Allergies?        na
7            food choice       right

当我 运行 这段代码时,似乎 group_by 没有正常运行。

my_data_raw_quest_2 <- 
  my_data_raw_quest %>%
  dplyr::group_by(user_id) %>%
  filter(add = TRUE, is.na(dv)|
    !str_detect(dv, "(G|g)luten"))

这会产生以下数据集,相反,我希望所有回答“麸质”的用户的回答都被删除。请注意,删除了第 3 行,但没有删除第 4 行。

user_id     question           dv
1            Allergies?        na     
1            food choice       left
2            food choice       left        
3            Allergies?        allergic to soy 
3            food choice       left                   
4            Allergies?        na
4            food choice       left             
5            Allergies?        na
5            food choice       left            
6            Allergies?        Soy 
6            food choice       right          
7            Allergies?        na
7            food choice       right

这就是 table 我想要达到的目标

user_id     question           dv
1            Allergies?        na     
1            food choice       left
3            Allergies?        allergic to soy 
3            food choice       left                   
4            Allergies?        na
4            food choice       left             
5            Allergies?        na
5            food choice       left            
6            Allergies?        Soy 
6            food choice       right          
7            Allergies?        na
7            food choice       right

此代码更好地代表了我拥有的所有数据,我的目标是保留所有列。尽管在下面的代码中没有包含麸质的答案示例。

structure(list(session_id = c(53039, 53039, 53039, 53039, 53039, 
53039, 53047, 53047, 53047, 53047, 53047, 53047, 53050, 53050, 
53050, 53050, 53050, 53050, 53052, 53052, 53052, 53052, 53052, 
53052, 53054, 53054, 53054, 53054, 53054, 53054, 53055, 53055, 
53055, 53055, 53055, 53055, 53056, 53056, 53056, 53056), project_id = c(495, 
495, 495, 495, 495, 495, 495, 495, 495, 495, 495, 495, 495, 495, 
495, 495, 495, 495, 495, 495, 495, 495, 495, 495, 495, 495, 495, 
495, 495, 495, 495, 495, 495, 495, 495, 495, 495, 495, 495, 495
), quest_name = c("Sociodemographic", "Sociodemographic", "Sociodemographic", 
"Sociodemographic", "Sociodemographic", "Sociodemographic", "Sociodemographic", 
"Sociodemographic", "Sociodemographic", "Sociodemographic", "Sociodemographic", 
"Sociodemographic", "Sociodemographic", "Sociodemographic", "Sociodemographic", 
"Sociodemographic", "Sociodemographic", "Sociodemographic", "Sociodemographic", 
"Sociodemographic", "Sociodemographic", "Sociodemographic", "Sociodemographic", 
"Sociodemographic", "Sociodemographic", "Sociodemographic", "Sociodemographic", 
"Sociodemographic", "Sociodemographic", "Sociodemographic", "Sociodemographic", 
"Sociodemographic", "Sociodemographic", "Sociodemographic", "Sociodemographic", 
"Sociodemographic", "Sociodemographic", "Sociodemographic", "Sociodemographic", 
"Sociodemographic"), quest_id = c(2189, 2189, 2189, 2189, 2189, 
2189, 2189, 2189, 2189, 2189, 2189, 2189, 2189, 2189, 2189, 2189, 
2189, 2189, 2189, 2189, 2189, 2189, 2189, 2189, 2189, 2189, 2189, 
2189, 2189, 2189, 2189, 2189, 2189, 2189, 2189, 2189, 2189, 2189, 
2189, 2189), user_id = c(46942, 46942, 46942, 46942, 46942, 46942, 
46946, 46946, 46946, 46946, 46946, 46946, 46947, 46947, 46947, 
46947, 46947, 46947, 46949, 46949, 46949, 46949, 46949, 46949, 
46950, 46950, 46950, 46950, 46950, 46950, 46951, 46951, 46951, 
46951, 46951, 46951, 46952, 46952, 46952, 46952), user_sex = c("male", 
"male", "male", "male", "male", "male", "male", "male", "male", 
"male", "male", "male", "male", "male", "male", "male", "male", 
"male", "male", "male", "male", "male", "male", "male", "male", 
"male", "male", "male", "male", "male", "male", "male", "male", 
"male", "male", "male", "male", "male", "male", "male"), user_status = c("test", 
"test", "test", "test", "test", "test", "guest", "guest", "guest", 
"guest", "guest", "guest", "guest", "guest", "guest", "guest", 
"guest", "guest", "guest", "guest", "guest", "guest", "guest", 
"guest", "registered", "registered", "registered", "registered", 
"registered", "registered", "guest", "guest", "guest", "guest", 
"guest", "guest", "guest", "guest", "guest", "guest"), user_age = c(23, 
23, 23, 23, 23, 23, 21, 21, 21, 21, 21, 21, 22, 22, 22, 22, 22, 
22, 21, 21, 21, 21, 21, 21, 58.9, 58.9, 58.9, 58.9, 58.9, 58.9, 
21, 21, 21, 21, 21, 21, 22, 22, 22, 22), q_name = c("vegan", 
"religious", "religious influence", "food allergies", "Other", 
"Education", "vegan", "religious", "religious influence", "food allergies", 
"Other", "Education", "vegan", "religious", "religious influence", 
"food allergies", "Other", "Education", "vegan", "religious", 
"religious influence", "food allergies", "Other", "Education", 
"vegan", "religious", "religious influence", "food allergies", 
"Other", "Education", "vegan", "religious", "religious influence", 
"food allergies", "Other", "Education", "vegan", "religious", 
"religious influence", "food allergies"), q_id = c(92827394, 
92827395, 92827396, 92827397, 92831398, 92832133, 92827394, 92827395, 
92827396, 92827397, 92831398, 92832133, 92827394, 92827395, 92827396, 
92827397, 92831398, 92832133, 92827394, 92827395, 92827396, 92827397, 
92831398, 92832133, 92827394, 92827395, 92827396, 92827397, 92831398, 
92832133, 92827394, 92827395, 92827396, 92827397, 92831398, 92832133, 
92827394, 92827395, 92827396, 92827397), order = c(1, 2, 3, 4, 
5, 6, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 1, 
2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4), dv = c("1", "Athetiesm", 
"none", "none", "none", "Undergraduate", "5", "Nope, Atheist", 
"None", "No, i am normal", "Money", "Undergraduate", "3", "Nope, Atheist", 
"None", "No", "No", "Postgraduate (masters)", "1", "Christianity", 
"None", "No", "Cost", "Postgraduate (masters)", "2", "Nope, Atheist", 
"none", "none", "less processed, cook fresh", "Undergraduate", 
"1", "Christianity", "None", "No", "Cost", "Postgraduate (masters)", 
"6", "Nope, Atheist", NA, NA), starttime = structure(c(1607440590, 
1607440590, 1607440590, 1607440590, 1607440590, 1607440590, 1607441663, 
1607441663, 1607441663, 1607441663, 1607441663, 1607441663, 1607441637, 
1607441637, 1607441637, 1607441637, 1607441637, 1607441637, 1607442744, 
1607442744, 1607442744, 1607442744, 1607442744, 1607442744, 1607442919, 
1607442919, 1607442919, 1607442919, 1607442919, 1607442919, 1607442998, 
1607442998, 1607442998, 1607442998, 1607442998, 1607442998, 1607443123, 
1607443123, 1607443123, 1607443123), class = c("POSIXct", "POSIXt"
), tzone = "UTC"), endtime = structure(c(1607440643, 1607440643, 
1607440643, 1607440643, 1607440643, 1607440643, 1607441839, 1607441839, 
1607441839, 1607441839, 1607441839, 1607441839, 1607441714, 1607441714, 
1607441714, 1607441714, 1607441714, 1607441714, 1607442819, 1607442819, 
1607442819, 1607442819, 1607442819, 1607442819, 1607443041, 1607443041, 
1607443041, 1607443041, 1607443041, 1607443041, 1607443020, 1607443020, 
1607443020, 1607443020, 1607443020, 1607443020, 1607443148, 1607443148, 
1607443148, 1607443148), class = c("POSIXct", "POSIXt"), tzone = "UTC")), row.names = c(NA, 
40L), class = "data.frame")

尝试以下操作:

library(tidyr)

my_data_raw_quest %>% 
  pivot_wider(names_from = question, values_from = dv) %>% # Reshape to wide
  filter(str_detect(`Allergies?`, "(G|g)luten", negate = TRUE) | str_detect(`food choice`, "(G|g)luten", negate = TRUE)) %>% # Filter
  pivot_longer(c(`Allergies?`, `food choice`), names_to = "question", values_to = "dv") # Reshape back to original shape

顺便说一下,我不知道您希望通过 filter(add = TRUE, ...) 实现什么。 add 不是 filter() 的有效参数。

编辑 2:更新为半连接

经过一番思考并根据您的示例数据,我认为以下内容应该可行。请注意,我在“none”这个词上进行了测试,因为我在提供的示例中找不到面筋,所以显然可以根据您的需要相应地更改模式。

my_data_raw_quest %>% group_by(user_id) %>% 
# Select user_id and dv, then nest into list columns
  select(user_id,dv) %>% 
  summarise( dv = list(dv)) %>% 
# Find occurences of none in dv
mutate(none = map(dv, str_detect, pattern = "none"),
# filter out using a logical vector, replace NAs
none = !map_lgl(none, any),
none = replace_na(none,FALSE)
) %>% 
  filter(none) %>% 
# semi-join on user_id to return selected users only
    semi_join(y = ., x = my_data_raw_quest, by = "user_id")

编辑 1:嵌套/取消嵌套

一种方法是在按用户 ID 分组后嵌套列,然后映射过滤序列并再次取消嵌套:

my_data_raw_quest %>% group_by(user_id) %>% 
# Nest into list columns
summarise(question = list(question), dv = list(dv)) %>% 
# Find occurences of gluten
mutate(gluten = map(dv, str_detect, pattern = "(G|g)luten"),
# filter out using a logical vector
gluten = !map_lgl(gluten, any)) %>% 
  filter(gluten) %>% select(-gluten) %>% 
# unnest
  unnest(c(question,dv))