使用 Dplyr 的 "group_by" 创建组,然后使用 Stringr 查找组之间的差异
Creating Groups with Dplyr's "group_by" then Using Stringr to Find Differences Between Groups
使用下面的示例,我想按 CaseWorker 对数据框进行分组,然后是客户端,然后为每个客户端组确定 "Task" 中的任务列表是否与 [= 中的任务列表相同21=]。
如果 "Task2" 而不是 "Task" 中的每个任务都可以提取并显示在新列或数据框中,我会很高兴得到一个简单的真或假,或者更好。
所以基本上我需要确保 "Task" 和 "Task2" 包含每个客户的相同条目。
如果可能的话,我想坚持使用 Dplyr 和 Stringr,或者至少留在 Tidyverse 中。我在想有一些方法可以使用 "group_by" 和 "str_detect" 或其他一些 Stringr 功能以优雅的方式实现这一点。
CaseWorker<-c("John","John","John","John","John","John","Melanie","Melanie","Melanie","Melanie","Melanie","Melanie")
Client<-c("Chris","Chris","Chris","Tom","Tom","Tom","Valerie","Valerie","Valerie","Tim","Tim","Tim")
Task<-c("Feed cat","Make dinner","Iron shirt","Make dinner","Do homework","Make lunch","Make dinner","Feed cat","Buy groceries","Do homework","Iron shirt","Make lunch")
Task2<-c("Feed cat","Make dinner","Iron shirt","Make dinner","Do homework","Feed cat","Make dinner","Feed cat","Iron shirt","Do homework","Iron shirt","Make lunch")
Df<-data.frame(CaseWorker,Client,Task,Task2)
您可以简单地通过 dplyr
并使用 %in%
来做到这一点
Df %>%
group_by(CaseWorker,Client) %>%
mutate(Check = Task %in% Task2)
这取决于精确的大小写匹配,如果您担心,您可以执行以下操作:
Df %>%
group_by(CaseWorker,Client) %>%
rowwise() %>%
mutate(Check = grepl(Task, Task2, ignore.case = TRUE))
但是您必须在 mutate 之前使用 rowwise 来解决 grepl(或大多数 R 函数)的矢量化性质
看看这是不是您想要的。
首先,查看 Task
是否匹配 Task2
。如果不是,return Task2
作为一个新变量。我将其存储到一个新的数据框中 df2
df2 <- Df %>%
mutate(match = Task == Task2,
non_match = ifelse(!match, Task2, ""))
df2
# CaseWorker Client Task Task2 match non_match
# 1 John Chris Feed cat Feed cat TRUE
# 2 John Chris Make dinner Make dinner TRUE
# 3 John Chris Iron shirt Iron shirt TRUE
# 4 John Tom Make dinner Make dinner TRUE
# 5 John Tom Do homework Do homework TRUE
# 6 John Tom Make lunch Feed cat FALSE Feed cat
# 7 Melanie Valerie Make dinner Make dinner TRUE
# 8 Melanie Valerie Feed cat Feed cat TRUE
# 9 Melanie Valerie Buy groceries Iron shirt FALSE Iron shirt
# 10 Melanie Tim Do homework Do homework TRUE
# 11 Melanie Tim Iron shirt Iron shirt TRUE
# 12 Melanie Tim Make lunch Make lunch TRUE
然后 summarise
结果以查看单个 CaseWorker
/Client
对是否匹配所有条目。
df2 %>%
group_by(CaseWorker, Client) %>%
summarise(n = n(),
matches = sum(match),
all_match = n == matches)
# CaseWorker Client n matches all_match
# <chr> <chr> <int> <int> <lgl>
# 1 John Chris 3 3 TRUE
# 2 John Tom 3 2 FALSE
# 3 Melanie Tim 3 3 TRUE
# 4 Melanie Valerie 3 2 FALSE
如果您需要原始数据集中的 all_match
变量,您当然可以将其合并回您的数据框中。
如果您想使用 stringr 包。以下内容也适合您。
Df %>%
group_by(CaseWorker,Client) %>%
mutate(Check=str_detect(as.character(Task),as.character(Task2))
这可能只是我误解了这个问题,但我认为如果您想要的只是任务与 Task2 不匹配的记录,您可能会过度复杂化这个问题。
> Df[which(Df$Task != Df$Task2),]
=== ========== ======= ============= ==========
\ CaseWorker Client Task Task2
=== ========== ======= ============= ==========
6 John Tom Make lunch Feed cat
9 Melanie Valerie Buy groceries Iron shirt
=== ========== ======= ============= ==========
使用下面的示例,我想按 CaseWorker 对数据框进行分组,然后是客户端,然后为每个客户端组确定 "Task" 中的任务列表是否与 [= 中的任务列表相同21=]。
如果 "Task2" 而不是 "Task" 中的每个任务都可以提取并显示在新列或数据框中,我会很高兴得到一个简单的真或假,或者更好。
所以基本上我需要确保 "Task" 和 "Task2" 包含每个客户的相同条目。
如果可能的话,我想坚持使用 Dplyr 和 Stringr,或者至少留在 Tidyverse 中。我在想有一些方法可以使用 "group_by" 和 "str_detect" 或其他一些 Stringr 功能以优雅的方式实现这一点。
CaseWorker<-c("John","John","John","John","John","John","Melanie","Melanie","Melanie","Melanie","Melanie","Melanie")
Client<-c("Chris","Chris","Chris","Tom","Tom","Tom","Valerie","Valerie","Valerie","Tim","Tim","Tim")
Task<-c("Feed cat","Make dinner","Iron shirt","Make dinner","Do homework","Make lunch","Make dinner","Feed cat","Buy groceries","Do homework","Iron shirt","Make lunch")
Task2<-c("Feed cat","Make dinner","Iron shirt","Make dinner","Do homework","Feed cat","Make dinner","Feed cat","Iron shirt","Do homework","Iron shirt","Make lunch")
Df<-data.frame(CaseWorker,Client,Task,Task2)
您可以简单地通过 dplyr
并使用 %in%
Df %>%
group_by(CaseWorker,Client) %>%
mutate(Check = Task %in% Task2)
这取决于精确的大小写匹配,如果您担心,您可以执行以下操作:
Df %>%
group_by(CaseWorker,Client) %>%
rowwise() %>%
mutate(Check = grepl(Task, Task2, ignore.case = TRUE))
但是您必须在 mutate 之前使用 rowwise 来解决 grepl(或大多数 R 函数)的矢量化性质
看看这是不是您想要的。
首先,查看 Task
是否匹配 Task2
。如果不是,return Task2
作为一个新变量。我将其存储到一个新的数据框中 df2
df2 <- Df %>%
mutate(match = Task == Task2,
non_match = ifelse(!match, Task2, ""))
df2
# CaseWorker Client Task Task2 match non_match
# 1 John Chris Feed cat Feed cat TRUE
# 2 John Chris Make dinner Make dinner TRUE
# 3 John Chris Iron shirt Iron shirt TRUE
# 4 John Tom Make dinner Make dinner TRUE
# 5 John Tom Do homework Do homework TRUE
# 6 John Tom Make lunch Feed cat FALSE Feed cat
# 7 Melanie Valerie Make dinner Make dinner TRUE
# 8 Melanie Valerie Feed cat Feed cat TRUE
# 9 Melanie Valerie Buy groceries Iron shirt FALSE Iron shirt
# 10 Melanie Tim Do homework Do homework TRUE
# 11 Melanie Tim Iron shirt Iron shirt TRUE
# 12 Melanie Tim Make lunch Make lunch TRUE
然后 summarise
结果以查看单个 CaseWorker
/Client
对是否匹配所有条目。
df2 %>%
group_by(CaseWorker, Client) %>%
summarise(n = n(),
matches = sum(match),
all_match = n == matches)
# CaseWorker Client n matches all_match
# <chr> <chr> <int> <int> <lgl>
# 1 John Chris 3 3 TRUE
# 2 John Tom 3 2 FALSE
# 3 Melanie Tim 3 3 TRUE
# 4 Melanie Valerie 3 2 FALSE
如果您需要原始数据集中的 all_match
变量,您当然可以将其合并回您的数据框中。
如果您想使用 stringr 包。以下内容也适合您。
Df %>%
group_by(CaseWorker,Client) %>%
mutate(Check=str_detect(as.character(Task),as.character(Task2))
这可能只是我误解了这个问题,但我认为如果您想要的只是任务与 Task2 不匹配的记录,您可能会过度复杂化这个问题。
> Df[which(Df$Task != Df$Task2),]
=== ========== ======= ============= ==========
\ CaseWorker Client Task Task2
=== ========== ======= ============= ==========
6 John Tom Make lunch Feed cat
9 Melanie Valerie Buy groceries Iron shirt
=== ========== ======= ============= ==========