使用 dplyr、tidyr 和 regex 将不同的名称组合分隔为第一个和最后一个
separate different combinations of names to first and last using dplyr, tidyr, and regex
示例数据框:
name <- c("Smith John Michael","Smith, John Michael","Smith John, Michael","Smith-John Michael","Smith-John, Michael")
df <- data.frame(name)
df
name
1 Smith John Michael
2 Smith, John Michael
3 Smith John, Michael
4 Smith-John Michael
5 Smith-John, Michael
我需要实现以下预期输出:
name first.name last.name
1 Smith John Michael John Smith
2 Smith, John Michael John Smith
3 Smith John, Michael Michael Smith John
4 Smith-John Michael Michael Smith-John
5 Smith-John, Michael Michael Smith-John
规则是:如果字符串中有逗号,则前面的任何内容都是姓氏。逗号后的第一个词是名字。如果字符串中没有逗号,则第一个单词是姓氏,第二个单词是姓氏。带连字符的词是一个词。我宁愿用 dplyr 和 regex 来实现这一点,但我会采取任何解决方案。感谢帮助
您可以使用 strsplit
根据 name
中是否有逗号在 ","
或 " "
之间切换来达到您想要的结果。在这里,我们定义了两个函数,使演示更清晰。您也可以在函数中内联代码。
get.last.name <- function(name) {
lapply(ifelse(grepl(",",name),strsplit(name,","),strsplit(name," ")),`[[`,1)
}
strsplit
的结果是一个列表。 lapply(...,'[[',1)
遍历此列表并从每个列表元素中提取第一个元素,即姓氏。
get.first.name <- function(name) {
d <- lapply(ifelse(grepl(",",name),strsplit(name,","),strsplit(name," ")),`[[`,2)
lapply(strsplit(gsub("^ ","",d), " "),`[[`,1)
}
此函数类似,只是我们从 strsplit
返回的每个列表元素中提取第二个元素,其中包含名字。然后我们使用 gsub
删除所有起始空格,我们再次使用 " "
拆分以从 strsplit
返回的每个列表元素中提取第一个元素作为名字。
与dplyr
放在一起:
library(dplyr)
res <- df %>% mutate(first.name=get.first.name(name),
last.name=get.last.name(name))
结果符合预期:
print(res)
## name first.name last.name
## 1 Smith John Michael John Smith
## 2 Smith, John Michael John Smith
## 3 Smith John, Michael Michael Smith John
## 4 Smith-John Michael Michael Smith-John
## 5 Smith-John, Michael Michael Smith-John
数据:
df <- structure(list(name = c("Smith John Michael", "Smith, John Michael",
"Smith John, Michael", "Smith-John Michael", "Smith-John, Michael"
)), .Names = "name", row.names = c(NA, -5L), class = "data.frame")
## name
##1 Smith John Michael
##2 Smith, John Michael
##3 Smith John, Michael
##4 Smith-John Michael
##5 Smith-John, Michael
我不确定这是否比 aichao 的回答更好,但我还是试了一下。我给出了正确的输出。
df1 <- df %>%
filter(grepl(",",name)) %>%
separate(name, c("last.name","first.middle.name"), sep = "\,", remove=F) %>%
mutate(first.middle.name = trimws(first.middle.name)) %>%
separate(first.middle.name, c("first.name","middle.name"), sep="\ ",remove=T) %>%
select(-middle.name)
df2 <- df %>%
filter(!grepl(",",name)) %>%
separate(name, c("last.name","first.name"), sep = "\ ", remove=F)
df<-rbind(df1,df2)
示例数据框:
name <- c("Smith John Michael","Smith, John Michael","Smith John, Michael","Smith-John Michael","Smith-John, Michael")
df <- data.frame(name)
df
name
1 Smith John Michael
2 Smith, John Michael
3 Smith John, Michael
4 Smith-John Michael
5 Smith-John, Michael
我需要实现以下预期输出:
name first.name last.name
1 Smith John Michael John Smith
2 Smith, John Michael John Smith
3 Smith John, Michael Michael Smith John
4 Smith-John Michael Michael Smith-John
5 Smith-John, Michael Michael Smith-John
规则是:如果字符串中有逗号,则前面的任何内容都是姓氏。逗号后的第一个词是名字。如果字符串中没有逗号,则第一个单词是姓氏,第二个单词是姓氏。带连字符的词是一个词。我宁愿用 dplyr 和 regex 来实现这一点,但我会采取任何解决方案。感谢帮助
您可以使用 strsplit
根据 name
中是否有逗号在 ","
或 " "
之间切换来达到您想要的结果。在这里,我们定义了两个函数,使演示更清晰。您也可以在函数中内联代码。
get.last.name <- function(name) {
lapply(ifelse(grepl(",",name),strsplit(name,","),strsplit(name," ")),`[[`,1)
}
strsplit
的结果是一个列表。 lapply(...,'[[',1)
遍历此列表并从每个列表元素中提取第一个元素,即姓氏。
get.first.name <- function(name) {
d <- lapply(ifelse(grepl(",",name),strsplit(name,","),strsplit(name," ")),`[[`,2)
lapply(strsplit(gsub("^ ","",d), " "),`[[`,1)
}
此函数类似,只是我们从 strsplit
返回的每个列表元素中提取第二个元素,其中包含名字。然后我们使用 gsub
删除所有起始空格,我们再次使用 " "
拆分以从 strsplit
返回的每个列表元素中提取第一个元素作为名字。
与dplyr
放在一起:
library(dplyr)
res <- df %>% mutate(first.name=get.first.name(name),
last.name=get.last.name(name))
结果符合预期:
print(res)
## name first.name last.name
## 1 Smith John Michael John Smith
## 2 Smith, John Michael John Smith
## 3 Smith John, Michael Michael Smith John
## 4 Smith-John Michael Michael Smith-John
## 5 Smith-John, Michael Michael Smith-John
数据:
df <- structure(list(name = c("Smith John Michael", "Smith, John Michael",
"Smith John, Michael", "Smith-John Michael", "Smith-John, Michael"
)), .Names = "name", row.names = c(NA, -5L), class = "data.frame")
## name
##1 Smith John Michael
##2 Smith, John Michael
##3 Smith John, Michael
##4 Smith-John Michael
##5 Smith-John, Michael
我不确定这是否比 aichao 的回答更好,但我还是试了一下。我给出了正确的输出。
df1 <- df %>%
filter(grepl(",",name)) %>%
separate(name, c("last.name","first.middle.name"), sep = "\,", remove=F) %>%
mutate(first.middle.name = trimws(first.middle.name)) %>%
separate(first.middle.name, c("first.name","middle.name"), sep="\ ",remove=T) %>%
select(-middle.name)
df2 <- df %>%
filter(!grepl(",",name)) %>%
separate(name, c("last.name","first.name"), sep = "\ ", remove=F)
df<-rbind(df1,df2)