将一个句子与 R 中的一个句子匹配?
match a sentence with a sentence in R?
我有两个数据框占用和数据。我想将数据中的每个职业与职业相匹配,并通过在职业数据框中添加一列来分配相应的class。
occupation <- c("I am Civil Engineer human being", "Graphic Designer too late", "Architect by profession", "Sales Manager Bank", "Love my profession of Professor", "NA")
occupation <- data.frame(occupation)
data <- data.frame(class = c("Engineers","Designer","Artist","Designer","Poetry""Banker and Prof"), Occupation = c("Civil Engineer", "Graphic Designer", "Painter","Poetry","Architect(prof)", "Sales Manager Bank"))
我想要这样的输出
occupation class
I am Civil Engineer human being Engineers
Painter Architect Poetry Artists
Graphic Designer too late Designers
Architect by Painter profession Architect
Sales Manager Bank Banker and Prof
Love my profession of Professor NA
NA NA
我试过了,但它没有任何反应
occupation$value <- sapply(data$occupation, grepl, x = occupation)
agrep
非常接近。我无法让它为 Architect(prof)
工作,但如果你删除括号,它就可以工作:
data$Occupation <- sub("\(.*", "", data$Occupation)
data
class Occupation
1 Engineers Civil Engineer
2 Designer Graphic Designer
3 Designer Architect
4 Banker and Prof Sales Manager Bank
occ.class <- data$class[unlist(sapply(data$Occupation, function(x) agrep(x, occupation)))]
occ.class
[1] Engineers Designer Designer Banker and Prof
Levels: Banker and Prof Designer Engineers
如果你想让第三个显示 Architect
你应该在你的 data
data.frame.
中相应地改变它
至于编辑:
occ.class <- unlist(sapply(data$Occupation, function(x) agrep(x, occupation)))
ifelse(length(occ.class), data$class[occ.class], NA)
我不知道你的数据有多复杂,但这对低复杂度的字符串很有用。使用 agrep
函数允许您设置容差参数,以便您可以匹配不相等的字符串:
occupation <- data.frame(occupation = c("I am Civil Engineer human being", "Graphic Designer too late", "Architect by profession", "Sales Manager Bank"),
stringsAsFactors = FALSE)
data <- data.frame(class = c("Engineers","Designer","Architect","Banker and Prof"),
occupation = c("Civil Engineer", "Graphic Designer", "Architect(prof)", "Sales Manager Bank"),
stringsAsFactors = FALSE)
occupation$value <- sapply(occupation$occupation, function(x) {
match.class <- sapply(data$class, function(y) agrep(y, x, max.distance = 0.2))
data$class[which(match.class == 1)]
}
)
如果你上升 max.distance
你可以检测到最后的文本,但以前的字符串也会这样做。
occupation value
1 I am Civil Engineer human being Civil Engineer
2 Graphic Designer too late Graphic Designer
3 Architect by profession Architect(prof)
4 Sales Manager Bank
第二个选项匹配每个单词,但对于 'I am Civil Engineer human being' 的情况,单词 'I' 和 'am' 匹配所有内容。
occupation$value <- sapply(occupation$occupation, function(x) {
match.class <- sapply(data$class, function(y) {
any(sapply(strsplit(x, ' ')[[1]], function(z)
any(agrep(z, y, max.distance = 0.2))))
})
data$class[which(match.class)]
}
)
结果是这样...
occupation value
1 I am Civil Engineer human being Civil Engineer, Graphic Designer, Architect(prof), Sales Manager Bank
2 Graphic Designer too late Graphic Designer
3 Architect by profession Architect(prof)
4 Sales Manager Bank Sales Manager Bank
我有两个数据框占用和数据。我想将数据中的每个职业与职业相匹配,并通过在职业数据框中添加一列来分配相应的class。
occupation <- c("I am Civil Engineer human being", "Graphic Designer too late", "Architect by profession", "Sales Manager Bank", "Love my profession of Professor", "NA")
occupation <- data.frame(occupation)
data <- data.frame(class = c("Engineers","Designer","Artist","Designer","Poetry""Banker and Prof"), Occupation = c("Civil Engineer", "Graphic Designer", "Painter","Poetry","Architect(prof)", "Sales Manager Bank"))
我想要这样的输出
occupation class
I am Civil Engineer human being Engineers
Painter Architect Poetry Artists
Graphic Designer too late Designers
Architect by Painter profession Architect
Sales Manager Bank Banker and Prof
Love my profession of Professor NA
NA NA
我试过了,但它没有任何反应
occupation$value <- sapply(data$occupation, grepl, x = occupation)
agrep
非常接近。我无法让它为 Architect(prof)
工作,但如果你删除括号,它就可以工作:
data$Occupation <- sub("\(.*", "", data$Occupation)
data
class Occupation
1 Engineers Civil Engineer
2 Designer Graphic Designer
3 Designer Architect
4 Banker and Prof Sales Manager Bank
occ.class <- data$class[unlist(sapply(data$Occupation, function(x) agrep(x, occupation)))]
occ.class
[1] Engineers Designer Designer Banker and Prof
Levels: Banker and Prof Designer Engineers
如果你想让第三个显示 Architect
你应该在你的 data
data.frame.
至于编辑:
occ.class <- unlist(sapply(data$Occupation, function(x) agrep(x, occupation)))
ifelse(length(occ.class), data$class[occ.class], NA)
我不知道你的数据有多复杂,但这对低复杂度的字符串很有用。使用 agrep
函数允许您设置容差参数,以便您可以匹配不相等的字符串:
occupation <- data.frame(occupation = c("I am Civil Engineer human being", "Graphic Designer too late", "Architect by profession", "Sales Manager Bank"),
stringsAsFactors = FALSE)
data <- data.frame(class = c("Engineers","Designer","Architect","Banker and Prof"),
occupation = c("Civil Engineer", "Graphic Designer", "Architect(prof)", "Sales Manager Bank"),
stringsAsFactors = FALSE)
occupation$value <- sapply(occupation$occupation, function(x) {
match.class <- sapply(data$class, function(y) agrep(y, x, max.distance = 0.2))
data$class[which(match.class == 1)]
}
)
如果你上升 max.distance
你可以检测到最后的文本,但以前的字符串也会这样做。
occupation value
1 I am Civil Engineer human being Civil Engineer
2 Graphic Designer too late Graphic Designer
3 Architect by profession Architect(prof)
4 Sales Manager Bank
第二个选项匹配每个单词,但对于 'I am Civil Engineer human being' 的情况,单词 'I' 和 'am' 匹配所有内容。
occupation$value <- sapply(occupation$occupation, function(x) {
match.class <- sapply(data$class, function(y) {
any(sapply(strsplit(x, ' ')[[1]], function(z)
any(agrep(z, y, max.distance = 0.2))))
})
data$class[which(match.class)]
}
)
结果是这样...
occupation value
1 I am Civil Engineer human being Civil Engineer, Graphic Designer, Architect(prof), Sales Manager Bank
2 Graphic Designer too late Graphic Designer
3 Architect by profession Architect(prof)
4 Sales Manager Bank Sales Manager Bank