R 字符匹配和排序
R character match and rank
我有一个字符向量
var1 <- c("pine tree", "dense forest", "red fruits", "green fruits",
"clean water", "pine")
和一个列表
var2 <- list(c("tall tree", "fruits", "star"), c("tree tall", "pine tree",
"tree pine", "black forest", "water"), c("apple", "orange", "grapes"))
我想将var1中的单词与var2中的元素进行匹配,得到var2中排名靠前的元素。例如,这里需要的输出是:
"tree tall" "pine tree" "tree pine" "black forest" "water"
var2[2] 排名 1(var1 中的 4 个短语:pine tree、dense forest、pine 和 water 与 var2[2] 匹配
"tall tree" "fruits" "star"
var2[1]为第2位,(var1中的3个短语:pine tree, red fruits, and green fruits与var2[1]匹配)
"apple" "orange" "grapes"
var2[3] 是等级 3,与 var1
不匹配
我试过了
indx1 <- sapply(var2, function(x) sum(grepl(var1, x)))
没有得到想要的输出。
如何解决?代码片段将不胜感激。
谢谢。
编辑:
新数据如下:
var11 <- c("nature" , "environmental", "ringing", "valley" , "status" , "climate" ,
"forge" , "environmental" , "common" ,
"birdwatch", "big" , "link" ,
"day" , "pintail" , "morning" ,
"big garden" , "birdwatch deadline", "deadline february" ,
"mu condition" , "garden birdwatch" , "status" ,
"chorus walk" , "dawn choru" , "walk sunday",
"climate lobby" , "lobby parliament" , "u status" ,
"sandwell valley" , "my status of" , "environmental lake")
var22 <- list(c("environmental condition"), c("condition", "status"), c("water", "ocean water"))
我们可以遍历 'var2' (sapply(var2,
) ,将字符串拆分为白色 space (strsplit(x, ' ')
),grep
输出列表元素为'var1' 的模式。检查是否有 any
匹配,sum
逻辑向量和 rank
它。这可用于重新排序 'var2' 元素。
indx <- rank(-sapply(var2, function(x) sum(sapply(strsplit(x, ' '),
function(y) any(grepl(paste(y,collapse='|'), var1))))),
ties.method='first')
indx
#[1] 2 1 3
var2[indx]
#[[1]]
#[1] "tree tall" "pine tree" "tree pine" "black forest" "water"
#[[2]]
#[1] "tall tree" "fruits" "star"
#[[3]]
#[1] "apple" "orange" "grapes"
更新
如果我们还需要计算重复项,请尝试
indx <- rank(-sapply(var22, function(x) sum(sapply(strsplit(x, ' '),
function(y) sum(sapply(strsplit(var11, ' '),
function(z) any(grepl(paste(y, collapse="|"), z))))))),
ties.method='random')
indx
#[1] 1 2
更新2
如果我们需要过滤掉'var2'中与'var1'不匹配的元素
pat <- paste(unique(unlist(strsplit(var1, ' '))), collapse="|")
Filter(function(x) any(grepl(pat, x)), var2[indx])
#[[1]]
#[1] "tree tall" "pine tree" "tree pine" "black forest" "water"
#[[2]]
#[1] "tall tree" "fruits" "star"
以下代码可以工作:
idx <- rank(-sapply(var2,
function(x) sum(unlist(sapply(strsplit(var1,split=' '),
function(y) any(unlist(sapply(y,
function(z) grepl(z,x))>0))>0)))),
ties.method='random')
我有一个字符向量
var1 <- c("pine tree", "dense forest", "red fruits", "green fruits",
"clean water", "pine")
和一个列表
var2 <- list(c("tall tree", "fruits", "star"), c("tree tall", "pine tree",
"tree pine", "black forest", "water"), c("apple", "orange", "grapes"))
我想将var1中的单词与var2中的元素进行匹配,得到var2中排名靠前的元素。例如,这里需要的输出是:
"tree tall" "pine tree" "tree pine" "black forest" "water"
var2[2] 排名 1(var1 中的 4 个短语:pine tree、dense forest、pine 和 water 与 var2[2] 匹配
"tall tree" "fruits" "star"
var2[1]为第2位,(var1中的3个短语:pine tree, red fruits, and green fruits与var2[1]匹配)
"apple" "orange" "grapes"
var2[3] 是等级 3,与 var1
不匹配我试过了
indx1 <- sapply(var2, function(x) sum(grepl(var1, x)))
没有得到想要的输出。
如何解决?代码片段将不胜感激。 谢谢。
编辑:
新数据如下:
var11 <- c("nature" , "environmental", "ringing", "valley" , "status" , "climate" ,
"forge" , "environmental" , "common" ,
"birdwatch", "big" , "link" ,
"day" , "pintail" , "morning" ,
"big garden" , "birdwatch deadline", "deadline february" ,
"mu condition" , "garden birdwatch" , "status" ,
"chorus walk" , "dawn choru" , "walk sunday",
"climate lobby" , "lobby parliament" , "u status" ,
"sandwell valley" , "my status of" , "environmental lake")
var22 <- list(c("environmental condition"), c("condition", "status"), c("water", "ocean water"))
我们可以遍历 'var2' (sapply(var2,
) ,将字符串拆分为白色 space (strsplit(x, ' ')
),grep
输出列表元素为'var1' 的模式。检查是否有 any
匹配,sum
逻辑向量和 rank
它。这可用于重新排序 'var2' 元素。
indx <- rank(-sapply(var2, function(x) sum(sapply(strsplit(x, ' '),
function(y) any(grepl(paste(y,collapse='|'), var1))))),
ties.method='first')
indx
#[1] 2 1 3
var2[indx]
#[[1]]
#[1] "tree tall" "pine tree" "tree pine" "black forest" "water"
#[[2]]
#[1] "tall tree" "fruits" "star"
#[[3]]
#[1] "apple" "orange" "grapes"
更新
如果我们还需要计算重复项,请尝试
indx <- rank(-sapply(var22, function(x) sum(sapply(strsplit(x, ' '),
function(y) sum(sapply(strsplit(var11, ' '),
function(z) any(grepl(paste(y, collapse="|"), z))))))),
ties.method='random')
indx
#[1] 1 2
更新2
如果我们需要过滤掉'var2'中与'var1'不匹配的元素
pat <- paste(unique(unlist(strsplit(var1, ' '))), collapse="|")
Filter(function(x) any(grepl(pat, x)), var2[indx])
#[[1]]
#[1] "tree tall" "pine tree" "tree pine" "black forest" "water"
#[[2]]
#[1] "tall tree" "fruits" "star"
以下代码可以工作:
idx <- rank(-sapply(var2,
function(x) sum(unlist(sapply(strsplit(var1,split=' '),
function(y) any(unlist(sapply(y,
function(z) grepl(z,x))>0))>0)))),
ties.method='random')