计算 2 个向量中每个单词之间的 Jaccard 相似度
Calculate Jaccard similarity between each words in 2 vectors
我需要计算 2 个向量中每个单词之间的 Jaccard 相似度。每个字一个字。并提取最相似的词。
这是我糟糕的慢代码:
txt1 <- c('The quick brown fox jumps over the lazy dog')
txt2 <- c('Te quick foks jump ovar lazey dogg')
words <- strsplit(as.character(txt1), " ")
words.p <- strsplit(as.character(txt2), " ")
r <- length(words[[1]])
c <- length(words.p[[1]])
m <- matrix(nrow=r, ncol=c)
for (i in 1:r){
for (j in 1:c){
m[i,j] = stringdist(tolower(words.p[[1]][j]), tolower(words[[1]][i]), method='jaccard', q=2)
}
}
ind <- which(m == min(m))-nrow(m)
words[[1]][ind]
请帮我改进和美化这个大数据框架的代码。
准备(在此处添加 tolower
):
txt1 <- c('The quick brown fox jumps over the lazy dog')
txt2 <- c('Te quick foks jump ovar lazey dogg')
words <- unlist(strsplit(tolower(as.character(txt1)), " "))
words.p <- unlist(strsplit(tolower(as.character(txt2)), " "))
获取每个单词的距离:
dists <- sapply(words, Map, f=stringdist, list(words.p), method="jaccard")
对于 words
中的每个单词,从 words.p
中找到最接近的单词:
matches <- words.p[sapply(dists, which.min)]
cbind(words, matches)
matches
[1,] "the" "te"
[2,] "quick" "quick"
[3,] "brown" "ovar"
[4,] "fox" "foks"
[5,] "jumps" "jump"
[6,] "over" "ovar"
[7,] "the" "te"
[8,] "lazy" "lazey"
[9,] "dog" "dogg"
编辑:
要获得最匹配的词对,您首先需要 select 从 words
中的每个词到 words.p
中的所有词的最小距离:
mindists <- sapply(dists, min)
这将得到每个单词的最佳距离。那么你select距离words
最小距离的词:
words[which.min(mindists)]
或者在一行中:
words[which.min(sapply(dists, min))]
我需要计算 2 个向量中每个单词之间的 Jaccard 相似度。每个字一个字。并提取最相似的词。
这是我糟糕的慢代码:
txt1 <- c('The quick brown fox jumps over the lazy dog')
txt2 <- c('Te quick foks jump ovar lazey dogg')
words <- strsplit(as.character(txt1), " ")
words.p <- strsplit(as.character(txt2), " ")
r <- length(words[[1]])
c <- length(words.p[[1]])
m <- matrix(nrow=r, ncol=c)
for (i in 1:r){
for (j in 1:c){
m[i,j] = stringdist(tolower(words.p[[1]][j]), tolower(words[[1]][i]), method='jaccard', q=2)
}
}
ind <- which(m == min(m))-nrow(m)
words[[1]][ind]
请帮我改进和美化这个大数据框架的代码。
准备(在此处添加 tolower
):
txt1 <- c('The quick brown fox jumps over the lazy dog')
txt2 <- c('Te quick foks jump ovar lazey dogg')
words <- unlist(strsplit(tolower(as.character(txt1)), " "))
words.p <- unlist(strsplit(tolower(as.character(txt2)), " "))
获取每个单词的距离:
dists <- sapply(words, Map, f=stringdist, list(words.p), method="jaccard")
对于 words
中的每个单词,从 words.p
中找到最接近的单词:
matches <- words.p[sapply(dists, which.min)]
cbind(words, matches)
matches
[1,] "the" "te"
[2,] "quick" "quick"
[3,] "brown" "ovar"
[4,] "fox" "foks"
[5,] "jumps" "jump"
[6,] "over" "ovar"
[7,] "the" "te"
[8,] "lazy" "lazey"
[9,] "dog" "dogg"
编辑:
要获得最匹配的词对,您首先需要 select 从 words
中的每个词到 words.p
中的所有词的最小距离:
mindists <- sapply(dists, min)
这将得到每个单词的最佳距离。那么你select距离words
最小距离的词:
words[which.min(mindists)]
或者在一行中:
words[which.min(sapply(dists, min))]