如何计算两个字符串向量之间的余弦相似度
How can I calculate Cosine similarity between two strings vectors
我有 2 个维度为 6 的向量,我想要一个介于 0 和 1 之间的数字。
a=c("HDa","2Pb","2","BxU","BuQ","Bve")
b=c("HCK","2Pb","2","09","F","G")
谁能解释一下我应该怎么做?
您首先需要一个包含可能术语的字典,然后将您的向量转换为二进制向量,其中相应术语的位置为 1,其他位置为 0。如果将新向量命名为 a2
和 b2
,则可以使用 cor(a2, b2)
类似地计算余弦,但请注意余弦同样介于 -1 和 1 之间。您可以将其映射到 [0 ,1] 像这样:0.5*cor(a2, b2) + 0.5
使用 lsa
软件包和该软件包的手册
# create some files
library('lsa')
td = tempfile()
dir.create(td)
write( c("HDa","2Pb","2","BxU","BuQ","Bve"), file=paste(td, "D1", sep="/"))
write( c("HCK","2Pb","2","09","F","G"), file=paste(td, "D2", sep="/"))
# read files into a document-term matrix
myMatrix = textmatrix(td, minWordLength=1)
编辑:显示 mymatrix
对象如何
myMatrix
#myMatrix
# docs
# terms D1 D2
# 2 1 1
# 2pb 1 1
# buq 1 0
# bve 1 0
# bxu 1 0
# hda 1 0
# 09 0 1
# f 0 1
# g 0 1
# hck 0 1
# Calculate cosine similarity
res <- lsa::cosine(myMatrix[,1], myMatrix[,2])
res
#0.3333
CSString_vector <- c("Hi Hello","Hello");
corp <- tm::VCorpus(VectorSource(CSString_vector));
controlForMatrix <- list(removePunctuation = TRUE,wordLengths = c(1, Inf), weighting = weightTf)
dtm <- DocumentTermMatrix(corp,control = controlForMatrix);
matrix_of_vector = as.matrix(dtm);
res <- lsa::cosine(matrix_of_vector[1,], matrix_of_vector[2,]);
对于更大的数据集可能是更好的一个。
高级嵌入形式可能会帮助您获得更好的输出。请检查以下代码。
它是一个通用句子编码模型,使用基于转换器的架构生成句子嵌入。
from absl import logging
import tensorflow as tf
import tensorflow_hub as hub
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import re
import seaborn as sns
module_url = "https://tfhub.dev/google/universal-sentence-encoder/4"
model = hub.load(module_url)
print ("module %s loaded" % module_url)
def embed(input):
return model([input])
paragraph = [
"Universal Sentence Encoder embeddings also support short paragraphs. ",
"Universal Sentence Encoder support paragraphs"]
messages = [paragraph]
print(np.inner( embed(paragraph[0]) , embed(paragraph[1])))
我有 2 个维度为 6 的向量,我想要一个介于 0 和 1 之间的数字。
a=c("HDa","2Pb","2","BxU","BuQ","Bve")
b=c("HCK","2Pb","2","09","F","G")
谁能解释一下我应该怎么做?
您首先需要一个包含可能术语的字典,然后将您的向量转换为二进制向量,其中相应术语的位置为 1,其他位置为 0。如果将新向量命名为 a2
和 b2
,则可以使用 cor(a2, b2)
类似地计算余弦,但请注意余弦同样介于 -1 和 1 之间。您可以将其映射到 [0 ,1] 像这样:0.5*cor(a2, b2) + 0.5
使用 lsa
软件包和该软件包的手册
# create some files
library('lsa')
td = tempfile()
dir.create(td)
write( c("HDa","2Pb","2","BxU","BuQ","Bve"), file=paste(td, "D1", sep="/"))
write( c("HCK","2Pb","2","09","F","G"), file=paste(td, "D2", sep="/"))
# read files into a document-term matrix
myMatrix = textmatrix(td, minWordLength=1)
编辑:显示 mymatrix
对象如何
myMatrix
#myMatrix
# docs
# terms D1 D2
# 2 1 1
# 2pb 1 1
# buq 1 0
# bve 1 0
# bxu 1 0
# hda 1 0
# 09 0 1
# f 0 1
# g 0 1
# hck 0 1
# Calculate cosine similarity
res <- lsa::cosine(myMatrix[,1], myMatrix[,2])
res
#0.3333
CSString_vector <- c("Hi Hello","Hello");
corp <- tm::VCorpus(VectorSource(CSString_vector));
controlForMatrix <- list(removePunctuation = TRUE,wordLengths = c(1, Inf), weighting = weightTf)
dtm <- DocumentTermMatrix(corp,control = controlForMatrix);
matrix_of_vector = as.matrix(dtm);
res <- lsa::cosine(matrix_of_vector[1,], matrix_of_vector[2,]);
对于更大的数据集可能是更好的一个。
高级嵌入形式可能会帮助您获得更好的输出。请检查以下代码。 它是一个通用句子编码模型,使用基于转换器的架构生成句子嵌入。
from absl import logging
import tensorflow as tf
import tensorflow_hub as hub
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import re
import seaborn as sns
module_url = "https://tfhub.dev/google/universal-sentence-encoder/4"
model = hub.load(module_url)
print ("module %s loaded" % module_url)
def embed(input):
return model([input])
paragraph = [
"Universal Sentence Encoder embeddings also support short paragraphs. ",
"Universal Sentence Encoder support paragraphs"]
messages = [paragraph]
print(np.inner( embed(paragraph[0]) , embed(paragraph[1])))