cosine_sim 在文本和数据集中的单个列之间

cosine_sim between a text and a single column in a dataset

我有一个数据集,我必须对其进行词法化,我在下面做了,然后我必须找到 1 列“文本”与“疫苗是致命的”一词之间的相似性,但不确定如何使用余弦相似性功能正常我尝试将文本放入一个值中并执行它但它不起作用,

texttweet2 = pd.read_csv("../input/pfizer-vaccine-tweets/vaccination_tweets.csv")

wordnet_lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()
def tokenize(str_input):    
words = re.sub(r"(?u)[^A-Za-z]", " ", str_input).lower().split(" ")
words = [stemmer.stem(word) for word in words if len(word)>2]
words = [wordnet_lemmatizer.lemmatize(word) for word in words if len(word)>2]
return words


vectorizer = TfidfVectorizer(tokenizer=tokenize, stop_words='english')
vectors = vectorizer.fit_transform(texttweet2['text'])
feature_names = vectorizer.get_feature_names()
texttweet_tfidf = pd.DataFrame(vectors.toarray(),columns=feature_names)

我尝试做余弦相似度

x= "vaccine is deadly"


cosine_sim = cosine_similarity(x, texttweet_tfidf)

但我收到此错误: 无法将字符串转换为浮点数:'vaccine is deadly'

cosine_similarity 接受数字向量,而不是字符串。您将需要使用矢量化器将字符串转换为矢量。

x = "vaccine is deadly"
x_vector = vectorizer.transform([x])
x_tfidf = pd.DataFrame(vectors.toarray(), columns=feature_names)

cosine_sim = cosine_similarity(x_tfidf, texttweet_tfidf)