使用 sklearn 预测文本聚类的新内容

Predicting new content for text-clustering using sklearn

我正在尝试了解如何使用 sklearn 创建文本聚类。我有 80000 条文本(600 条训练数据和 200 条测试数据),如下所示:

Texts # columns name

  1 Donald Trump, Donald Trump news, Trump bleach, Trump injected bleach, bleach coronavirus.
  2 Thank you Janey.......laughing so much at this........you have saved my sanity in these mad times. Only bleach Trump is using is on his heed 
  3 His more uncharitable critics said Trump had suggested that Americans drink bleach. Trump responded that he was being sarcastic.
  4 Outcry after Trump suggests injecting disinfectant as treatment.
  5 Trump Suggested 'Injecting' Disinfectant to Cure Coronavirus?
  6 The study also showed that bleach and isopropyl alcohol killed the virus in saliva or respiratory fluids in a matter of minutes.

我想从中创建集群。 为了将语料库转换为向量 space,我使用了 tf-idf 并使用 k-means 算法对文档进行聚类。 但是,我无法理解结果是否符合预期,因为不幸的是输出不是 'graphical'(我曾尝试使用 CountVectorizer 来获得频率矩阵,但可能我以错误的方式使用它)。 我对 tf-idf 的期望是,当我测试测试数据集时 当我测试时:

test_dataset = ["'请不要注射漂白剂':特朗普的野生冠状病毒声称令人难以置信。", "Donald Trump has won the shock and ire of the scientific and medical communities after suggesting bogus treatments for Covid-19", "Bleach manufacturers have warned people not to inject themselves with disinfectant after Trump falsely suggested it might cure the coronavirus."]

(测试数据集来自df["0"]['Names']列) 我想看看文本属于哪个集群(由 k-means 制作)。 请在下面查看我当前使用的代码:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import re
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import nltk
from nltk.corpus import stopwords

stop_words = stopwords.words('english')

def preprocessing(line):
    line = re.sub(r"[^a-zA-Z]", " ", line.lower())
    words = word_tokenize(line)
    words_lemmed = [WordNetLemmatizer().lemmatize(w) for w in words if w not in stop_words]
    return words_lemmed

tfidf_vectorizer = TfidfVectorizer(tokenizer=preprocessing)
vec = CountVectorizer()

tfidf = tfidf_vectorizer.fit_transform(df["0"]['Names'])
matrix = vec.fit_transform(df["0"]['Names'])

kmeans = KMeans(n_clusters=2).fit(tfidf)
pd.DataFrame(matrix.toarray(), columns=vec.get_feature_names())

其中 df["0"]['Names']0th 数据框的列“Names”。 如果您愿意,即使使用不同的数据集但数据框的结构非常相同(只是为了更好地理解),一个可视化示例也很好。

非常感谢您提供的所有帮助。谢谢

取你的 test_data 并添加三个句子来制作语料库

train_data = ["'Please don't inject bleach': Trump's wild coronavirus claims prompt disbelief.",
              "Donald Trump has won the shock and ire of the scientific and medical communities after suggesting bogus treatments for Covid-19", 
              "Bleach manufacturers have warned people not to inject themselves with disinfectant after Trump falsely suggested it might cure the coronavirus.",
              "find the most representative document for each topic",
              "topic distribution across documents",
               "to help with understanding the topic",
                "one of the practical application of topic modeling is to determine"]

从上面的数据集创建数据框

 df = pd.DataFrame(train_data, columns = 'text')

现在您可以使用 Countvectorizer 或 TfidfVectorizer 来矢量化文本,我正在使用 TfidfVectorizer

 vect = TfidfVectorizer(tokenizer=preprocessing)

 vectorized_text = vect.fit_transform(df['text'])

 kmeans = KMeans(n_clusters=2).fit(vectorized_text)

 # now predicting the cluster for given dataset

df['predicted cluster'] = kmeans.predict(vectorized_text)

现在,当您要预测测试数据或新数据时

new_sent = 'coronavirus has created lot of problem in the world'
kmeans.predict(vect.transform([new_sent])) #you have to use transform only and not fit_transfrom 

#op
array([1])