用于文本聚类分析的 tf-idf
tf-idf for text cluster-analysis
我想对数据框中 df['Texts']
列中包含的小文本进行分组。
要分析的句子示例如下:
Texts
1 Donald Trump, Donald Trump news, Trump bleach, Trump injected bleach, bleach coronavirus.
2 Thank you Janey.......laughing so much at this........you have saved my sanity in these mad times. Only bleach Trump is using is on his heed
3 His more uncharitable critics said Trump had suggested that Americans drink bleach. Trump responded that he was being sarcastic.
4 Outcry after Trump suggests injecting disinfectant as treatment.
5 Trump Suggested 'Injecting' Disinfectant to Cure Coronavirus?
6 The study also showed that bleach and isopropyl alcohol killed the virus in saliva or respiratory fluids in a matter of minutes.
因为我知道 TF-IDF 对聚类很有用,所以我一直在使用以下代码行(通过关注社区中以前的一些问题):
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import re
import string
def preprocessing(line):
line = line.lower()
line = re.sub(r"[{}]".format(string.punctuation), " ", line)
return line
tfidf_vectorizer = TfidfVectorizer(preprocessor=preprocessing)
tfidf = tfidf_vectorizer.fit_transform(all_text)
kmeans = KMeans(n_clusters=2).fit(tfidf) # the number of clusters could be manually changed
但是,由于我正在考虑来自数据框的列,所以我不知道如何应用上述功能。
你能帮我吗?
def preprocessing(line):
line = line.lower()
line = re.sub(r"[{}]".format(string.punctuation), " ", line)
return line
tfidf_vectorizer = TfidfVectorizer(preprocessor=preprocessing)
tfidf = tfidf_vectorizer.fit_transform(df['Texts'])
kmeans = KMeans(n_clusters=2).fit(tfidf)
您只需将 all_text 替换为您的 df。最好先构建一个管道,然后同时应用向量化器和 Kmeans。
此外,为了获得更精确的结果,对文本进行更多预处理绝不是一个坏主意。此外,但是我认为降低文本不是一个好主意,因为您自然会删除写作风格的一个好功能(如果我们认为您想找到作者或将作者分配给一个组)但是为了获得句子的情感是的还是低点好。
我想对数据框中 df['Texts']
列中包含的小文本进行分组。
要分析的句子示例如下:
Texts
1 Donald Trump, Donald Trump news, Trump bleach, Trump injected bleach, bleach coronavirus.
2 Thank you Janey.......laughing so much at this........you have saved my sanity in these mad times. Only bleach Trump is using is on his heed
3 His more uncharitable critics said Trump had suggested that Americans drink bleach. Trump responded that he was being sarcastic.
4 Outcry after Trump suggests injecting disinfectant as treatment.
5 Trump Suggested 'Injecting' Disinfectant to Cure Coronavirus?
6 The study also showed that bleach and isopropyl alcohol killed the virus in saliva or respiratory fluids in a matter of minutes.
因为我知道 TF-IDF 对聚类很有用,所以我一直在使用以下代码行(通过关注社区中以前的一些问题):
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import re
import string
def preprocessing(line):
line = line.lower()
line = re.sub(r"[{}]".format(string.punctuation), " ", line)
return line
tfidf_vectorizer = TfidfVectorizer(preprocessor=preprocessing)
tfidf = tfidf_vectorizer.fit_transform(all_text)
kmeans = KMeans(n_clusters=2).fit(tfidf) # the number of clusters could be manually changed
但是,由于我正在考虑来自数据框的列,所以我不知道如何应用上述功能。 你能帮我吗?
def preprocessing(line):
line = line.lower()
line = re.sub(r"[{}]".format(string.punctuation), " ", line)
return line
tfidf_vectorizer = TfidfVectorizer(preprocessor=preprocessing)
tfidf = tfidf_vectorizer.fit_transform(df['Texts'])
kmeans = KMeans(n_clusters=2).fit(tfidf)
您只需将 all_text 替换为您的 df。最好先构建一个管道,然后同时应用向量化器和 Kmeans。
此外,为了获得更精确的结果,对文本进行更多预处理绝不是一个坏主意。此外,但是我认为降低文本不是一个好主意,因为您自然会删除写作风格的一个好功能(如果我们认为您想找到作者或将作者分配给一个组)但是为了获得句子的情感是的还是低点好。