文本分析,R 中的 DocumentTermMatrix 翻译成 Python
Text Analytics, DocumentTermMatrix in R translated into Python
我在 R 中有以下代码,并在 Python 中寻找等效代码。我想做的是从文本中取出单词,清理它们(删除标点符号,降低,去除白色 space,等)并以矩阵格式从中创建变量,可用于预测型号。
text<- c("amazing flight",
"got there early",
"great prices on flights??")
mydata_1<- data.frame(text)
library(tm)
corpus<- Corpus(DataframeSource(mydata_1))
corpus<- tm_map(corpus, content_transformer(tolower))
corpus<- tm_map(corpus, removePunctuation)
corpus<- tm_map(corpus, removeWords, stopwords("english"))
corpus<- tm_map(corpus, stripWhitespace)
dtm_1<- DocumentTermMatrix(corpus)
final_output<- as.matrix(dtm_1)
输出如下所示,其中单词 "amazing"、"early" 等现在是我可以在模型中使用的二进制输入变量:
Docs amazing early flight flights got great prices
1 1 0 1 0 0 0 0
2 0 1 0 0 1 0 0
3 0 0 0 1 0 1 1
如何在 Python 中完成?
我找到了答案。 DocumentTermMatrix 在 Python 中的等效项称为 CountVectorizer
text= ["amazing flight","got there early","great prices on flights??"]
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
vectorizer= CountVectorizer()
X= vectorizer.fit_transform(text)
Y= vectorizer.get_feature_names()
final_output= pd.DataFrame(X.toarray(),columns=Y)
结果如下:
amazing early flight flights got great on prices there
0 1 0 1 0 0 0 0 0 0
1 0 1 0 0 1 0 0 0 1
2 0 0 0 1 0 1 1 1 0
我在 R 中有以下代码,并在 Python 中寻找等效代码。我想做的是从文本中取出单词,清理它们(删除标点符号,降低,去除白色 space,等)并以矩阵格式从中创建变量,可用于预测型号。
text<- c("amazing flight",
"got there early",
"great prices on flights??")
mydata_1<- data.frame(text)
library(tm)
corpus<- Corpus(DataframeSource(mydata_1))
corpus<- tm_map(corpus, content_transformer(tolower))
corpus<- tm_map(corpus, removePunctuation)
corpus<- tm_map(corpus, removeWords, stopwords("english"))
corpus<- tm_map(corpus, stripWhitespace)
dtm_1<- DocumentTermMatrix(corpus)
final_output<- as.matrix(dtm_1)
输出如下所示,其中单词 "amazing"、"early" 等现在是我可以在模型中使用的二进制输入变量:
Docs amazing early flight flights got great prices
1 1 0 1 0 0 0 0
2 0 1 0 0 1 0 0
3 0 0 0 1 0 1 1
如何在 Python 中完成?
我找到了答案。 DocumentTermMatrix 在 Python 中的等效项称为 CountVectorizer
text= ["amazing flight","got there early","great prices on flights??"]
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
vectorizer= CountVectorizer()
X= vectorizer.fit_transform(text)
Y= vectorizer.get_feature_names()
final_output= pd.DataFrame(X.toarray(),columns=Y)
结果如下:
amazing early flight flights got great on prices there
0 1 0 1 0 0 0 0 0 0
1 0 1 0 0 1 0 0 0 1
2 0 0 0 1 0 1 1 1 0