CountVectorizer 构建字典以删除多余的单词
CountVectorizer to build dictionary for removing extra words
我在 pandas 列中有一个句子列表:
sentence
I am writing on Whosebug because I cannot find a solution to my problem.
I am writing on Whosebug.
I need to show some code.
Please see the code below
我想运行通过他们进行一些文本挖掘和分析,比如获取词频。
为此,我正在使用这种方法:
from sklearn.feature_extraction.text import CountVectorizer
# list of text documents
text = ["I am writing on Whosebug because I cannot find a solution to my problem."]
vectorizer = CountVectorizer()
# tokenize and build vocab
vectorizer.fit(text)
如何将它应用到我的专栏中,在构建词汇表后删除多余的停用词?
您可以在 CountVectorizer 中使用 stop_words
参数,这将负责删除停用词:
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
text = ["I am writing on Whosebug because I cannot find a solution to my problem."]
stopwords = stopwords.words("english") # you may add or define your stopwords here
vectorizer = CountVectorizer(stop_words=stopwords)
vectorizer.fit_transform(text)
如果您想在 pandas
数据帧内进行所有预处理:
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
text = ["I am writing on Whosebug because I cannot find a solution to my problem.", "I am writing on Whosebug."]
df = pd.DataFrame({"text": text})
stopwords = stopwords.words("english") # you may add or define your stopwords here
vectorizer = CountVectorizer(stop_words=stopwords)
df["counts"] = vectorizer.fit_transform(df["text"]).todense().tolist()
df
text counts
0 I am writing on Whosebug because I cannot... [1, 1, 1, 1, 1, 1]
1 I am writing on Whosebug. [0, 0, 0, 0, 1, 1]
在这两种情况下,您的词汇表都删除了停用词:
print(vectorizer.vocabulary_)
{'writing': 5, 'Whosebug': 4, 'cannot': 0, 'find': 1, 'solution': 3, 'problem': 2}
我在 pandas 列中有一个句子列表:
sentence
I am writing on Whosebug because I cannot find a solution to my problem.
I am writing on Whosebug.
I need to show some code.
Please see the code below
我想运行通过他们进行一些文本挖掘和分析,比如获取词频。 为此,我正在使用这种方法:
from sklearn.feature_extraction.text import CountVectorizer
# list of text documents
text = ["I am writing on Whosebug because I cannot find a solution to my problem."]
vectorizer = CountVectorizer()
# tokenize and build vocab
vectorizer.fit(text)
如何将它应用到我的专栏中,在构建词汇表后删除多余的停用词?
您可以在 CountVectorizer 中使用 stop_words
参数,这将负责删除停用词:
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
text = ["I am writing on Whosebug because I cannot find a solution to my problem."]
stopwords = stopwords.words("english") # you may add or define your stopwords here
vectorizer = CountVectorizer(stop_words=stopwords)
vectorizer.fit_transform(text)
如果您想在 pandas
数据帧内进行所有预处理:
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
text = ["I am writing on Whosebug because I cannot find a solution to my problem.", "I am writing on Whosebug."]
df = pd.DataFrame({"text": text})
stopwords = stopwords.words("english") # you may add or define your stopwords here
vectorizer = CountVectorizer(stop_words=stopwords)
df["counts"] = vectorizer.fit_transform(df["text"]).todense().tolist()
df
text counts
0 I am writing on Whosebug because I cannot... [1, 1, 1, 1, 1, 1]
1 I am writing on Whosebug. [0, 0, 0, 0, 1, 1]
在这两种情况下,您的词汇表都删除了停用词:
print(vectorizer.vocabulary_)
{'writing': 5, 'Whosebug': 4, 'cannot': 0, 'find': 1, 'solution': 3, 'problem': 2}