从 Python 中的句子集中最常见的词中删除停用词
Remove stopwords from most common words from set of sentences in Python
我在 np.array
中有 5 个句子,我想找到最常见的 n 个出现的单词。例如,如果 n=5
我想要 5 个最常用的词。我在下面有一个例子:
0 rt my mother be on school amp race
1 rt i am a red hair down and its a great
2 rt my for your every day and my chocolate
3 rt i am that red human being a man
4 rt my mother be on school and wear
以下是我用来获取最常见的n个单词的代码。
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
A = np.array(["rt my mother be on school amp race",
"rt i am a red hair down and its a great",
"rt my for your every day and my chocolate",
"rt i am that red human being a man",
"rt my mother be on school and wear"])
n = 5
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(A)
vocabulary = vectorizer.get_feature_names()
ind = np.argsort(X.toarray().sum(axis=0))[-n:]
top_n_words = [vocabulary[a] for a in ind]
print(top_n_words)
结果如下:
['school', 'am', 'and', 'my', 'rt']
但是,我想要的是从这些最常见的词中忽略像'and
'、'am
' and
'my
' 这样的停用词。我怎样才能做到这一点?
你只需要将参数stop_words='english'
包含到CountVectorizer()
vectorizer = CountVectorizer(stop_words='english')
您现在应该得到:
['wear', 'mother', 'red', 'school', 'rt']
请参阅此处的文档:https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
import numpy as np
from nltk.corpus import stopwords
from nltk.corpus import wordnet
from sklearn.feature_extraction.text import CountVectorizer
stop_words = set(stopwords.words('english'))
A = np.array(["rt my mother be on school amp race",
"rt i am a red hair down and its a great",
"rt my for your every day and my chocolate",
"rt i am that red human being a man",
"rt my mother be on school and wear"])
data = []
for i in A:
d = i.split()
s = ""
for w in d:
if w not in stop_words:
s+=" "+w
s = s.strip()
data.append(s)
vect = CountVectorizer()
x = vect.fit_transform(data)
keyword = vect.get_feature_names()
list = x.toarray()
list = np.transpose(list)
l_total=[]
for i in list:
l_total.append(sum(i))
n=len(keyword)
for i in range(n):
for j in range(0, n - i - 1):
if l_total[j] > l_total[j + 1]:
l_total[j], l_total[j + 1] = l_total[j + 1], l_total[j]
keyword[j], keyword[j + 1] = keyword[j + 1], keyword[j]
keyword.reverse()
print(keyword[:5])
我在 np.array
中有 5 个句子,我想找到最常见的 n 个出现的单词。例如,如果 n=5
我想要 5 个最常用的词。我在下面有一个例子:
0 rt my mother be on school amp race
1 rt i am a red hair down and its a great
2 rt my for your every day and my chocolate
3 rt i am that red human being a man
4 rt my mother be on school and wear
以下是我用来获取最常见的n个单词的代码。
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
A = np.array(["rt my mother be on school amp race",
"rt i am a red hair down and its a great",
"rt my for your every day and my chocolate",
"rt i am that red human being a man",
"rt my mother be on school and wear"])
n = 5
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(A)
vocabulary = vectorizer.get_feature_names()
ind = np.argsort(X.toarray().sum(axis=0))[-n:]
top_n_words = [vocabulary[a] for a in ind]
print(top_n_words)
结果如下:
['school', 'am', 'and', 'my', 'rt']
但是,我想要的是从这些最常见的词中忽略像'and
'、'am
' and
'my
' 这样的停用词。我怎样才能做到这一点?
你只需要将参数stop_words='english'
包含到CountVectorizer()
vectorizer = CountVectorizer(stop_words='english')
您现在应该得到:
['wear', 'mother', 'red', 'school', 'rt']
请参阅此处的文档:https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
import numpy as np
from nltk.corpus import stopwords
from nltk.corpus import wordnet
from sklearn.feature_extraction.text import CountVectorizer
stop_words = set(stopwords.words('english'))
A = np.array(["rt my mother be on school amp race",
"rt i am a red hair down and its a great",
"rt my for your every day and my chocolate",
"rt i am that red human being a man",
"rt my mother be on school and wear"])
data = []
for i in A:
d = i.split()
s = ""
for w in d:
if w not in stop_words:
s+=" "+w
s = s.strip()
data.append(s)
vect = CountVectorizer()
x = vect.fit_transform(data)
keyword = vect.get_feature_names()
list = x.toarray()
list = np.transpose(list)
l_total=[]
for i in list:
l_total.append(sum(i))
n=len(keyword)
for i in range(n):
for j in range(0, n - i - 1):
if l_total[j] > l_total[j + 1]:
l_total[j], l_total[j + 1] = l_total[j + 1], l_total[j]
keyword[j], keyword[j + 1] = keyword[j + 1], keyword[j]
keyword.reverse()
print(keyword[:5])