在 python 中删除停用词和标记化
Removing stopwords and tokenization in python
我有以下输入数据,我想从此输入中删除停用词并想进行标记化:
input = [['Hi i am going to college', 'We will meet next time possible'],
['My college name is jntu', 'I am into machine learning specialization'],
['Machine learnin is my favorite subject' ,'Here i am using python for
implementation']]
我尝试了以下代码但没有得到想要的结果:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import nltk
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
word_tokens = word_tokenize(input)
filtered_sentence = [w for w in word_tokens if not w in stop_words]
filtered_sentence = []
for w in word_tokens:
if w not in stop_words:
filtered_sentence.append(w)
#print(word_tokens)
print(filtered_sentence)
预期输出如下:
Output = [['Hi', 'going', 'college', 'meet','next', 'time', 'possible'],
['college', 'name','jntu', 'machine', 'learning', 'specialization'],
['Machine', 'learnin', 'favorite', 'subject' ,'using', 'python', 'implementation']]
像以前一样开始
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
stop_words = set(stopwords.words('english'))
input_ = [['Hi i am going to college', 'We will meet next time possible'],
['My college name is jntu', 'I am into machine learning specialization'],
['Machine learnin is my favorite subject' ,'Here i am using python for implementation']]
我认为将您的输入命名为 input_
更好,因为 input
在 Python 中已经具有意义。
我将从展平您的输入开始。我们应该有一个句子列表,而不是一个嵌套的列表列表:
input_flatten = [sentence for sublist in input for sentence in sublist]
print(input_flatten)
>>>['Hi i am going to college',
'We will meet next time possible',
'My college name is jntu',
'I am into machine learning specialization',
'Machine learnin is my favorite subject',
'Here i am using python for implementation']
然后你可以像这样遍历每个句子并删除停用词:
sentences_without_stopwords = []
for sentence in input_flatten:
sentence_tokenized = word_tokenize(sentence)
stop_words_removed = [word for word in sentence_tokenized if word not in stop_words]
sentences_without_stopwords.append(stop_words_removed)
print(sentences_without_stopwords)
>>>[['Hi', 'going', 'college'],
['We', 'meet', 'next', 'time', 'possible'],
['My', 'college', 'name', 'jntu'],
['I', 'machine', 'learning', 'specialization'],
['Machine', 'learnin', 'favorite', 'subject'],
['Here', 'using', 'python', 'implementation']]
我相信这会对你有所帮助。
stop_words = set(stopwords.words('english'))
op=[]
for item in _input:
word_tokens = word_tokenize(' '.join(item).lower())
filtered_sentence = [w for w in word_tokens if not w in stop_words]
op.append(filtered_sentence)
print(op)
列表中的每个项目都有两个字符串。因此,将它们作为单个字符串加入并删除停用词。
我有以下输入数据,我想从此输入中删除停用词并想进行标记化:
input = [['Hi i am going to college', 'We will meet next time possible'],
['My college name is jntu', 'I am into machine learning specialization'],
['Machine learnin is my favorite subject' ,'Here i am using python for
implementation']]
我尝试了以下代码但没有得到想要的结果:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import nltk
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
word_tokens = word_tokenize(input)
filtered_sentence = [w for w in word_tokens if not w in stop_words]
filtered_sentence = []
for w in word_tokens:
if w not in stop_words:
filtered_sentence.append(w)
#print(word_tokens)
print(filtered_sentence)
预期输出如下:
Output = [['Hi', 'going', 'college', 'meet','next', 'time', 'possible'],
['college', 'name','jntu', 'machine', 'learning', 'specialization'],
['Machine', 'learnin', 'favorite', 'subject' ,'using', 'python', 'implementation']]
像以前一样开始
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
stop_words = set(stopwords.words('english'))
input_ = [['Hi i am going to college', 'We will meet next time possible'],
['My college name is jntu', 'I am into machine learning specialization'],
['Machine learnin is my favorite subject' ,'Here i am using python for implementation']]
我认为将您的输入命名为 input_
更好,因为 input
在 Python 中已经具有意义。
我将从展平您的输入开始。我们应该有一个句子列表,而不是一个嵌套的列表列表:
input_flatten = [sentence for sublist in input for sentence in sublist]
print(input_flatten)
>>>['Hi i am going to college',
'We will meet next time possible',
'My college name is jntu',
'I am into machine learning specialization',
'Machine learnin is my favorite subject',
'Here i am using python for implementation']
然后你可以像这样遍历每个句子并删除停用词:
sentences_without_stopwords = []
for sentence in input_flatten:
sentence_tokenized = word_tokenize(sentence)
stop_words_removed = [word for word in sentence_tokenized if word not in stop_words]
sentences_without_stopwords.append(stop_words_removed)
print(sentences_without_stopwords)
>>>[['Hi', 'going', 'college'],
['We', 'meet', 'next', 'time', 'possible'],
['My', 'college', 'name', 'jntu'],
['I', 'machine', 'learning', 'specialization'],
['Machine', 'learnin', 'favorite', 'subject'],
['Here', 'using', 'python', 'implementation']]
我相信这会对你有所帮助。
stop_words = set(stopwords.words('english'))
op=[]
for item in _input:
word_tokens = word_tokenize(' '.join(item).lower())
filtered_sentence = [w for w in word_tokens if not w in stop_words]
op.append(filtered_sentence)
print(op)
列表中的每个项目都有两个字符串。因此,将它们作为单个字符串加入并删除停用词。