删除停用词并仅选择 pandas 中的名称
Removing stop-words and selecting only names in pandas
我正在尝试按日期提取热门词,如下所示:
df.set_index('Publishing_Date').Quotes.str.lower().str.extractall(r'(\w+)')[0].groupby('Publishing_Date').value_counts().groupby('Publishing_Date')
在以下数据框中:
import pandas as pd
# initialize
data = [['20/05', "So many books, so little time." ], ['20/05', "The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid." ], ['19/05',
"Don't be pushed around by the fears in your mind. Be led by the dreams in your heart."], ['19/05', "Be the reason someone smiles. Be the reason someone feels loved and believes in the goodness in people."], ['19/05', "Do what is right, not what is easy nor what is popular."]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Publishing_Date', 'Quotes'])
如您所见,有很多停用词 ("the", "an", "a", "be", ...
),我想将其删除以便有更好的选择。我的目标是按日期找到一些共同的关键词,即模式,这样我会更感兴趣并专注于名称而不是动词。
知道如何删除停用词并只保留名称吗?
编辑
预期输出(基于下面 Vaibhav Khandelwal 的回答的结果):
Publishing_Date Quotes Nouns
20/05 .... books, time, person, gentleman, lady, novel
19/05 .... fears, mind, dreams, heart, reason, smiles
我只需要提取名词(原因应该更频繁,所以它会根据频率排序)。
我认为标签在 ('NN') 中应该有用 nltk.pos_tag
。
这是从文本中删除停用词的方法:
import nltk
from nltk.corpus import stopwords
def remove_stopwords(text):
stop_words = stopwords.words('english')
fresh_text = []
for i in text.lower().split():
if i not in stop_words:
fresh_text.append(i)
return(' '.join(fresh_text))
df['text'] = df['Quotes'].apply(remove_stopwords)
注意:如果要删除停用词列表中明确附加的词
对于你的另一半,你可以添加另一个函数来提取名词:
def extract_noun(text):
token = nltk.tokenize.word_tokenize(text)
result=[]
for i in nltk.pos_tag(token):
if i[1].startswith('NN'):
result.append(i[0])
return(', '.join(result))
df['NOUN'] = df['text'].apply(extract_noun)
最终输出结果如下:
我正在尝试按日期提取热门词,如下所示:
df.set_index('Publishing_Date').Quotes.str.lower().str.extractall(r'(\w+)')[0].groupby('Publishing_Date').value_counts().groupby('Publishing_Date')
在以下数据框中:
import pandas as pd
# initialize
data = [['20/05', "So many books, so little time." ], ['20/05', "The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid." ], ['19/05',
"Don't be pushed around by the fears in your mind. Be led by the dreams in your heart."], ['19/05', "Be the reason someone smiles. Be the reason someone feels loved and believes in the goodness in people."], ['19/05', "Do what is right, not what is easy nor what is popular."]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Publishing_Date', 'Quotes'])
如您所见,有很多停用词 ("the", "an", "a", "be", ...
),我想将其删除以便有更好的选择。我的目标是按日期找到一些共同的关键词,即模式,这样我会更感兴趣并专注于名称而不是动词。
知道如何删除停用词并只保留名称吗?
编辑
预期输出(基于下面 Vaibhav Khandelwal 的回答的结果):
Publishing_Date Quotes Nouns
20/05 .... books, time, person, gentleman, lady, novel
19/05 .... fears, mind, dreams, heart, reason, smiles
我只需要提取名词(原因应该更频繁,所以它会根据频率排序)。
我认为标签在 ('NN') 中应该有用 nltk.pos_tag
。
这是从文本中删除停用词的方法:
import nltk
from nltk.corpus import stopwords
def remove_stopwords(text):
stop_words = stopwords.words('english')
fresh_text = []
for i in text.lower().split():
if i not in stop_words:
fresh_text.append(i)
return(' '.join(fresh_text))
df['text'] = df['Quotes'].apply(remove_stopwords)
注意:如果要删除停用词列表中明确附加的词
对于你的另一半,你可以添加另一个函数来提取名词:
def extract_noun(text):
token = nltk.tokenize.word_tokenize(text)
result=[]
for i in nltk.pos_tag(token):
if i[1].startswith('NN'):
result.append(i[0])
return(', '.join(result))
df['NOUN'] = df['text'].apply(extract_noun)
最终输出结果如下: