如何用停用词去除标点符号和不相关的词(文本挖掘)
How to remove punctuation and irrelevant words with stopwords (Text Mining)
我使用的库是:
import pandas as pd
import string
from nltk.corpus import stopwords
import nltk
我有以下数据框:
df = pd.DataFrame({'Send': ['Golgi body, membrane-bound organelle of eukaryotic cells (cells
with clearly defined nuclei).',
'The Golgi apparatus is responsible for transporting, modifying, and
packaging proteins',
'Non-foliated metamorphic rocks do not have a platy or sheet-like
structure.',
'The process of metamorphism does not melt the rocks.'],
'Class': ['biology', 'biology', 'geography', 'geography']})
print(df)
Send Class
Golgi body, membrane-bound organelle of eukary... biology
The Golgi apparatus is responsible for transpo... biology
Non-foliated metamorphic rocks do not have a p... geography
The process of metamorphism does not melt the ... geography
我想生成一个函数来清理 'Send' 列中的数据。我愿意:
- 去掉分数;
- 删除停用词'stopwords';
- Return 一个新数据框,其中 'Send' 列包含“clean words”。
尝试开发以下功能:
def Text_Process(mess):
nopunc = [char for char in mess if char not in string.punctuation]
nopunc = ''.join(nopunc)
return [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]
但是,return的感觉并不完全是我想要的。当我 运行:
Text_Process(df['Send'])
输出为:
['Golgi', 'body,', 'membrane-bound', 'organelle', 'eukaryotic', 'cells', '(cells', 'clearly',
'defined', 'nuclei).The', 'Golgi', 'apparatus', 'responsible', 'transporting,',
'modifying,', 'packaging', 'proteinsNon-foliated', 'metamorphic', 'rocks',
'platy', 'sheet-like', 'structure.The', 'process', 'metamorphism',
'melt', 'rocks.']
我希望输出是具有修改后的 'Send' 列的数据框:
df = pd.DataFrame({'Send': ['Golgi membrane bound organelle eukaryotic cells cells
clearly defined nuclei',
'Golgi apparatus responsible transporting modifying
packaging proteins',
'Non foliated metamorphic rocks platy sheet like
structure',
'process metamorphism mel rocks'],
'Class': ['biology', 'biology', 'geography', 'geography']})
我希望输出是数据框 'Send' 列干净(没有分数,也没有不相关的词)。
谢谢。
这是一个清理列的脚本。请注意,您可能希望向停用词集中添加更多词以满足您的要求。
import pandas as pd
import string
import re
from nltk.corpus import stopwords
df = pd.DataFrame(
{'Send': ['Golgi body, membrane-bound organelle of eukaryotic cells (cells with clearly defined nuclei).',
'The Golgi apparatus is responsible for transporting, modifying, and packaging proteins',
'Non-foliated metamorphic rocks do not have a platy or sheet-like structure.',
'The process of metamorphism does not melt the rocks.'],
'Class': ['biology', 'biology', 'geography', 'geography']})
table = str.maketrans('', '', string.punctuation)
def text_process(mess):
words = re.split(r'\W+', mess)
nopunc = [w.translate(table) for w in words]
nostop = ' '.join([word for word in nopunc if word.lower() not in stopwords.words('english')])
return nostop
df['Send'] = df.apply(lambda row: text_process(row.Send), axis=1)
print(df)
输出:
Send Class
0 Golgi body membrane bound organelle eukaryotic cells cells clearly defined nuclei biology
1 Golgi apparatus responsible transporting modifying packaging proteins biology
2 Non foliated metamorphic rocks platy sheet like structure geography
3 process metamorphism melt rocks geography
我使用的库是:
import pandas as pd
import string
from nltk.corpus import stopwords
import nltk
我有以下数据框:
df = pd.DataFrame({'Send': ['Golgi body, membrane-bound organelle of eukaryotic cells (cells
with clearly defined nuclei).',
'The Golgi apparatus is responsible for transporting, modifying, and
packaging proteins',
'Non-foliated metamorphic rocks do not have a platy or sheet-like
structure.',
'The process of metamorphism does not melt the rocks.'],
'Class': ['biology', 'biology', 'geography', 'geography']})
print(df)
Send Class
Golgi body, membrane-bound organelle of eukary... biology
The Golgi apparatus is responsible for transpo... biology
Non-foliated metamorphic rocks do not have a p... geography
The process of metamorphism does not melt the ... geography
我想生成一个函数来清理 'Send' 列中的数据。我愿意:
- 去掉分数;
- 删除停用词'stopwords';
- Return 一个新数据框,其中 'Send' 列包含“clean words”。
尝试开发以下功能:
def Text_Process(mess):
nopunc = [char for char in mess if char not in string.punctuation]
nopunc = ''.join(nopunc)
return [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]
但是,return的感觉并不完全是我想要的。当我 运行:
Text_Process(df['Send'])
输出为:
['Golgi', 'body,', 'membrane-bound', 'organelle', 'eukaryotic', 'cells', '(cells', 'clearly',
'defined', 'nuclei).The', 'Golgi', 'apparatus', 'responsible', 'transporting,',
'modifying,', 'packaging', 'proteinsNon-foliated', 'metamorphic', 'rocks',
'platy', 'sheet-like', 'structure.The', 'process', 'metamorphism',
'melt', 'rocks.']
我希望输出是具有修改后的 'Send' 列的数据框:
df = pd.DataFrame({'Send': ['Golgi membrane bound organelle eukaryotic cells cells
clearly defined nuclei',
'Golgi apparatus responsible transporting modifying
packaging proteins',
'Non foliated metamorphic rocks platy sheet like
structure',
'process metamorphism mel rocks'],
'Class': ['biology', 'biology', 'geography', 'geography']})
我希望输出是数据框 'Send' 列干净(没有分数,也没有不相关的词)。
谢谢。
这是一个清理列的脚本。请注意,您可能希望向停用词集中添加更多词以满足您的要求。
import pandas as pd
import string
import re
from nltk.corpus import stopwords
df = pd.DataFrame(
{'Send': ['Golgi body, membrane-bound organelle of eukaryotic cells (cells with clearly defined nuclei).',
'The Golgi apparatus is responsible for transporting, modifying, and packaging proteins',
'Non-foliated metamorphic rocks do not have a platy or sheet-like structure.',
'The process of metamorphism does not melt the rocks.'],
'Class': ['biology', 'biology', 'geography', 'geography']})
table = str.maketrans('', '', string.punctuation)
def text_process(mess):
words = re.split(r'\W+', mess)
nopunc = [w.translate(table) for w in words]
nostop = ' '.join([word for word in nopunc if word.lower() not in stopwords.words('english')])
return nostop
df['Send'] = df.apply(lambda row: text_process(row.Send), axis=1)
print(df)
输出:
Send Class
0 Golgi body membrane bound organelle eukaryotic cells cells clearly defined nuclei biology
1 Golgi apparatus responsible transporting modifying packaging proteins biology
2 Non foliated metamorphic rocks platy sheet like structure geography
3 process metamorphism melt rocks geography