如何用停用词去除标点符号和不相关的词（文本挖掘）

Question

我使用的库是：

      import pandas as pd
      import string
      from nltk.corpus import stopwords
      import nltk

我有以下数据框：

     df = pd.DataFrame({'Send': ['Golgi body, membrane-bound organelle of eukaryotic cells (cells 
                                  with clearly defined nuclei).',
                                 'The Golgi apparatus is responsible for transporting, modifying, and 
                                  packaging proteins',
                                 'Non-foliated metamorphic rocks do not have a platy or sheet-like 
                                  structure.',
                                 'The process of metamorphism does not melt the rocks.'], 
                                 'Class': ['biology', 'biology', 'geography', 'geography']})

     print(df)

                              Send                           Class
         Golgi body, membrane-bound organelle of eukary...  biology
         The Golgi apparatus is responsible for transpo...  biology
         Non-foliated metamorphic rocks do not have a p...  geography
         The process of metamorphism does not melt the ...  geography

我想生成一个函数来清理 'Send' 列中的数据。我愿意：

去掉分数；
删除停用词'stopwords';
Return 一个新数据框，其中 'Send' 列包含“clean words”。

尝试开发以下功能：

      def Text_Process(mess): 
           nopunc = [char for char in mess if char not in string.punctuation]
           nopunc = ''.join(nopunc)  
           return [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]

但是，return的感觉并不完全是我想要的。当我运行:

        Text_Process(df['Send'])

输出为：

       ['Golgi', 'body,', 'membrane-bound', 'organelle', 'eukaryotic', 'cells', '(cells', 'clearly',
        'defined', 'nuclei).The', 'Golgi', 'apparatus', 'responsible',  'transporting,', 
        'modifying,', 'packaging', 'proteinsNon-foliated', 'metamorphic', 'rocks',
        'platy', 'sheet-like', 'structure.The', 'process', 'metamorphism',
        'melt', 'rocks.']

我希望输出是具有修改后的 'Send' 列的数据框：

       df = pd.DataFrame({'Send': ['Golgi membrane bound organelle eukaryotic cells cells 
                                   clearly defined nuclei',
                                  'Golgi apparatus responsible transporting modifying                                     
                                   packaging proteins',
                                 'Non foliated metamorphic rocks platy sheet like 
                                  structure',
                                 'process metamorphism mel rocks'], 
                                 'Class': ['biology', 'biology', 'geography', 'geography']})

我希望输出是数据框 'Send' 列干净（没有分数，也没有不相关的词）。

谢谢。

Answer 1

这是一个清理列的脚本。请注意，您可能希望向停用词集中添加更多词以满足您的要求。

import pandas as pd
import string
import re
from nltk.corpus import stopwords

df = pd.DataFrame(
    {'Send': ['Golgi body, membrane-bound organelle of eukaryotic cells (cells with clearly defined nuclei).',
              'The Golgi apparatus is responsible for transporting, modifying, and packaging proteins',
              'Non-foliated metamorphic rocks do not have a platy or sheet-like structure.',
              'The process of metamorphism does not melt the rocks.'],
     'Class': ['biology', 'biology', 'geography', 'geography']})

table = str.maketrans('', '', string.punctuation)

def text_process(mess):
    words = re.split(r'\W+', mess)
    nopunc = [w.translate(table) for w in words]
    nostop =  ' '.join([word for word in nopunc if word.lower() not in stopwords.words('english')])
    return nostop

df['Send'] = df.apply(lambda row: text_process(row.Send), axis=1)

print(df)

输出：

                                                                                 Send      Class
0  Golgi body membrane bound organelle eukaryotic cells cells clearly defined nuclei     biology
1               Golgi apparatus responsible transporting modifying packaging proteins    biology
2                          Non foliated metamorphic rocks platy sheet like structure   geography
3                                                    process metamorphism melt rocks   geography

如何用停用词去除标点符号和不相关的词（文本挖掘）

How to remove punctuation and irrelevant words with stopwords (Text Mining)

python

text

nltk

stop-words

mining