
How to remove punctuation and irrelevant words with stopwords (Text Mining)


      import pandas as pd
      import string
      from nltk.corpus import stopwords
      import nltk


     df = pd.DataFrame({'Send': ['Golgi body, membrane-bound organelle of eukaryotic cells (cells 
                                  with clearly defined nuclei).',
                                 'The Golgi apparatus is responsible for transporting, modifying, and 
                                  packaging proteins',
                                 'Non-foliated metamorphic rocks do not have a platy or sheet-like 
                                 'The process of metamorphism does not melt the rocks.'], 
                                 'Class': ['biology', 'biology', 'geography', 'geography']})


                              Send                           Class
         Golgi body, membrane-bound organelle of eukary...  biology
         The Golgi apparatus is responsible for transpo...  biology
         Non-foliated metamorphic rocks do not have a p...  geography
         The process of metamorphism does not melt the ...  geography

我想生成一个函数来清理 'Send' 列中的数据。我愿意:

  1. 去掉分数;
  2. 删除停用词'stopwords';
  3. Return 一个新数据框,其中 'Send' 列包含“clean words”。


      def Text_Process(mess): 
           nopunc = [char for char in mess if char not in string.punctuation]
           nopunc = ''.join(nopunc)  
           return [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]

但是,return的感觉并不完全是我想要的。当我 运行:



       ['Golgi', 'body,', 'membrane-bound', 'organelle', 'eukaryotic', 'cells', '(cells', 'clearly',
        'defined', 'nuclei).The', 'Golgi', 'apparatus', 'responsible',  'transporting,', 
        'modifying,', 'packaging', 'proteinsNon-foliated', 'metamorphic', 'rocks',
        'platy', 'sheet-like', 'structure.The', 'process', 'metamorphism',
        'melt', 'rocks.']

我希望输出是具有修改后的 'Send' 列的数据框:

       df = pd.DataFrame({'Send': ['Golgi membrane bound organelle eukaryotic cells cells 
                                   clearly defined nuclei',
                                  'Golgi apparatus responsible transporting modifying                                     
                                   packaging proteins',
                                 'Non foliated metamorphic rocks platy sheet like 
                                 'process metamorphism mel rocks'], 
                                 'Class': ['biology', 'biology', 'geography', 'geography']})

我希望输出是数据框 'Send' 列干净(没有分数,也没有不相关的词)。



import pandas as pd
import string
import re
from nltk.corpus import stopwords

df = pd.DataFrame(
    {'Send': ['Golgi body, membrane-bound organelle of eukaryotic cells (cells with clearly defined nuclei).',
              'The Golgi apparatus is responsible for transporting, modifying, and packaging proteins',
              'Non-foliated metamorphic rocks do not have a platy or sheet-like structure.',
              'The process of metamorphism does not melt the rocks.'],
     'Class': ['biology', 'biology', 'geography', 'geography']})

table = str.maketrans('', '', string.punctuation)

def text_process(mess):
    words = re.split(r'\W+', mess)
    nopunc = [w.translate(table) for w in words]
    nostop =  ' '.join([word for word in nopunc if word.lower() not in stopwords.words('english')])
    return nostop

df['Send'] = df.apply(lambda row: text_process(row.Send), axis=1)



                                                                                 Send      Class
0  Golgi body membrane bound organelle eukaryotic cells cells clearly defined nuclei     biology
1               Golgi apparatus responsible transporting modifying packaging proteins    biology
2                          Non foliated metamorphic rocks platy sheet like structure   geography
3                                                    process metamorphism melt rocks   geography