如何在 python 中的 pandas 数据框中执行词干提取和删除列？

Question

下面是我的数据集的子集。我正在尝试使用 nltk 包中提供的 Porter stemmer 清理我的数据集。我想删除词干相似的列，例如“abandon”，'abondoned'，'abondening' 应该在我的数据集中被放弃。下面是我正在尝试的代码，我可以在其中看到 words/columns 被阻止了。但我不确定如何删除这些列？我已经从语料库中标记并删除了标点符号。

注：我是Python和Textmining的新手。

数据集子集

{
   'aaaahhhs':{
      0:0,
      1:0,
      2:0,
      3:0,
      4:0,
      5:0
   },
   'aahs':{
      0:0,
      1:0,
      2:0,
      3:0,
      4:0,
      5:0
   },
   'aamir':{
      0:0,
      1:0,
      2:0,
      3:0,
      4:0,
      5:0
   },
   'aardman':{
      0:0,
      1:0,
      2:0,
      3:0,
      4:0,
      5:0
   },
   'aaron':{
      0:0,
      1:0,
      2:0,
      3:0,
      4:0,
      5:0
   },
   'abandon':{
      0:0,
      1:0,
      2:0,
      3:0,
      4:0,
      5:0
   },
   'abandoned':{
      0:0,
      1:0,
      2:0,
      3:0,
      4:0,
      5:0
   },
   'abandoning':{
      0:0,
      1:0,
      2:0,
      3:0,
      4:0,
      5:0
   },
   'abandonment':{
      0:0,
      1:0,
      2:0,
      3:0,
      4:0,
      5:0
   },
   'abandons':{
      0:0,
      1:0,
      2:0,
      3:0,
      4:0,
      5:0
   }
}

到目前为止的代码..

from nltk.stem import PorterStemmer 
from nltk.tokenize import word_tokenize   
ps = PorterStemmer() 
for w in clean_df.columns:
    print(ps.stem(w))

Answer 1

我认为这样的事情可以满足您的要求：

import collections

# Here the assotiations between stems and column names are built:
stems = collections.defaultdict(list)
for column_name in clean_df.columns:
    stems[ps.stem(column_name)].append(column_name)

# Here for each stem the first (in lexicographical order) is gotten:
new_columns = [sorted(columns)[0] for _, columns in stems.items()]

# Here the new `DataFrame` is created which contains selected columns:
new_df = clean_df[new_columns]

如何在 python 中的 pandas 数据框中执行词干提取和删除列？

How to perfom stemming and drop columns in pandas dataframe in python?

python

stemming

porter-stemmer

text-mining

pandas