如何合并列名相似的数据列 Pandas

How to combine data columns with similar column names Pandas

我有一个数据有很多相似的列名(基本上是拼错的词),例如:

apple    grapes    apples    bana    apyles    grayes    graph    banana

在这里,我想合并列 'apple, apples, apyles',然后是 'grapes, grayes, graph',然后是 'bana, banana'。我该怎么做?

*编辑评论:

问。当你说“结合”时,你是什么意思?你能包括示例输入和输出吗?

答案

输入

apple    grapes    apples    bana    apyles    grayes    graph    banana
  1         2         3        4        5         6        7         8

输出

apple    grape    banana
  9       15         12 

使用 fuzzywuzzy 您可以尝试以下操作。请注意,我可以用来让它工作的最佳 fuzz.ratio70:

import pandas as pd
from fuzzywuzzy import fuzz
l = []
correct = ['apple', 'grapes', 'banana']
cols = df.columns[df.columns.isin(correct)]
for col in cols:
    l.append([c for c in df.columns if fuzz.ratio(col,c) > 70])
df = df.T.reset_index()
for i in range(len(correct)):
    for j in l[i]:
        df['index'] = df['index'].replace(j, correct[i])
df = df.groupby('index').sum().T
df
Out[1]: 
index  apple  banana  grapes
0          9      12      15

您不需要模糊分数的截止值。只用最高的。

import pandas as pd
from fuzzywuzzy import fuzz

df = pd.DataFrame({'fruit':['apple' ,'grapes',  'apples',  'bana',  'apyles',  'grayes',  'graph', 'banana'],'count':[1,2,3,4,5,6,7,8]})

choices = ['apples','grapes','bananas']

transl ={el2:max([(fuzz.ratio(el1,el2),el1) for el1 in choices])[1] for el2 in df['fruit'] }

df = df.replace({'fruit': transl}).groupby(['fruit'])['count'].sum()

print(df)

输出:

fruit
apples      9
bananas    12
grapes     15
Name: count, dtype: int64