如何合并列名相似的数据列 Pandas
How to combine data columns with similar column names Pandas
我有一个数据有很多相似的列名(基本上是拼错的词),例如:
apple grapes apples bana apyles grayes graph banana
在这里,我想合并列 'apple, apples, apyles',然后是 'grapes, grayes, graph',然后是 'bana, banana'。我该怎么做?
*编辑评论:
问。当你说“结合”时,你是什么意思?你能包括示例输入和输出吗?
答案
输入
apple grapes apples bana apyles grayes graph banana
1 2 3 4 5 6 7 8
输出
apple grape banana
9 15 12
使用 fuzzywuzzy
您可以尝试以下操作。请注意,我可以用来让它工作的最佳 fuzz.ratio
是 70
:
import pandas as pd
from fuzzywuzzy import fuzz
l = []
correct = ['apple', 'grapes', 'banana']
cols = df.columns[df.columns.isin(correct)]
for col in cols:
l.append([c for c in df.columns if fuzz.ratio(col,c) > 70])
df = df.T.reset_index()
for i in range(len(correct)):
for j in l[i]:
df['index'] = df['index'].replace(j, correct[i])
df = df.groupby('index').sum().T
df
Out[1]:
index apple banana grapes
0 9 12 15
您不需要模糊分数的截止值。只用最高的。
import pandas as pd
from fuzzywuzzy import fuzz
df = pd.DataFrame({'fruit':['apple' ,'grapes', 'apples', 'bana', 'apyles', 'grayes', 'graph', 'banana'],'count':[1,2,3,4,5,6,7,8]})
choices = ['apples','grapes','bananas']
transl ={el2:max([(fuzz.ratio(el1,el2),el1) for el1 in choices])[1] for el2 in df['fruit'] }
df = df.replace({'fruit': transl}).groupby(['fruit'])['count'].sum()
print(df)
输出:
fruit
apples 9
bananas 12
grapes 15
Name: count, dtype: int64
我有一个数据有很多相似的列名(基本上是拼错的词),例如:
apple grapes apples bana apyles grayes graph banana
在这里,我想合并列 'apple, apples, apyles',然后是 'grapes, grayes, graph',然后是 'bana, banana'。我该怎么做?
*编辑评论:
问。当你说“结合”时,你是什么意思?你能包括示例输入和输出吗?
答案
输入
apple grapes apples bana apyles grayes graph banana
1 2 3 4 5 6 7 8
输出
apple grape banana
9 15 12
使用 fuzzywuzzy
您可以尝试以下操作。请注意,我可以用来让它工作的最佳 fuzz.ratio
是 70
:
import pandas as pd
from fuzzywuzzy import fuzz
l = []
correct = ['apple', 'grapes', 'banana']
cols = df.columns[df.columns.isin(correct)]
for col in cols:
l.append([c for c in df.columns if fuzz.ratio(col,c) > 70])
df = df.T.reset_index()
for i in range(len(correct)):
for j in l[i]:
df['index'] = df['index'].replace(j, correct[i])
df = df.groupby('index').sum().T
df
Out[1]:
index apple banana grapes
0 9 12 15
您不需要模糊分数的截止值。只用最高的。
import pandas as pd
from fuzzywuzzy import fuzz
df = pd.DataFrame({'fruit':['apple' ,'grapes', 'apples', 'bana', 'apyles', 'grayes', 'graph', 'banana'],'count':[1,2,3,4,5,6,7,8]})
choices = ['apples','grapes','bananas']
transl ={el2:max([(fuzz.ratio(el1,el2),el1) for el1 in choices])[1] for el2 in df['fruit'] }
df = df.replace({'fruit': transl}).groupby(['fruit'])['count'].sum()
print(df)
输出:
fruit
apples 9
bananas 12
grapes 15
Name: count, dtype: int64