根据出现次数最多的唯一值对唯一值进行分组 Python
Group Unique Values on Unique Value with most occurence Python
下面是我的 df
name
A S BITO
A S KIGEL
A S NATURENERGI
A S NATURENERGIE
A S NATURENERGIE
A S P BU SERVICE POWER P
A S P BU SERVICE POWER P
A S P BU SERVICE POWER PETER GMBH
A S P GMBH
A RESE LAND
A RITTER WITH SA
A RITTER WITH SA
A RITTER WITH SA
A RITTER SA CO
A RITTER SA CO
A RITTER SA CO
A RITTER SA CO
A RITTER WITH MASCHINE
A RITTER WITH MASCHINE SA CO
A RITTER WITH MASCHINE SA CO
目的是用出现次数最多的唯一值替换名称
下面是唯一值列表
name occurences
A S BITO 1
A S KIGEL 1
A S NATURENERGI 1
A S NATURENERGIE 2
A S P BU SERVICE POWER P 2
A S P BU SERVICE POWER PETER GMBH 1
A S P GMBH 1
A RESE LAND 1
A RITTER WITH SA 3
A RITTER SA CO 4
A RITTER WITH MASCHINE 1
A RITTER WITH MASCHINE SA CO 2
正如您在 DF 中看到的那样,可以对某些名称进行分组。
但是,由于拼写错误,没有。
所需的输出如下所示
name
A S BITO
A S KIGEL
A S NATURENERGIE
A S NATURENERGIE
A S NATURENERGIE
A S P BU SERVICE POWER P
A S P BU SERVICE POWER P
A S P BU SERVICE POWER P
A S P GMBH
A RESE LAND
A RITTER SA CO
A RITTER SA CO
A RITTER SA CO
A RITTER SA CO
A RITTER SA CO
A RITTER SA CO
A RITTER SA CO
A RITTER SA CO
A RITTER SA CO
A RITTER SA CO
下面是试过的代码
df['name'] = df['name'].replace('A S NATURENERGI', 'A S NATURENERGIE')
df['name'] = df['name'].replace('A S P BU SERVICE POWER PETER GMBH', 'A S P BU SERVICE POWER P')
df['name'] = df['name'].replace('A RITTER WITH SA', 'A RITTER SA CO')
df['name'] = df['name'].replace('A RITTER WITH MASCHINE', 'A RITTER SA CO')
df['name'] = df['name'].replace('A RITTER WITH MASCHINE SA CO ', 'A RITTER SA CO')
但是,这可能不是处理该问题的最佳方法。
因此,我正在考虑使用 difflib 并计算匹配分数。
下一步将用最高分匹配替换名称。
f = partial(difflib.get_close_matches, possibilities= df['name'].tolist(), n=1) #
matches = df['name'].map(f).str[0].fillna('')
scores = [difflib.SequenceMatcher(None, x, y).ratio() for x, y in zip(matches, df['name'])]
df_diff = df.assign(best=matches, score=scores)
这种方法的缺点是我会检索到完全相同的名称...
所以,如果有人有一些想法,非常感谢!
我创建了一个自定义函数,它在 pandas 系列中迭代映射:
import difflib
def similarity_replace(series):
reverse_map = {}
diz_map = {}
for i,s in series.iteritems():
diz_map[s] = s.replace(" ", "")
reverse_map[s.replace(" ", "")] = s
best_match = {}
uni = list(set(diz_map.values()))
for w in uni:
best_match[w] = sorted(difflib.get_close_matches(w, uni, n=3, cutoff=0.6), key=len)[0]
return series.map(diz_map).map(best_match).map(reverse_map)
这里是例子:
name = pd.Series(['A S BITO',
'A S KIGEL',
'A S NATURENERGI',
'A S NATURENERGIE',
'A S NATURENERGIE',
'A S P BU SERVICE POWER P',
'A S P BU SERVICE POWER P',
'A S P BU SERVICE POWER PETER GMBH',
'A S P GMBH',
'A RESE LAND',
'A RITTER WITH SA',
'A RITTER WITH SA',
'A RITTER WITH SA',
'A RITTER SA CO',
'A RITTER SA CO',
'A RITTER SA CO',
'A RITTER SA CO',
'A RITTER WITH MASCHINE',
'A RITTER WITH MASCHINE SA CO',
'A RITTER WITH MASCHINE SA CO'])
similarity_replace(similarity_replace(name))
输出:
0 A S BITO
1 A S KIGEL
2 A S NATURENERGI
3 A S NATURENERGI
4 A S NATURENERGI
5 A S P BU SERVICE POWER P
6 A S P BU SERVICE POWER P
7 A S P BU SERVICE POWER P
8 A S P GMBH
9 A RESE LAND
10 A RITTER SA CO
11 A RITTER SA CO
12 A RITTER SA CO
13 A RITTER SA CO
14 A RITTER SA CO
15 A RITTER SA CO
16 A RITTER SA CO
17 A RITTER SA CO
18 A RITTER SA CO
19 A RITTER SA CO
下面是我的 df
name
A S BITO
A S KIGEL
A S NATURENERGI
A S NATURENERGIE
A S NATURENERGIE
A S P BU SERVICE POWER P
A S P BU SERVICE POWER P
A S P BU SERVICE POWER PETER GMBH
A S P GMBH
A RESE LAND
A RITTER WITH SA
A RITTER WITH SA
A RITTER WITH SA
A RITTER SA CO
A RITTER SA CO
A RITTER SA CO
A RITTER SA CO
A RITTER WITH MASCHINE
A RITTER WITH MASCHINE SA CO
A RITTER WITH MASCHINE SA CO
目的是用出现次数最多的唯一值替换名称
下面是唯一值列表
name occurences
A S BITO 1
A S KIGEL 1
A S NATURENERGI 1
A S NATURENERGIE 2
A S P BU SERVICE POWER P 2
A S P BU SERVICE POWER PETER GMBH 1
A S P GMBH 1
A RESE LAND 1
A RITTER WITH SA 3
A RITTER SA CO 4
A RITTER WITH MASCHINE 1
A RITTER WITH MASCHINE SA CO 2
正如您在 DF 中看到的那样,可以对某些名称进行分组。
但是,由于拼写错误,没有。
所需的输出如下所示
name
A S BITO
A S KIGEL
A S NATURENERGIE
A S NATURENERGIE
A S NATURENERGIE
A S P BU SERVICE POWER P
A S P BU SERVICE POWER P
A S P BU SERVICE POWER P
A S P GMBH
A RESE LAND
A RITTER SA CO
A RITTER SA CO
A RITTER SA CO
A RITTER SA CO
A RITTER SA CO
A RITTER SA CO
A RITTER SA CO
A RITTER SA CO
A RITTER SA CO
A RITTER SA CO
下面是试过的代码
df['name'] = df['name'].replace('A S NATURENERGI', 'A S NATURENERGIE')
df['name'] = df['name'].replace('A S P BU SERVICE POWER PETER GMBH', 'A S P BU SERVICE POWER P')
df['name'] = df['name'].replace('A RITTER WITH SA', 'A RITTER SA CO')
df['name'] = df['name'].replace('A RITTER WITH MASCHINE', 'A RITTER SA CO')
df['name'] = df['name'].replace('A RITTER WITH MASCHINE SA CO ', 'A RITTER SA CO')
但是,这可能不是处理该问题的最佳方法。
因此,我正在考虑使用 difflib 并计算匹配分数。
下一步将用最高分匹配替换名称。
f = partial(difflib.get_close_matches, possibilities= df['name'].tolist(), n=1) #
matches = df['name'].map(f).str[0].fillna('')
scores = [difflib.SequenceMatcher(None, x, y).ratio() for x, y in zip(matches, df['name'])]
df_diff = df.assign(best=matches, score=scores)
这种方法的缺点是我会检索到完全相同的名称...
所以,如果有人有一些想法,非常感谢!
我创建了一个自定义函数,它在 pandas 系列中迭代映射:
import difflib
def similarity_replace(series):
reverse_map = {}
diz_map = {}
for i,s in series.iteritems():
diz_map[s] = s.replace(" ", "")
reverse_map[s.replace(" ", "")] = s
best_match = {}
uni = list(set(diz_map.values()))
for w in uni:
best_match[w] = sorted(difflib.get_close_matches(w, uni, n=3, cutoff=0.6), key=len)[0]
return series.map(diz_map).map(best_match).map(reverse_map)
这里是例子:
name = pd.Series(['A S BITO',
'A S KIGEL',
'A S NATURENERGI',
'A S NATURENERGIE',
'A S NATURENERGIE',
'A S P BU SERVICE POWER P',
'A S P BU SERVICE POWER P',
'A S P BU SERVICE POWER PETER GMBH',
'A S P GMBH',
'A RESE LAND',
'A RITTER WITH SA',
'A RITTER WITH SA',
'A RITTER WITH SA',
'A RITTER SA CO',
'A RITTER SA CO',
'A RITTER SA CO',
'A RITTER SA CO',
'A RITTER WITH MASCHINE',
'A RITTER WITH MASCHINE SA CO',
'A RITTER WITH MASCHINE SA CO'])
similarity_replace(similarity_replace(name))
输出:
0 A S BITO
1 A S KIGEL
2 A S NATURENERGI
3 A S NATURENERGI
4 A S NATURENERGI
5 A S P BU SERVICE POWER P
6 A S P BU SERVICE POWER P
7 A S P BU SERVICE POWER P
8 A S P GMBH
9 A RESE LAND
10 A RITTER SA CO
11 A RITTER SA CO
12 A RITTER SA CO
13 A RITTER SA CO
14 A RITTER SA CO
15 A RITTER SA CO
16 A RITTER SA CO
17 A RITTER SA CO
18 A RITTER SA CO
19 A RITTER SA CO