基于模糊匹配查找和替换列表中的值
Finding and replacing values in a list based on fuzzy matching
我正在尝试循环浏览 pandas 中某列的值并更改所有相似的值以便它们协调一致。我首先将该列提取为一个列表,并希望遍历每一行,在找到相似值时将相似值替换为相似值,然后将列表放回数据框中替换该列。例如,像这样的一列:
Cool
Awesome
cool
CoOl
Awesum
Awesome
Mathss
Math
Maths
Mathss
会变成:
CoOl
Awesome
coOol
CoOl
Awesome
Awesome
Mathss
Mathss
Mathss
Mathss
代码如下:
def matchbrands():
conn = sqlite3.connect('/Users/XXX/db.sqlite3')
c = conn.cursor()
matchbrands_df = pd.read_sql_query("SELECT * from removeduplicates", conn)
brands = [x for x in matchbrands_df['brand']]
i=1
for x in brands:
if fuzz.token_sort_ratio(x, brands[i]) > 85:
x = brands[i]
else:
i += 1
n = matchbrands_df.columns[7]
matchbrands_df.drop(n, axis=1, inplace=True)
matchbrands_df[n] = brands
matchbrands_df.to_csv('/Users/XXX/matchedbrands.csv')
matchbrands_df.to_sql('removeduplicates', conn, if_exists="replace")
然而,这根本不会改变列。我不确定为什么。任何帮助将不胜感激
你的代码没有意义。
首先:使用 x =...
您无法更改列表 brands
上的值。你需要 brands[index] = ...
其次:它需要嵌套的for
循环来比较x
和brands
中的所有其他词
for index, word in enumerate(brands):
for other in brands[index+1:]:
#print(word, other, fuzz.token_sort_ratio(word, other))
if fuzz.token_sort_ratio(word, other) > 85:
brands[index] = other
最少的工作代码
import pandas as pd
import fuzzywuzzy.fuzz as fuzz
data = {'brands':
'''Cool
Awesome
cool
CoOl
Awesum
Awesome
Mathss
Math
Maths
Mathss'''.split('\n')
} # rows
df = pd.DataFrame(data)
print('--- before ---')
print(df)
brands = df['brands'].to_list()
print('--- changes ---')
for index, word in enumerate(brands):
#for other_index, other_word in enumerate(brands):
for other_index, other_word in enumerate(brands[index+1:], index+1):
#if word != other_word:
result = fuzz.token_sort_ratio(word, other_word)
if result > 85:
print(f'OK | {result:3} | {index:2} {word:7} -> {other_index:2} {other_word}')
elif result > 50:
print(f' | {result:3} | {index:2} {word:7} -> {other_index:2} {other_word}')
if result > 85:
brands[index] = other_word
#break
#word = other_word
df['brands'] = brands
print('--- after ---')
print(df)
结果:
--- before ---
brands
0 Cool
1 Awesome
2 cool
3 CoOl
4 Awesum
5 Awesome
6 Mathss
7 Math
8 Maths
9 Mathss
--- changes ---
OK | 100 | 0 Cool -> 2 cool
OK | 100 | 0 Cool -> 3 CoOl
| 77 | 1 Awesome -> 4 Awesum
OK | 100 | 1 Awesome -> 5 Awesome
OK | 100 | 2 cool -> 3 CoOl
| 77 | 4 Awesum -> 5 Awesome
| 80 | 6 Mathss -> 7 Math
OK | 91 | 6 Mathss -> 8 Maths
OK | 100 | 6 Mathss -> 9 Mathss
OK | 89 | 7 Math -> 8 Maths
| 80 | 7 Math -> 9 Mathss
OK | 91 | 8 Maths -> 9 Mathss
--- after ---
brands
0 CoOl
1 Awesome
2 CoOl
3 CoOl
4 Awesum
5 Awesome
6 Mathss
7 Maths
8 Mathss
9 Mathss
它不会将 Awesum
更改为 Awesome
因为它得到 77
它不会将 Math
更改为 Mathss
,因为它得到 80
。但是它得到 89
for Maths
.
如果您在 for
循环中使用 word = other_word
,那么它可以将 Math
转换为 Maths
(89
),接下来是 Maths
到 Mathss
(91
)。但是这样它可能会改变很多次,最后变成一个本来可以给出比 85
小得多的值的词。您也可以获得 75
而不是 85
.
的预期结果
但是这个方法得到最后一个词的值为 >85
,而不是最大的词 - 所以可以有更好的匹配词,它不会使用它。使用 break 得到 >85
的第一个词。也许它应该得到所有带有 >85
的单词并选择具有最大值的单词。并且它必须跳过相同但在不同行中的单词。但是这一切都会造成奇怪的情况。
在代码的注释中我保留了其他修改的想法。
编辑:
与 >75
和颜色相同。
import pandas as pd
import fuzzywuzzy.fuzz as fuzz
from colorama import Fore as FG, Back as BG, Style as ST
data = {'brands':
'''Cool
Awesome
cool
CoOl
Awesum
Awesome
Mathss
Math
Maths
Mathss'''.split('\n')
} # rows
df = pd.DataFrame(data)
print('--- before ---')
print(df)
brands = df['brands'].to_list()
print('--- changes ---')
for index, word in enumerate(brands):
print('-', index, '-')
#for other_index, other_word in enumerate(brands):
for other_index, other_word in enumerate(brands[index+1:], index+1):
#if word != other_word:
result = fuzz.token_sort_ratio(word, other_word)
if result > 85:
color = ST.BRIGHT + FG.GREEN
info = 'OK'
elif result > 75:
color = ST.BRIGHT + FG.YELLOW
info = ' ?'
elif result > 50:
color = ST.BRIGHT + FG.WHITE
info = ' '
else:
color = ST.BRIGHT + FG.RED
info = ' -'
print(f'{color}{info} | {result:3} | {index:2} {word:7} -> {other_index:2} {other_word}{ST.RESET_ALL}')
if result > 75:
brands[index] = other_word
#break
#word = other_word
df['brands'] = brands
print('--- after ---')
print(df)
我正在尝试循环浏览 pandas 中某列的值并更改所有相似的值以便它们协调一致。我首先将该列提取为一个列表,并希望遍历每一行,在找到相似值时将相似值替换为相似值,然后将列表放回数据框中替换该列。例如,像这样的一列:
Cool
Awesome
cool
CoOl
Awesum
Awesome
Mathss
Math
Maths
Mathss
会变成:
CoOl
Awesome
coOol
CoOl
Awesome
Awesome
Mathss
Mathss
Mathss
Mathss
代码如下:
def matchbrands():
conn = sqlite3.connect('/Users/XXX/db.sqlite3')
c = conn.cursor()
matchbrands_df = pd.read_sql_query("SELECT * from removeduplicates", conn)
brands = [x for x in matchbrands_df['brand']]
i=1
for x in brands:
if fuzz.token_sort_ratio(x, brands[i]) > 85:
x = brands[i]
else:
i += 1
n = matchbrands_df.columns[7]
matchbrands_df.drop(n, axis=1, inplace=True)
matchbrands_df[n] = brands
matchbrands_df.to_csv('/Users/XXX/matchedbrands.csv')
matchbrands_df.to_sql('removeduplicates', conn, if_exists="replace")
然而,这根本不会改变列。我不确定为什么。任何帮助将不胜感激
你的代码没有意义。
首先:使用 x =...
您无法更改列表 brands
上的值。你需要 brands[index] = ...
其次:它需要嵌套的for
循环来比较x
和brands
for index, word in enumerate(brands):
for other in brands[index+1:]:
#print(word, other, fuzz.token_sort_ratio(word, other))
if fuzz.token_sort_ratio(word, other) > 85:
brands[index] = other
最少的工作代码
import pandas as pd
import fuzzywuzzy.fuzz as fuzz
data = {'brands':
'''Cool
Awesome
cool
CoOl
Awesum
Awesome
Mathss
Math
Maths
Mathss'''.split('\n')
} # rows
df = pd.DataFrame(data)
print('--- before ---')
print(df)
brands = df['brands'].to_list()
print('--- changes ---')
for index, word in enumerate(brands):
#for other_index, other_word in enumerate(brands):
for other_index, other_word in enumerate(brands[index+1:], index+1):
#if word != other_word:
result = fuzz.token_sort_ratio(word, other_word)
if result > 85:
print(f'OK | {result:3} | {index:2} {word:7} -> {other_index:2} {other_word}')
elif result > 50:
print(f' | {result:3} | {index:2} {word:7} -> {other_index:2} {other_word}')
if result > 85:
brands[index] = other_word
#break
#word = other_word
df['brands'] = brands
print('--- after ---')
print(df)
结果:
--- before ---
brands
0 Cool
1 Awesome
2 cool
3 CoOl
4 Awesum
5 Awesome
6 Mathss
7 Math
8 Maths
9 Mathss
--- changes ---
OK | 100 | 0 Cool -> 2 cool
OK | 100 | 0 Cool -> 3 CoOl
| 77 | 1 Awesome -> 4 Awesum
OK | 100 | 1 Awesome -> 5 Awesome
OK | 100 | 2 cool -> 3 CoOl
| 77 | 4 Awesum -> 5 Awesome
| 80 | 6 Mathss -> 7 Math
OK | 91 | 6 Mathss -> 8 Maths
OK | 100 | 6 Mathss -> 9 Mathss
OK | 89 | 7 Math -> 8 Maths
| 80 | 7 Math -> 9 Mathss
OK | 91 | 8 Maths -> 9 Mathss
--- after ---
brands
0 CoOl
1 Awesome
2 CoOl
3 CoOl
4 Awesum
5 Awesome
6 Mathss
7 Maths
8 Mathss
9 Mathss
它不会将 Awesum
更改为 Awesome
因为它得到 77
它不会将 Math
更改为 Mathss
,因为它得到 80
。但是它得到 89
for Maths
.
如果您在 for
循环中使用 word = other_word
,那么它可以将 Math
转换为 Maths
(89
),接下来是 Maths
到 Mathss
(91
)。但是这样它可能会改变很多次,最后变成一个本来可以给出比 85
小得多的值的词。您也可以获得 75
而不是 85
.
但是这个方法得到最后一个词的值为 >85
,而不是最大的词 - 所以可以有更好的匹配词,它不会使用它。使用 break 得到 >85
的第一个词。也许它应该得到所有带有 >85
的单词并选择具有最大值的单词。并且它必须跳过相同但在不同行中的单词。但是这一切都会造成奇怪的情况。
在代码的注释中我保留了其他修改的想法。
编辑:
与 >75
和颜色相同。
import pandas as pd
import fuzzywuzzy.fuzz as fuzz
from colorama import Fore as FG, Back as BG, Style as ST
data = {'brands':
'''Cool
Awesome
cool
CoOl
Awesum
Awesome
Mathss
Math
Maths
Mathss'''.split('\n')
} # rows
df = pd.DataFrame(data)
print('--- before ---')
print(df)
brands = df['brands'].to_list()
print('--- changes ---')
for index, word in enumerate(brands):
print('-', index, '-')
#for other_index, other_word in enumerate(brands):
for other_index, other_word in enumerate(brands[index+1:], index+1):
#if word != other_word:
result = fuzz.token_sort_ratio(word, other_word)
if result > 85:
color = ST.BRIGHT + FG.GREEN
info = 'OK'
elif result > 75:
color = ST.BRIGHT + FG.YELLOW
info = ' ?'
elif result > 50:
color = ST.BRIGHT + FG.WHITE
info = ' '
else:
color = ST.BRIGHT + FG.RED
info = ' -'
print(f'{color}{info} | {result:3} | {index:2} {word:7} -> {other_index:2} {other_word}{ST.RESET_ALL}')
if result > 75:
brands[index] = other_word
#break
#word = other_word
df['brands'] = brands
print('--- after ---')
print(df)