如何替换 python 中两个 for's()、一个列表和一个数据框的使用?
How to replace the use of two for's(), a list and a dataframe in python?
我有一个数据框和一个字符串列表:
import pandas as pd
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
df = pd.DataFrame({'Name': ['PARIS', 'NEW YORK', 'MADRI', 'PARI', 'P ARIS', 'NOW YORK',
'PORTUGAL', 'PORTUGLA'],
'Column_two': [1,2,3,4,5,6,7,8]
})
print(df)
# Output:
Name Column_two
PARIS 1
NEW YORK 2
MADRI 3
PARI 4
P ARIS 5
NOW YORK 6
PORTUGAL 7
PORTUGLA 8
list_string_correct = ['PARIS', 'NEW YORK', 'PORTUGAL']
我正在使用 Fuzzywuzzy python 库。此方法 returns 一个数字,表示两个比较字符串的相似程度:
例子:
fuzz.partial_ratio("巴西", "巴西")
# Output:
88
我想遍历数据框的 'Name' 列并将字符串与 var_string_correct 进行比较。如果这些相似,我想用正确的名称(这是字符串的名称)替换它。所以,我编写了以下代码:
for i in range(0, len(df)):
for j in range(0, len(list_string_correct)):
var_string = list_string_correct[j]
# Return number [0 until 100]
result = fuzz.partial_ratio(var_string, df['Name'].iloc[i])
if(fuzz.partial_ratio(var_string, df['Name'].iloc[i]) >= 80): # Condition
df['Name'].loc[i] = var_string
代码有效。输出如愿:
print(df)
# Output:
Name Column_two
PARIS 1
NEW YORK 2
MADRI 3
PARIS 4
PARIS 5
NEW YORK 6
PORTUGAL 7
PORTUGAL 8
但是,我需要使用两个 for() 命令。有没有办法替换 for() 并保持相同的输出?
要安装库,请使用:
pip install fuzzywuzzy
pip install python-Levenshtein
尝试 thefuzz
包中的 process.extractOne
([=14= 的后继者],同一作者,相同 api):
# from fuzzywuzzy import process
from thefuzz import process
THRESHOLD = 80
df['Name'] = \
df['Name'].apply(lambda x: process.extractOne(x, list_string_correct,
score_cutoff=THRESHOLD)).str[0].fillna(df['Name'])
输出:
>>> df
Name Column_two
0 PARIS 1
1 NEW YORK 2
2 MADRI 3
3 PARIS 4
4 PARIS 5
5 NEW YORK 6
6 PORTUGAL 7
7 PORTUGAL 8
如果出于某种原因你需要使用 fuzzywuzzy
包(而不是@Corralien 推荐的 thefuzz
),你可以使用一个循环来代替:
import pandas as pd
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
df = pd.DataFrame({'Name': ['PARIS', 'NEW YORK', 'MADRI', 'PARI', 'P ARIS', 'NOW YORK',
'PORTUGAL', 'PORTUGLA'],
'Column_two': [1,2,3,4,5,6,7,8]
})
list_string_correct = ['PARIS', 'NEW YORK', 'PORTUGAL']
for correct_name in list_string_correct:
df['Name'] = df['Name'].apply(lambda x: correct_name if fuzz.partial_ratio(correct_name, x) >= 80 else x)
Name Column_two
0 PARIS 1
1 NEW YORK 2
2 MADRI 3
3 PARIS 4
4 PARIS 5
5 NEW YORK 6
6 PORTUGAL 7
7 PORTUGAL 8
我有一个数据框和一个字符串列表:
import pandas as pd
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
df = pd.DataFrame({'Name': ['PARIS', 'NEW YORK', 'MADRI', 'PARI', 'P ARIS', 'NOW YORK',
'PORTUGAL', 'PORTUGLA'],
'Column_two': [1,2,3,4,5,6,7,8]
})
print(df)
# Output:
Name Column_two
PARIS 1
NEW YORK 2
MADRI 3
PARI 4
P ARIS 5
NOW YORK 6
PORTUGAL 7
PORTUGLA 8
list_string_correct = ['PARIS', 'NEW YORK', 'PORTUGAL']
我正在使用 Fuzzywuzzy python 库。此方法 returns 一个数字,表示两个比较字符串的相似程度: 例子: fuzz.partial_ratio("巴西", "巴西")
# Output:
88
我想遍历数据框的 'Name' 列并将字符串与 var_string_correct 进行比较。如果这些相似,我想用正确的名称(这是字符串的名称)替换它。所以,我编写了以下代码:
for i in range(0, len(df)):
for j in range(0, len(list_string_correct)):
var_string = list_string_correct[j]
# Return number [0 until 100]
result = fuzz.partial_ratio(var_string, df['Name'].iloc[i])
if(fuzz.partial_ratio(var_string, df['Name'].iloc[i]) >= 80): # Condition
df['Name'].loc[i] = var_string
代码有效。输出如愿:
print(df)
# Output:
Name Column_two
PARIS 1
NEW YORK 2
MADRI 3
PARIS 4
PARIS 5
NEW YORK 6
PORTUGAL 7
PORTUGAL 8
但是,我需要使用两个 for() 命令。有没有办法替换 for() 并保持相同的输出?
要安装库,请使用:
pip install fuzzywuzzy
pip install python-Levenshtein
尝试 thefuzz
包中的 process.extractOne
([=14= 的后继者],同一作者,相同 api):
# from fuzzywuzzy import process
from thefuzz import process
THRESHOLD = 80
df['Name'] = \
df['Name'].apply(lambda x: process.extractOne(x, list_string_correct,
score_cutoff=THRESHOLD)).str[0].fillna(df['Name'])
输出:
>>> df
Name Column_two
0 PARIS 1
1 NEW YORK 2
2 MADRI 3
3 PARIS 4
4 PARIS 5
5 NEW YORK 6
6 PORTUGAL 7
7 PORTUGAL 8
如果出于某种原因你需要使用 fuzzywuzzy
包(而不是@Corralien 推荐的 thefuzz
),你可以使用一个循环来代替:
import pandas as pd
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
df = pd.DataFrame({'Name': ['PARIS', 'NEW YORK', 'MADRI', 'PARI', 'P ARIS', 'NOW YORK',
'PORTUGAL', 'PORTUGLA'],
'Column_two': [1,2,3,4,5,6,7,8]
})
list_string_correct = ['PARIS', 'NEW YORK', 'PORTUGAL']
for correct_name in list_string_correct:
df['Name'] = df['Name'].apply(lambda x: correct_name if fuzz.partial_ratio(correct_name, x) >= 80 else x)
Name Column_two
0 PARIS 1
1 NEW YORK 2
2 MADRI 3
3 PARIS 4
4 PARIS 5
5 NEW YORK 6
6 PORTUGAL 7
7 PORTUGAL 8