如何替换 python 中两个 for's()、一个列表和一个数据框的使用？

Question

我有一个数据框和一个字符串列表：

      import pandas as pd
      from fuzzywuzzy import fuzz
      from fuzzywuzzy import process

      df = pd.DataFrame({'Name': ['PARIS', 'NEW YORK', 'MADRI', 'PARI', 'P ARIS', 'NOW YORK',
                                  'PORTUGAL', 'PORTUGLA'],                   
                         'Column_two': [1,2,3,4,5,6,7,8]                 
                         })

      print(df)

      # Output:

      Name   Column_two
     PARIS       1
     NEW YORK    2
     MADRI       3
      PARI       4
     P ARIS      5
    NOW YORK     6
    PORTUGAL     7
    PORTUGLA     8

      list_string_correct = ['PARIS', 'NEW YORK', 'PORTUGAL']

我正在使用 Fuzzywuzzy python 库。此方法 returns 一个数字，表示两个比较字符串的相似程度：例子： fuzz.partial_ratio("巴西", "巴西")

     # Output:
     88

我想遍历数据框的 'Name' 列并将字符串与 var_string_correct 进行比较。如果这些相似，我想用正确的名称（这是字符串的名称）替换它。所以，我编写了以下代码：

      for i in range(0, len(df)):
          for j in range(0, len(list_string_correct)):
    
              var_string = list_string_correct[j] 

              # Return number [0 until 100]       
              result = fuzz.partial_ratio(var_string, df['Name'].iloc[i]) 
    
              if(fuzz.partial_ratio(var_string, df['Name'].iloc[i]) >= 80): # Condition            
                   df['Name'].loc[i] = var_string

代码有效。输出如愿：

     print(df)

     # Output:

         Name   Column_two
         PARIS      1
        NEW YORK    2
         MADRI      3
         PARIS      4
         PARIS      5
        NEW YORK    6
        PORTUGAL    7
        PORTUGAL    8

但是，我需要使用两个 for() 命令。有没有办法替换 for() 并保持相同的输出？

要安装库，请使用：

      pip install fuzzywuzzy
      pip install python-Levenshtein

Answer 1

尝试 thefuzz 包中的 process.extractOne（[=14= 的后继者]，同一作者，相同 api）：

# from fuzzywuzzy import process
from thefuzz import process

THRESHOLD = 80

df['Name'] = \
    df['Name'].apply(lambda x: process.extractOne(x, list_string_correct,
                                   score_cutoff=THRESHOLD)).str[0].fillna(df['Name'])

输出：

>>> df
       Name  Column_two
0     PARIS           1
1  NEW YORK           2
2     MADRI           3
3     PARIS           4
4     PARIS           5
5  NEW YORK           6
6  PORTUGAL           7
7  PORTUGAL           8

Answer 2

如果出于某种原因你需要使用 fuzzywuzzy 包（而不是@Corralien 推荐的 thefuzz），你可以使用一个循环来代替：

import pandas as pd
from fuzzywuzzy import fuzz
from fuzzywuzzy import process

df = pd.DataFrame({'Name': ['PARIS', 'NEW YORK', 'MADRI', 'PARI', 'P ARIS', 'NOW YORK',
                            'PORTUGAL', 'PORTUGLA'],                   
                    'Column_two': [1,2,3,4,5,6,7,8]                 
                    })

list_string_correct = ['PARIS', 'NEW YORK', 'PORTUGAL']


for correct_name in list_string_correct:
    df['Name'] = df['Name'].apply(lambda x: correct_name if fuzz.partial_ratio(correct_name, x) >= 80 else x)

       Name  Column_two
0     PARIS           1
1  NEW YORK           2
2     MADRI           3
3     PARIS           4
4     PARIS           5
5  NEW YORK           6
6  PORTUGAL           7
7  PORTUGAL           8

如何替换 python 中两个 for's()、一个列表和一个数据框的使用？

How to replace the use of two for's(), a list and a dataframe in python?

python

dataframe

pandas

fuzzywuzzy