如何逐行比较数据框中的数据

How to compare row by row in a dataframe

我有一个数据框,它有一个名称和名称的 URL ID。例如:

Abc           123
Abc.com       123
Def           345
Pqr           123
PQR.com       123

此处由于数据提取错误,有时不同的名称具有相同的ID。我想清理 table 这样如果名称不同而 ID 相同,则记录应保持不变。如果名称相似且ID也相同,则应将名称更改为一个。明确地说,

预期的输出应该是:

Abc.com     123
Abc.com     123
Def         354
PQR.com     123
PQR.com     123

也就是说,最后一个是数据输入错误..而且都是同名(字符串的第一个单词相同)。所以他们都改成了一个名字看ID。 但是第一条和第二条记录尽管它们的 ID 与最后一条记录相似,但它们的名称并不匹配并且完全不同。

我无法理解如何实现这一目标。

在这里请求一些指导。提前致谢。

注意:数据集的大小几乎是 1600 万条这样的记录。

想法是通过 DataFrame.merge and removed rows with same names in both columns by DataFrame.query, also was added new column by lengths of data by Series.str.len:

的交叉连接对 Name 的所有组合的 ratio 使用模糊匹配库 fuzzywuzzy
from fuzzywuzzy import fuzz

df1 = df.merge(df, on='ID').query('Name_x != Name_y')
df1['ratio'] = df1.apply(lambda x:  fuzz.ratio(x['Name_x'], x['Name_y']), axis=1)
df1['len'] = df1['Name_x'].str.len()
print (df1)
    Name_x   ID   Name_y  ratio  len
1      Abc  123      BCD      0    3
2      BCD  123      Abc      0    3
6      Pqr  789  PQR.com     20    3
7  PQR.com  789      Pqr     20    7

然后按阈值和 boolean indexing. Then is necessary choose which value is necessary, one possible solution is get longer text. So is uses DataFrameGroupBy.idxmax with DataFrame.loc and then DataFrame.set_index 过滤行 Series:

N = 15    
df2 = df1[df1['ratio'].gt(N)]
s = df2.loc[df2.groupby('ID')['len'].idxmax()].set_index('ID')['Name_x']
print (s)
ID
789    PQR.com
Name: Name_x, dtype: object

上次 Series.map by ID and replace non matched values by original with Series.fillna:

df['Name'] = df['ID'].map(s).fillna(df['Name'])
print (df)
      Name   ID
0      Abc  123
1      BCD  123
2      Def  345
3  PQR.com  789
4  PQR.com  789

编辑:如果每个 ID 有更多有效字符串,则更复杂:

print (df)
               Name          ID
0      Air Ordnance  1578013421
1  Air-Ordnance.com  1578013421
2          Garreett  1578013421
3           Garrett  1578013421

首先得到fuzz.ratio像之前的解决方案:

from fuzzywuzzy import fuzz

df1 = df.merge(df, on='ID').query('Name_x != Name_y')
df1['ratio'] = df1.apply(lambda x:  fuzz.ratio(x['Name_x'], x['Name_y']), axis=1)
print (df1)
              Name_x          ID            Name_y  ratio
1       Air Ordnance  1578013421  Air-Ordnance.com     79
2       Air Ordnance  1578013421          Garreett     30
3       Air Ordnance  1578013421           Garrett     32
4   Air-Ordnance.com  1578013421      Air Ordnance     79
6   Air-Ordnance.com  1578013421          Garreett     25
7   Air-Ordnance.com  1578013421           Garrett     26
8           Garreett  1578013421      Air Ordnance     30
9           Garreett  1578013421  Air-Ordnance.com     25
11          Garreett  1578013421           Garrett     93
12           Garrett  1578013421      Air Ordnance     32
13           Garrett  1578013421  Air-Ordnance.com     26
14           Garrett  1578013421          Garreett     93

然后按阈值过滤:

N = 50    
df2 = df1[df1['ratio'].gt(N)]
print (df2)

              Name_x          ID            Name_y  ratio
1       Air Ordnance  1578013421  Air-Ordnance.com     79
4   Air-Ordnance.com  1578013421      Air Ordnance     79
11          Garreett  1578013421           Garrett     93
14           Garrett  1578013421          Garreett     93

但为了更精确,需要指定哪些字符串在列表 L 中有效,按列表过滤:

L = ['Air-Ordnance.com','Garrett']
df2 = df2.loc[df2['Name_x'].isin(L),['Name_x','Name_y','ID']].rename(columns={'Name_y':'Name'})
print (df2)
              Name_x          Name          ID
4   Air-Ordnance.com  Air Ordnance  1578013421
14           Garrett      Garreett  1578013421

最后 merge 左连接到原始值并替换缺失值:

df = df.merge(df2, on=['Name','ID'], how='left')
df['Name'] = df.pop('Name_x').fillna(df['Name'])
print (df)
               Name          ID
0  Air-Ordnance.com  1578013421
1  Air-Ordnance.com  1578013421
2           Garrett  1578013421
3           Garrett  1578013421