如何逐行比较数据框中的数据
How to compare row by row in a dataframe
我有一个数据框,它有一个名称和名称的 URL ID。例如:
Abc 123
Abc.com 123
Def 345
Pqr 123
PQR.com 123
此处由于数据提取错误,有时不同的名称具有相同的ID。我想清理 table 这样如果名称不同而 ID 相同,则记录应保持不变。如果名称相似且ID也相同,则应将名称更改为一个。明确地说,
预期的输出应该是:
Abc.com 123
Abc.com 123
Def 354
PQR.com 123
PQR.com 123
也就是说,最后一个是数据输入错误..而且都是同名(字符串的第一个单词相同)。所以他们都改成了一个名字看ID。
但是第一条和第二条记录尽管它们的 ID 与最后一条记录相似,但它们的名称并不匹配并且完全不同。
我无法理解如何实现这一目标。
在这里请求一些指导。提前致谢。
注意:数据集的大小几乎是 1600 万条这样的记录。
想法是通过 DataFrame.merge
and removed rows with same names in both columns by DataFrame.query
, also was added new column by lengths of data by Series.str.len
:
的交叉连接对 Name
的所有组合的 ratio
使用模糊匹配库 fuzzywuzzy
from fuzzywuzzy import fuzz
df1 = df.merge(df, on='ID').query('Name_x != Name_y')
df1['ratio'] = df1.apply(lambda x: fuzz.ratio(x['Name_x'], x['Name_y']), axis=1)
df1['len'] = df1['Name_x'].str.len()
print (df1)
Name_x ID Name_y ratio len
1 Abc 123 BCD 0 3
2 BCD 123 Abc 0 3
6 Pqr 789 PQR.com 20 3
7 PQR.com 789 Pqr 20 7
然后按阈值和 boolean indexing
. Then is necessary choose which value is necessary, one possible solution is get longer text. So is uses DataFrameGroupBy.idxmax
with DataFrame.loc
and then DataFrame.set_index
过滤行 Series
:
N = 15
df2 = df1[df1['ratio'].gt(N)]
s = df2.loc[df2.groupby('ID')['len'].idxmax()].set_index('ID')['Name_x']
print (s)
ID
789 PQR.com
Name: Name_x, dtype: object
上次 Series.map
by ID
and replace non matched values by original with Series.fillna
:
df['Name'] = df['ID'].map(s).fillna(df['Name'])
print (df)
Name ID
0 Abc 123
1 BCD 123
2 Def 345
3 PQR.com 789
4 PQR.com 789
编辑:如果每个 ID
有更多有效字符串,则更复杂:
print (df)
Name ID
0 Air Ordnance 1578013421
1 Air-Ordnance.com 1578013421
2 Garreett 1578013421
3 Garrett 1578013421
首先得到fuzz.ratio
像之前的解决方案:
from fuzzywuzzy import fuzz
df1 = df.merge(df, on='ID').query('Name_x != Name_y')
df1['ratio'] = df1.apply(lambda x: fuzz.ratio(x['Name_x'], x['Name_y']), axis=1)
print (df1)
Name_x ID Name_y ratio
1 Air Ordnance 1578013421 Air-Ordnance.com 79
2 Air Ordnance 1578013421 Garreett 30
3 Air Ordnance 1578013421 Garrett 32
4 Air-Ordnance.com 1578013421 Air Ordnance 79
6 Air-Ordnance.com 1578013421 Garreett 25
7 Air-Ordnance.com 1578013421 Garrett 26
8 Garreett 1578013421 Air Ordnance 30
9 Garreett 1578013421 Air-Ordnance.com 25
11 Garreett 1578013421 Garrett 93
12 Garrett 1578013421 Air Ordnance 32
13 Garrett 1578013421 Air-Ordnance.com 26
14 Garrett 1578013421 Garreett 93
然后按阈值过滤:
N = 50
df2 = df1[df1['ratio'].gt(N)]
print (df2)
Name_x ID Name_y ratio
1 Air Ordnance 1578013421 Air-Ordnance.com 79
4 Air-Ordnance.com 1578013421 Air Ordnance 79
11 Garreett 1578013421 Garrett 93
14 Garrett 1578013421 Garreett 93
但为了更精确,需要指定哪些字符串在列表 L
中有效,按列表过滤:
L = ['Air-Ordnance.com','Garrett']
df2 = df2.loc[df2['Name_x'].isin(L),['Name_x','Name_y','ID']].rename(columns={'Name_y':'Name'})
print (df2)
Name_x Name ID
4 Air-Ordnance.com Air Ordnance 1578013421
14 Garrett Garreett 1578013421
最后 merge
左连接到原始值并替换缺失值:
df = df.merge(df2, on=['Name','ID'], how='left')
df['Name'] = df.pop('Name_x').fillna(df['Name'])
print (df)
Name ID
0 Air-Ordnance.com 1578013421
1 Air-Ordnance.com 1578013421
2 Garrett 1578013421
3 Garrett 1578013421
我有一个数据框,它有一个名称和名称的 URL ID。例如:
Abc 123
Abc.com 123
Def 345
Pqr 123
PQR.com 123
此处由于数据提取错误,有时不同的名称具有相同的ID。我想清理 table 这样如果名称不同而 ID 相同,则记录应保持不变。如果名称相似且ID也相同,则应将名称更改为一个。明确地说,
预期的输出应该是:
Abc.com 123
Abc.com 123
Def 354
PQR.com 123
PQR.com 123
也就是说,最后一个是数据输入错误..而且都是同名(字符串的第一个单词相同)。所以他们都改成了一个名字看ID。 但是第一条和第二条记录尽管它们的 ID 与最后一条记录相似,但它们的名称并不匹配并且完全不同。
我无法理解如何实现这一目标。
在这里请求一些指导。提前致谢。
注意:数据集的大小几乎是 1600 万条这样的记录。
想法是通过 DataFrame.merge
and removed rows with same names in both columns by DataFrame.query
, also was added new column by lengths of data by Series.str.len
:
Name
的所有组合的 ratio
使用模糊匹配库 fuzzywuzzy
from fuzzywuzzy import fuzz
df1 = df.merge(df, on='ID').query('Name_x != Name_y')
df1['ratio'] = df1.apply(lambda x: fuzz.ratio(x['Name_x'], x['Name_y']), axis=1)
df1['len'] = df1['Name_x'].str.len()
print (df1)
Name_x ID Name_y ratio len
1 Abc 123 BCD 0 3
2 BCD 123 Abc 0 3
6 Pqr 789 PQR.com 20 3
7 PQR.com 789 Pqr 20 7
然后按阈值和 boolean indexing
. Then is necessary choose which value is necessary, one possible solution is get longer text. So is uses DataFrameGroupBy.idxmax
with DataFrame.loc
and then DataFrame.set_index
过滤行 Series
:
N = 15
df2 = df1[df1['ratio'].gt(N)]
s = df2.loc[df2.groupby('ID')['len'].idxmax()].set_index('ID')['Name_x']
print (s)
ID
789 PQR.com
Name: Name_x, dtype: object
上次 Series.map
by ID
and replace non matched values by original with Series.fillna
:
df['Name'] = df['ID'].map(s).fillna(df['Name'])
print (df)
Name ID
0 Abc 123
1 BCD 123
2 Def 345
3 PQR.com 789
4 PQR.com 789
编辑:如果每个 ID
有更多有效字符串,则更复杂:
print (df)
Name ID
0 Air Ordnance 1578013421
1 Air-Ordnance.com 1578013421
2 Garreett 1578013421
3 Garrett 1578013421
首先得到fuzz.ratio
像之前的解决方案:
from fuzzywuzzy import fuzz
df1 = df.merge(df, on='ID').query('Name_x != Name_y')
df1['ratio'] = df1.apply(lambda x: fuzz.ratio(x['Name_x'], x['Name_y']), axis=1)
print (df1)
Name_x ID Name_y ratio
1 Air Ordnance 1578013421 Air-Ordnance.com 79
2 Air Ordnance 1578013421 Garreett 30
3 Air Ordnance 1578013421 Garrett 32
4 Air-Ordnance.com 1578013421 Air Ordnance 79
6 Air-Ordnance.com 1578013421 Garreett 25
7 Air-Ordnance.com 1578013421 Garrett 26
8 Garreett 1578013421 Air Ordnance 30
9 Garreett 1578013421 Air-Ordnance.com 25
11 Garreett 1578013421 Garrett 93
12 Garrett 1578013421 Air Ordnance 32
13 Garrett 1578013421 Air-Ordnance.com 26
14 Garrett 1578013421 Garreett 93
然后按阈值过滤:
N = 50
df2 = df1[df1['ratio'].gt(N)]
print (df2)
Name_x ID Name_y ratio
1 Air Ordnance 1578013421 Air-Ordnance.com 79
4 Air-Ordnance.com 1578013421 Air Ordnance 79
11 Garreett 1578013421 Garrett 93
14 Garrett 1578013421 Garreett 93
但为了更精确,需要指定哪些字符串在列表 L
中有效,按列表过滤:
L = ['Air-Ordnance.com','Garrett']
df2 = df2.loc[df2['Name_x'].isin(L),['Name_x','Name_y','ID']].rename(columns={'Name_y':'Name'})
print (df2)
Name_x Name ID
4 Air-Ordnance.com Air Ordnance 1578013421
14 Garrett Garreett 1578013421
最后 merge
左连接到原始值并替换缺失值:
df = df.merge(df2, on=['Name','ID'], how='left')
df['Name'] = df.pop('Name_x').fillna(df['Name'])
print (df)
Name ID
0 Air-Ordnance.com 1578013421
1 Air-Ordnance.com 1578013421
2 Garrett 1578013421
3 Garrett 1578013421