当连接列有轻微的拼写差异时,如何将列合并/添加到 pandas 中的数据框?
How to merge/ add columns to dataframes in pandas when the joining column has slight spelling differences?
所以我有这样一个数据框
Rank State/Union territory NSDP Per Capita (Nominal)(2019–20)[1][2] state_id
0 1 Goa 466585.0 30.0
1 2 Sikkim 425656.0 11.0
2 3 Delhi 376143.0 NaN
3 4 Chandigarh NaN 4.0
4 5 Haryana 247207.0 6.0
5 6 Telangana 225756.0 0.0
6 7 Karnataka 223246.0 29.0
7 8 Kerala 221904.0 32.0
8 9 Puducherry 220949.0 34.0
9 10 Andaman and Nicobar Islands 219842.0 NaN
10 11 Tamil Nadu 218599.0 33.0
11 12 Gujarat 216329.0 24.0
12 13 Mizoram 204018.0 15.0
13 14 Uttarakhand 202895.0 5.0
14 15 Maharashtra 202130.0 27.0
15 16 Himachal Pradesh 190255.0 2.0
16 17 Andhra Pradesh 168480.0 28.0
17 18 Arunachal Pradesh 164615.0 NaN
18 19 Punjab 161083.0 3.0
20 20 Nagaland 130282.0 13.0
21 21 Tripura 125630.0 16.0
22 22 Rajasthan 115492.0 8.0
23 23 West Bengal 115348.0 19.0
24 24 Odisha 98896.0 21.0
25 25 Chhattisgarh 105281.0 22.0
26 26 Jammu and Kashmir 102882.0 NaN
27 27 Madhya Pradesh 103288.0 23.0
28 28 Meghalaya 92174.0 17.0
29 29 Assam 90758.0 18.0
30 30 Manipur 84746.0 14.0
31 31 Jharkhand 79873.0 20.0
32 32 Uttar Pradesh 65704.0 9.0
33 33 Bihar 46664.0 10.0
我的另一本词典有
{'Telangana': 0, 'Andaman & Nicobar Island': 35, 'Andhra Pradesh': 28, 'Arunanchal Pradesh': 12, 'Assam': 18, 'Bihar': 10, 'Chhattisgarh': 22, 'Daman
& Diu': 25, 'Goa': 30, 'Gujarat': 24, 'Haryana': 6, 'Himachal Pradesh': 2, 'Jammu & Kashmir': 1, 'Jharkhand': 20, 'Karnataka': 29, 'Kerala': 32, 'Lakshadweep': 31, 'Madhya Pradesh': 23, 'Maharashtra': 27, 'Manipur': 14, 'Chandigarh': 4, 'Puducherry': 34, 'Punjab': 3, 'Rajasthan': 8, 'Sikkim': 11, 'Tamil Nadu': 33, 'Tripura': 16, 'Uttar Pradesh': 9, 'Uttarakhand': 5, 'West Bengal': 19, 'Odisha': 21, 'Dadara & Nagar Havelli': 26, 'Meghalaya': 17, 'Mizoram': 15, 'Nagaland': 13, 'NCT of Delhi': 7}
所以你可能已经看到了问题,Andaman and Nicobar Islands
在两者中都存在但拼写不同,就像字典中的 ' Andaman & Nicobar Island'
一样。
这使得最后一列 NaN
9 10 Andaman and Nicobar Islands 219842.0 NaN
如何将其与 difflib 库结合使用?
我试过了
df_19_20['State/Union territory'] = df_19_20['State/Union territory'].apply(get_close_matches(df_19_20['State/Union territory'], id_d.keys()))
和
df_19_20['State/Union territory'] = get_close_matches(df_19_20['State/Union territory'], id_d.keys())
有什么我想念的吗?如何处理列以获得最佳匹配?
问题出在 df.apply
的应用程序中
df.apply
需要被赋予一个函数,该函数从它被迭代的每一行中获取值。您还需要清理 get_close_matches
的 return,其中 return 是 list
,因此您需要取第一个元素
df_19_20['State/Union territory'].apply(lambda x: get_close_matches(x, id_d.keys())[0])
应该可以
所以我有这样一个数据框
Rank State/Union territory NSDP Per Capita (Nominal)(2019–20)[1][2] state_id
0 1 Goa 466585.0 30.0
1 2 Sikkim 425656.0 11.0
2 3 Delhi 376143.0 NaN
3 4 Chandigarh NaN 4.0
4 5 Haryana 247207.0 6.0
5 6 Telangana 225756.0 0.0
6 7 Karnataka 223246.0 29.0
7 8 Kerala 221904.0 32.0
8 9 Puducherry 220949.0 34.0
9 10 Andaman and Nicobar Islands 219842.0 NaN
10 11 Tamil Nadu 218599.0 33.0
11 12 Gujarat 216329.0 24.0
12 13 Mizoram 204018.0 15.0
13 14 Uttarakhand 202895.0 5.0
14 15 Maharashtra 202130.0 27.0
15 16 Himachal Pradesh 190255.0 2.0
16 17 Andhra Pradesh 168480.0 28.0
17 18 Arunachal Pradesh 164615.0 NaN
18 19 Punjab 161083.0 3.0
20 20 Nagaland 130282.0 13.0
21 21 Tripura 125630.0 16.0
22 22 Rajasthan 115492.0 8.0
23 23 West Bengal 115348.0 19.0
24 24 Odisha 98896.0 21.0
25 25 Chhattisgarh 105281.0 22.0
26 26 Jammu and Kashmir 102882.0 NaN
27 27 Madhya Pradesh 103288.0 23.0
28 28 Meghalaya 92174.0 17.0
29 29 Assam 90758.0 18.0
30 30 Manipur 84746.0 14.0
31 31 Jharkhand 79873.0 20.0
32 32 Uttar Pradesh 65704.0 9.0
33 33 Bihar 46664.0 10.0
我的另一本词典有
{'Telangana': 0, 'Andaman & Nicobar Island': 35, 'Andhra Pradesh': 28, 'Arunanchal Pradesh': 12, 'Assam': 18, 'Bihar': 10, 'Chhattisgarh': 22, 'Daman
& Diu': 25, 'Goa': 30, 'Gujarat': 24, 'Haryana': 6, 'Himachal Pradesh': 2, 'Jammu & Kashmir': 1, 'Jharkhand': 20, 'Karnataka': 29, 'Kerala': 32, 'Lakshadweep': 31, 'Madhya Pradesh': 23, 'Maharashtra': 27, 'Manipur': 14, 'Chandigarh': 4, 'Puducherry': 34, 'Punjab': 3, 'Rajasthan': 8, 'Sikkim': 11, 'Tamil Nadu': 33, 'Tripura': 16, 'Uttar Pradesh': 9, 'Uttarakhand': 5, 'West Bengal': 19, 'Odisha': 21, 'Dadara & Nagar Havelli': 26, 'Meghalaya': 17, 'Mizoram': 15, 'Nagaland': 13, 'NCT of Delhi': 7}
所以你可能已经看到了问题,Andaman and Nicobar Islands
在两者中都存在但拼写不同,就像字典中的 ' Andaman & Nicobar Island'
一样。
这使得最后一列 NaN
9 10 Andaman and Nicobar Islands 219842.0 NaN
如何将其与 difflib 库结合使用?
我试过了
df_19_20['State/Union territory'] = df_19_20['State/Union territory'].apply(get_close_matches(df_19_20['State/Union territory'], id_d.keys()))
和
df_19_20['State/Union territory'] = get_close_matches(df_19_20['State/Union territory'], id_d.keys())
有什么我想念的吗?如何处理列以获得最佳匹配?
问题出在 df.apply
df.apply
需要被赋予一个函数,该函数从它被迭代的每一行中获取值。您还需要清理 get_close_matches
的 return,其中 return 是 list
,因此您需要取第一个元素
df_19_20['State/Union territory'].apply(lambda x: get_close_matches(x, id_d.keys())[0])
应该可以