基于一列映射两个数据框并创建一个新列。也匹配部分匹配
Map two dataframe base on a column and create a new column. Also match partial matching
我有两个数据框
有代码和值的一个需要映射到其他数据框
B = pd.DataFrame({'Code': ['a', 'b', 'c', 'a', 'e','b','b','c'],
'Value': ["House with indoor pool", "House with Gray_C_Door", "Big Chandelier",
"Window Glass", "Frame Window",'High Column','Wood Raling', 'Window Glass trim']})
其他数据框包含大量具有值的数据,需要根据数据框“B”列“代码”创建新列。
A = pd.DataFrame({'Test': [2,34,12,45,np.nan,34,56,23,56,87,23,67,89,123,np.nan],
'Name': [ "House with indoor pool","House with Gray_C_Door",'House with indoor pool and Porch',"Wood Raling",
'Window Glass Tinted',"Windows Glass_with",'Big Chandelier', "Frame Window",np.nan,"Window glass","House with indoor pool",'High column with',
"Window Glass trim",'Frame Window',"glass Window"],
'Value': ["50", "100", "70", "20", "15",'75','50',"10", "10", "34", "5", "56",'12','83',np.nan]})
A.loc[:,'NewName'] = A['Name']
所以我使用下面的代码来替换 A['NewName']。
A['NewName']= A['NewName'].replace(B.set_index('Value')['Code'])
Test Name Value NewName
0 2.0000 House with indoor pool 50 a
1 34.0000 House with Gray_C_Door 100 b
2 12.0000 House with indoor pool and Porch 70 House with indoor pool and Porch
3 45.0000 Wood Raling 20 b
4 NaN Window Glass Tinted 15 Window Glass Tinted
5 34.0000 Windows Glass_with 75 Windows Glass_with
6 56.0000 Big Chandelier 50 c
7 23.0000 Frame Window 10 e
8 56.0000 NaN 10 NaN
9 87.0000 Window glass 34 Window glass
10 23.0000 House with indoor pool 5 a
11 67.0000 High column with 56 High column with
12 89.0000 Window Glass trim 12 c
13 123.000 Frame Window 83 e
14 NaN glass Window NaN glass Window
但是,有些 A['NewName'] 与 B['Value'] 不匹配,因此没有给出确切的预期结果。
有没有办法,当它与 A['NewName'] 部分匹配时,我可以匹配这些值并给出正确的代码?
我的意思是,例如当 A['NewName'] 有“带室内游泳池和门廊的房子”时,我想将它与 B['Value'] = 'House with indoor pool' 匹配并用正确的 B[' 替换它代码] = 'a'。我无法将其添加到数据框 B 值列中,因为在“带室内游泳池的房子”之后它可以通过多种方式更改(例如:“带室内游泳池的房子_带大玻璃门”、“带室内游泳池和高的房子”栏杆”等)
这可以在 map/replace 函数或任何其他方法中完成吗?
感谢advacne!
您可以在自己的函数中使用 re
(正则表达式)并将此函数应用于 A['Name']
(顺便说一下,初始化 'Newname' 在这里没用):
import re
import pandas as pd
import numpy as np
B = pd.DataFrame({'Code': ['a', 'b', 'c', 'a', 'e','b','b','c'],
'Value': ["House with indoor pool", "House with Gray_C_Door", "Big Chandelier",
"Window Glass", "Frame Window",'High Column','Wood Raling', 'Window Glass trim']})
A = pd.DataFrame({'Test': [2,34,12,45,np.nan,34,56,23,56,87,23,67,89,123,np.nan],
'Name': [ "House with indoor pool","House with Gray_C_Door",'House with indoor pool and Porch',"Wood Raling",
'Window Glass Tinted',"Windows Glass_with",'Big Chandelier', "Frame Window",np.nan,"Window glass","House with indoor pool",'High column with',
"Window Glass trim",'Frame Window',"glass Window"],
'Value': ["50", "100", "70", "20", "15",'75','50',"10", "10", "34", "5", "56",'12','83',np.nan]})
B_series = B.set_index('Value')
def get_code_for_matching_value(val):
if type(val)==str:
if val in B_series.index:
return B_series.at[val, 'Code']
else:
for i in B_series.index:
if re.match(i, val):
return B_series.at[i, 'Code']
return val
A['NewName']= A['Name'].apply(get_code_for_matching_value)
print(A)
输出:
Test Name Value NewName
0 2.0 House with indoor pool 50 a
1 34.0 House with Gray_C_Door 100 b
2 12.0 House with indoor pool and Porch 70 a
3 45.0 Wood Raling 20 b
4 NaN Window Glass Tinted 15 a
5 34.0 Windows Glass_with 75 Windows Glass_with
6 56.0 Big Chandelier 50 c
7 23.0 Frame Window 10 e
8 56.0 NaN 10 NaN
9 87.0 Window glass 34 Window glass
10 23.0 House with indoor pool 5 a
11 67.0 High column with 56 High column with
12 89.0 Window Glass trim 12 c
13 123.0 Frame Window 83 e
14 NaN glass Window NaN glass Window
注意:您可以通过不区分大小写来改进匹配(例如):if re.match(i.lower(), val.lower()):
这里有一种使用 difflib
的方法。我们有效地对没有精确匹配的 Name
和 Value
进行了比较的外积,以便在做出选择之前找到最合适的(根据 difflib
)。
为了标记真正糟糕的匹配,我们可以为 difflib
匹配率设置一个阈值,低于该阈值我们 return NaN。我在下面的代码中选择了 0.5,它捕获示例输入 Name
“我不匹配任何东西”。
import pandas as pd
import numpy as np
B = pd.DataFrame({'Code': ['a', 'b', 'c', 'a', 'e','b','b','c'],
'Value': ["House with indoor pool", "House with Gray_C_Door", "Big Chandelier",
"Window Glass", "Frame Window",'High Column','Wood Raling', 'Window Glass trim']})
A = pd.DataFrame({'Test': [2,34,12,45,np.nan,34,56,23,56,87,23,67,89,123,np.nan,333],
'Name': [ "House with indoor pool","House with Gray_C_Door",'House with indoor pool and Porch',"Wood Raling",
'Window Glass Tinted',"Windows Glass_with",'Big Chandelier', "Frame Window",np.nan,"Window glass","House with indoor pool",'High column with',
"Window Glass trim",'Frame Window',"glass Window","I don't match anything"],
'Value': ["50", "100", "70", "20", "15",'75','50',"10", "10", "34", "5", "56",'12','83',np.nan,'999']})
from difflib import SequenceMatcher
bValues = B['Value'].to_list()
bValuesDict = {value : idx for idx, value in enumerate(bValues)}
columnCode = B.columns.to_list().index('Code')
def getCode(x):
name = x['Name']
if not isinstance(name, str):
return np.nan
if name in bValuesDict:
idx = bValuesDict[name]
else:
bMatches = [SequenceMatcher(None, name, value).ratio() for value in bValues]
maxRatio = max(bMatches)
if maxRatio < 0.5:
return np.nan
idx = bMatches.index(maxRatio)
code = B.iloc[idx, columnCode]
return code
A['NewName'] = A.apply(getCode, axis=1)
print(A)
输出:
Test Name Value NewName
0 2.0 House with indoor pool 50 a
1 34.0 House with Gray_C_Door 100 b
2 12.0 House with indoor pool and Porch 70 a
3 45.0 Wood Raling 20 b
4 NaN Window Glass Tinted 15 c
5 34.0 Windows Glass_with 75 a
6 56.0 Big Chandelier 50 c
7 23.0 Frame Window 10 e
8 56.0 NaN 10 NaN
9 87.0 Window glass 34 a
10 23.0 House with indoor pool 5 a
11 67.0 High column with 56 b
12 89.0 Window Glass trim 12 c
13 123.0 Frame Window 83 e
14 NaN glass Window NaN e
15 333.0 I don't match anything 999 NaN
我有两个数据框
有代码和值的一个需要映射到其他数据框
B = pd.DataFrame({'Code': ['a', 'b', 'c', 'a', 'e','b','b','c'],
'Value': ["House with indoor pool", "House with Gray_C_Door", "Big Chandelier",
"Window Glass", "Frame Window",'High Column','Wood Raling', 'Window Glass trim']})
其他数据框包含大量具有值的数据,需要根据数据框“B”列“代码”创建新列。
A = pd.DataFrame({'Test': [2,34,12,45,np.nan,34,56,23,56,87,23,67,89,123,np.nan],
'Name': [ "House with indoor pool","House with Gray_C_Door",'House with indoor pool and Porch',"Wood Raling",
'Window Glass Tinted',"Windows Glass_with",'Big Chandelier', "Frame Window",np.nan,"Window glass","House with indoor pool",'High column with',
"Window Glass trim",'Frame Window',"glass Window"],
'Value': ["50", "100", "70", "20", "15",'75','50',"10", "10", "34", "5", "56",'12','83',np.nan]})
A.loc[:,'NewName'] = A['Name']
所以我使用下面的代码来替换 A['NewName']。
A['NewName']= A['NewName'].replace(B.set_index('Value')['Code'])
Test Name Value NewName
0 2.0000 House with indoor pool 50 a
1 34.0000 House with Gray_C_Door 100 b
2 12.0000 House with indoor pool and Porch 70 House with indoor pool and Porch
3 45.0000 Wood Raling 20 b
4 NaN Window Glass Tinted 15 Window Glass Tinted
5 34.0000 Windows Glass_with 75 Windows Glass_with
6 56.0000 Big Chandelier 50 c
7 23.0000 Frame Window 10 e
8 56.0000 NaN 10 NaN
9 87.0000 Window glass 34 Window glass
10 23.0000 House with indoor pool 5 a
11 67.0000 High column with 56 High column with
12 89.0000 Window Glass trim 12 c
13 123.000 Frame Window 83 e
14 NaN glass Window NaN glass Window
但是,有些 A['NewName'] 与 B['Value'] 不匹配,因此没有给出确切的预期结果。
有没有办法,当它与 A['NewName'] 部分匹配时,我可以匹配这些值并给出正确的代码? 我的意思是,例如当 A['NewName'] 有“带室内游泳池和门廊的房子”时,我想将它与 B['Value'] = 'House with indoor pool' 匹配并用正确的 B[' 替换它代码] = 'a'。我无法将其添加到数据框 B 值列中,因为在“带室内游泳池的房子”之后它可以通过多种方式更改(例如:“带室内游泳池的房子_带大玻璃门”、“带室内游泳池和高的房子”栏杆”等)
这可以在 map/replace 函数或任何其他方法中完成吗?
感谢advacne!
您可以在自己的函数中使用 re
(正则表达式)并将此函数应用于 A['Name']
(顺便说一下,初始化 'Newname' 在这里没用):
import re
import pandas as pd
import numpy as np
B = pd.DataFrame({'Code': ['a', 'b', 'c', 'a', 'e','b','b','c'],
'Value': ["House with indoor pool", "House with Gray_C_Door", "Big Chandelier",
"Window Glass", "Frame Window",'High Column','Wood Raling', 'Window Glass trim']})
A = pd.DataFrame({'Test': [2,34,12,45,np.nan,34,56,23,56,87,23,67,89,123,np.nan],
'Name': [ "House with indoor pool","House with Gray_C_Door",'House with indoor pool and Porch',"Wood Raling",
'Window Glass Tinted',"Windows Glass_with",'Big Chandelier', "Frame Window",np.nan,"Window glass","House with indoor pool",'High column with',
"Window Glass trim",'Frame Window',"glass Window"],
'Value': ["50", "100", "70", "20", "15",'75','50',"10", "10", "34", "5", "56",'12','83',np.nan]})
B_series = B.set_index('Value')
def get_code_for_matching_value(val):
if type(val)==str:
if val in B_series.index:
return B_series.at[val, 'Code']
else:
for i in B_series.index:
if re.match(i, val):
return B_series.at[i, 'Code']
return val
A['NewName']= A['Name'].apply(get_code_for_matching_value)
print(A)
输出:
Test Name Value NewName
0 2.0 House with indoor pool 50 a
1 34.0 House with Gray_C_Door 100 b
2 12.0 House with indoor pool and Porch 70 a
3 45.0 Wood Raling 20 b
4 NaN Window Glass Tinted 15 a
5 34.0 Windows Glass_with 75 Windows Glass_with
6 56.0 Big Chandelier 50 c
7 23.0 Frame Window 10 e
8 56.0 NaN 10 NaN
9 87.0 Window glass 34 Window glass
10 23.0 House with indoor pool 5 a
11 67.0 High column with 56 High column with
12 89.0 Window Glass trim 12 c
13 123.0 Frame Window 83 e
14 NaN glass Window NaN glass Window
注意:您可以通过不区分大小写来改进匹配(例如):if re.match(i.lower(), val.lower()):
这里有一种使用 difflib
的方法。我们有效地对没有精确匹配的 Name
和 Value
进行了比较的外积,以便在做出选择之前找到最合适的(根据 difflib
)。
为了标记真正糟糕的匹配,我们可以为 difflib
匹配率设置一个阈值,低于该阈值我们 return NaN。我在下面的代码中选择了 0.5,它捕获示例输入 Name
“我不匹配任何东西”。
import pandas as pd
import numpy as np
B = pd.DataFrame({'Code': ['a', 'b', 'c', 'a', 'e','b','b','c'],
'Value': ["House with indoor pool", "House with Gray_C_Door", "Big Chandelier",
"Window Glass", "Frame Window",'High Column','Wood Raling', 'Window Glass trim']})
A = pd.DataFrame({'Test': [2,34,12,45,np.nan,34,56,23,56,87,23,67,89,123,np.nan,333],
'Name': [ "House with indoor pool","House with Gray_C_Door",'House with indoor pool and Porch',"Wood Raling",
'Window Glass Tinted',"Windows Glass_with",'Big Chandelier', "Frame Window",np.nan,"Window glass","House with indoor pool",'High column with',
"Window Glass trim",'Frame Window',"glass Window","I don't match anything"],
'Value': ["50", "100", "70", "20", "15",'75','50',"10", "10", "34", "5", "56",'12','83',np.nan,'999']})
from difflib import SequenceMatcher
bValues = B['Value'].to_list()
bValuesDict = {value : idx for idx, value in enumerate(bValues)}
columnCode = B.columns.to_list().index('Code')
def getCode(x):
name = x['Name']
if not isinstance(name, str):
return np.nan
if name in bValuesDict:
idx = bValuesDict[name]
else:
bMatches = [SequenceMatcher(None, name, value).ratio() for value in bValues]
maxRatio = max(bMatches)
if maxRatio < 0.5:
return np.nan
idx = bMatches.index(maxRatio)
code = B.iloc[idx, columnCode]
return code
A['NewName'] = A.apply(getCode, axis=1)
print(A)
输出:
Test Name Value NewName
0 2.0 House with indoor pool 50 a
1 34.0 House with Gray_C_Door 100 b
2 12.0 House with indoor pool and Porch 70 a
3 45.0 Wood Raling 20 b
4 NaN Window Glass Tinted 15 c
5 34.0 Windows Glass_with 75 a
6 56.0 Big Chandelier 50 c
7 23.0 Frame Window 10 e
8 56.0 NaN 10 NaN
9 87.0 Window glass 34 a
10 23.0 House with indoor pool 5 a
11 67.0 High column with 56 b
12 89.0 Window Glass trim 12 c
13 123.0 Frame Window 83 e
14 NaN glass Window NaN e
15 333.0 I don't match anything 999 NaN