基于一列映射两个数据框并创建一个新列。也匹配部分匹配

Map two dataframe base on a column and create a new column. Also match partial matching

我有两个数据框

有代码和值的一个需要映射到其他数据框

B = pd.DataFrame({'Code': ['a', 'b', 'c', 'a', 'e','b','b','c'],
                  'Value': ["House with indoor pool", "House with Gray_C_Door", "Big Chandelier",
                            "Window Glass", "Frame Window",'High Column','Wood Raling', 'Window Glass trim']})

其他数据框包含大量具有值的数据,需要根据数据框“B”列“代码”创建新列。

A = pd.DataFrame({'Test': [2,34,12,45,np.nan,34,56,23,56,87,23,67,89,123,np.nan],
                  'Name': [ "House with indoor pool","House with Gray_C_Door",'House with indoor pool and Porch',"Wood Raling",
                           'Window Glass Tinted',"Windows Glass_with",'Big Chandelier', "Frame Window",np.nan,"Window glass","House with indoor pool",'High column with',
                           "Window Glass trim",'Frame Window',"glass Window"],
                 'Value': ["50", "100", "70", "20", "15",'75','50',"10", "10", "34", "5", "56",'12','83',np.nan]})
A.loc[:,'NewName'] = A['Name']

所以我使用下面的代码来替换 A['NewName']。

A['NewName']= A['NewName'].replace(B.set_index('Value')['Code'])

    Test    Name                                Value   NewName
0   2.0000  House with indoor pool              50      a
1   34.0000 House with Gray_C_Door              100     b
2   12.0000 House with indoor pool and Porch    70      House with indoor pool and Porch
3   45.0000 Wood Raling                         20      b
4   NaN     Window Glass Tinted                 15      Window Glass Tinted
5   34.0000 Windows Glass_with                  75      Windows Glass_with
6   56.0000 Big Chandelier                      50      c
7   23.0000 Frame Window                        10      e
8   56.0000 NaN                                 10      NaN
9   87.0000 Window glass                        34      Window glass
10  23.0000 House with indoor pool              5       a
11  67.0000 High column with                    56      High column with
12  89.0000 Window Glass trim                   12      c
13  123.000 Frame Window                        83      e
14  NaN     glass Window                        NaN     glass Window

但是,有些 A['NewName'] 与 B['Value'] 不匹配,因此没有给出确切的预期结果。

有没有办法,当它与 A['NewName'] 部分匹配时,我可以匹配这些值并给出正确的代码? 我的意思是,例如当 A['NewName'] 有“带室内游泳池和门廊的房子”时,我想将它与 B['Value'] = 'House with indoor pool' 匹配并用正确的 B[' 替换它代码] = 'a'。我无法将其添加到数据框 B 值列中,因为在“带室内游泳池的房子”之后它可以通过多种方式更改(例如:“带室内游泳池的房子_带大玻璃门”、“带室内游泳池和高的房子”栏杆”等)

这可以在 map/replace 函数或任何其他方法中完成吗?

感谢advacne!

您可以在自己的函数中使用 re(正则表达式)并将此函数应用于 A['Name'](顺便说一下,初始化 'Newname' 在这里没用):

import re
import pandas as pd
import numpy as np

B = pd.DataFrame({'Code': ['a', 'b', 'c', 'a', 'e','b','b','c'],
                  'Value': ["House with indoor pool", "House with Gray_C_Door", "Big Chandelier",
                            "Window Glass", "Frame Window",'High Column','Wood Raling', 'Window Glass trim']})

A = pd.DataFrame({'Test': [2,34,12,45,np.nan,34,56,23,56,87,23,67,89,123,np.nan],
                  'Name': [ "House with indoor pool","House with Gray_C_Door",'House with indoor pool and Porch',"Wood Raling",
                           'Window Glass Tinted',"Windows Glass_with",'Big Chandelier', "Frame Window",np.nan,"Window glass","House with indoor pool",'High column with',
                           "Window Glass trim",'Frame Window',"glass Window"],
                 'Value': ["50", "100", "70", "20", "15",'75','50',"10", "10", "34", "5", "56",'12','83',np.nan]})

B_series = B.set_index('Value')

def get_code_for_matching_value(val):
    if type(val)==str:
        if val in B_series.index:
            return B_series.at[val, 'Code']
        else:
            for i in B_series.index:
                if re.match(i, val):
                    return B_series.at[i, 'Code']
    return val
        
A['NewName']= A['Name'].apply(get_code_for_matching_value)
print(A)

输出:

     Test                              Name Value             NewName
0     2.0            House with indoor pool    50                   a
1    34.0            House with Gray_C_Door   100                   b
2    12.0  House with indoor pool and Porch    70                   a
3    45.0                       Wood Raling    20                   b
4     NaN               Window Glass Tinted    15                   a
5    34.0                Windows Glass_with    75  Windows Glass_with
6    56.0                    Big Chandelier    50                   c
7    23.0                      Frame Window    10                   e
8    56.0                               NaN    10                 NaN
9    87.0                      Window glass    34        Window glass
10   23.0            House with indoor pool     5                   a
11   67.0                  High column with    56    High column with
12   89.0                 Window Glass trim    12                   c
13  123.0                      Frame Window    83                   e
14    NaN                      glass Window   NaN        glass Window

注意:您可以通过不区分大小写来改进匹配(例如):if re.match(i.lower(), val.lower()):

这里有一种使用 difflib 的方法。我们有效地对没有精确匹配的 NameValue 进行了比较的外积,以便在做出选择之前找到最合适的(根据 difflib)。

为了标记真正糟糕的匹配,我们可以为 difflib 匹配率设置一个阈值,低于该阈值我们 return NaN。我在下面的代码中选择了 0.5,它捕获示例输入 Name“我不匹配任何东西”。

import pandas as pd
import numpy as np
B = pd.DataFrame({'Code': ['a', 'b', 'c', 'a', 'e','b','b','c'],
                  'Value': ["House with indoor pool", "House with Gray_C_Door", "Big Chandelier",
                            "Window Glass", "Frame Window",'High Column','Wood Raling', 'Window Glass trim']})
A = pd.DataFrame({'Test': [2,34,12,45,np.nan,34,56,23,56,87,23,67,89,123,np.nan,333],
                  'Name': [ "House with indoor pool","House with Gray_C_Door",'House with indoor pool and Porch',"Wood Raling",
                           'Window Glass Tinted',"Windows Glass_with",'Big Chandelier', "Frame Window",np.nan,"Window glass","House with indoor pool",'High column with',
                           "Window Glass trim",'Frame Window',"glass Window","I don't match anything"],
                 'Value': ["50", "100", "70", "20", "15",'75','50',"10", "10", "34", "5", "56",'12','83',np.nan,'999']})

from difflib import SequenceMatcher
bValues = B['Value'].to_list()
bValuesDict = {value : idx for idx, value in enumerate(bValues)}
columnCode = B.columns.to_list().index('Code')
def getCode(x):
    name = x['Name']
    if not isinstance(name, str):
        return np.nan
    if name in bValuesDict: 
        idx = bValuesDict[name]
    else:
        bMatches = [SequenceMatcher(None, name, value).ratio() for value in bValues]
        maxRatio = max(bMatches)
        if maxRatio < 0.5:
            return np.nan
        idx = bMatches.index(maxRatio)
    code = B.iloc[idx, columnCode]
    return code
A['NewName'] = A.apply(getCode, axis=1)

print(A)

输出:

     Test                              Name Value NewName
0     2.0            House with indoor pool    50       a
1    34.0            House with Gray_C_Door   100       b
2    12.0  House with indoor pool and Porch    70       a
3    45.0                       Wood Raling    20       b
4     NaN               Window Glass Tinted    15       c
5    34.0                Windows Glass_with    75       a
6    56.0                    Big Chandelier    50       c
7    23.0                      Frame Window    10       e
8    56.0                               NaN    10     NaN
9    87.0                      Window glass    34       a
10   23.0            House with indoor pool     5       a
11   67.0                  High column with    56       b
12   89.0                 Window Glass trim    12       c
13  123.0                      Frame Window    83       e
14    NaN                      glass Window   NaN       e
15  333.0            I don't match anything   999     NaN