python 动态模糊逻辑连接

python dynamic fuzzy logic join

我正在尝试为 2 个表创建动态模糊逻辑连接。我所说的动态是指允许参数指定允许两个表连接的变量。 下面注明的代码是下面link下静态代码的修改版: Python Pandas fuzzy merge/match with duplicates

我编译了下面的动态代码:

import pandas as pd
import datetime
from fuzzywuzzy import fuzz
import difflib 

donors = pd.DataFrame({"name": pd.Series(["John Doe","John Doe","Tom Smith","Jane Doe","Jane Doe","Kat test"]), "Email": pd.Series(['a@a.ca','a@a.ca','b@b.ca','c@c.ca','something@a.ca','d@d.ca']),"Date": (["27/03/2013  10:00:00 AM","1/03/2013  10:39:00 AM","2/03/2013  10:39:00 AM","3/03/2013  10:39:00 AM","4/03/2013  10:39:00 AM","27/03/2013  10:39:00 AM"])})
fundraisers = pd.DataFrame({"name": pd.Series(["John Doe","John Doe","Kathy test","Tes Ester", "Jane Doe"]),"Email": pd.Series(['a@a.ca','a@a.ca','d@d.ca','asdf@asdf.ca','something@a.ca']),"Date": pd.Series(["2/03/2013  10:39:00 AM","27/03/2013  11:39:00 AM","3/03/2013  10:39:00 AM","4/03/2013  10:40:00 AM","27/03/2013  10:39:00 AM"])})
donors["Date"] = pd.to_datetime(donors["Date"], dayfirst=True)
fundraisers["Date"] = pd.to_datetime(donors["Date"], dayfirst=True)
donors["code"] = donors.apply(lambda row: str(row['name'])+' '+str(row['Email']), axis=1)
idx = donors.groupby('code')["Date"].transform(min) == donors['Date']
donors = donors[idx].reset_index().drop('index',1)

def get_donors_v1(fund_var,don_var, don_tab,row=None):
    d = don_tab.apply(lambda x: fuzz.ratio(x["%s" % don_var], 'row["%s" %fund_var]') * 2, axis=1)
    d = d[d >= 75]
    if len(d) == 0:
        v = ['']*3
    else:
        v = don_tab.ix[d.idxmax(), ["%s"% don_var ,'Email','Date']].values
    return pd.Series(v, index=['donor name', 'donor email', 'donor date'])

trial=pd.concat((fundraisers, fundraisers.apply(get_donors_v1(fund_var="name",don_var="name",don_tab=donors), axis=1)), axis=1)

我收到以下错误:

TypeError: get_donors_v1() takes exactly 4 arguments (3 given)

我是否应该将函数替换为:

get_donors_v1(row=None,fund_var,don_var, don_tab)

然后我得到以下错误:

TypeError: ("'NoneType' object has no attribute 'getitem'", u'occurred at index 0')

请帮忙。

在您的代码示例中,您为 get_donors() 提供参数 'row' 的值 None。在下一行中,您试图将 row 用作映射 (row["%s" %fund_var]) 而没有测试对象是否存在,即:不等于 None.

索引像 'row["%s" %fund_var]' 这样的对象会导致调用 getitem 方法,None 确实没有。