如何在 Pandas DataFrame 中应用复杂的 lambda 函数,每行包含一长串元素

How to apply a complex lambda function in Pandas DataFrame with long list of elements per row

我有一个 pandas DataFrame,其中每一列的每一行都有一个长字符串(请参阅变量 'dframe')。在单独的列表中,我存储了所有关键字,我必须将这些关键字与 DataFrame 中每个字符串中的每个单词进行比较。如果找到关键字,我必须将其存储为成功并标记它,在哪个句子中找到它。我正在使用一个复杂的 for 循环,其中包含很少的 'if' 语句,这为我提供了正确的输出,但效率不高。在我有 130 个关键字和数千行要迭代的整个集合中,运行 花费了将近 4 个小时。

我想应用一些 lambda 函数进行优化,这是我正在努力解决的问题。下面我给大家介绍一下我的数据集的思路和我目前的代码。

import pandas as pd
from fuzzywuzzy import fuzz


dframe = pd.DataFrame({ 'Email' : ['this is a first very long e-mail about fraud and money',
                           'this is a second e-mail about money',
                           'this would be a next message where people talk about secret information',
                           'this is a sentence where someone misspelled word frad',
                           'this sentence has no keyword']})

keywords = ['fraud','money','secret']


keyword_set = set(keywords)

dframe['Flag'] = False
dframe['part_word'] = 0
output = []


for k in range(0, len(keywords)):
    count_ = 0
    dframe['Flag'] = False
    for j in range(0, len(dframe['Email'])):
        row_list = []
        print(str(k) + '  /  ' + str(len(keywords)) + '  ||  ' +  str(j) + '  /  ' + str(len(dframe['Email'])))
        for i in dframe['Email'][j].split():
            if dframe['part_word'][j] != 0 :
                row_list = dframe['part_word'][j]


            fuz_part = fuzz.partial_ratio(keywords[k].lower(),i.lower())
            fuz_set = fuzz.token_set_ratio(keywords[k],i)

            if ((fuz_part > 90) | (fuz_set > 85)) & (len(i) > 3):
                if keywords[k] not in row_list:
                    row_list.append(keywords[k])
                    print(keywords[k] + '  found as :  ' + i)
                dframe['Flag'][j] = True
                dframe['part_word'][j] = row_list


    count_ = dframe['Flag'].values.sum()
    if count_ > 0:

        y = keywords[k] + ' ' + str(count_)
        output.append(y)
    else:
        y = keywords[k] + ' ' + '0'
        output.append(y)          

也许对 lambda 函数有经验的人可以给我一个提示,告诉我如何将它应用到我的 DataFrame 上以执行类似的操作? 在将每行整个句子拆分为单独的单词并选择具有最高匹配值的值且条件应大于 85 或 90 之后,需要以某种方式在 lambda 中应用模糊匹配。这是我感到困惑的事情。在此先感谢您的帮助。

我没有适合您的 lambda 函数,但是您可以将其应用于 dframe.Email:

import pandas as pd
from fuzzywuzzy import fuzz

首先创建与您相同的示例数据框:

dframe = pd.DataFrame({ 'Email' : ['this is a first very long e-mail about fraud and money',
                       'this is a second e-mail about money',
                       'this would be a next message where people talk about secret information',
                       'this is a sentence where someone misspelled word frad',
                       'this sentence has no keyword']})

keywords = ['fraud','money','secret']

这是要应用的函数:

def fct(sntnc, kwds):
    mtch = []
    for kwd in kwds:
        fuz_part = [fuzz.partial_ratio(kwd.lower(), w.lower()) > 90 for w in sntnc.split()]
        fuz_set = [fuzz.token_set_ratio(kwd, w) > 85 for w in sntnc.split()]
        bL = [len(w) > 3 for w in sntnc.split()]
        mtch.append(any([(p | s) & l for p, s, l in zip(fuz_part, fuz_set, bL)]))
    return mtch

对于每个关键字 它为句子中的所有单词计算 fuz_part > 90, 与 fuz_set > 85 相同 与 wordlength > 3 相同。 最后,对于每个关键字,如果一个句子的所有单词中都有 ((fuz_part > 90) | (fuz_set > 85)) & (wordlength > 3),它会保存在列表中。

这就是它的应用方式和结果的创建方式:

s = dframe.Email.apply(fct, kwds=keywords)
s = s.apply(pd.Series).set_axis(keywords, axis=1, inplace=False)
dframe = pd.concat([dframe, s], axis=1)

结果:

result = dframe.drop('Email', 1)
#    fraud  money  secret
# 0   True   True   False                                    
# 1  False   True   False                                     
# 2  False  False    True                                    
# 3   True  False   False                                     
# 4  False  False   False              

result.sum()
# fraud     2
# money     2                                           
# secret    1                                           
# dtype: int64