匹配词频并从 pandas 中的另一个数据框中分配最大分数的类别和子类别

Question

输入：

df = pd.DataFrame([[121,'Customer Comments xxxx ttttt','loan, mortgage, payment, refinance, rate, new, time, credit, pay, current'],
[34,'Customer Comments xxxx','loan, mortgage, payment, refinance, rate, new, time, credit, pay, services'],
[356,'Customer Comments xxxx','loss, make, payment, refinance, rate, new, time, credit, pay, current'],
[908,'Customer Comments aaaaa','portal, improve, online, top, covid, web, deal, competitive, take, lost'],
[4356,'Customer Comments aaassds','portal, improve, website, top, covid, web, deal, competitive, take, care'],
[3333,'Customer Comments xxxx','communication, make, sure, process, rate, company, timely, interest, customer, know'],
[33456,'Customer Comments xxxx','communication, make, sure, process, rate, company, timely, interest, customer, lot']]
          , columns=['Loan Number','Commetns','Topic_Keywords'])


  df2=pd.DataFrame([[0,'loan, mortgage, payment, refinance, rate, new, time, credit, pay, current','Servicing','Refinance'],
[5,'closing, survey, time, notary, company, date, title, day, close, cost','Origination','Loan closing'],
[9,'service, customer, keep, good, work, excellent, great, continue, job, company','Servicing','good service'],
[6,'loan, phone, call, process, person, email, contact, time, processor, communication','Servicing','phone call process'],
[4, 'loan, helpful, processor, officer, professional, staff, knowledgeable, hire, work, process','Servicing','Staff/Agent behaviour'],
[3, 'process, easy, nothing, refinance, entire, whole, experience, time, everything, start','Origination','OnBoarding'],
[8, 'great, experience, everything, job, overall, company, nothing, work, mortgage, everyone','Servicing','good service'],
[1, 'portal, improve, online, top, covid, web, deal, competitive, take, care','Servicing','websites'],
[2, 'communication, make, sure, process, rate, company, timely, interest, customer, know',  'Origination','OnBoarding'],
[7, 'process, anything, website, app, change, think, easy, thing, use, mobile', 'Servicing','websites']]
,columns=['Dominant_Topic','Topic_Keywords','Cate','SubCategory'])

输出：

outdf=pd.DataFrame([[121,'Customer Comments xxxx ttttt','loan, mortgage, payment, refinance, rate, new, time, credit, pay, current','Servicing','Refinance',10,100],
[34,'Customer Comments xxxx','loan, mortgage, payment, refinance, rate, new, time, credit, pay, services','Servicing','Refinance',9,90],
[356,'Customer Comments xxxx','loss, make, payment, refinance, rate, new, time, credit, pay, current','Servicing','Refinance',8,80],
[908,'Customer Comments aaaaa','portal, improve, online, top, covid, web, deal, competitive, take, lost','Servicing','websites',9,90],
[4356,'Customer Comments aaassds','portal, improve, website, top, covid, web, deal, competitive, take, care','Servicing','websites',10,100],
[3333,'Customer Comments xxxx','communication, make, sure, process, rate, company, timely, interest, customer, know','Origination','OnBoarding',9,90],
[33456,'Customer Comments xxxx','communication, make, sure, process, rate, company, timely, interest, customer, lot','Origination','OnBoarding',9,90]],
columns=['Loan Number','Commetns','Topic_Keywords','Category','subCategory','String_match','match_score'])

我运行主题建模并从每个评论中获取主题，我想从另一个数据框中分配类别和子类别（使用 Topic_keywords 列）获取匹配的单词计数并迭代行并借助最大单词匹配分数.

获得最大分数的类别和子类别

如果有任何疑问，请告诉我

Answer 1

import pandas as pd
from pandas import json_normalize

words_series = df2["Topic_Keywords"].str.split(",")




def find_max(words):
    
    words = words.split(",")
    matched = words_series.apply(lambda x : set(x).intersection(words)).str.len()
    max_len = matched.max()
    max_index = matched.argmax()
    d = df2.loc[max_index].to_dict()
    
    d.pop("Topic_Keywords")
    
    return {
        **d,
        "string_match" : max_len
    }

df["result"] = df["Topic_Keywords"].apply(find_max)
out_df = df.join(json_normalize(df["result"])).drop("result",axis=1)

out_df = out_df.assign(match_score=out_df["string_match"] * 10)

匹配词频并从 pandas 中的另一个数据框中分配最大分数的类别和子类别

match the word frequency and assign max score's category and sub category from another data frame in pandas

python-3.x

pandas

anaconda3