匹配词频并从 pandas 中的另一个数据框中分配最大分数的类别和子类别
match the word frequency and assign max score's category and sub category from another data frame in pandas
输入:
df = pd.DataFrame([[121,'Customer Comments xxxx ttttt','loan, mortgage, payment, refinance, rate, new, time, credit, pay, current'],
[34,'Customer Comments xxxx','loan, mortgage, payment, refinance, rate, new, time, credit, pay, services'],
[356,'Customer Comments xxxx','loss, make, payment, refinance, rate, new, time, credit, pay, current'],
[908,'Customer Comments aaaaa','portal, improve, online, top, covid, web, deal, competitive, take, lost'],
[4356,'Customer Comments aaassds','portal, improve, website, top, covid, web, deal, competitive, take, care'],
[3333,'Customer Comments xxxx','communication, make, sure, process, rate, company, timely, interest, customer, know'],
[33456,'Customer Comments xxxx','communication, make, sure, process, rate, company, timely, interest, customer, lot']]
, columns=['Loan Number','Commetns','Topic_Keywords'])
df2=pd.DataFrame([[0,'loan, mortgage, payment, refinance, rate, new, time, credit, pay, current','Servicing','Refinance'],
[5,'closing, survey, time, notary, company, date, title, day, close, cost','Origination','Loan closing'],
[9,'service, customer, keep, good, work, excellent, great, continue, job, company','Servicing','good service'],
[6,'loan, phone, call, process, person, email, contact, time, processor, communication','Servicing','phone call process'],
[4, 'loan, helpful, processor, officer, professional, staff, knowledgeable, hire, work, process','Servicing','Staff/Agent behaviour'],
[3, 'process, easy, nothing, refinance, entire, whole, experience, time, everything, start','Origination','OnBoarding'],
[8, 'great, experience, everything, job, overall, company, nothing, work, mortgage, everyone','Servicing','good service'],
[1, 'portal, improve, online, top, covid, web, deal, competitive, take, care','Servicing','websites'],
[2, 'communication, make, sure, process, rate, company, timely, interest, customer, know', 'Origination','OnBoarding'],
[7, 'process, anything, website, app, change, think, easy, thing, use, mobile', 'Servicing','websites']]
,columns=['Dominant_Topic','Topic_Keywords','Cate','SubCategory'])
输出:
outdf=pd.DataFrame([[121,'Customer Comments xxxx ttttt','loan, mortgage, payment, refinance, rate, new, time, credit, pay, current','Servicing','Refinance',10,100],
[34,'Customer Comments xxxx','loan, mortgage, payment, refinance, rate, new, time, credit, pay, services','Servicing','Refinance',9,90],
[356,'Customer Comments xxxx','loss, make, payment, refinance, rate, new, time, credit, pay, current','Servicing','Refinance',8,80],
[908,'Customer Comments aaaaa','portal, improve, online, top, covid, web, deal, competitive, take, lost','Servicing','websites',9,90],
[4356,'Customer Comments aaassds','portal, improve, website, top, covid, web, deal, competitive, take, care','Servicing','websites',10,100],
[3333,'Customer Comments xxxx','communication, make, sure, process, rate, company, timely, interest, customer, know','Origination','OnBoarding',9,90],
[33456,'Customer Comments xxxx','communication, make, sure, process, rate, company, timely, interest, customer, lot','Origination','OnBoarding',9,90]],
columns=['Loan Number','Commetns','Topic_Keywords','Category','subCategory','String_match','match_score'])
我 运行 主题建模并从每个评论中获取主题,我想从另一个数据框中分配类别和子类别
(使用 Topic_keywords 列)获取匹配的单词计数并迭代行并借助最大单词匹配分数.
获得最大分数的类别和子类别
如果有任何疑问,请告诉我
import pandas as pd
from pandas import json_normalize
words_series = df2["Topic_Keywords"].str.split(",")
def find_max(words):
words = words.split(",")
matched = words_series.apply(lambda x : set(x).intersection(words)).str.len()
max_len = matched.max()
max_index = matched.argmax()
d = df2.loc[max_index].to_dict()
d.pop("Topic_Keywords")
return {
**d,
"string_match" : max_len
}
df["result"] = df["Topic_Keywords"].apply(find_max)
out_df = df.join(json_normalize(df["result"])).drop("result",axis=1)
out_df = out_df.assign(match_score=out_df["string_match"] * 10)
输入:
df = pd.DataFrame([[121,'Customer Comments xxxx ttttt','loan, mortgage, payment, refinance, rate, new, time, credit, pay, current'],
[34,'Customer Comments xxxx','loan, mortgage, payment, refinance, rate, new, time, credit, pay, services'],
[356,'Customer Comments xxxx','loss, make, payment, refinance, rate, new, time, credit, pay, current'],
[908,'Customer Comments aaaaa','portal, improve, online, top, covid, web, deal, competitive, take, lost'],
[4356,'Customer Comments aaassds','portal, improve, website, top, covid, web, deal, competitive, take, care'],
[3333,'Customer Comments xxxx','communication, make, sure, process, rate, company, timely, interest, customer, know'],
[33456,'Customer Comments xxxx','communication, make, sure, process, rate, company, timely, interest, customer, lot']]
, columns=['Loan Number','Commetns','Topic_Keywords'])
df2=pd.DataFrame([[0,'loan, mortgage, payment, refinance, rate, new, time, credit, pay, current','Servicing','Refinance'],
[5,'closing, survey, time, notary, company, date, title, day, close, cost','Origination','Loan closing'],
[9,'service, customer, keep, good, work, excellent, great, continue, job, company','Servicing','good service'],
[6,'loan, phone, call, process, person, email, contact, time, processor, communication','Servicing','phone call process'],
[4, 'loan, helpful, processor, officer, professional, staff, knowledgeable, hire, work, process','Servicing','Staff/Agent behaviour'],
[3, 'process, easy, nothing, refinance, entire, whole, experience, time, everything, start','Origination','OnBoarding'],
[8, 'great, experience, everything, job, overall, company, nothing, work, mortgage, everyone','Servicing','good service'],
[1, 'portal, improve, online, top, covid, web, deal, competitive, take, care','Servicing','websites'],
[2, 'communication, make, sure, process, rate, company, timely, interest, customer, know', 'Origination','OnBoarding'],
[7, 'process, anything, website, app, change, think, easy, thing, use, mobile', 'Servicing','websites']]
,columns=['Dominant_Topic','Topic_Keywords','Cate','SubCategory'])
输出:
outdf=pd.DataFrame([[121,'Customer Comments xxxx ttttt','loan, mortgage, payment, refinance, rate, new, time, credit, pay, current','Servicing','Refinance',10,100],
[34,'Customer Comments xxxx','loan, mortgage, payment, refinance, rate, new, time, credit, pay, services','Servicing','Refinance',9,90],
[356,'Customer Comments xxxx','loss, make, payment, refinance, rate, new, time, credit, pay, current','Servicing','Refinance',8,80],
[908,'Customer Comments aaaaa','portal, improve, online, top, covid, web, deal, competitive, take, lost','Servicing','websites',9,90],
[4356,'Customer Comments aaassds','portal, improve, website, top, covid, web, deal, competitive, take, care','Servicing','websites',10,100],
[3333,'Customer Comments xxxx','communication, make, sure, process, rate, company, timely, interest, customer, know','Origination','OnBoarding',9,90],
[33456,'Customer Comments xxxx','communication, make, sure, process, rate, company, timely, interest, customer, lot','Origination','OnBoarding',9,90]],
columns=['Loan Number','Commetns','Topic_Keywords','Category','subCategory','String_match','match_score'])
我 运行 主题建模并从每个评论中获取主题,我想从另一个数据框中分配类别和子类别 (使用 Topic_keywords 列)获取匹配的单词计数并迭代行并借助最大单词匹配分数.
获得最大分数的类别和子类别如果有任何疑问,请告诉我
import pandas as pd
from pandas import json_normalize
words_series = df2["Topic_Keywords"].str.split(",")
def find_max(words):
words = words.split(",")
matched = words_series.apply(lambda x : set(x).intersection(words)).str.len()
max_len = matched.max()
max_index = matched.argmax()
d = df2.loc[max_index].to_dict()
d.pop("Topic_Keywords")
return {
**d,
"string_match" : max_len
}
df["result"] = df["Topic_Keywords"].apply(find_max)
out_df = df.join(json_normalize(df["result"])).drop("result",axis=1)
out_df = out_df.assign(match_score=out_df["string_match"] * 10)