如何通过比较 pandas 中的两个数据帧来分割输入
How to split the input based by comparing two dataframes in pandas
我正在尝试从数据库中获取两个 table 中的输入和关键字。因此,我使用 pandas 读取两个 table 并使用各自的列来拆分数据,然后将输出写回到数据库中的相同 table 中。
我的输入:
Original_Input
LARIDENT SRL
MIZUHO Corporation Gosen Factory
ZIMMER MANUFACTURING BV
GALT MEDICAL CORP
MIZUHO AMERICA INC
AVENT S de RL de CV
LUV N CARE LTD
STERIS ISOMEDIX PUERTO RICO INC
MEDISTIM INC
Cadence Science Inc
TECHNOLOGIES SA
AMG Mdicale Co Inc
我的关键字table:
**Name_Extension** **Company_Type** **Priority**
co llc Company LLC 2
Pvt ltd Private Limited 8
Corp Corporation 4
CO Ltd Company Limited 3
inc Incorporated 5
CO Company 1
ltd Limited 7
llc LLC 6
Corporation Corporation 4
& Co Company 1
Company Limited Company Limited 3
Limited Limited 7
Co inc Company Incorporated 9
AB AB 10
SA SA 11
S A SA 11
GmbH GmbH 12
Sdn Bhd Sdn Bhd 13
llp LLP 14
co llp LLP 14
SA DE CV SA DE CV 19
Company Company 1
Coinc Company Incorporated 9
Coltd Company Limited 3
因此,如果输入(在 table 1 中)具有任何扩展名(在 table 2 中),则必须将其拆分并放入 Core_input和 Type_input 列,其中核心输入将包含公司名称,type_input 将包含公司类型(来自 table 2 列 2),必须优先检查。
我的输出将是:
Core_Input Type_input
NULL NULL
NULL NULL
NULL NULL
GALT MEDICAL Corporation
MIZUHO AMERICA Incorporated
NULL NULL
LUV N CARE Limited
STERIS ISOMEDIX PUERTO RICO Incorporated
MEDISTIM Incorporated
Cadence Science Incorporated
我的代码:
k1=[]
k2=[]
df1=pd.read_sql('select * from [dbo].[company_Extension]',engine)
for inp1 in df1['Name_Extension']:
k1.append(inp1.strip())
for inp2 in df1['Company_Type']:
k2.append(inp2.strip())
p=1
p1=max(df1['Priority'])
for k1 in df1['Name_Extension']:
for k2 in df1['Company_Type']:
#for pr in df1['Priority']:
for i in df['Cleansed_Input']:
while p<=p1:
if re.search(r'[^>]*?\s'+str(k1).strip(),str(i).strip(),re.I) and (p == (pr for pr in
df1['Priority'])):
splits = i.str.split(str(k1),re.I)
df['Core_Input'] = splits[0] #df['Cleansed_Input'].str.replace(str(k1),'',re.I)
df['Type_input'] = str(k2)
p=p+1
data.to_sql('Testtable', con=engine, if_exists='replace',index= False)
感谢任何帮助。
编辑:
df=pd.read_sql('select * from [dbo].[TempCompanyName]',engine)
df1=pd.read_sql('select * from [dbo].[company_Extension]',engine)
ext_list = df1['Name_Extension']
type_list =df1['Company_Type']
for i, j in df.iterrows():
comp_name = df['Original_Input'][i]
for idx, ex in enumerate(ext_list):
if re.search(rf'\b{ex}\b', comp_name,re.IGNORECASE):
df['Core_Input'] = type_list[idx]
df['Type_input'].iloc[i] = comp_type
print(df)
df.to_sql('TempCompanyName', con=engine, if_exists='replace',index= False)
Edit:
ext_list = df1['Name_Extension']
type_list =df1['Company_Type']
for i, j in enumerate(df['Cleansed_Input']):
comp_name = df['Cleansed_Input'][i]
for idx, ex in enumerate(ext_list):
comp_name.replace('.,','')
if re.search(rf'(\b{ex}\b)', comp_name, re.I):
comp_type = type_list[idx]
df['Type_input'].iloc[i]= comp_type
# Delete the extension name from company name
updated_comp_name =
re.sub(rf'(\b{str(ex).upper()}\b)','',str(comp_name).upper())
# Above regex is leaving space post word removal adding space
from next word becomes 2 spaces
updated_comp_name = str(updated_comp_name).replace(' ',' ')
# Update the company name
df['Core_Input'].iloc[i] = updated_comp_name
您好,希望下面几行可以帮助您找到解决方案...由于某种原因我没有使用 SQL,但是在 2 个不同的 excel 中获取了您的数据...您需要添加一列类型在输入 table 之前 运行 代码...
import pandas as pd
import numpy
import re
input_df = pd.read_excel('input.xlsx',sheet_name='Sheet1')
exts_df = pd.read_excel('exts.xlsx', sheet_name='Sheet1')
# Check if correct data is loaded
print(input_df.head())
ext_list = exts_df['Name_Extension']
type_list =exts_df['Company_Type']
for i, j in input_df.iterrows():
comp_name = input_df['Company Names'][i]
for idx, ex in enumerate(ext_list):
if re.search(rf'\b{ex}\b', comp_name,re.IGNORECASE):
comp_type = type_list[idx]
input_df['Type'].iloc[i] = comp_type
# Delete teh extension name from company name
updated_comp_name = re.sub(rf'\b{str(ex).upper()}\b','',str(comp_name).upper())
# Above regex is leaving space post word removal adding space from next word becomes 2 spaces
updated_comp_name = str(updated_comp_name).replace(' ',' ')
# Update the company name
input_df['Company Names'].iloc[i] = updated_comp_name
print(input_df)
input_df.to_excel('output.xlsx', index=False)
输出 post 从输入公司名称列映射中删除扩展 Company_Type ...
...
我正在尝试从数据库中获取两个 table 中的输入和关键字。因此,我使用 pandas 读取两个 table 并使用各自的列来拆分数据,然后将输出写回到数据库中的相同 table 中。
我的输入:
Original_Input
LARIDENT SRL
MIZUHO Corporation Gosen Factory
ZIMMER MANUFACTURING BV
GALT MEDICAL CORP
MIZUHO AMERICA INC
AVENT S de RL de CV
LUV N CARE LTD
STERIS ISOMEDIX PUERTO RICO INC
MEDISTIM INC
Cadence Science Inc
TECHNOLOGIES SA
AMG Mdicale Co Inc
我的关键字table:
**Name_Extension** **Company_Type** **Priority**
co llc Company LLC 2
Pvt ltd Private Limited 8
Corp Corporation 4
CO Ltd Company Limited 3
inc Incorporated 5
CO Company 1
ltd Limited 7
llc LLC 6
Corporation Corporation 4
& Co Company 1
Company Limited Company Limited 3
Limited Limited 7
Co inc Company Incorporated 9
AB AB 10
SA SA 11
S A SA 11
GmbH GmbH 12
Sdn Bhd Sdn Bhd 13
llp LLP 14
co llp LLP 14
SA DE CV SA DE CV 19
Company Company 1
Coinc Company Incorporated 9
Coltd Company Limited 3
因此,如果输入(在 table 1 中)具有任何扩展名(在 table 2 中),则必须将其拆分并放入 Core_input和 Type_input 列,其中核心输入将包含公司名称,type_input 将包含公司类型(来自 table 2 列 2),必须优先检查。
我的输出将是:
Core_Input Type_input
NULL NULL
NULL NULL
NULL NULL
GALT MEDICAL Corporation
MIZUHO AMERICA Incorporated
NULL NULL
LUV N CARE Limited
STERIS ISOMEDIX PUERTO RICO Incorporated
MEDISTIM Incorporated
Cadence Science Incorporated
我的代码:
k1=[]
k2=[]
df1=pd.read_sql('select * from [dbo].[company_Extension]',engine)
for inp1 in df1['Name_Extension']:
k1.append(inp1.strip())
for inp2 in df1['Company_Type']:
k2.append(inp2.strip())
p=1
p1=max(df1['Priority'])
for k1 in df1['Name_Extension']:
for k2 in df1['Company_Type']:
#for pr in df1['Priority']:
for i in df['Cleansed_Input']:
while p<=p1:
if re.search(r'[^>]*?\s'+str(k1).strip(),str(i).strip(),re.I) and (p == (pr for pr in
df1['Priority'])):
splits = i.str.split(str(k1),re.I)
df['Core_Input'] = splits[0] #df['Cleansed_Input'].str.replace(str(k1),'',re.I)
df['Type_input'] = str(k2)
p=p+1
data.to_sql('Testtable', con=engine, if_exists='replace',index= False)
感谢任何帮助。
编辑:
df=pd.read_sql('select * from [dbo].[TempCompanyName]',engine)
df1=pd.read_sql('select * from [dbo].[company_Extension]',engine)
ext_list = df1['Name_Extension']
type_list =df1['Company_Type']
for i, j in df.iterrows():
comp_name = df['Original_Input'][i]
for idx, ex in enumerate(ext_list):
if re.search(rf'\b{ex}\b', comp_name,re.IGNORECASE):
df['Core_Input'] = type_list[idx]
df['Type_input'].iloc[i] = comp_type
print(df)
df.to_sql('TempCompanyName', con=engine, if_exists='replace',index= False)
Edit:
ext_list = df1['Name_Extension']
type_list =df1['Company_Type']
for i, j in enumerate(df['Cleansed_Input']):
comp_name = df['Cleansed_Input'][i]
for idx, ex in enumerate(ext_list):
comp_name.replace('.,','')
if re.search(rf'(\b{ex}\b)', comp_name, re.I):
comp_type = type_list[idx]
df['Type_input'].iloc[i]= comp_type
# Delete the extension name from company name
updated_comp_name =
re.sub(rf'(\b{str(ex).upper()}\b)','',str(comp_name).upper())
# Above regex is leaving space post word removal adding space
from next word becomes 2 spaces
updated_comp_name = str(updated_comp_name).replace(' ',' ')
# Update the company name
df['Core_Input'].iloc[i] = updated_comp_name
您好,希望下面几行可以帮助您找到解决方案...由于某种原因我没有使用 SQL,但是在 2 个不同的 excel 中获取了您的数据...您需要添加一列类型在输入 table 之前 运行 代码...
import pandas as pd
import numpy
import re
input_df = pd.read_excel('input.xlsx',sheet_name='Sheet1')
exts_df = pd.read_excel('exts.xlsx', sheet_name='Sheet1')
# Check if correct data is loaded
print(input_df.head())
ext_list = exts_df['Name_Extension']
type_list =exts_df['Company_Type']
for i, j in input_df.iterrows():
comp_name = input_df['Company Names'][i]
for idx, ex in enumerate(ext_list):
if re.search(rf'\b{ex}\b', comp_name,re.IGNORECASE):
comp_type = type_list[idx]
input_df['Type'].iloc[i] = comp_type
# Delete teh extension name from company name
updated_comp_name = re.sub(rf'\b{str(ex).upper()}\b','',str(comp_name).upper())
# Above regex is leaving space post word removal adding space from next word becomes 2 spaces
updated_comp_name = str(updated_comp_name).replace(' ',' ')
# Update the company name
input_df['Company Names'].iloc[i] = updated_comp_name
print(input_df)
input_df.to_excel('output.xlsx', index=False)
输出 post 从输入公司名称列映射中删除扩展 Company_Type ...