如何使用 Pandas 库将一个值与 Python 中的多个值进行比较?
How to compare one value against multiple values in Python with Pandas library?
目前,我有以下场景。
Excel Data Frame = SQL Data Frame =
________ ________ _______ ___________ _________
|sector| |sector| | hour| | value_cs| value_ps|
-------- -------- ------- ----------- ---------
AXYZ AXYZ 0 78.90 87.10
BYYT RACH 0 87.12 13.90
IOPL IOPL 0 93.10 13.87
XFTR AXYZ 1 27.90 12.87
MANU IOPL 1 23.09 90.09
FRES 2 34.09 12.34
YYYT 2 12.43 32.98
REWT 3 98.09 99.99
我有一个 Excel 文件和一组 SQL 结果,我想比较 [=39] 的 扇区列 的每个值=] 文件对照 SQL 结果中扇区列的所有值,因此,如果这两列的值匹配,则添加列 hour , value_cs 和 value_ps 来自 SQL 结果进入新的数据帧。 注意: SQL 结果的数据与 Excel 文件的数据大小不同。
想要的结果
New data frame 1 for value cs
________ ____ ___ ___ ___ ___ ___ ___ ____
|sector| |0| |1| |2| |3| |4| |5| |6| .... |23|
-------- ---- --- ---- --- --- --- ---- ----
AXYZ 78.90 27.90 78.89 54.90 98.23 85.0 45.90 68.23
BYYT 18.94 67.10 65.69 76.32 76.56 56.03 56.23 87.65
IOPL 93.10 23.09 34.29 97.34 34.34 14.54 34.91 23.21
... ...
New data frame 2 for value ps
________ ____ ___ ___ ___ ___ ___ ___ ____
|sector| |0| |1| |2| |3| |4| |5| |6| .... |23|
-------- ---- --- ---- --- --- --- ---- ----
AXYZ 87.10 12.87 49.89 84.90 76.23 15.01 12.90 68.23
BYYT 28.43 27.11 54.69 57.12 19.56 45.12 45.23 47.15
IOPL 13.87 90.09 24.19 47.34 18.34 21.54 67.11 13.61
... ...
我采用的方法是将 SQL 结果以及 Excel 文件中的数据转换为数据框,但我不知道如何在没有 for 的情况下进行比较循环,但只使用 Pandas(for 循环会花费太多时间来执行计算)。
import pandas as pd
import pypyodbc
from datetime import date
def get_and_compare():
start_date = date.today()
retrieve_values = "[DEV].[CS].[QA_Export] @start_date='{start_date:%Y-%m-%d}'".format(start_date=start_date)
# Connect to the database
db_connection = pypyodbc.connect(driver="{SQL Server}", server="xxx.xxx.xxx.xxx", uid="xxx",
pwd="xxx", Trusted_Connection="No")
# Get the sql result into dataframe
data_frame_sql = pd.read_sql(retrieve_values,db_connection)
#declare new data frames
new_df_one = pd.DataFrame(columns=['sector', 'value cs', 'hour 0', 'hour 1', 'hour 2', 'hour 3', 'hour 4',
'hour 5', 'hour 6', 'hour 7', 'hour 8', 'hour 9', 'hour 10', 'hour 11',
'hour 12', 'hour 13', 'hour 14', 'hour 15', 'hour 16', 'hour 17', 'hour 18',
'hour 19', 'hour 20', 'hour 21', 'hour 22', 'hour 23'])
new_df_two = pd.DataFrame(columns=['sector', 'value ps', 'hour 0', 'hour 1', 'hour 2', 'hour 3', 'hour 4',
'hour 5', 'hour 6', 'hour 7', 'hour 8', 'hour 9', 'hour 10', 'hour 11',
'hour 12', 'hour 13', 'hour 14', 'hour 15', 'hour 16', 'hour 17', 'hour 18',
'hour 19', 'hour 20', 'hour 21', 'hour 22', 'hour 23'])
# Read the Excel file
current_wb = pd.ExcelFile \
("C:\U\dev\testing\Main values to compare.xlsx")
# Get the specific sheet to compare
working_values = current_wb.parse("Main values")
#Get the column from Excel
sector_from_excel = working_values['sector']
#Comparison to perform
#.... unknown part
所有的建议和意见将不胜感激,以帮助我完成这部分代码。
试试这个:
def get_and_compare():
start_date = date.today()
retrieve_values = "[DEV].[CS].[QA_Export] @start_date='{start_date:%Y-%m-%d}'".format(start_date=start_date)
# Connect to the database
db_connection = pypyodbc.connect(driver="{SQL Server}", server="xxx.xxx.xxx.xxx", uid="xxx",
pwd="xxx", Trusted_Connection="No")
# Get the sql result into dataframe
data_frame_sql = pd.read_sql(retrieve_values,db_connection)
# Read the Excel file
current_wb = pd.ExcelFile \
("C:\U\dev\testing\Main values to compare.xlsx")
# Get the specific sheet to compare
working_values = current_wb.parse("Main values")
#Get the column from Excel
sector_from_excel = working_values['sector']
# perform inner join between DataFrames
# note: this requires that "sector" is a column (and not an index)
# in both DataFrames, and that it is also named as "sector" in each
merged_df = data_frame_sql.merge(sector_from_excel, how="inner", on="sector")
# use "pivot" to reshape data from wide to long
# first with value_cs
cs_value_df = merged_df.pivot(index="sector", columns="hour", values="value_cs")
# and then with value_ps
ps_value_df = merged_df.pivot(index="sector", columns="hour", values="value_ps")
# I'd suggest returning both DataFrames in a single object;
# in this case I'm using a dict
return {"value cs": cs_value_df, "value ps": ps_value_df}
就其价值而言,我建议将此函数拆分为多个函数,一个用于生成您的 SQL 查询,一个用于读取您的 Excel 文件,一个用于执行 Pandas操作。将如此多的操作写入一个函数并不是一个好习惯 -- 如果有必要的话,调试起来会很乏味。
目前,我有以下场景。
Excel Data Frame = SQL Data Frame =
________ ________ _______ ___________ _________
|sector| |sector| | hour| | value_cs| value_ps|
-------- -------- ------- ----------- ---------
AXYZ AXYZ 0 78.90 87.10
BYYT RACH 0 87.12 13.90
IOPL IOPL 0 93.10 13.87
XFTR AXYZ 1 27.90 12.87
MANU IOPL 1 23.09 90.09
FRES 2 34.09 12.34
YYYT 2 12.43 32.98
REWT 3 98.09 99.99
我有一个 Excel 文件和一组 SQL 结果,我想比较 [=39] 的 扇区列 的每个值=] 文件对照 SQL 结果中扇区列的所有值,因此,如果这两列的值匹配,则添加列 hour , value_cs 和 value_ps 来自 SQL 结果进入新的数据帧。 注意: SQL 结果的数据与 Excel 文件的数据大小不同。
想要的结果
New data frame 1 for value cs
________ ____ ___ ___ ___ ___ ___ ___ ____
|sector| |0| |1| |2| |3| |4| |5| |6| .... |23|
-------- ---- --- ---- --- --- --- ---- ----
AXYZ 78.90 27.90 78.89 54.90 98.23 85.0 45.90 68.23
BYYT 18.94 67.10 65.69 76.32 76.56 56.03 56.23 87.65
IOPL 93.10 23.09 34.29 97.34 34.34 14.54 34.91 23.21
... ...
New data frame 2 for value ps
________ ____ ___ ___ ___ ___ ___ ___ ____
|sector| |0| |1| |2| |3| |4| |5| |6| .... |23|
-------- ---- --- ---- --- --- --- ---- ----
AXYZ 87.10 12.87 49.89 84.90 76.23 15.01 12.90 68.23
BYYT 28.43 27.11 54.69 57.12 19.56 45.12 45.23 47.15
IOPL 13.87 90.09 24.19 47.34 18.34 21.54 67.11 13.61
... ...
我采用的方法是将 SQL 结果以及 Excel 文件中的数据转换为数据框,但我不知道如何在没有 for 的情况下进行比较循环,但只使用 Pandas(for 循环会花费太多时间来执行计算)。
import pandas as pd
import pypyodbc
from datetime import date
def get_and_compare():
start_date = date.today()
retrieve_values = "[DEV].[CS].[QA_Export] @start_date='{start_date:%Y-%m-%d}'".format(start_date=start_date)
# Connect to the database
db_connection = pypyodbc.connect(driver="{SQL Server}", server="xxx.xxx.xxx.xxx", uid="xxx",
pwd="xxx", Trusted_Connection="No")
# Get the sql result into dataframe
data_frame_sql = pd.read_sql(retrieve_values,db_connection)
#declare new data frames
new_df_one = pd.DataFrame(columns=['sector', 'value cs', 'hour 0', 'hour 1', 'hour 2', 'hour 3', 'hour 4',
'hour 5', 'hour 6', 'hour 7', 'hour 8', 'hour 9', 'hour 10', 'hour 11',
'hour 12', 'hour 13', 'hour 14', 'hour 15', 'hour 16', 'hour 17', 'hour 18',
'hour 19', 'hour 20', 'hour 21', 'hour 22', 'hour 23'])
new_df_two = pd.DataFrame(columns=['sector', 'value ps', 'hour 0', 'hour 1', 'hour 2', 'hour 3', 'hour 4',
'hour 5', 'hour 6', 'hour 7', 'hour 8', 'hour 9', 'hour 10', 'hour 11',
'hour 12', 'hour 13', 'hour 14', 'hour 15', 'hour 16', 'hour 17', 'hour 18',
'hour 19', 'hour 20', 'hour 21', 'hour 22', 'hour 23'])
# Read the Excel file
current_wb = pd.ExcelFile \
("C:\U\dev\testing\Main values to compare.xlsx")
# Get the specific sheet to compare
working_values = current_wb.parse("Main values")
#Get the column from Excel
sector_from_excel = working_values['sector']
#Comparison to perform
#.... unknown part
所有的建议和意见将不胜感激,以帮助我完成这部分代码。
试试这个:
def get_and_compare():
start_date = date.today()
retrieve_values = "[DEV].[CS].[QA_Export] @start_date='{start_date:%Y-%m-%d}'".format(start_date=start_date)
# Connect to the database
db_connection = pypyodbc.connect(driver="{SQL Server}", server="xxx.xxx.xxx.xxx", uid="xxx",
pwd="xxx", Trusted_Connection="No")
# Get the sql result into dataframe
data_frame_sql = pd.read_sql(retrieve_values,db_connection)
# Read the Excel file
current_wb = pd.ExcelFile \
("C:\U\dev\testing\Main values to compare.xlsx")
# Get the specific sheet to compare
working_values = current_wb.parse("Main values")
#Get the column from Excel
sector_from_excel = working_values['sector']
# perform inner join between DataFrames
# note: this requires that "sector" is a column (and not an index)
# in both DataFrames, and that it is also named as "sector" in each
merged_df = data_frame_sql.merge(sector_from_excel, how="inner", on="sector")
# use "pivot" to reshape data from wide to long
# first with value_cs
cs_value_df = merged_df.pivot(index="sector", columns="hour", values="value_cs")
# and then with value_ps
ps_value_df = merged_df.pivot(index="sector", columns="hour", values="value_ps")
# I'd suggest returning both DataFrames in a single object;
# in this case I'm using a dict
return {"value cs": cs_value_df, "value ps": ps_value_df}
就其价值而言,我建议将此函数拆分为多个函数,一个用于生成您的 SQL 查询,一个用于读取您的 Excel 文件,一个用于执行 Pandas操作。将如此多的操作写入一个函数并不是一个好习惯 -- 如果有必要的话,调试起来会很乏味。