python 中 运行 countifs 的更快方法
faster way to run countifs in python
我之前问过如何在 python 中跨多个数据框进行计数,就像您可以在 Excel 中的单独工作表上进行计数一样。有人给了我一个非常有创意的答案:
感谢@AlexG——我试过了,效果非常好:
import pandas as pd
import numpy as np
import matplotlib as plt
#import the data
students = pd.read_csv("Student Detail stump.csv")
exams = pd.read_csv("Exam Detail stump.csv")
#get data parameters
student_info = students[['Student Number', 'Enrollment Date', 'Detail Date']].values
#prepare an empty list to hold the results
N_exams_passed = []
#count records in data set according to parameters
for s_id, s_enroll, s_qual in student_info:
N_exams_passed.append(len(exams[(exams['Student Number']==s_id) &
(exams['Exam Grade Date']>=s_enroll) &
(exams['Exam Grade Date']<=s_qual) &
(exams['Exam Grade']>=70)])
)
#add the results to the original data set
students['Exams Passed'] = N_exams_passed
但是,它只对小数据集有效。当我运行 100,000 行的数据时,它甚至不会在一夜之间完成。好像不太pythonic.
可以在几秒钟内完成此操作的 SQL 方法是使用相关子查询,如下所示:
SELECT
s.*,
(SELECT COUNT(e.[Exam Grade])
FROM
exams AS e
WHERE
e.[Exam Grade] >= 65
AND e.[Student Number] = s.[Student Number]
AND e.[Exam Grade Date] >= s.[Enrollment Date]
AND e.[Exam Grade Date] <= s.[Detail Date])
AS ExamsPassed
FROM
students AS s;
如何在 pandas 或其他一些 pythonic 方式中重现这样的相关子查询?
这是数据框:
#Students
Student Number Enroll Date Detail Date
1 1/1/2016 2/1/2016
1 1/1/2016 3/1/2016
2 2/1/2016 3/1/2016
3 3/1/2016 4/1/2016
#Exams
Student Number Exam Date Exam Grade
1 1/1/2016 50
1 1/15/2016 80
1 1/28/2016 90
1 2/5/2016 100
1 3/5/2016 80
1 4/5/2016 40
2 2/2/2016 85
2 2/3/2016 10
2 2/4/2016 100
最终数据框应如下所示,末尾的计数为 'Passed Exams':
#FinalResult
Student Number Enroll Date Detail Date Passed Exams
1 1/1/2016 2/1/2016 2
1 1/1/2016 3/1/2016 3
2 2/1/2016 3/1/2016 2
3 3/1/2016 4/1/2016 0
如果我正确理解你的数据帧的结构,我建议合并这两个数据帧,然后使用 numpy.where
.
对合并后的数据执行任务
import numpy as np
exams = exams.merge(students, on='Student Number', how='left')
exams['Passed'] = np.where(
(exams['Exam Grade Date'] >= exams['Enrollment Date']) &
(exams['Exam Grade Date'] <= exams['Detail Date']) &
(exams['Grade'] >= 70),
1, 0)
students = students.merge(
exams.groupby(['Student Number', 'Detail Date'])['Passed'].sum().reset_index(),
left_on=['Student Number', 'Detail Date'],
right_on=['Student Number', 'Detail Date'],
how='left')
students['Passed'] = students['Passed'].fillna(0).astype('int')
注意:您需要确保日期列正确存储为日期时间(您可以使用 pandas.to_datetime
来执行此操作)。
numpy.where
创建一个新数组,其中的值是一种方式(上例中的 1
),如果满足您指定的条件,则另一种方式(0
)遇到了。
行 exams.groupby(['Student Number', 'Detail Date'])['Passed'].sum()
生成一个系列,其中索引为 Student Number
和 Detail Date
,值是对应于 Student Number
和 Detail Date
组合。 reset_index()
将其变成数据帧以进行合并。
我之前问过如何在 python 中跨多个数据框进行计数,就像您可以在 Excel 中的单独工作表上进行计数一样。有人给了我一个非常有创意的答案:
感谢@AlexG——我试过了,效果非常好:
import pandas as pd
import numpy as np
import matplotlib as plt
#import the data
students = pd.read_csv("Student Detail stump.csv")
exams = pd.read_csv("Exam Detail stump.csv")
#get data parameters
student_info = students[['Student Number', 'Enrollment Date', 'Detail Date']].values
#prepare an empty list to hold the results
N_exams_passed = []
#count records in data set according to parameters
for s_id, s_enroll, s_qual in student_info:
N_exams_passed.append(len(exams[(exams['Student Number']==s_id) &
(exams['Exam Grade Date']>=s_enroll) &
(exams['Exam Grade Date']<=s_qual) &
(exams['Exam Grade']>=70)])
)
#add the results to the original data set
students['Exams Passed'] = N_exams_passed
但是,它只对小数据集有效。当我运行 100,000 行的数据时,它甚至不会在一夜之间完成。好像不太pythonic.
可以在几秒钟内完成此操作的 SQL 方法是使用相关子查询,如下所示:
SELECT
s.*,
(SELECT COUNT(e.[Exam Grade])
FROM
exams AS e
WHERE
e.[Exam Grade] >= 65
AND e.[Student Number] = s.[Student Number]
AND e.[Exam Grade Date] >= s.[Enrollment Date]
AND e.[Exam Grade Date] <= s.[Detail Date])
AS ExamsPassed
FROM
students AS s;
如何在 pandas 或其他一些 pythonic 方式中重现这样的相关子查询?
这是数据框:
#Students
Student Number Enroll Date Detail Date
1 1/1/2016 2/1/2016
1 1/1/2016 3/1/2016
2 2/1/2016 3/1/2016
3 3/1/2016 4/1/2016
#Exams
Student Number Exam Date Exam Grade
1 1/1/2016 50
1 1/15/2016 80
1 1/28/2016 90
1 2/5/2016 100
1 3/5/2016 80
1 4/5/2016 40
2 2/2/2016 85
2 2/3/2016 10
2 2/4/2016 100
最终数据框应如下所示,末尾的计数为 'Passed Exams':
#FinalResult
Student Number Enroll Date Detail Date Passed Exams
1 1/1/2016 2/1/2016 2
1 1/1/2016 3/1/2016 3
2 2/1/2016 3/1/2016 2
3 3/1/2016 4/1/2016 0
如果我正确理解你的数据帧的结构,我建议合并这两个数据帧,然后使用 numpy.where
.
import numpy as np
exams = exams.merge(students, on='Student Number', how='left')
exams['Passed'] = np.where(
(exams['Exam Grade Date'] >= exams['Enrollment Date']) &
(exams['Exam Grade Date'] <= exams['Detail Date']) &
(exams['Grade'] >= 70),
1, 0)
students = students.merge(
exams.groupby(['Student Number', 'Detail Date'])['Passed'].sum().reset_index(),
left_on=['Student Number', 'Detail Date'],
right_on=['Student Number', 'Detail Date'],
how='left')
students['Passed'] = students['Passed'].fillna(0).astype('int')
注意:您需要确保日期列正确存储为日期时间(您可以使用 pandas.to_datetime
来执行此操作)。
numpy.where
创建一个新数组,其中的值是一种方式(上例中的 1
),如果满足您指定的条件,则另一种方式(0
)遇到了。
行 exams.groupby(['Student Number', 'Detail Date'])['Passed'].sum()
生成一个系列,其中索引为 Student Number
和 Detail Date
,值是对应于 Student Number
和 Detail Date
组合。 reset_index()
将其变成数据帧以进行合并。