无法计算列中唯一值的频率

Question

我正在做一个项目，要求我计算学生在类不同科目中的出勤次数和缺勤次数，并计算他的出勤率。我有他的考勤记录如下

    Attend  Date    Subject
96  Present 09-04-2020  AM-II
69  Present 16-04-2020  AM-II
61  Present 20-04-2020  AM-II
49  Present 22-04-2020  AM-II
45  Present 23-04-2020  AM-II
... ... ... ...
14  Present 12-04-2020  LMS
13  Absent  18-04-2020  LMS
11  Absent  19-04-2020  LMS
10  Present 25-04-2020  LMS
9   Present 26-04-2020  LMS

我正在使用 python 的 pandas 库来计算每个唯一主题出现“出现”的次数和出现“缺席”的次数，但我无法这样做。这就是我正在做的。

data=pd.read_csv("data1.csv") 
  
#sorting data frame by Team and then By names 
data.sort_values(["Subject", "Date"], axis=0, 
                 ascending=True, inplace=True) 
p = 0
a = 0
total = 0
attpercent = {}
data.set_index(["Subject"], inplace = True, 
                            append = True, drop = False)
temp = ""
data = data.infer_objects()
for Subject, Attend in data.iterrows()
    if(temp == ""):
        temp = Subject
        if Attend == "Present":
                p = p + 1
        else:
            a = a + 1
    else:
        if(temp == Subject):
            if Attend == "Present":
                p = p + 1
            else:
                a = a + 1
        else:
            total = a + p
            attpercent[temp] = (p * 100) / total
            a = 0
            p = 0
            temp = Subject 
            if Attend == "Present":
                p = p + 1
            else:
                a = a + 1
                
print(attpercent)

显示错误：

 TypeError                                 Traceback (most recent call last)
<ipython-input-65-9d7243427e5f> in <module>
     18 data = data.infer_objects()
     19 for Subject, Attend in data.iterrows():
---> 20     Attend = str(Attend)
     21     if(temp == ""):
     22         temp = Subject

TypeError: 'Series' object is not callable

我是第一次使用pandas，所以不太了解。我尝试使用 infer_objects 和 astypes() 转换列的类型，但我仍然遇到相同的错误。请帮忙。

Answer 1

你应该尽量避免循环和迭代，并熟悉pandas方法，如.groupby、.pivot_table和.unstack。对于这个特定问题，您可以使用 .groupby 和 .size，然后使用 .unstack 将行移动到列，并以良好的格式获取数据，为计算出勤率做准备。

df = df.groupby(['Subject','Attend']).size().reset_index() \
       .set_index(['Subject', 'Attend']) \
       .unstack(1).fillna(0).astype(int)
df.columns = df.columns.droplevel(0)
df['Attendance'] = df['Present'] / ( df['Present'] + df['Absent'])
df

输出：

Attend  Absent  Present Attendance
Subject         
AM-II   0       5       1.0
LMS     2       3       0.6

更详细的解释。

相关列上的.groupby和size统计出现次数后，加上.set_index(['Subject', 'Attend'])，我在索引上设置这两列，为下一步做准备。接下来，我将 Attend 移动到 headers 以将此数据集放入一个很好的矩阵格式，如 Excel Pivot Table。使用 .unstack(1)，我使用我刚刚设置的第二个索引列（记住 python 从 0 开始，所以 1 使用第二个索引列并将它们设为我的 headers 现在，基本上以一种非常方便的方式将数据框从行重塑为列。如果我这样做 .unstack(0)，它会把 Subject 移动到 headers，这不会以我们想要的方式可视化数据。

最后，df.columns = df.columns.droplevel(0) 从 Multiindex 中删除了一个级别，使其看起来更清晰，然后 Attendance 的计算非常简单，即用 # of Present 除以 Total 得到每个人的出勤率主题。

比方说，完整数据包括学生的另一列。基于第一个示例，您可能可以从这里尝试弄清楚如何做到这一点，但这是您可以做的。

输入：

    Attend  Date       Subject  Student
96  Present 09-04-2020  AM-II   Kathy
69  Present 16-04-2020  AM-II   John
61  Present 20-04-2020  AM-II   John
49  Present 22-04-2020  AM-II   John
45  Present 23-04-2020  AM-II   Kathy
14  Present 12-04-2020  LMS     Kathy
13  Absent  18-04-2020  LMS     Kathy
11  Absent  19-04-2020  LMS     John
10  Present 25-04-2020  LMS     Kathy
9   Present 26-04-2020  LMS     John

代码：

df = df.groupby(['Student','Subject','Attend']).size().reset_index().set_index(['Student','Subject', 'Attend']).unstack(2).fillna(0).astype(int)
df.columns = df.columns.droplevel(0)
df['Attendance'] = df['Present'] / ( df['Present'] + df['Absent'])
df

        Attend  Absent  Present Attendance
Student Subject         
John    AM-II   0       3       1.000000
        LMS     1       1       0.500000
Kathy   AM-II   0       2       1.000000
        LMS     1       2       0.666667

代码差不多。您只需将额外的列 Student 包含在 .groupby 和 .set_index() 中，并将 .unstack 从 1 增加到 2，因为 Attend列现在是 .set_index() 指定的第三个 index 列。然后，将 drop_level(1) 更改为 drop_level(0)，因为索引上有两列。

最后，如果您想要一个没有多索引的干净数据集，只需执行 df = df.reset_index() 作为 return 的最后一步：

Attend  Student Subject Absent  Present Attendance
0       John    AM-II   0       3       1.000000
1       John    LMS     1       1       0.500000
2       Kathy   AM-II   0       2       1.000000
3       Kathy   LMS     1       2       0.666667

无法计算列中唯一值的频率

Unable to calculate the frequency of unique values in a columns

python

numpy

import-from-excel

dataframe

pandas