计算每个集群中的主题数,然后除以 python 中每个学生的主题总数

count topics in each cluster and divide by a total number of topics for each student in python

我有两个数据框(df1、df2),df1 包含学生姓名、每个学生的主题偏好,df_topics 包含主题。

这是一个示例输入数据框:

    import pandas as pd
    import numpy as np
    
    df1 = pd.DataFrame({'Name':['student 1', 'student 2', 'student 3', 'student 4'],
            'topics':['algebra; atom; geometry; evolution; food safety',
                      'chemical reaction; linear algebra; Probability; quantum',
                      'botany; electricity; mechanics',
                      'Statistics; botany; number theory; atom; evolution; Probability']})

   df2 = pd.DataFrame({'topics':['algebra', 'Probability', 'geometry', 'atom', 'chemical reaction',
                              'evolution', 'botany', 'quantum'],    
                    'cluster':[0, 0, 0, 1, 1, 2, 2, 3]
                   })

我想用 k(此处 k=4)维二进制向量表示每个学生,即如果学生在 df2 中有主题,则计数每个集群中的主题并除以该学生的主题总数。例如,学生 1 在集群 0 中有两个主题,代数和几何,除以 5(学生 1 的主题总数),我们得到 0.4 等等

结果应如下所示:

可能有更好的方法来实现您想要的,但它应该有效:

out = (
    df1.assign(topics=df1['topics'].str.split('; ')).explode('topics')
       .merge(df2, how='left', on='topics')
       .assign(size=lambda x: x.groupby('Name')['cluster'].transform('size'),
               count=lambda x: x.groupby(['Name', 'cluster'])['cluster'].transform('count'),
               ratio=lambda x: x['count'] / x['size'])
       .query('cluster.notna()').astype({'cluster': int})
       .drop_duplicates(['Name', 'cluster'])
       .pivot_table('ratio', 'Name', 'cluster', aggfunc='sum', fill_value=0, margins=True)
       .rename_axis(index=None, columns=None)
)

输出:

>>> out
                  0         1         2     3       All
student 1  0.400000  0.200000  0.200000  0.00  0.800000
student 2  0.250000  0.250000  0.000000  0.25  0.750000
student 3  0.000000  0.000000  0.333333  0.00  0.333333
student 4  0.166667  0.166667  0.333333  0.00  0.666667
All        0.816667  0.616667  0.866667  0.25  2.550000

这是一个将大部分工作委托给名为 cluster() 的函数的答案,该函数通过 apply() 为学生数据框调用:

import pandas as pd
import numpy as np
    
df1 = pd.DataFrame({'Name':['student 1', 'student 2', 'student 3', 'student 4'],
            'topics':['algebra; atom; geometry; evolution; food safety',
                      'chemical reaction; linear algebra; Probability; quantum',
                      'botany; electricity; mechanics',
                      'Statistics; botany; number theory; atom; evolution; Probability']})

df2 = pd.DataFrame({'topics':['algebra', 'Probability', 'geometry', 'atom', 'chemical reaction',
                              'evolution', 'botany', 'quantum'],    
                    'cluster':[0, 0, 0, 1, 1, 2, 2, 3]
                   })
print(df1)
print('\n', df2)

clusterNames = df2['cluster'].unique().tolist() + ['All']
clusters = [set(df2[df2['cluster'] == c]['topics']) for c in clusterNames[:-1]]
def cluster(x):
    studentTopics = set(top.strip() for top in x['topics'].split(';'))
    clusterPct = [len(clusterTopics & studentTopics) / len(studentTopics) for clusterTopics in clusters]
    clusterPct += [sum(clusterPct)]
    return clusterPct
df1[clusterNames] = df1.apply(cluster, axis=1).tolist()
df1 = pd.concat([df1.drop('topics', axis=1), pd.DataFrame([['All'] + df1[clusterNames].sum(axis=0).tolist()], columns=['Name'] + clusterNames)])
print('\n', df1)

输出:

        Name                                             topics
0  student 1    algebra; atom; geometry; evolution; food safety
1  student 2  chemical reaction; linear algebra; Probability...
2  student 3                     botany; electricity; mechanics
3  student 4  Statistics; botany; number theory; atom; evolu...

               topics  cluster
0            algebra        0
1        Probability        0
2           geometry        0
3               atom        1
4  chemical reaction        1
5          evolution        2
6             botany        2
7            quantum        3

         Name         0         1         2     3       All
0  student 1  0.400000  0.200000  0.200000  0.00  0.800000
1  student 2  0.250000  0.250000  0.000000  0.25  0.750000
2  student 3  0.000000  0.000000  0.333333  0.00  0.333333
3  student 4  0.166667  0.166667  0.333333  0.00  0.666667
0        All  0.816667  0.616667  0.866667  0.25  2.550000

将每个学生每个主题的 df1 主题分解为一行

  df1 =df1.assign(topics=df1['topics'].str.split(';')).explode('topics')#Explode each topic into a row

将 df2 主题和集群映射到 df1

df1= df1.assign(topics=df1['topics'].str.strip().map(dict(zip(df2['topics'].str.strip(), df2['cluster'].astype(str)))).fillna('5'))

计算每行的总数。您会注意到我用 5 填充了 NaN 以确保没有映射的地方也被计算在内。计算每行比率

new =pd.crosstab(df1.Name, df1.topics).apply(lambda x:x/x.sum(),axis=1).drop(columns=['5'])#, margins_name="count").agg(lambda x:round(x/x['count'],1), axis=1).drop(columns=['5','count'])

#按行和按列求和

#new['All'],new.loc['All']= new.sum(1),new.sum(0)

new['All']= new.sum(1)
new.loc['All']=new.sum(0)
print(new)

输出

topics            0         1         2     3       All
Name                                                   
student 1  0.400000  0.200000  0.200000  0.00  0.800000
student 2  0.250000  0.250000  0.000000  0.25  0.750000
student 3  0.000000  0.000000  0.333333  0.00  0.333333
student 4  0.166667  0.166667  0.333333  0.00  0.666667
All        0.816667  0.616667  0.866667  0.25  2.550000