计算每个集群中的主题数,然后除以 python 中每个学生的主题总数
count topics in each cluster and divide by a total number of topics for each student in python
我有两个数据框(df1、df2),df1 包含学生姓名、每个学生的主题偏好,df_topics 包含主题。
这是一个示例输入数据框:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'Name':['student 1', 'student 2', 'student 3', 'student 4'],
'topics':['algebra; atom; geometry; evolution; food safety',
'chemical reaction; linear algebra; Probability; quantum',
'botany; electricity; mechanics',
'Statistics; botany; number theory; atom; evolution; Probability']})
df2 = pd.DataFrame({'topics':['algebra', 'Probability', 'geometry', 'atom', 'chemical reaction',
'evolution', 'botany', 'quantum'],
'cluster':[0, 0, 0, 1, 1, 2, 2, 3]
})
我想用 k(此处 k=4)维二进制向量表示每个学生,即如果学生在 df2 中有主题,则计数每个集群中的主题并除以该学生的主题总数。例如,学生 1 在集群 0 中有两个主题,代数和几何,除以 5(学生 1 的主题总数),我们得到 0.4 等等
结果应如下所示:
可能有更好的方法来实现您想要的,但它应该有效:
out = (
df1.assign(topics=df1['topics'].str.split('; ')).explode('topics')
.merge(df2, how='left', on='topics')
.assign(size=lambda x: x.groupby('Name')['cluster'].transform('size'),
count=lambda x: x.groupby(['Name', 'cluster'])['cluster'].transform('count'),
ratio=lambda x: x['count'] / x['size'])
.query('cluster.notna()').astype({'cluster': int})
.drop_duplicates(['Name', 'cluster'])
.pivot_table('ratio', 'Name', 'cluster', aggfunc='sum', fill_value=0, margins=True)
.rename_axis(index=None, columns=None)
)
输出:
>>> out
0 1 2 3 All
student 1 0.400000 0.200000 0.200000 0.00 0.800000
student 2 0.250000 0.250000 0.000000 0.25 0.750000
student 3 0.000000 0.000000 0.333333 0.00 0.333333
student 4 0.166667 0.166667 0.333333 0.00 0.666667
All 0.816667 0.616667 0.866667 0.25 2.550000
这是一个将大部分工作委托给名为 cluster()
的函数的答案,该函数通过 apply()
为学生数据框调用:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'Name':['student 1', 'student 2', 'student 3', 'student 4'],
'topics':['algebra; atom; geometry; evolution; food safety',
'chemical reaction; linear algebra; Probability; quantum',
'botany; electricity; mechanics',
'Statistics; botany; number theory; atom; evolution; Probability']})
df2 = pd.DataFrame({'topics':['algebra', 'Probability', 'geometry', 'atom', 'chemical reaction',
'evolution', 'botany', 'quantum'],
'cluster':[0, 0, 0, 1, 1, 2, 2, 3]
})
print(df1)
print('\n', df2)
clusterNames = df2['cluster'].unique().tolist() + ['All']
clusters = [set(df2[df2['cluster'] == c]['topics']) for c in clusterNames[:-1]]
def cluster(x):
studentTopics = set(top.strip() for top in x['topics'].split(';'))
clusterPct = [len(clusterTopics & studentTopics) / len(studentTopics) for clusterTopics in clusters]
clusterPct += [sum(clusterPct)]
return clusterPct
df1[clusterNames] = df1.apply(cluster, axis=1).tolist()
df1 = pd.concat([df1.drop('topics', axis=1), pd.DataFrame([['All'] + df1[clusterNames].sum(axis=0).tolist()], columns=['Name'] + clusterNames)])
print('\n', df1)
输出:
Name topics
0 student 1 algebra; atom; geometry; evolution; food safety
1 student 2 chemical reaction; linear algebra; Probability...
2 student 3 botany; electricity; mechanics
3 student 4 Statistics; botany; number theory; atom; evolu...
topics cluster
0 algebra 0
1 Probability 0
2 geometry 0
3 atom 1
4 chemical reaction 1
5 evolution 2
6 botany 2
7 quantum 3
Name 0 1 2 3 All
0 student 1 0.400000 0.200000 0.200000 0.00 0.800000
1 student 2 0.250000 0.250000 0.000000 0.25 0.750000
2 student 3 0.000000 0.000000 0.333333 0.00 0.333333
3 student 4 0.166667 0.166667 0.333333 0.00 0.666667
0 All 0.816667 0.616667 0.866667 0.25 2.550000
将每个学生每个主题的 df1 主题分解为一行
df1 =df1.assign(topics=df1['topics'].str.split(';')).explode('topics')#Explode each topic into a row
将 df2 主题和集群映射到 df1
df1= df1.assign(topics=df1['topics'].str.strip().map(dict(zip(df2['topics'].str.strip(), df2['cluster'].astype(str)))).fillna('5'))
计算每行的总数。您会注意到我用 5 填充了 NaN 以确保没有映射的地方也被计算在内。计算每行比率
new =pd.crosstab(df1.Name, df1.topics).apply(lambda x:x/x.sum(),axis=1).drop(columns=['5'])#, margins_name="count").agg(lambda x:round(x/x['count'],1), axis=1).drop(columns=['5','count'])
#按行和按列求和
#new['All'],new.loc['All']= new.sum(1),new.sum(0)
new['All']= new.sum(1)
new.loc['All']=new.sum(0)
print(new)
输出
topics 0 1 2 3 All
Name
student 1 0.400000 0.200000 0.200000 0.00 0.800000
student 2 0.250000 0.250000 0.000000 0.25 0.750000
student 3 0.000000 0.000000 0.333333 0.00 0.333333
student 4 0.166667 0.166667 0.333333 0.00 0.666667
All 0.816667 0.616667 0.866667 0.25 2.550000
我有两个数据框(df1、df2),df1 包含学生姓名、每个学生的主题偏好,df_topics 包含主题。
这是一个示例输入数据框:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'Name':['student 1', 'student 2', 'student 3', 'student 4'],
'topics':['algebra; atom; geometry; evolution; food safety',
'chemical reaction; linear algebra; Probability; quantum',
'botany; electricity; mechanics',
'Statistics; botany; number theory; atom; evolution; Probability']})
df2 = pd.DataFrame({'topics':['algebra', 'Probability', 'geometry', 'atom', 'chemical reaction',
'evolution', 'botany', 'quantum'],
'cluster':[0, 0, 0, 1, 1, 2, 2, 3]
})
我想用 k(此处 k=4)维二进制向量表示每个学生,即如果学生在 df2 中有主题,则计数每个集群中的主题并除以该学生的主题总数。例如,学生 1 在集群 0 中有两个主题,代数和几何,除以 5(学生 1 的主题总数),我们得到 0.4 等等
结果应如下所示:
可能有更好的方法来实现您想要的,但它应该有效:
out = (
df1.assign(topics=df1['topics'].str.split('; ')).explode('topics')
.merge(df2, how='left', on='topics')
.assign(size=lambda x: x.groupby('Name')['cluster'].transform('size'),
count=lambda x: x.groupby(['Name', 'cluster'])['cluster'].transform('count'),
ratio=lambda x: x['count'] / x['size'])
.query('cluster.notna()').astype({'cluster': int})
.drop_duplicates(['Name', 'cluster'])
.pivot_table('ratio', 'Name', 'cluster', aggfunc='sum', fill_value=0, margins=True)
.rename_axis(index=None, columns=None)
)
输出:
>>> out
0 1 2 3 All
student 1 0.400000 0.200000 0.200000 0.00 0.800000
student 2 0.250000 0.250000 0.000000 0.25 0.750000
student 3 0.000000 0.000000 0.333333 0.00 0.333333
student 4 0.166667 0.166667 0.333333 0.00 0.666667
All 0.816667 0.616667 0.866667 0.25 2.550000
这是一个将大部分工作委托给名为 cluster()
的函数的答案,该函数通过 apply()
为学生数据框调用:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'Name':['student 1', 'student 2', 'student 3', 'student 4'],
'topics':['algebra; atom; geometry; evolution; food safety',
'chemical reaction; linear algebra; Probability; quantum',
'botany; electricity; mechanics',
'Statistics; botany; number theory; atom; evolution; Probability']})
df2 = pd.DataFrame({'topics':['algebra', 'Probability', 'geometry', 'atom', 'chemical reaction',
'evolution', 'botany', 'quantum'],
'cluster':[0, 0, 0, 1, 1, 2, 2, 3]
})
print(df1)
print('\n', df2)
clusterNames = df2['cluster'].unique().tolist() + ['All']
clusters = [set(df2[df2['cluster'] == c]['topics']) for c in clusterNames[:-1]]
def cluster(x):
studentTopics = set(top.strip() for top in x['topics'].split(';'))
clusterPct = [len(clusterTopics & studentTopics) / len(studentTopics) for clusterTopics in clusters]
clusterPct += [sum(clusterPct)]
return clusterPct
df1[clusterNames] = df1.apply(cluster, axis=1).tolist()
df1 = pd.concat([df1.drop('topics', axis=1), pd.DataFrame([['All'] + df1[clusterNames].sum(axis=0).tolist()], columns=['Name'] + clusterNames)])
print('\n', df1)
输出:
Name topics
0 student 1 algebra; atom; geometry; evolution; food safety
1 student 2 chemical reaction; linear algebra; Probability...
2 student 3 botany; electricity; mechanics
3 student 4 Statistics; botany; number theory; atom; evolu...
topics cluster
0 algebra 0
1 Probability 0
2 geometry 0
3 atom 1
4 chemical reaction 1
5 evolution 2
6 botany 2
7 quantum 3
Name 0 1 2 3 All
0 student 1 0.400000 0.200000 0.200000 0.00 0.800000
1 student 2 0.250000 0.250000 0.000000 0.25 0.750000
2 student 3 0.000000 0.000000 0.333333 0.00 0.333333
3 student 4 0.166667 0.166667 0.333333 0.00 0.666667
0 All 0.816667 0.616667 0.866667 0.25 2.550000
将每个学生每个主题的 df1 主题分解为一行
df1 =df1.assign(topics=df1['topics'].str.split(';')).explode('topics')#Explode each topic into a row
将 df2 主题和集群映射到 df1
df1= df1.assign(topics=df1['topics'].str.strip().map(dict(zip(df2['topics'].str.strip(), df2['cluster'].astype(str)))).fillna('5'))
计算每行的总数。您会注意到我用 5 填充了 NaN 以确保没有映射的地方也被计算在内。计算每行比率
new =pd.crosstab(df1.Name, df1.topics).apply(lambda x:x/x.sum(),axis=1).drop(columns=['5'])#, margins_name="count").agg(lambda x:round(x/x['count'],1), axis=1).drop(columns=['5','count'])
#按行和按列求和
#new['All'],new.loc['All']= new.sum(1),new.sum(0)
new['All']= new.sum(1)
new.loc['All']=new.sum(0)
print(new)
输出
topics 0 1 2 3 All
Name
student 1 0.400000 0.200000 0.200000 0.00 0.800000
student 2 0.250000 0.250000 0.000000 0.25 0.750000
student 3 0.000000 0.000000 0.333333 0.00 0.333333
student 4 0.166667 0.166667 0.333333 0.00 0.666667
All 0.816667 0.616667 0.866667 0.25 2.550000