子集群内计数的汇总报告
Summary report of count within subclusters
我有如下数据集。
要求是统计人数
仅 IP - 集群内子集群中的父子都是 IP,
仅限 P&T - 集群内子集群中的 Parent & Child 都是 P&T
IP->P&T - 当 Parent 是 IP & Child 是集群中的子集群中的 P&T
P&T->IP - 当 Parent 是 P&T & Child 是集群中子集群中的 IP
Final_cluster Relation Subcluster Category
5 Parent 1 IP
5 Child 1 IP
5 Child 1 IP
5 Child 4 IP
5 Parent 4 P&T
5 Parent 5 IP
5 Child 5 P&T
5 Child 5 P&T
5 Child 5 P&T
5 Child 5 P&T
7 Parent 1 P&T
7 Child 1 P&T
7 Parent 2 IP
7 Child 2 IP
7 Parent 3 P&T
7 Child 3 P&T
7 Child 7 IP
7 Child 7 P&T
7 Parent 7 P&T
因此,最终结果如下:
Cluster IP-> IP P&T->P&T IP-> P&T P&T->IP
5 1 1 2
7 1 2 1
我能够使用以下 sqldf 创建单个类别的计数
single_cat <- sqldf("SELECT Final_Cluster, Subcluster, category, COUNT(distinct(category)) AS count_single
FROM final_output_csv
GROUP BY Final_cluster, Subcluster
HAVING COUNT(distinct(category)) = 1")
single_cat_final <- sqldf("SELECT Final_Cluster,category, count(count_single) As total_count
FROM single_cat
GROUP BY Final_cluster,category ")
我能够使用 sqldf 通过多个步骤解决问题。
如果有人能post更好的方法。请分享。
single_cat <- sqldf("SELECT Final_Cluster, New_Subcol5, category, COUNT(distinct(category)) AS count_single
FROM final_output_csv
GROUP BY Final_cluster, New_subcol5
HAVING COUNT(distinct(category)) = 1")
single_cat_final <- sqldf("SELECT Final_Cluster,category, count(count_single) As total_count
FROM single_cat
GROUP BY Final_cluster,category ")
IP_only <- sqldf("SELECT Final_cluster, category, total_count FROM single_cat_final WHERE category = 'IP' ")
PT_only <- sqldf("SELECT Final_cluster, category, total_count FROM single_cat_final WHERE category = 'P&T' ")
####
info1 <- sqldf("SELECT Final_Cluster as A, New_Subcol5 as B FROM final_output_csv GROUP BY Final_cluster, New_subcol5 HAVING COUNT(distinct(category)) = 2")
subset_IP <- sqldf("SELECT Final_Cluster, New_Subcol5, Relation, category
FROM final_output_csv,info1
WHERE final_output_csv.Final_Cluster = info1.A
AND final_output_csv.New_Subcol5 = info1.B
AND final_output_csv.Relation= 'Parent' and category = 'IP'")
IP_PT <- sqldf("SELECT Final_Cluster, count(New_Subcol5) AS total_count_IP_PT from subset_IP GROUP BY Final_Cluster")
subset_PT <- sqldf("SELECT Final_Cluster, New_Subcol5, Relation, category
FROM final_output_csv,info1
WHERE final_output_csv.Final_Cluster = info1.A
AND final_output_csv.New_Subcol5 = info1.B
AND final_output_csv.Relation= 'Parent' and category = 'P&T'")
PT_IP <- sqldf("SELECT Final_Cluster, count(New_Subcol5) AS total_count_PT_IP from subset_PT GROUP BY Final_Cluster")
final_cat<- merge(merge(merge(IP_only,PT_only,by='Final_cluster',all = TRUE),IP_PT,by='Final_cluster',all = TRUE),PT_IP,by='Final_cluster',all = TRUE)
我有如下数据集。
要求是统计人数 仅 IP - 集群内子集群中的父子都是 IP, 仅限 P&T - 集群内子集群中的 Parent & Child 都是 P&T IP->P&T - 当 Parent 是 IP & Child 是集群中的子集群中的 P&T P&T->IP - 当 Parent 是 P&T & Child 是集群中子集群中的 IP
Final_cluster Relation Subcluster Category
5 Parent 1 IP
5 Child 1 IP
5 Child 1 IP
5 Child 4 IP
5 Parent 4 P&T
5 Parent 5 IP
5 Child 5 P&T
5 Child 5 P&T
5 Child 5 P&T
5 Child 5 P&T
7 Parent 1 P&T
7 Child 1 P&T
7 Parent 2 IP
7 Child 2 IP
7 Parent 3 P&T
7 Child 3 P&T
7 Child 7 IP
7 Child 7 P&T
7 Parent 7 P&T
因此,最终结果如下:
Cluster IP-> IP P&T->P&T IP-> P&T P&T->IP
5 1 1 2
7 1 2 1
我能够使用以下 sqldf 创建单个类别的计数
single_cat <- sqldf("SELECT Final_Cluster, Subcluster, category, COUNT(distinct(category)) AS count_single
FROM final_output_csv
GROUP BY Final_cluster, Subcluster
HAVING COUNT(distinct(category)) = 1")
single_cat_final <- sqldf("SELECT Final_Cluster,category, count(count_single) As total_count
FROM single_cat
GROUP BY Final_cluster,category ")
我能够使用 sqldf 通过多个步骤解决问题。 如果有人能post更好的方法。请分享。
single_cat <- sqldf("SELECT Final_Cluster, New_Subcol5, category, COUNT(distinct(category)) AS count_single
FROM final_output_csv
GROUP BY Final_cluster, New_subcol5
HAVING COUNT(distinct(category)) = 1")
single_cat_final <- sqldf("SELECT Final_Cluster,category, count(count_single) As total_count
FROM single_cat
GROUP BY Final_cluster,category ")
IP_only <- sqldf("SELECT Final_cluster, category, total_count FROM single_cat_final WHERE category = 'IP' ")
PT_only <- sqldf("SELECT Final_cluster, category, total_count FROM single_cat_final WHERE category = 'P&T' ")
####
info1 <- sqldf("SELECT Final_Cluster as A, New_Subcol5 as B FROM final_output_csv GROUP BY Final_cluster, New_subcol5 HAVING COUNT(distinct(category)) = 2")
subset_IP <- sqldf("SELECT Final_Cluster, New_Subcol5, Relation, category
FROM final_output_csv,info1
WHERE final_output_csv.Final_Cluster = info1.A
AND final_output_csv.New_Subcol5 = info1.B
AND final_output_csv.Relation= 'Parent' and category = 'IP'")
IP_PT <- sqldf("SELECT Final_Cluster, count(New_Subcol5) AS total_count_IP_PT from subset_IP GROUP BY Final_Cluster")
subset_PT <- sqldf("SELECT Final_Cluster, New_Subcol5, Relation, category
FROM final_output_csv,info1
WHERE final_output_csv.Final_Cluster = info1.A
AND final_output_csv.New_Subcol5 = info1.B
AND final_output_csv.Relation= 'Parent' and category = 'P&T'")
PT_IP <- sqldf("SELECT Final_Cluster, count(New_Subcol5) AS total_count_PT_IP from subset_PT GROUP BY Final_Cluster")
final_cat<- merge(merge(merge(IP_only,PT_only,by='Final_cluster',all = TRUE),IP_PT,by='Final_cluster',all = TRUE),PT_IP,by='Final_cluster',all = TRUE)