在 SQL 中,我如何按一长列的每一个列进行分组并获取计数,将所有列组合成一个 table?
In SQL how do I group by every one of a long list of columns and get counts, assembled all into one table?
我在训练分类器之前在多标签数据集上执行了分层样本,现在想检查它的平衡程度。数据集中的列是:
|_Body|label_0|label_1|label_10|label_100|label_101|label_102|label_103|label_104|label_11|label_12|label_13|label_14|label_15|label_16|label_17|label_18|label_19|label_2|label_20|label_21|label_22|label_23|label_24|label_25|label_26|label_27|label_28|label_29|label_3|label_30|label_31|label_32|label_33|label_34|label_35|label_36|label_37|label_38|label_39|label_4|label_40|label_41|label_42|label_43|label_44|label_45|label_46|label_47|label_48|label_49|label_5|label_50|label_51|label_52|label_53|label_54|label_55|label_56|label_57|label_58|label_59|label_6|label_60|label_61|label_62|label_63|label_64|label_65|label_66|label_67|label_68|label_69|label_7|label_70|label_71|label_72|label_73|label_74|label_75|label_76|label_77|label_78|label_79|label_8|label_80|label_81|label_82|label_83|label_84|label_85|label_86|label_87|label_88|label_89|label_9|label_90|label_91|label_92|label_93|label_94|label_95|label_96|label_97|label_98|label_99|
我想按每个 label_*
列分组一次,并创建一个包含 positive/negative 计数的结果字典。目前,我正在 PySpark SQL 中完成此操作,如下所示:
# Evaluate how skewed the sample is after balancing it by resampling
stratified_sample = spark.read.json('s3://Whosebug-events/1901/Sample.Stratified.{}.*.jsonl'.format(limit))
stratified_sample.registerTempTable('stratified_sample')
label_counts = {}
for i in range(0, 100):
count_df = spark.sql('SELECT label_{}, COUNT(*) as total FROM stratified_sample GROUP BY label_{}'.format(i, i))
rows = count_df.rdd.take(2)
neg_count = getattr(rows[0], 'total')
pos_count = getattr(rows[1], 'total')
label_counts[i] = [neg_count, pos_count]
因此输出为:
{0: [1034673, 14491],
1: [1023250, 25914],
2: [1030462, 18702],
3: [1035645, 13519],
4: [1037445, 11719],
5: [1010664, 38500],
6: [1031699, 17465],
...}
这感觉应该可以在一个 SQL 语句中实现,但我不知道如何做到这一点或找到现有的解决方案。显然我不想写出所有的列名并且生成 SQL 似乎比这个解决方案更糟糕。
SQL可以做到吗?谢谢!
你确实可以在一个声明中做到这一点,但我不确定表现会很好。
from pyspark.sql import functions as F
from functools import reduce
dataframes_list = [
stratified_sample.groupBy(
"label_{}".format(i)
).count().select(
F.lit("label_{}".format(i)).alias("col"),
"count"
)
for i in range(0, 100)
]
count_df = reduce(
lambda a, b: a.union(b),
dataframes_list
)
这将创建一个包含 2 列的数据框,col
包含您正在计算的列的名称,count
包含计数值。
改成字典,我让你读。
您可以生成 sql 而无需分组依据。
类似于
SELECT COUNT(*) AS total, SUM(label_k) as positive_k ,.. FROM table
然后使用结果生成你的字典{k : [total-positive_k, positive_k]}
这是一个单一 sql 的解决方案,用于获取所有正面和负面计数
sql = 'select '
for i in range(0, 100):
sql = sql + ' sum(CASE WHEN label_{} > 0 THEN 1 ELSE 0 END) as label{}_pos_count, '.format(i,i)
sql = sql + ' sum(CASE WHEN label_{} < 0 THEN 1 ELSE 0 END) as label{}_neg_count'.format(i,i)
if i < 99:
sql = sql + ', '
sql = sql + ' from stratified_sample '
df = spark.sql(sql)
rows = df.rdd.take(1)
label_counts = {}
for i in range(0, 100):
label_counts[i] = [rows[0][2*i],rows[0][2*i+1] ]
print(label_counts)
我在训练分类器之前在多标签数据集上执行了分层样本,现在想检查它的平衡程度。数据集中的列是:
|_Body|label_0|label_1|label_10|label_100|label_101|label_102|label_103|label_104|label_11|label_12|label_13|label_14|label_15|label_16|label_17|label_18|label_19|label_2|label_20|label_21|label_22|label_23|label_24|label_25|label_26|label_27|label_28|label_29|label_3|label_30|label_31|label_32|label_33|label_34|label_35|label_36|label_37|label_38|label_39|label_4|label_40|label_41|label_42|label_43|label_44|label_45|label_46|label_47|label_48|label_49|label_5|label_50|label_51|label_52|label_53|label_54|label_55|label_56|label_57|label_58|label_59|label_6|label_60|label_61|label_62|label_63|label_64|label_65|label_66|label_67|label_68|label_69|label_7|label_70|label_71|label_72|label_73|label_74|label_75|label_76|label_77|label_78|label_79|label_8|label_80|label_81|label_82|label_83|label_84|label_85|label_86|label_87|label_88|label_89|label_9|label_90|label_91|label_92|label_93|label_94|label_95|label_96|label_97|label_98|label_99|
我想按每个 label_*
列分组一次,并创建一个包含 positive/negative 计数的结果字典。目前,我正在 PySpark SQL 中完成此操作,如下所示:
# Evaluate how skewed the sample is after balancing it by resampling
stratified_sample = spark.read.json('s3://Whosebug-events/1901/Sample.Stratified.{}.*.jsonl'.format(limit))
stratified_sample.registerTempTable('stratified_sample')
label_counts = {}
for i in range(0, 100):
count_df = spark.sql('SELECT label_{}, COUNT(*) as total FROM stratified_sample GROUP BY label_{}'.format(i, i))
rows = count_df.rdd.take(2)
neg_count = getattr(rows[0], 'total')
pos_count = getattr(rows[1], 'total')
label_counts[i] = [neg_count, pos_count]
因此输出为:
{0: [1034673, 14491],
1: [1023250, 25914],
2: [1030462, 18702],
3: [1035645, 13519],
4: [1037445, 11719],
5: [1010664, 38500],
6: [1031699, 17465],
...}
这感觉应该可以在一个 SQL 语句中实现,但我不知道如何做到这一点或找到现有的解决方案。显然我不想写出所有的列名并且生成 SQL 似乎比这个解决方案更糟糕。
SQL可以做到吗?谢谢!
你确实可以在一个声明中做到这一点,但我不确定表现会很好。
from pyspark.sql import functions as F
from functools import reduce
dataframes_list = [
stratified_sample.groupBy(
"label_{}".format(i)
).count().select(
F.lit("label_{}".format(i)).alias("col"),
"count"
)
for i in range(0, 100)
]
count_df = reduce(
lambda a, b: a.union(b),
dataframes_list
)
这将创建一个包含 2 列的数据框,col
包含您正在计算的列的名称,count
包含计数值。
改成字典,我让你读
您可以生成 sql 而无需分组依据。
类似于
SELECT COUNT(*) AS total, SUM(label_k) as positive_k ,.. FROM table
然后使用结果生成你的字典{k : [total-positive_k, positive_k]}
这是一个单一 sql 的解决方案,用于获取所有正面和负面计数
sql = 'select '
for i in range(0, 100):
sql = sql + ' sum(CASE WHEN label_{} > 0 THEN 1 ELSE 0 END) as label{}_pos_count, '.format(i,i)
sql = sql + ' sum(CASE WHEN label_{} < 0 THEN 1 ELSE 0 END) as label{}_neg_count'.format(i,i)
if i < 99:
sql = sql + ', '
sql = sql + ' from stratified_sample '
df = spark.sql(sql)
rows = df.rdd.take(1)
label_counts = {}
for i in range(0, 100):
label_counts[i] = [rows[0][2*i],rows[0][2*i+1] ]
print(label_counts)