将列值除以 impala 中的总行数
Divide columns values by total rows in impala
SELECT COUNT(DISTINCT cgi.sample_idSince Impala 不允许 SET 操作,或 select 语句中的子查询,我很难弄清楚如何将列值除以返回的总行数。我的最终目标是计算每个 chr:start 位置的次要等位基因频率。
我的数据结构如下:
| chr | start | stop | ref | allele1seq | allele2seq | sample_id |
| 6 | 66720709 | 66720710 | A | A | T | 101-46-3 |
| 7 | 66720809 | 66720810 | GG | GA | GG | 101-46-3 |
我想执行类似于以下查询的操作:
WITH vars as
(SELECT cgi.chr, cgi.start, concat(cgi.chr, ':', CAST(cgi.start AS STRING)) as pos, cgi.ref, cgi.allele1seq, cgi.allele2seq,
CASE
WHEN (cgi.allele1seq = cgi.ref AND cgi.allele2seq <> cgi.ref) THEN '1'
WHEN (cgi.allele1seq <> cgi.ref AND cgi.allele2seq = cgi.ref) THEN '1'
WHEN (cgi.allele1seq = cgi.ref AND cgi.allele2seq = cgi.ref) THEN '2'
ELSE '0' END as ma_count
FROM comgen_variants as cgi)
SELECT vars.*, (CAST(vars.ma_count as INT)/
((SELECT COUNT(DISTINCT cgi.sample_id) from comgen_variants as cgi) * 2)) as maf
FROM vars
我想要的输出是:
| chr | start | ref | allele1seq | allele2seq | ma_count | maf |
| 6 | 66720709 | A | A | T | 1 | .05 |
| 7 | 66720809 | GG | GG | GG | 0 | 0 |
除了想办法按行数划分,我还需要将结果按chr和pos分组,然后统计每个交替等位基因的次数(其中allele1seq和allele2seq不等于ref)发生而不是像上面那样简单地计算每行;但由于计数问题,我还没有走到那一步。
在此先感谢您的帮助。
看起来您可以提前计算 total number of distinct sample_ids*2
,然后将其用于后续查询,因为该值不会每行更改。如果值 did 取决于行,您可能需要查看 analytic/window functions available to Impala.
但是,由于看起来您不需要这样做,您可以执行如下操作:
WITH total AS
(SELECT COUNT(DISTINCT sample_id) * 2 AS total FROM comgen_variants)
SELECT cgi.*,
(CASE
WHEN (cgi.allele1seq = cgi.ref AND cgi.allele2seq <> cgi.ref) THEN 1
WHEN (cgi.allele1seq <> cgi.ref AND cgi.allele2seq = cgi.ref) THEN 1
WHEN (cgi.allele1seq = cgi.ref AND cgi.allele2seq = cgi.ref) THEN 2
ELSE 0 END) / total.total AS maf
FROM comgen_variants AS cgi, total;
虽然我不确定这就是次要等位基因频率;似乎您想为每个基因座选择第二个最常见的等位基因频率?
SELECT COUNT(DISTINCT cgi.sample_idSince Impala 不允许 SET 操作,或 select 语句中的子查询,我很难弄清楚如何将列值除以返回的总行数。我的最终目标是计算每个 chr:start 位置的次要等位基因频率。
我的数据结构如下:
| chr | start | stop | ref | allele1seq | allele2seq | sample_id |
| 6 | 66720709 | 66720710 | A | A | T | 101-46-3 |
| 7 | 66720809 | 66720810 | GG | GA | GG | 101-46-3 |
我想执行类似于以下查询的操作:
WITH vars as
(SELECT cgi.chr, cgi.start, concat(cgi.chr, ':', CAST(cgi.start AS STRING)) as pos, cgi.ref, cgi.allele1seq, cgi.allele2seq,
CASE
WHEN (cgi.allele1seq = cgi.ref AND cgi.allele2seq <> cgi.ref) THEN '1'
WHEN (cgi.allele1seq <> cgi.ref AND cgi.allele2seq = cgi.ref) THEN '1'
WHEN (cgi.allele1seq = cgi.ref AND cgi.allele2seq = cgi.ref) THEN '2'
ELSE '0' END as ma_count
FROM comgen_variants as cgi)
SELECT vars.*, (CAST(vars.ma_count as INT)/
((SELECT COUNT(DISTINCT cgi.sample_id) from comgen_variants as cgi) * 2)) as maf
FROM vars
我想要的输出是:
| chr | start | ref | allele1seq | allele2seq | ma_count | maf |
| 6 | 66720709 | A | A | T | 1 | .05 |
| 7 | 66720809 | GG | GG | GG | 0 | 0 |
除了想办法按行数划分,我还需要将结果按chr和pos分组,然后统计每个交替等位基因的次数(其中allele1seq和allele2seq不等于ref)发生而不是像上面那样简单地计算每行;但由于计数问题,我还没有走到那一步。
在此先感谢您的帮助。
看起来您可以提前计算 total number of distinct sample_ids*2
,然后将其用于后续查询,因为该值不会每行更改。如果值 did 取决于行,您可能需要查看 analytic/window functions available to Impala.
但是,由于看起来您不需要这样做,您可以执行如下操作:
WITH total AS
(SELECT COUNT(DISTINCT sample_id) * 2 AS total FROM comgen_variants)
SELECT cgi.*,
(CASE
WHEN (cgi.allele1seq = cgi.ref AND cgi.allele2seq <> cgi.ref) THEN 1
WHEN (cgi.allele1seq <> cgi.ref AND cgi.allele2seq = cgi.ref) THEN 1
WHEN (cgi.allele1seq = cgi.ref AND cgi.allele2seq = cgi.ref) THEN 2
ELSE 0 END) / total.total AS maf
FROM comgen_variants AS cgi, total;
虽然我不确定这就是次要等位基因频率;似乎您想为每个基因座选择第二个最常见的等位基因频率?