提取值限制为 10 或更少的 BigQuery returns 正确结果,更改限制或添加提取值 return null
BigQuery with limit 10 or fewer extraction value returns correct results, changing limit or adding extraction return null
以下是基因组数据的问题:
我在大查询中对 pgp 数据使用以下查询:http://googlegenomics.readthedocs.io/en/latest/use_cases/discover_public_data/pgp_public_data.html
(为简单起见,使用了一个样本 ID:hu089792)
*SELECT
sample_id,
allele1Gene,
NTH(2,SPLIT(s.allele1XRef,':')) AS rsID,
NTH(1,SPLIT(allele1Gene,';')) AS input,
NTH(3,SPLIT((NTH(1,SPLIT(allele1Gene,';'))),':')) AS gene1,
NTH(2,SPLIT(allele1Gene,';')) AS input2,
NTH(3,SPLIT((NTH(2,SPLIT(allele1Gene,';'))),':')) AS gene2
FROM
[speedy-emissary-167213:pgp_orielresearch.pgp_variants_gene_dbsnp_hu089792] AS s
LIMIT
10*
**the result is as expected:**
*Row| sample_id| allele1Gene| rsID| input| gene1| input2| gene2
--------------------
1 hu089792 10645:NM_006549.3:CAMKK2:INTRON:UNKNOWN-INC;10645:NM_153499.2:CAMKK2:INTRON:UNKNOWN-INC;10645:NM_153500.1:CAMKK2:INTRON:UNKNOWN-INC;10645:NM_172214.2:CAMKK2:INTRON:UNKNOWN-INC;10645:NM_172215.2:CAMKK2:INTRON:UNKNOWN-INC;10645:NM_172216.1:CAMKK2:INTRON:UNKNOWN-INC;10645:NM_172226.2:CAMKK2:INTRON:UNKNOWN-INC rs3794207 10645:NM_006549.3:CAMKK2:INTRON:UNKNOWN-INC CAMKK2 10645:NM_153499.2:CAMKK2:INTRON:UNKNOWN-INC CAMKK2
2 hu089792 387357:NM_001010923.2:THEMIS:INTRON:UNKNOWN-INC;387357:NM_001164685.1:THEMIS:INTRON:UNKNOWN-INC;387357:NM_001164687.1:THEMIS:INTRON:UNKNOWN-INC rs683202 387357:NM_001010923.2:THEMIS:INTRON:UNKNOWN-INC THEMIS 387357:NM_001164685.1:THEMIS:INTRON:UNKNOWN-INC THEMIS
3 hu089792 10207:NM_176877.2:INADL:INTRON:UNKNOWN-INC rs2666491 10207:NM_176877.2:INADL:INTRON:UNKNOWN-INC INADL null null*
**when i change the limit to 100 / or add another gene extraction, i get null results:**
**the limit change query is:**
SELECT
sample_id,
allele1Gene,
NTH(2,SPLIT(s.allele1XRef,':')) AS rsID,
NTH(1,SPLIT(allele1Gene,';')) AS input,
NTH(3,SPLIT((NTH(1,SPLIT(allele1Gene,';'))),':')) AS gene1,
NTH(2,SPLIT(allele1Gene,';')) AS input2,
NTH(3,SPLIT((NTH(2,SPLIT(allele1Gene,';'))),':')) AS gene2
FROM
[speedy-emissary-167213:pgp_orielresearch.pgp_variants_gene_dbsnp_hu089792] AS s
LIMIT
1000
**The result is:**
*Row| sample_id| allele1Gene| rsID| input| gene1| input2| gene2
------------------
1 hu089792 null rs6078843 null null null null
2 hu089792 null rs79092469 null null null null
3 hu089792 null rs56216546 null null null null
4 hu089792 null rs9576011 null null null null*
**The other query (adding extraction query):**
SELECT
sample_id,
allele1Gene,
NTH(2,SPLIT(s.allele1XRef,':')) AS rsID,
NTH(1,SPLIT(allele1Gene,';')) AS input,
NTH(3,SPLIT((NTH(1,SPLIT(allele1Gene,';'))),':')) AS gene1,
NTH(2,SPLIT(allele1Gene,';')) AS input2,
NTH(3,SPLIT((NTH(2,SPLIT(allele1Gene,';'))),':')) AS gene2,
NTH(3,SPLIT(allele1Gene,';')) AS input3,
NTH(3,SPLIT((NTH(3,SPLIT(allele1Gene,';'))),':')) AS gene3
FROM
[speedy-emissa167213:pgp_orielresearch.pgp_variants_gene_dbsnp_hu089792] AS s
LIMIT 10
**returns:**
*Row| sample_id| allele1Gene| rsID| input| gene1| input2| gene2| input3| gene3
-----------------------
1 hu089792 null rs6551009 null null null null null null
2 hu089792 null rs2050586 null null null null null null
3 hu089792 null rs7151797 null null null null null null*
**any idea why?**
Any help is greatly appreciated
Best,
eilalan
**the original table includes 3 columns that are extracted from [google.com:biggene:pgp.cgi_variants]
see below:**
Row| sample_id| allele1XRef| allele1Gene|
-------------
1 hu089792 dbsnp.107:rs3794207 10645:NM_006549.3:CAMKK2:INTRON:UNKNOWN-INC;10645:NM_153499.2:CAMKK2:INTRON:UNKNOWN-INC;10645:NM_153500.1:CAMKK2:INTRON:UNKNOWN-INC;10645:NM_172214.2:CAMKK2:INTRON:UNKNOWN-INC;10645:NM_172215.2:CAMKK2:INTRON:UNKNOWN-INC;10645:NM_172216.1:CAMKK2:INTRON:UNKNOWN-INC;10645:NM_172226.2:CAMKK2:INTRON:UNKNOWN-INC
2 hu089792 dbsnp.83:rs683202 387357:NM_001010923.2:THEMIS:INTRON:UNKNOWN-INC;387357:NM_001164685.1:THEMIS:INTRON:UNKNOWN-INC;387357:NM_001164687.1:THEMIS:INTRON:UNKNOWN-INC
3 hu089792 dbsnp.100:rs2666491 10207:NM_176877.2:INADL:INTRON:UNKNOWN-INC
BigQuery 不保证输出行的顺序(除非您添加明确的 ORDER BY)
因此,当您更改 LIMIT - 您很可能会在 owtput 中获得不同的行,并且对于这些行,相应的提取会产生 NULL
为了测试 - 我建议添加特定的 ORDER BY 这样你就会有一致的行输出因此你会比较橙子和橙子 - 而不是苹果
以下是基因组数据的问题: 我在大查询中对 pgp 数据使用以下查询:http://googlegenomics.readthedocs.io/en/latest/use_cases/discover_public_data/pgp_public_data.html (为简单起见,使用了一个样本 ID:hu089792)
*SELECT
sample_id,
allele1Gene,
NTH(2,SPLIT(s.allele1XRef,':')) AS rsID,
NTH(1,SPLIT(allele1Gene,';')) AS input,
NTH(3,SPLIT((NTH(1,SPLIT(allele1Gene,';'))),':')) AS gene1,
NTH(2,SPLIT(allele1Gene,';')) AS input2,
NTH(3,SPLIT((NTH(2,SPLIT(allele1Gene,';'))),':')) AS gene2
FROM
[speedy-emissary-167213:pgp_orielresearch.pgp_variants_gene_dbsnp_hu089792] AS s
LIMIT
10*
**the result is as expected:**
*Row| sample_id| allele1Gene| rsID| input| gene1| input2| gene2
--------------------
1 hu089792 10645:NM_006549.3:CAMKK2:INTRON:UNKNOWN-INC;10645:NM_153499.2:CAMKK2:INTRON:UNKNOWN-INC;10645:NM_153500.1:CAMKK2:INTRON:UNKNOWN-INC;10645:NM_172214.2:CAMKK2:INTRON:UNKNOWN-INC;10645:NM_172215.2:CAMKK2:INTRON:UNKNOWN-INC;10645:NM_172216.1:CAMKK2:INTRON:UNKNOWN-INC;10645:NM_172226.2:CAMKK2:INTRON:UNKNOWN-INC rs3794207 10645:NM_006549.3:CAMKK2:INTRON:UNKNOWN-INC CAMKK2 10645:NM_153499.2:CAMKK2:INTRON:UNKNOWN-INC CAMKK2
2 hu089792 387357:NM_001010923.2:THEMIS:INTRON:UNKNOWN-INC;387357:NM_001164685.1:THEMIS:INTRON:UNKNOWN-INC;387357:NM_001164687.1:THEMIS:INTRON:UNKNOWN-INC rs683202 387357:NM_001010923.2:THEMIS:INTRON:UNKNOWN-INC THEMIS 387357:NM_001164685.1:THEMIS:INTRON:UNKNOWN-INC THEMIS
3 hu089792 10207:NM_176877.2:INADL:INTRON:UNKNOWN-INC rs2666491 10207:NM_176877.2:INADL:INTRON:UNKNOWN-INC INADL null null*
**when i change the limit to 100 / or add another gene extraction, i get null results:**
**the limit change query is:**
SELECT
sample_id,
allele1Gene,
NTH(2,SPLIT(s.allele1XRef,':')) AS rsID,
NTH(1,SPLIT(allele1Gene,';')) AS input,
NTH(3,SPLIT((NTH(1,SPLIT(allele1Gene,';'))),':')) AS gene1,
NTH(2,SPLIT(allele1Gene,';')) AS input2,
NTH(3,SPLIT((NTH(2,SPLIT(allele1Gene,';'))),':')) AS gene2
FROM
[speedy-emissary-167213:pgp_orielresearch.pgp_variants_gene_dbsnp_hu089792] AS s
LIMIT
1000
**The result is:**
*Row| sample_id| allele1Gene| rsID| input| gene1| input2| gene2
------------------
1 hu089792 null rs6078843 null null null null
2 hu089792 null rs79092469 null null null null
3 hu089792 null rs56216546 null null null null
4 hu089792 null rs9576011 null null null null*
**The other query (adding extraction query):**
SELECT
sample_id,
allele1Gene,
NTH(2,SPLIT(s.allele1XRef,':')) AS rsID,
NTH(1,SPLIT(allele1Gene,';')) AS input,
NTH(3,SPLIT((NTH(1,SPLIT(allele1Gene,';'))),':')) AS gene1,
NTH(2,SPLIT(allele1Gene,';')) AS input2,
NTH(3,SPLIT((NTH(2,SPLIT(allele1Gene,';'))),':')) AS gene2,
NTH(3,SPLIT(allele1Gene,';')) AS input3,
NTH(3,SPLIT((NTH(3,SPLIT(allele1Gene,';'))),':')) AS gene3
FROM
[speedy-emissa167213:pgp_orielresearch.pgp_variants_gene_dbsnp_hu089792] AS s
LIMIT 10
**returns:**
*Row| sample_id| allele1Gene| rsID| input| gene1| input2| gene2| input3| gene3
-----------------------
1 hu089792 null rs6551009 null null null null null null
2 hu089792 null rs2050586 null null null null null null
3 hu089792 null rs7151797 null null null null null null*
**any idea why?**
Any help is greatly appreciated
Best,
eilalan
**the original table includes 3 columns that are extracted from [google.com:biggene:pgp.cgi_variants]
see below:**
Row| sample_id| allele1XRef| allele1Gene|
-------------
1 hu089792 dbsnp.107:rs3794207 10645:NM_006549.3:CAMKK2:INTRON:UNKNOWN-INC;10645:NM_153499.2:CAMKK2:INTRON:UNKNOWN-INC;10645:NM_153500.1:CAMKK2:INTRON:UNKNOWN-INC;10645:NM_172214.2:CAMKK2:INTRON:UNKNOWN-INC;10645:NM_172215.2:CAMKK2:INTRON:UNKNOWN-INC;10645:NM_172216.1:CAMKK2:INTRON:UNKNOWN-INC;10645:NM_172226.2:CAMKK2:INTRON:UNKNOWN-INC
2 hu089792 dbsnp.83:rs683202 387357:NM_001010923.2:THEMIS:INTRON:UNKNOWN-INC;387357:NM_001164685.1:THEMIS:INTRON:UNKNOWN-INC;387357:NM_001164687.1:THEMIS:INTRON:UNKNOWN-INC
3 hu089792 dbsnp.100:rs2666491 10207:NM_176877.2:INADL:INTRON:UNKNOWN-INC
BigQuery 不保证输出行的顺序(除非您添加明确的 ORDER BY)
因此,当您更改 LIMIT - 您很可能会在 owtput 中获得不同的行,并且对于这些行,相应的提取会产生 NULL
为了测试 - 我建议添加特定的 ORDER BY 这样你就会有一致的行输出因此你会比较橙子和橙子 - 而不是苹果