旋转 SNP Table,使用 Bash 将 CSV 文件转换为 JSON
Pivotting the SNP Table , Converting CSV file to JSON using Bash
我正在处理 GWAS 数据。需要帮助。
我的数据是这样的:
IID,rs098083,kgp794789,rs09848309,kgp8300747,.....
63,CC,AG,GA,AA,.....
54,AT,CT,TT,AG,.....
12,TT,GA,AG,AA,.....
.
.
.
如上所述,我总共有 512 行和 200 万列。
期望的输出:
SNP,Genotyping
rs098083,{
"CC" : [ 1, 63, 6, 18, 33, ...],
"CT" : [ 2, 54, 6, 7, 8, ...],
"TT" : [ 4, 9, 12, 13, ...],
"AA" : [86, 124, 4, 19, ...],
"AT" : [8, 98, 34, 74, ....],
.
.
.
}
kgp794789,{
"CC" : [ 1, 63, 6, 18, 33, ...],
"CT" : [ 2, 5, 6, 7, 8, ...],
"TT" : [ 4, 9, 12, 13, ...],
"AA" : [86, 124, 4, 19, ...],
"AT" : [8, 98, 34, 74, ....],
.
.
.
}
rs09848309,{
"CC" : [ 1, 63, 6, 18, 3, ...],
"CT" : [ 2, 5, 6, 7, 8, ...],
"TT" : [ 4, 9, 24 13, ...],
"AA" : [86, 134, 4, 19, ...],
"AT" : [8, 48, 34, 44, ....],
.
.
.
如上所述,旋转后,我应该有一个包含 200 万行和 2 列的 JSON 文件。该行的 SNP
列包含 SNP 的 ID。 genotyping
列将包含一个 JSON BLOB。这个 BLOB 将是一组键值对。键是一个特定的基因型(例如,CC、CT、TT、....),值是具有与键匹配的基因型的 IID 列表。
输出格式为“嵌入 JSON 的 CSV”
这是一种使用 stedolan/jq 的方法:
jq -Rrn '
[ inputs / "," ] | transpose | .[0][1:] as $h | .[1:][]
| .[1:] |= [reduce ([.,$h] | transpose[]) as $t ({}; .[$t[0]] += [$t[1]]) | @text]
| join(", ")
'
rs098083, {"CC":["63"],"AT":["54"],"TT":["12"]}
kgp794789, {"AG":["63"],"CT":["54"],"GA":["12"]}
rs09848309, {"GA":["63"],"TT":["54"],"AG":["12"]}
kgp8300747, {"AA":["63","12"],"AG":["54"]}
如果 ID 应编码为 JSON 个数字
,请添加 tonumber
jq -Rrn '
[ inputs / "," ] | transpose | (.[0][1:] | map(tonumber)) as $h | .[1:][]
| .[1:] |= [reduce ([.,$h] | transpose[]) as $t ({}; .[$t[0]] += [$t[1]]) | @text]
| join(", ")
'
rs098083, {"CC":[63],"AT":[54],"TT":[12]}
kgp794789, {"AG":[63],"CT":[54],"GA":[12]}
rs09848309, {"GA":[63],"TT":[54],"AG":[12]}
kgp8300747, {"AA":[63,12],"AG":[54]}
如果您的最终目标无论如何都是 JSON 表示,请忽略格式化原始输出,这样的事情可能会做:
jq -Rn '
[ inputs / "," ] | transpose | .[0][1:] as $h | reduce .[1:][] as $t (
{}; .[$t[0]] = reduce ([$t[1:],$h] | transpose[]) as $i (
{}; .[$i[0]] += [$i[1]]
)
)
'
{
"rs098083": { "CC": ["63"], "AT": ["54"], "TT": ["12"] },
"kgp794789": { "AG": ["63"], "CT": ["54"], "GA": ["12"] },
"rs09848309": { "GA": ["63"], "TT": ["54"], "AG": ["12"] },
"kgp8300747": { "AA": ["63", "12"], "AG": ["54"] }
}
Demo(手动格式化以便与以前的解决方案比较容易)
我正在处理 GWAS 数据。需要帮助。
我的数据是这样的:
IID,rs098083,kgp794789,rs09848309,kgp8300747,.....
63,CC,AG,GA,AA,.....
54,AT,CT,TT,AG,.....
12,TT,GA,AG,AA,.....
.
.
.
如上所述,我总共有 512 行和 200 万列。
期望的输出:
SNP,Genotyping
rs098083,{
"CC" : [ 1, 63, 6, 18, 33, ...],
"CT" : [ 2, 54, 6, 7, 8, ...],
"TT" : [ 4, 9, 12, 13, ...],
"AA" : [86, 124, 4, 19, ...],
"AT" : [8, 98, 34, 74, ....],
.
.
.
}
kgp794789,{
"CC" : [ 1, 63, 6, 18, 33, ...],
"CT" : [ 2, 5, 6, 7, 8, ...],
"TT" : [ 4, 9, 12, 13, ...],
"AA" : [86, 124, 4, 19, ...],
"AT" : [8, 98, 34, 74, ....],
.
.
.
}
rs09848309,{
"CC" : [ 1, 63, 6, 18, 3, ...],
"CT" : [ 2, 5, 6, 7, 8, ...],
"TT" : [ 4, 9, 24 13, ...],
"AA" : [86, 134, 4, 19, ...],
"AT" : [8, 48, 34, 44, ....],
.
.
.
如上所述,旋转后,我应该有一个包含 200 万行和 2 列的 JSON 文件。该行的 SNP
列包含 SNP 的 ID。 genotyping
列将包含一个 JSON BLOB。这个 BLOB 将是一组键值对。键是一个特定的基因型(例如,CC、CT、TT、....),值是具有与键匹配的基因型的 IID 列表。
输出格式为“嵌入 JSON 的 CSV”
这是一种使用 stedolan/jq 的方法:
jq -Rrn '
[ inputs / "," ] | transpose | .[0][1:] as $h | .[1:][]
| .[1:] |= [reduce ([.,$h] | transpose[]) as $t ({}; .[$t[0]] += [$t[1]]) | @text]
| join(", ")
'
rs098083, {"CC":["63"],"AT":["54"],"TT":["12"]}
kgp794789, {"AG":["63"],"CT":["54"],"GA":["12"]}
rs09848309, {"GA":["63"],"TT":["54"],"AG":["12"]}
kgp8300747, {"AA":["63","12"],"AG":["54"]}
如果 ID 应编码为 JSON 个数字
,请添加tonumber
jq -Rrn '
[ inputs / "," ] | transpose | (.[0][1:] | map(tonumber)) as $h | .[1:][]
| .[1:] |= [reduce ([.,$h] | transpose[]) as $t ({}; .[$t[0]] += [$t[1]]) | @text]
| join(", ")
'
rs098083, {"CC":[63],"AT":[54],"TT":[12]}
kgp794789, {"AG":[63],"CT":[54],"GA":[12]}
rs09848309, {"GA":[63],"TT":[54],"AG":[12]}
kgp8300747, {"AA":[63,12],"AG":[54]}
如果您的最终目标无论如何都是 JSON 表示,请忽略格式化原始输出,这样的事情可能会做:
jq -Rn '
[ inputs / "," ] | transpose | .[0][1:] as $h | reduce .[1:][] as $t (
{}; .[$t[0]] = reduce ([$t[1:],$h] | transpose[]) as $i (
{}; .[$i[0]] += [$i[1]]
)
)
'
{
"rs098083": { "CC": ["63"], "AT": ["54"], "TT": ["12"] },
"kgp794789": { "AG": ["63"], "CT": ["54"], "GA": ["12"] },
"rs09848309": { "GA": ["63"], "TT": ["54"], "AG": ["12"] },
"kgp8300747": { "AA": ["63", "12"], "AG": ["54"] }
}
Demo(手动格式化以便与以前的解决方案比较容易)