Xarray:让同一个Dataset中的两个DataArray使用同一个坐标系
Xarray: Make two DataArrays in the same Dataset use the same coordinate system
我有一个 ArviZ InferenceData 后验跟踪,它是一个 XArray 数据集。
在那里,我的两个随机变量 a_mu_org
和 b_mu_org
的后验轨迹是 DataArray。他们的坐标是:
a_mu_org
: (chain
, draws
, a_mu_org
), 长度分别为(1, 2000, 15).
b_mu_org
: (chain
, draws
, b_mu_org
), 长度分别为(1, 2000, 15).
从语义上讲,a_mu_org
和 b_mu_org
实际上应该由 15 种生物的单一分类坐标系索引,而不是单独的索引。
为了更清楚一点,这里是完整的数据集字符串 repr:
<xarray.Dataset>
Dimensions: (L_dim_0: 34281, a_dim_0: 456260, a_prot_shift_dim_0: 34281, b_dim_0: 456260, b_mu_org_dim_0: 15, b_prot_shift_dim_0: 34281, chain: 1, draw: 2000, organism: 15, sigma_dim_0: 34281, t50_org_dim_0: 15, t50_prot_dim_0: 39957)
Coordinates:
* chain (chain) int64 0
* draw (draw) int64 0 1 2 3 4 5 ... 1995 1996 1997 1998 1999
* a_prot_shift_dim_0 (a_prot_shift_dim_0) object 'A0A023PXQ4_YMR173W-A' ... 'Z4YNA9_AB124611'
* b_prot_shift_dim_0 (b_prot_shift_dim_0) object 'A0A023PXQ4_YMR173W-A' ... 'Z4YNA9_AB124611'
* L_dim_0 (L_dim_0) object 'A0A023PXQ4_YMR173W-A' ... 'Z4YNA9_AB124611'
a_mu_org_dim_0 (organism) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
* a_dim_0 (a_dim_0) object 'ytzI' 'mtlF' ... 'atpG2' 'atpB2'
* b_mu_org_dim_0 (b_mu_org_dim_0) int64 0 1 2 3 4 5 ... 9 10 11 12 13 14
* b_dim_0 (b_dim_0) object 'ytzI' 'mtlF' ... 'atpG2' 'atpB2'
* t50_prot_dim_0 (t50_prot_dim_0) <U65 'Bacillus subtilis_168_lysate_R1-C0H3Q1_ytzI' ... 'Oleispira antarctica_RB-8_lysate_R1-R4YVF0_atpB2'
* t50_org_dim_0 (t50_org_dim_0) <U43 'Arabidopsis thaliana seedling lysate' ... 'Thermus thermophilus HB27 lysate'
* sigma_dim_0 (sigma_dim_0) object 'A0A023PXQ4_YMR173W-A' ... 'Z4YNA9_AB124611'
Dimensions without coordinates: organism
Data variables:
a_org_pop (chain, draw) float32 519.3236 518.8292 ... 517.84784
a_prot_shift (chain, draw, a_prot_shift_dim_0) float32 ...
b_org_pop (chain, draw) float32 11.509291 11.445394 ... 11.929538
b_prot_shift (chain, draw, b_prot_shift_dim_0) float32 ...
L_pop (chain, draw) float32 3.445896 3.4300675 ... 3.3917112
L (chain, draw, L_dim_0) float32 ...
a_mu_org (chain, draw, organism) float32 430.56827 ... 813.2518
a (chain, draw, a_dim_0) float32 ...
b_mu_org (chain, draw, b_mu_org_dim_0) float32 9.997488 ... 8.389757
b (chain, draw, b_dim_0) float32 ...
t50_prot (chain, draw, t50_prot_dim_0) float32 39.249863 ... 52.19809
t50_org (chain, draw, t50_org_dim_0) float32 43.067646 ... 96.93388
sigma (chain, draw, sigma_dim_0) float32 ...
Attributes:
created_at: 2020-04-23T08:54:58.300091
arviz_version: 0.7.0
inference_library: pymc3
inference_library_version: 3.8
我想让 a_mu_org
和 b_mu_org
具有维度(chain
、draw
、organism
)而不是它们单独的 a_mu_org
和 b_mu_org
。我已经尝试过的事情包括:
- 添加一个名为
organism
的坐标,然后执行 trace.posterior.swap_dims({"a_mu_org_dim_0": "organism"})
,但我收到一条错误消息,指出 "replacement dimension 'organism' is not a 1D variable along the old dimension 'a_mu_org_dim_0'".
- 将维度
a_mu_org_dim_0
重命名为 organism
,但我也无法将 b_mu_org_dim_0
换成新的 organism
。
我想要实现的目标可行吗?
我不确定我的解决方案是不是很好的做法,感觉有点太老套了。此外,术语非常棘手,我会尝试坚持使用 xarray terminology 但可能会失败。诀窍是删除 coordinates 以便 a_dim_0
和 b_dim_0
仅变为 dimensions (现在没有坐标的维度)。之后,它们可以重命名为同一事物并分配给新坐标。这是一个例子:
从名为 ds
的以下数据集开始:
<xarray.Dataset>
Dimensions: (a_dim_0: 15, b_dim_0: 15, chain: 4, draw: 100)
Coordinates:
* chain (chain) int64 0 1 2 3
* draw (draw) int64 0 1 2 3 4 5 6 7 8 9 ... 90 91 92 93 94 95 96 97 98 99
* a_dim_0 (a_dim_0) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
* b_dim_0 (b_dim_0) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Data variables:
a (chain, draw, a_dim_0) float64 0.8152 1.189 ... 1.32 -0.2023
b (chain, draw, b_dim_0) float64 0.6447 -0.8059 ... -0.06435 -0.8666
下面3条命令就可以了(assign_coord
的位置好像不影响输出,这是有道理的,关键是先去掉坐标再重命名):
organism_names = [f"o{i}" for i in range(15)]
ds.reset_index(["a_dim_0", "b_dim_0"], drop=True) \
.assign_coords(organism=organism_names) \
.rename({"a_dim_0": "organism", "b_dim_0": "organism"})
输出:
<xarray.Dataset>
Dimensions: (chain: 4, draw: 100, organism: 15)
Coordinates:
* chain (chain) int64 0 1 2 3
* draw (draw) int64 0 1 2 3 4 5 6 7 8 9 ... 90 91 92 93 94 95 96 97 98 99
* organism (organism) <U3 'o0' 'o1' 'o2' 'o3' ... 'o11' 'o12' 'o13' 'o14'
Data variables:
a (chain, draw, organism) float64 0.8152 1.189 ... 1.32 -0.2023
b (chain, draw, organism) float64 0.6447 -0.8059 ... -0.8666
我有一个 ArviZ InferenceData 后验跟踪,它是一个 XArray 数据集。
在那里,我的两个随机变量 a_mu_org
和 b_mu_org
的后验轨迹是 DataArray。他们的坐标是:
a_mu_org
: (chain
,draws
,a_mu_org
), 长度分别为(1, 2000, 15).b_mu_org
: (chain
,draws
,b_mu_org
), 长度分别为(1, 2000, 15).
从语义上讲,a_mu_org
和 b_mu_org
实际上应该由 15 种生物的单一分类坐标系索引,而不是单独的索引。
为了更清楚一点,这里是完整的数据集字符串 repr:
<xarray.Dataset>
Dimensions: (L_dim_0: 34281, a_dim_0: 456260, a_prot_shift_dim_0: 34281, b_dim_0: 456260, b_mu_org_dim_0: 15, b_prot_shift_dim_0: 34281, chain: 1, draw: 2000, organism: 15, sigma_dim_0: 34281, t50_org_dim_0: 15, t50_prot_dim_0: 39957)
Coordinates:
* chain (chain) int64 0
* draw (draw) int64 0 1 2 3 4 5 ... 1995 1996 1997 1998 1999
* a_prot_shift_dim_0 (a_prot_shift_dim_0) object 'A0A023PXQ4_YMR173W-A' ... 'Z4YNA9_AB124611'
* b_prot_shift_dim_0 (b_prot_shift_dim_0) object 'A0A023PXQ4_YMR173W-A' ... 'Z4YNA9_AB124611'
* L_dim_0 (L_dim_0) object 'A0A023PXQ4_YMR173W-A' ... 'Z4YNA9_AB124611'
a_mu_org_dim_0 (organism) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
* a_dim_0 (a_dim_0) object 'ytzI' 'mtlF' ... 'atpG2' 'atpB2'
* b_mu_org_dim_0 (b_mu_org_dim_0) int64 0 1 2 3 4 5 ... 9 10 11 12 13 14
* b_dim_0 (b_dim_0) object 'ytzI' 'mtlF' ... 'atpG2' 'atpB2'
* t50_prot_dim_0 (t50_prot_dim_0) <U65 'Bacillus subtilis_168_lysate_R1-C0H3Q1_ytzI' ... 'Oleispira antarctica_RB-8_lysate_R1-R4YVF0_atpB2'
* t50_org_dim_0 (t50_org_dim_0) <U43 'Arabidopsis thaliana seedling lysate' ... 'Thermus thermophilus HB27 lysate'
* sigma_dim_0 (sigma_dim_0) object 'A0A023PXQ4_YMR173W-A' ... 'Z4YNA9_AB124611'
Dimensions without coordinates: organism
Data variables:
a_org_pop (chain, draw) float32 519.3236 518.8292 ... 517.84784
a_prot_shift (chain, draw, a_prot_shift_dim_0) float32 ...
b_org_pop (chain, draw) float32 11.509291 11.445394 ... 11.929538
b_prot_shift (chain, draw, b_prot_shift_dim_0) float32 ...
L_pop (chain, draw) float32 3.445896 3.4300675 ... 3.3917112
L (chain, draw, L_dim_0) float32 ...
a_mu_org (chain, draw, organism) float32 430.56827 ... 813.2518
a (chain, draw, a_dim_0) float32 ...
b_mu_org (chain, draw, b_mu_org_dim_0) float32 9.997488 ... 8.389757
b (chain, draw, b_dim_0) float32 ...
t50_prot (chain, draw, t50_prot_dim_0) float32 39.249863 ... 52.19809
t50_org (chain, draw, t50_org_dim_0) float32 43.067646 ... 96.93388
sigma (chain, draw, sigma_dim_0) float32 ...
Attributes:
created_at: 2020-04-23T08:54:58.300091
arviz_version: 0.7.0
inference_library: pymc3
inference_library_version: 3.8
我想让 a_mu_org
和 b_mu_org
具有维度(chain
、draw
、organism
)而不是它们单独的 a_mu_org
和 b_mu_org
。我已经尝试过的事情包括:
- 添加一个名为
organism
的坐标,然后执行trace.posterior.swap_dims({"a_mu_org_dim_0": "organism"})
,但我收到一条错误消息,指出 "replacement dimension 'organism' is not a 1D variable along the old dimension 'a_mu_org_dim_0'". - 将维度
a_mu_org_dim_0
重命名为organism
,但我也无法将b_mu_org_dim_0
换成新的organism
。
我想要实现的目标可行吗?
我不确定我的解决方案是不是很好的做法,感觉有点太老套了。此外,术语非常棘手,我会尝试坚持使用 xarray terminology 但可能会失败。诀窍是删除 coordinates 以便 a_dim_0
和 b_dim_0
仅变为 dimensions (现在没有坐标的维度)。之后,它们可以重命名为同一事物并分配给新坐标。这是一个例子:
从名为 ds
的以下数据集开始:
<xarray.Dataset>
Dimensions: (a_dim_0: 15, b_dim_0: 15, chain: 4, draw: 100)
Coordinates:
* chain (chain) int64 0 1 2 3
* draw (draw) int64 0 1 2 3 4 5 6 7 8 9 ... 90 91 92 93 94 95 96 97 98 99
* a_dim_0 (a_dim_0) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
* b_dim_0 (b_dim_0) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Data variables:
a (chain, draw, a_dim_0) float64 0.8152 1.189 ... 1.32 -0.2023
b (chain, draw, b_dim_0) float64 0.6447 -0.8059 ... -0.06435 -0.8666
下面3条命令就可以了(assign_coord
的位置好像不影响输出,这是有道理的,关键是先去掉坐标再重命名):
organism_names = [f"o{i}" for i in range(15)]
ds.reset_index(["a_dim_0", "b_dim_0"], drop=True) \
.assign_coords(organism=organism_names) \
.rename({"a_dim_0": "organism", "b_dim_0": "organism"})
输出:
<xarray.Dataset>
Dimensions: (chain: 4, draw: 100, organism: 15)
Coordinates:
* chain (chain) int64 0 1 2 3
* draw (draw) int64 0 1 2 3 4 5 6 7 8 9 ... 90 91 92 93 94 95 96 97 98 99
* organism (organism) <U3 'o0' 'o1' 'o2' 'o3' ... 'o11' 'o12' 'o13' 'o14'
Data variables:
a (chain, draw, organism) float64 0.8152 1.189 ... 1.32 -0.2023
b (chain, draw, organism) float64 0.6447 -0.8059 ... -0.8666