xarray Multiindex concat 的最佳实践
Best practice to xarray Multiindex concat
我有一组 1000 (2D) pd.Dataframe
(比方说,index:time,列:run_id),每个都有 3 个属性(比方说温度,压力、位置)。理想情况下,我希望所有内容都包含在 5 维的 xr.DataArray
中(或 4 维的 xr.Dataset
并将最后一个维度作为唯一数据变量)。
我创建了一个具有两个 dims 和 2+3 坐标的 DataArray,但是 xr.concat
似乎不适用于多个维度。 (我按照这里提到的方法。)
示例:我从单个数据帧和属性列表构建 DataArrays。
# Mock data:
data = {}
for i in np.arange(500):
data[i] = pd.DataFrame(np.random.randn(1000, 8), index=pd.DatetimeIndex(start='01.01.2013',periods=1000,freq='h'),
columns=list('ABCDEFGH'))
df_catalogue = pd.DataFrame(np.random.choice(10,(500, 3)), columns=['temp','pre','zon'])
#Build DataArrays adding scalar coords
res_da = []
for i,v in df_catalogue.iterrows():
i_df = data[i] # data is a dictionary of properly indexed dataframes
da = xr.DataArray(i_df.values,
coords={'time':i_df.index.values,'runs':i_df.columns.values,
'temp':v['temp'], 'pre':v['pre'],'zon':v['zon']},
dims=['time','runs'])
res_da.append(da)
但是当我尝试 all_da = xr.concat(res_da, dim=['temp','pre','zon'])
时,我得到了奇怪的结果。实现这样的目标的最佳方法是什么:
<xarray.DataArray (time: 8000, runs: 50, temp:8, pre:10, zon: 5)>
array([[[ 4545.453613, 4545.453613, ..., 4545.453613, 4545.453613],
[ 4545.453613, 4545.453613, ..., 4545.453613, 4545.453613],
...,
[ 4177.425781, 4177.425781, ..., 4177.425781, 4177.425781]]], dtype=float32)
Coordinates:
* runs (runs) object 'A' 'B' ...
* time (time) datetime64[ns] 2013-12-31T23:00:00 2014-01-01 ...
* zon (zon) 'zon1', 'zon2', 'zon3', ......
* temp (temp) 'XX' 'YY', 'ZZ' .....
* pre (pre) 'AAA', 'BBB', 'CCC' ....
xarray.concat
仅支持沿单个维度串联。但是我们可以通过连接、设置 MultiIndex 然后取消堆叠来解决这个问题。
我正在更改您的设置代码,因为这仅在您正在构建的新坐标 (['temp','pre','zon']
) 的每个组合都是唯一的情况下才有效:
import numpy as np
import pandas as pd
import xarray as xr
import itertools
data = {}
for i in np.arange(500):
data[i] = pd.DataFrame(np.random.randn(1000, 8),
index=pd.DatetimeIndex(start='01.01.2013',periods=1000,freq='h'),
columns=list('ABCDEFGH'))
cat_data = [(x, y, z)
for x in range(20)
for y in ['a', 'b', 'c', 'd', 'e']
for z in ['A', 'B', 'C', 'D', 'E']]
df_catalogue = pd.DataFrame(cat_data, columns=['temp','pre','zon'])
#Build DataArrays adding scalar coords
res_da = []
for i,v in df_catalogue.iterrows():
i_df = data[i] # data is a dictionary of properly indexed dataframes
da = xr.DataArray(i_df.values,
coords={'time':i_df.index.values,'runs':i_df.columns.values,
'temp':v['temp'], 'pre':v['pre'],'zon':v['zon']},
dims=['time','runs'])
res_da.append(da)
那么,我们可以简单的写成:
xr.concat(res_da, dim='prop').set_index(prop=['temp', 'pre', 'zon']).unstack('prop')
这会产生您想要的 5 维数组:
<xarray.DataArray (time: 1000, runs: 8, temp: 20, pre: 5, zon: 5)>
array([[[[[-0.690557, ..., -1.526415],
...,
[ 0.737887, ..., 1.585335]],
...,
[[ 0.99557 , ..., 0.256517],
...,
[ 0.179632, ..., -1.236502]]],
...,
[[[ 0.234426, ..., -0.149901],
...,
[ 1.492255, ..., -0.380909]],
...,
[[-0.36111 , ..., -0.451571],
...,
[ 0.10457 , ..., 0.722738]]]]])
Coordinates:
* time (time) datetime64[ns] 2013-01-01 2013-01-01T01:00:00 ...
* runs (runs) object 'A' 'B' 'C' 'D' 'E' 'F' 'G' 'H'
* temp (temp) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
* pre (pre) object 'a' 'b' 'c' 'd' 'e'
* zon (zon) object 'A' 'B' 'C' 'D' 'E'
我有一组 1000 (2D) pd.Dataframe
(比方说,index:time,列:run_id),每个都有 3 个属性(比方说温度,压力、位置)。理想情况下,我希望所有内容都包含在 5 维的 xr.DataArray
中(或 4 维的 xr.Dataset
并将最后一个维度作为唯一数据变量)。
我创建了一个具有两个 dims 和 2+3 坐标的 DataArray,但是 xr.concat
似乎不适用于多个维度。 (我按照这里提到的方法
示例:我从单个数据帧和属性列表构建 DataArrays。
# Mock data:
data = {}
for i in np.arange(500):
data[i] = pd.DataFrame(np.random.randn(1000, 8), index=pd.DatetimeIndex(start='01.01.2013',periods=1000,freq='h'),
columns=list('ABCDEFGH'))
df_catalogue = pd.DataFrame(np.random.choice(10,(500, 3)), columns=['temp','pre','zon'])
#Build DataArrays adding scalar coords
res_da = []
for i,v in df_catalogue.iterrows():
i_df = data[i] # data is a dictionary of properly indexed dataframes
da = xr.DataArray(i_df.values,
coords={'time':i_df.index.values,'runs':i_df.columns.values,
'temp':v['temp'], 'pre':v['pre'],'zon':v['zon']},
dims=['time','runs'])
res_da.append(da)
但是当我尝试 all_da = xr.concat(res_da, dim=['temp','pre','zon'])
时,我得到了奇怪的结果。实现这样的目标的最佳方法是什么:
<xarray.DataArray (time: 8000, runs: 50, temp:8, pre:10, zon: 5)>
array([[[ 4545.453613, 4545.453613, ..., 4545.453613, 4545.453613],
[ 4545.453613, 4545.453613, ..., 4545.453613, 4545.453613],
...,
[ 4177.425781, 4177.425781, ..., 4177.425781, 4177.425781]]], dtype=float32)
Coordinates:
* runs (runs) object 'A' 'B' ...
* time (time) datetime64[ns] 2013-12-31T23:00:00 2014-01-01 ...
* zon (zon) 'zon1', 'zon2', 'zon3', ......
* temp (temp) 'XX' 'YY', 'ZZ' .....
* pre (pre) 'AAA', 'BBB', 'CCC' ....
xarray.concat
仅支持沿单个维度串联。但是我们可以通过连接、设置 MultiIndex 然后取消堆叠来解决这个问题。
我正在更改您的设置代码,因为这仅在您正在构建的新坐标 (['temp','pre','zon']
) 的每个组合都是唯一的情况下才有效:
import numpy as np
import pandas as pd
import xarray as xr
import itertools
data = {}
for i in np.arange(500):
data[i] = pd.DataFrame(np.random.randn(1000, 8),
index=pd.DatetimeIndex(start='01.01.2013',periods=1000,freq='h'),
columns=list('ABCDEFGH'))
cat_data = [(x, y, z)
for x in range(20)
for y in ['a', 'b', 'c', 'd', 'e']
for z in ['A', 'B', 'C', 'D', 'E']]
df_catalogue = pd.DataFrame(cat_data, columns=['temp','pre','zon'])
#Build DataArrays adding scalar coords
res_da = []
for i,v in df_catalogue.iterrows():
i_df = data[i] # data is a dictionary of properly indexed dataframes
da = xr.DataArray(i_df.values,
coords={'time':i_df.index.values,'runs':i_df.columns.values,
'temp':v['temp'], 'pre':v['pre'],'zon':v['zon']},
dims=['time','runs'])
res_da.append(da)
那么,我们可以简单的写成:
xr.concat(res_da, dim='prop').set_index(prop=['temp', 'pre', 'zon']).unstack('prop')
这会产生您想要的 5 维数组:
<xarray.DataArray (time: 1000, runs: 8, temp: 20, pre: 5, zon: 5)>
array([[[[[-0.690557, ..., -1.526415],
...,
[ 0.737887, ..., 1.585335]],
...,
[[ 0.99557 , ..., 0.256517],
...,
[ 0.179632, ..., -1.236502]]],
...,
[[[ 0.234426, ..., -0.149901],
...,
[ 1.492255, ..., -0.380909]],
...,
[[-0.36111 , ..., -0.451571],
...,
[ 0.10457 , ..., 0.722738]]]]])
Coordinates:
* time (time) datetime64[ns] 2013-01-01 2013-01-01T01:00:00 ...
* runs (runs) object 'A' 'B' 'C' 'D' 'E' 'F' 'G' 'H'
* temp (temp) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
* pre (pre) object 'a' 'b' 'c' 'd' 'e'
* zon (zon) object 'A' 'B' 'C' 'D' 'E'