基于具有不同值的掩码的数组求和
Summation of array based on mask with different values
为了让问题更容易理解,我在下面逐步绘制了一些图表。
名为 data
的 3D 数组,这是我想要根据 feature
和 mask
.
求和的数据
名为 mask
的三维数组(与 data
形状相同),用于子集 data
.
颜色显示 data
、feature
和 mask
之间的关系。我会在下面解释。
我有一个名为 feature
的一维 DataArray,其值是 mask
.
的一部分
feature
的所有值都不重复,但 time
维度有一些重复值。
步骤:
通过 time
坐标
循环 feature
根据mask
创建临时遮罩并循环feature
:
1
时间和值都等于所选特征; 0
其他人
使用临时掩码对data
进行掩码,对掩码后的数据求和,保存为新数据data_mask
,与feature
形状相同.
结果如下:
我已经使用 for 循环编写了代码:
import xarray as xr
import pandas as pd
import numpy as np
# create feature example
t_feature = pd.to_datetime(['2019-07-25 00:00', '2019-07-25 00:00', '2019-07-25 01:00'])
feature = xr.DataArray(np.array([1,2,4]),
coords=[t_feature],
dims={'time': t_feature})
# create mask example
t = pd.to_datetime(['2019-07-25 00:00', '2019-07-25 01:00'])
mask_t1 = np.array([[1,1,1], [2,2,2], [3,3,3]])
mask_t2 = mask_t1*2
mask = np.stack((mask_t1, mask_t2))
mask = xr.DataArray(mask, coords=[t, range(3), range(3)], dims=['time', 'x', 'y'])
# create data example
data = np.ones(mask.shape)
data[0, 1, :] *= 2
data[1, ...] *= 3
data = xr.DataArray(data, coords=[t, range(3), range(3)], dims=['time', 'x', 'y'])
data_mask = feature.copy()
for index,f in enumerate(feature):
timestamp = f.time
pair_mask = mask.sel(time=timestamp)
pair_mask = pair_mask.where(pair_mask==f, False)
data_mask[dict(time=index)] = data.sel(time=timestamp).where(pair_mask).sum()
但是,对于大型数据集来说它太慢了。如果您有更好的建议,我将不胜感激!
更新
根据Oxbowerce的建议,想出了三种方法,测试了速度。
结论
xarray
方法速度最快,但会导致内存错误
pandas
方法也会导致内存错误,并且比 xarray
方法慢。
for loop
最慢但没有内存问题,因为数据已加载。
详情
import xarray as xr
import pandas as pd
import numpy as np
len_t = int(1e3)
# create feature example
t = pd.date_range(start='1/1/2018', periods=len_t, freq='S')
feature = xr.DataArray(np.random.randint(len_t/2, size=len_t),
# range(len_t),
coords=[t],
dims={'time': t})
# create mask example
mask = xr.DataArray(np.random.randint(len_t/2, size=(len_t, 50, 50)), coords=[t, range(50), range(50)], dims=['time', 'x', 'y'])
# create data example
data = mask.copy()
data_mask = feature.copy()
# --- method 1: for loop --- #
for index,f in enumerate(feature):
timestamp = f.time
pair_mask = mask.sel(time=timestamp)
pair_mask = pair_mask.where(pair_mask==f, False)
data_mask[dict(time=index)] = data.sel(time=timestamp).where(pair_mask).sum()
# --- method 2: pandas --- #
# convert xarrays to pandas dataframes
data_df = data.to_dataframe(name="data_value").reset_index()
feature_df = feature.to_dataframe(name="feature_value")
mask_df = mask.to_dataframe(name="mask_value").reset_index()
result = (
data_df
# add mask values data
.merge(mask_df, how="left", on=["time", "x", "y"])
# add feature values to data, using inner join to only leave rows present in feature array
.merge(feature_df, how="inner", left_on=["time", "mask_value"], right_on=["time", "feature_value"])
# group rows and add up the values
.groupby("feature_value")
.sum()["data_value"]
)
# --- method 3: xarray --- #
feature_time = feature.time
merge_ds = xr.merge([data.rename('data'), mask.rename('mask')], join="left").sel(time=feature_time)
result = merge_ds['data'].where(merge_ds['mask']==feature, drop=True).sum(dim=['x', 'y'])
这里是执行时间:
循环:5.24 s ± 30.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
pandas方法:1.48 s ± 27.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
xarray 方法:74.3 ms ± 2.03 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
根据您给出的示例,我首先将所有 xarray 转换为 pandas 数据帧,然后使用连接合并数据。我过滤特征数组中存在掩码数组中的值的行,然后将这些值相加。这看起来像这样:
import xarray as xr
import pandas as pd
import numpy as np
# create feature example
t_feature = pd.to_datetime(['2019-07-25 00:00', '2019-07-25 00:00', '2019-07-25 01:00'])
feature = xr.DataArray(np.array([1,2,4]),
coords=[t_feature],
dims={'time': t_feature})
# create mask example
t = pd.to_datetime(['2019-07-25 00:00', '2019-07-25 01:00'])
mask_t1 = np.array([[1,1,1], [2,2,2], [3,3,3]])
mask_t2 = mask_t1*2
mask = np.stack((mask_t1, mask_t2))
mask = xr.DataArray(mask, coords=[t, range(3), range(3)], dims=['time', 'x', 'y'])
print(mask)
# create data example
data = np.ones(mask.shape)
data[0, 1, :] *= 2
data[1, ...] *= 3
data = xr.DataArray(data, coords=[t, range(3), range(3)], dims=['time', 'x', 'y'])
# convert xarrays to pandas dataframes
data_df = data.to_dataframe(name="data_value").reset_index()
feature_df = feature.to_dataframe(name="feature_value")
mask_df = mask.to_dataframe(name="mask_value").reset_index()
result = (
data_df
# add mask values data
.merge(mask_df, how="left", on=["time", "x", "y"])
# add feature values to data, using inner join to only leave rows present in feature array
.merge(feature_df, how="inner", left_on=["time", "mask_value"], right_on=["time", "feature_value"])
# group rows and add up the values
.groupby("feature_value")
.sum()["data_value"]
)
结果如下:
feature_value
data_value
1
3
2
6
4
9
为了让问题更容易理解,我在下面逐步绘制了一些图表。
名为
求和的数据data
的 3D 数组,这是我想要根据feature
和mask
.名为
mask
的三维数组(与data
形状相同),用于子集data
.颜色显示
data
、feature
和mask
之间的关系。我会在下面解释。我有一个名为
的一部分feature
的一维 DataArray,其值是mask
.feature
的所有值都不重复,但time
维度有一些重复值。
步骤:
通过
循环time
坐标feature
根据
mask
创建临时遮罩并循环feature
:1
时间和值都等于所选特征;0
其他人使用临时掩码对
data
进行掩码,对掩码后的数据求和,保存为新数据data_mask
,与feature
形状相同.
结果如下:
我已经使用 for 循环编写了代码:
import xarray as xr
import pandas as pd
import numpy as np
# create feature example
t_feature = pd.to_datetime(['2019-07-25 00:00', '2019-07-25 00:00', '2019-07-25 01:00'])
feature = xr.DataArray(np.array([1,2,4]),
coords=[t_feature],
dims={'time': t_feature})
# create mask example
t = pd.to_datetime(['2019-07-25 00:00', '2019-07-25 01:00'])
mask_t1 = np.array([[1,1,1], [2,2,2], [3,3,3]])
mask_t2 = mask_t1*2
mask = np.stack((mask_t1, mask_t2))
mask = xr.DataArray(mask, coords=[t, range(3), range(3)], dims=['time', 'x', 'y'])
# create data example
data = np.ones(mask.shape)
data[0, 1, :] *= 2
data[1, ...] *= 3
data = xr.DataArray(data, coords=[t, range(3), range(3)], dims=['time', 'x', 'y'])
data_mask = feature.copy()
for index,f in enumerate(feature):
timestamp = f.time
pair_mask = mask.sel(time=timestamp)
pair_mask = pair_mask.where(pair_mask==f, False)
data_mask[dict(time=index)] = data.sel(time=timestamp).where(pair_mask).sum()
但是,对于大型数据集来说它太慢了。如果您有更好的建议,我将不胜感激!
更新
根据Oxbowerce的建议,想出了三种方法,测试了速度。
结论
xarray
方法速度最快,但会导致内存错误
pandas
方法也会导致内存错误,并且比 xarray
方法慢。
for loop
最慢但没有内存问题,因为数据已加载。
详情
import xarray as xr
import pandas as pd
import numpy as np
len_t = int(1e3)
# create feature example
t = pd.date_range(start='1/1/2018', periods=len_t, freq='S')
feature = xr.DataArray(np.random.randint(len_t/2, size=len_t),
# range(len_t),
coords=[t],
dims={'time': t})
# create mask example
mask = xr.DataArray(np.random.randint(len_t/2, size=(len_t, 50, 50)), coords=[t, range(50), range(50)], dims=['time', 'x', 'y'])
# create data example
data = mask.copy()
data_mask = feature.copy()
# --- method 1: for loop --- #
for index,f in enumerate(feature):
timestamp = f.time
pair_mask = mask.sel(time=timestamp)
pair_mask = pair_mask.where(pair_mask==f, False)
data_mask[dict(time=index)] = data.sel(time=timestamp).where(pair_mask).sum()
# --- method 2: pandas --- #
# convert xarrays to pandas dataframes
data_df = data.to_dataframe(name="data_value").reset_index()
feature_df = feature.to_dataframe(name="feature_value")
mask_df = mask.to_dataframe(name="mask_value").reset_index()
result = (
data_df
# add mask values data
.merge(mask_df, how="left", on=["time", "x", "y"])
# add feature values to data, using inner join to only leave rows present in feature array
.merge(feature_df, how="inner", left_on=["time", "mask_value"], right_on=["time", "feature_value"])
# group rows and add up the values
.groupby("feature_value")
.sum()["data_value"]
)
# --- method 3: xarray --- #
feature_time = feature.time
merge_ds = xr.merge([data.rename('data'), mask.rename('mask')], join="left").sel(time=feature_time)
result = merge_ds['data'].where(merge_ds['mask']==feature, drop=True).sum(dim=['x', 'y'])
这里是执行时间:
循环:
5.24 s ± 30.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
pandas方法:
1.48 s ± 27.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
xarray 方法:
74.3 ms ± 2.03 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
根据您给出的示例,我首先将所有 xarray 转换为 pandas 数据帧,然后使用连接合并数据。我过滤特征数组中存在掩码数组中的值的行,然后将这些值相加。这看起来像这样:
import xarray as xr
import pandas as pd
import numpy as np
# create feature example
t_feature = pd.to_datetime(['2019-07-25 00:00', '2019-07-25 00:00', '2019-07-25 01:00'])
feature = xr.DataArray(np.array([1,2,4]),
coords=[t_feature],
dims={'time': t_feature})
# create mask example
t = pd.to_datetime(['2019-07-25 00:00', '2019-07-25 01:00'])
mask_t1 = np.array([[1,1,1], [2,2,2], [3,3,3]])
mask_t2 = mask_t1*2
mask = np.stack((mask_t1, mask_t2))
mask = xr.DataArray(mask, coords=[t, range(3), range(3)], dims=['time', 'x', 'y'])
print(mask)
# create data example
data = np.ones(mask.shape)
data[0, 1, :] *= 2
data[1, ...] *= 3
data = xr.DataArray(data, coords=[t, range(3), range(3)], dims=['time', 'x', 'y'])
# convert xarrays to pandas dataframes
data_df = data.to_dataframe(name="data_value").reset_index()
feature_df = feature.to_dataframe(name="feature_value")
mask_df = mask.to_dataframe(name="mask_value").reset_index()
result = (
data_df
# add mask values data
.merge(mask_df, how="left", on=["time", "x", "y"])
# add feature values to data, using inner join to only leave rows present in feature array
.merge(feature_df, how="inner", left_on=["time", "mask_value"], right_on=["time", "feature_value"])
# group rows and add up the values
.groupby("feature_value")
.sum()["data_value"]
)
结果如下:
feature_value | data_value |
---|---|
1 | 3 |
2 | 6 |
4 | 9 |