我如何有效地 "stretch" 在数组中呈现值而不是缺失值
How can I efficiently "stretch" present values in an array over absent ones
其中 'absent' 可以表示 nan
或 np.masked
,以最容易实现的为准。
例如:
>>> from numpy import nan
>>> do_it([1, nan, nan, 2, nan, 3, nan, nan, 4, 3, nan, 2, nan])
array([1, 1, 1, 2, 2, 3, 3, 3, 4, 3, 3, 2, 2])
# each nan is replaced with the first non-nan value before it
>>> do_it([nan, nan, 2, nan])
array([nan, nan, 2, 2])
# don't care too much about the outcome here, but this seems sensible
我知道你是如何用 for 循环做到这一点的:
def do_it(a):
res = []
last_val = nan
for item in a:
if not np.isnan(item):
last_val = item
res.append(last_val)
return np.asarray(res)
有没有更快的矢量化方法?
cumsum
对标志数组进行排序提供了一种确定要在 NaN 上写入哪些数字的好方法:
def do_it(x):
x = np.asarray(x)
is_valid = ~np.isnan(x)
is_valid[0] = True
valid_elems = x[is_valid]
replacement_indices = is_valid.cumsum() - 1
return valid_elems[replacement_indices]
假设您的数据中没有零(为了使用 numpy.nan_to_num
):
b = numpy.maximum.accumulate(numpy.nan_to_num(a))
>>> array([ 1., 1., 1., 2., 2., 3., 3., 3., 4., 4.])
mask = numpy.isnan(a)
a[mask] = b[mask]
>>> array([ 1., 1., 1., 2., 2., 3., 3., 3., 4., 3.])
编辑:正如 Eric 所指出的,更好的解决方案是用 -inf 替换 nans:
mask = numpy.isnan(a)
a[mask] = -numpy.inf
b = numpy.maximum.accumulate(a)
a[mask] = b[mask]
使用@Benjamin 已删除的解决方案,如果您使用索引,一切都很好
def do_it(data, valid=None, axis=0):
# normalize the inputs to match the question examples
data = np.asarray(data)
if valid is None:
valid = ~np.isnan(data)
# flat array of the data values
data_flat = data.ravel()
# array of indices such that data_flat[indices] == data
indices = np.arange(data.size).reshape(data.shape)
# thanks to benjamin here
stretched_indices = np.maximum.accumulate(valid*indices, axis=axis)
return data_flat[stretched_indices]
比较解决方案运行时间:
>>> import numpy as np
>>> data = np.random.rand(10000)
>>> %timeit do_it_question(data)
10000 loops, best of 3: 17.3 ms per loop
>>> %timeit do_it_mine(data)
10000 loops, best of 3: 179 µs per loop
>>> %timeit do_it_user(data)
10000 loops, best of 3: 182 µs per loop
# with lots of nans
>>> data[data > 0.25] = np.nan
>>> %timeit do_it_question(data)
10000 loops, best of 3: 18.9 ms per loop
>>> %timeit do_it_mine(data)
10000 loops, best of 3: 177 µs per loop
>>> %timeit do_it_user(data)
10000 loops, best of 3: 231 µs per loop
所以这个和@user2357112 的解决方案都把问题中的解决方案吹得一塌糊涂,但是当有大量 nan
s
时,它比 @user2357112 略有优势
其中 'absent' 可以表示 nan
或 np.masked
,以最容易实现的为准。
例如:
>>> from numpy import nan
>>> do_it([1, nan, nan, 2, nan, 3, nan, nan, 4, 3, nan, 2, nan])
array([1, 1, 1, 2, 2, 3, 3, 3, 4, 3, 3, 2, 2])
# each nan is replaced with the first non-nan value before it
>>> do_it([nan, nan, 2, nan])
array([nan, nan, 2, 2])
# don't care too much about the outcome here, but this seems sensible
我知道你是如何用 for 循环做到这一点的:
def do_it(a):
res = []
last_val = nan
for item in a:
if not np.isnan(item):
last_val = item
res.append(last_val)
return np.asarray(res)
有没有更快的矢量化方法?
cumsum
对标志数组进行排序提供了一种确定要在 NaN 上写入哪些数字的好方法:
def do_it(x):
x = np.asarray(x)
is_valid = ~np.isnan(x)
is_valid[0] = True
valid_elems = x[is_valid]
replacement_indices = is_valid.cumsum() - 1
return valid_elems[replacement_indices]
假设您的数据中没有零(为了使用 numpy.nan_to_num
):
b = numpy.maximum.accumulate(numpy.nan_to_num(a))
>>> array([ 1., 1., 1., 2., 2., 3., 3., 3., 4., 4.])
mask = numpy.isnan(a)
a[mask] = b[mask]
>>> array([ 1., 1., 1., 2., 2., 3., 3., 3., 4., 3.])
编辑:正如 Eric 所指出的,更好的解决方案是用 -inf 替换 nans:
mask = numpy.isnan(a)
a[mask] = -numpy.inf
b = numpy.maximum.accumulate(a)
a[mask] = b[mask]
使用@Benjamin 已删除的解决方案,如果您使用索引,一切都很好
def do_it(data, valid=None, axis=0):
# normalize the inputs to match the question examples
data = np.asarray(data)
if valid is None:
valid = ~np.isnan(data)
# flat array of the data values
data_flat = data.ravel()
# array of indices such that data_flat[indices] == data
indices = np.arange(data.size).reshape(data.shape)
# thanks to benjamin here
stretched_indices = np.maximum.accumulate(valid*indices, axis=axis)
return data_flat[stretched_indices]
比较解决方案运行时间:
>>> import numpy as np
>>> data = np.random.rand(10000)
>>> %timeit do_it_question(data)
10000 loops, best of 3: 17.3 ms per loop
>>> %timeit do_it_mine(data)
10000 loops, best of 3: 179 µs per loop
>>> %timeit do_it_user(data)
10000 loops, best of 3: 182 µs per loop
# with lots of nans
>>> data[data > 0.25] = np.nan
>>> %timeit do_it_question(data)
10000 loops, best of 3: 18.9 ms per loop
>>> %timeit do_it_mine(data)
10000 loops, best of 3: 177 µs per loop
>>> %timeit do_it_user(data)
10000 loops, best of 3: 231 µs per loop
所以这个和@user2357112 的解决方案都把问题中的解决方案吹得一塌糊涂,但是当有大量 nan
s