高效获取 pandas 区间类别的右边缘
Get right edges of pandas interval Categories efficiently
如何有效地得到 pandas 区间类别的右边缘?在下面的示例中,如何有效地创建 z
?
import pandas as pd, numpy as np
bins = pd.interval_range(start=0, end=4, freq=1, closed='left')
x = pd.Series(np.linspace(0.0,3.8,num=20))
y = pd.cut(x, bins)
# How can one create z efficiently?
z = pd.Series(y.iat[n].right for n in range(len(y)))
感谢您的帮助!
对于高性能方法,您可以使用 np.bincount
:
np.digitize(x, range(0,4))
# array([1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4])
对于 pd.Series
:
pd.Series(np.digitize(x, range(0,4)), index=x.index)
0 1
1 1
2 1
3 1
4 1
5 1
6 2
7 2
8 2
9 2
10 2
11 3
...
更大数据帧上的计时 -
bins = pd.interval_range(start=0, end=400, freq=1, closed='left')
x = pd.Series(np.linspace(0.0,380,num=20_000))
%timeit pd.Series(np.digitize(x, range(0,400)))
# 567 µs ± 28.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
def op(x):
y = pd.cut(x, bins)
z = pd.Series(y.iat[n].right for n in range(len(y)))
%timeit op(x)
# 682 ms ± 49.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
两者给出相同的地方:
np.allclose(op(x), pd.Series(np.digitize(x, range(0,400))))
# True
因此,对于包含 20000 行的更大数据帧,我们得到 1200x
加速
如何有效地得到 pandas 区间类别的右边缘?在下面的示例中,如何有效地创建 z
?
import pandas as pd, numpy as np
bins = pd.interval_range(start=0, end=4, freq=1, closed='left')
x = pd.Series(np.linspace(0.0,3.8,num=20))
y = pd.cut(x, bins)
# How can one create z efficiently?
z = pd.Series(y.iat[n].right for n in range(len(y)))
感谢您的帮助!
对于高性能方法,您可以使用 np.bincount
:
np.digitize(x, range(0,4))
# array([1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4])
对于 pd.Series
:
pd.Series(np.digitize(x, range(0,4)), index=x.index)
0 1
1 1
2 1
3 1
4 1
5 1
6 2
7 2
8 2
9 2
10 2
11 3
...
更大数据帧上的计时 -
bins = pd.interval_range(start=0, end=400, freq=1, closed='left')
x = pd.Series(np.linspace(0.0,380,num=20_000))
%timeit pd.Series(np.digitize(x, range(0,400)))
# 567 µs ± 28.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
def op(x):
y = pd.cut(x, bins)
z = pd.Series(y.iat[n].right for n in range(len(y)))
%timeit op(x)
# 682 ms ± 49.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
两者给出相同的地方:
np.allclose(op(x), pd.Series(np.digitize(x, range(0,400))))
# True
因此,对于包含 20000 行的更大数据帧,我们得到 1200x
加速