pandas - 用 below/less 的观察百分比替换值
pandas - replace values with percent of observations that are below/less
我有一个这样的 df:
>>> a = [1, 2, 3, 4, 5, 6, 7, 8]
>>> df = pd.DataFrame({'a': a})
>>> df
a
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
我想用显示有多少观察值小于该值(以百分比表示)的值替换这些值。像这样:
>>> df
a how_many_percent_of_observations_are_less_than_value_from_a
0 1 0 (no observations that are lower, 0/8)
1 2 .125 (one observation is lower, 1/8)
2 3 .25 (two observations are lower, 2/8)
3 4
4 5
5 6
6 7
7 8 .875 (7 observations are lower, 7/8)
如果 a
值不太像相同的值,您可以使用 numpy 广播进行测试,然后计算每个 'columns'
的 True
的数量并除以数组的长度:
a = df.a.to_numpy()
print (a[:, None] < a)
[[False True True True True True True True]
[False False True True True True True True]
[False False False True True True True True]
[False False False False True True True True]
[False False False False False True True True]
[False False False False False False True True]
[False False False False False False False True]
[False False False False False False False False]]
df['new'] = (a[:, None] < a).sum(axis=0) / len(a)
print (df)
a new
0 1 0.000
1 2 0.125
2 3 0.250
3 4 0.375
4 5 0.500
5 6 0.625
6 7 0.750
7 8 0.875
使用rank
a = [1, 2, 3, 4, 5, 6, 7, 8]
df = pd.DataFrame({'a': a})
ranks = df['a'].rank(method = 'min')
maxi = ranks.size
df['b'] = (ranks-1)/maxi
输出:
>>> df
a b
0 1 0.000
1 2 0.125
2 3 0.250
3 4 0.375
4 5 0.500
5 6 0.625
6 7 0.750
7 8 0.875
你可以在这里使用np.searchsorted
with ndarray.argsort
。
a = df.a.to_numpy()
idx = a.argsort()
df['new'] = np.searchsorted(a[idx], a) / len(df)
df
a new
0 1 0.000
1 2 0.125
2 3 0.250
3 4 0.375
4 5 0.500
5 6 0.625
6 7 0.750
7 8 0.875
时间分析:
基准设置
a = np.array([1, 2, 3, 4, 5, 6, 7, 8])
a = a.repeat(1_000_000)
np.random.shuffle(a)
a = a[:1_000_000]
df = pd.DataFrame({'a': a})
结果:
In [69]: %%timeit
...: a = df.a.to_numpy()
...: (a[:, None] < a).sum(axis=0) / len(a)
...:
...:
MemoryError: Unable to allocate 931. GiB for an array with shape (1000000, 1000000) and data type bool
In [70]: %%timeit
...: a = df.a.to_numpy()
...: idx = a.argsort()
...: np.searchsorted(a[idx], a) / len(df)
...:
...:
96 ms ± 1.32 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [71]: %%timeit
...: ranks = df['a'].rank()
...: maxi = ranks.max()
...: (ranks-1)/maxi
...:
...:
86 ms ± 1.39 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
对于小数据,基准设置
a = a[:10_000]
df = pd.DataFrame({'a': a})
结果:
In [73]: %%timeit
...: ranks = df['a'].rank()
...: maxi = ranks.max()
...: (ranks-1)/maxi
...:
...:
1.29 ms ± 205 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [74]: %%timeit
...: a = df.a.to_numpy()
...: idx = a.argsort()
...: np.searchsorted(a[idx], a) / len(df)
...:
...:
684 µs ± 19.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [75]: %%timeit
...: a = df.a.to_numpy()
...: (a[:, None] < a).sum(axis=0) / len(a)
...:
...:
122 ms ± 2.37 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
等式检查
ranks = df['a'].rank()
maxi = ranks.max()
ris = ((ranks-1)/maxi).to_numpy()
jez = (a[:, None] < a).sum(axis=0) / len(a)
idx = a.argsort()
ch3 = np.searchsorted(a[idx], a) / len(df)
(jez == ch3).all()
# True
(jez == ris).all()
# False
我有一个这样的 df:
>>> a = [1, 2, 3, 4, 5, 6, 7, 8]
>>> df = pd.DataFrame({'a': a})
>>> df
a
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
我想用显示有多少观察值小于该值(以百分比表示)的值替换这些值。像这样:
>>> df
a how_many_percent_of_observations_are_less_than_value_from_a
0 1 0 (no observations that are lower, 0/8)
1 2 .125 (one observation is lower, 1/8)
2 3 .25 (two observations are lower, 2/8)
3 4
4 5
5 6
6 7
7 8 .875 (7 observations are lower, 7/8)
如果 a
值不太像相同的值,您可以使用 numpy 广播进行测试,然后计算每个 'columns'
的 True
的数量并除以数组的长度:
a = df.a.to_numpy()
print (a[:, None] < a)
[[False True True True True True True True]
[False False True True True True True True]
[False False False True True True True True]
[False False False False True True True True]
[False False False False False True True True]
[False False False False False False True True]
[False False False False False False False True]
[False False False False False False False False]]
df['new'] = (a[:, None] < a).sum(axis=0) / len(a)
print (df)
a new
0 1 0.000
1 2 0.125
2 3 0.250
3 4 0.375
4 5 0.500
5 6 0.625
6 7 0.750
7 8 0.875
使用rank
a = [1, 2, 3, 4, 5, 6, 7, 8]
df = pd.DataFrame({'a': a})
ranks = df['a'].rank(method = 'min')
maxi = ranks.size
df['b'] = (ranks-1)/maxi
输出:
>>> df
a b
0 1 0.000
1 2 0.125
2 3 0.250
3 4 0.375
4 5 0.500
5 6 0.625
6 7 0.750
7 8 0.875
你可以在这里使用np.searchsorted
with ndarray.argsort
。
a = df.a.to_numpy()
idx = a.argsort()
df['new'] = np.searchsorted(a[idx], a) / len(df)
df
a new
0 1 0.000
1 2 0.125
2 3 0.250
3 4 0.375
4 5 0.500
5 6 0.625
6 7 0.750
7 8 0.875
时间分析:
基准设置
a = np.array([1, 2, 3, 4, 5, 6, 7, 8])
a = a.repeat(1_000_000)
np.random.shuffle(a)
a = a[:1_000_000]
df = pd.DataFrame({'a': a})
结果:
In [69]: %%timeit
...: a = df.a.to_numpy()
...: (a[:, None] < a).sum(axis=0) / len(a)
...:
...:
MemoryError: Unable to allocate 931. GiB for an array with shape (1000000, 1000000) and data type bool
In [70]: %%timeit
...: a = df.a.to_numpy()
...: idx = a.argsort()
...: np.searchsorted(a[idx], a) / len(df)
...:
...:
96 ms ± 1.32 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [71]: %%timeit
...: ranks = df['a'].rank()
...: maxi = ranks.max()
...: (ranks-1)/maxi
...:
...:
86 ms ± 1.39 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
对于小数据,基准设置
a = a[:10_000]
df = pd.DataFrame({'a': a})
结果:
In [73]: %%timeit
...: ranks = df['a'].rank()
...: maxi = ranks.max()
...: (ranks-1)/maxi
...:
...:
1.29 ms ± 205 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [74]: %%timeit
...: a = df.a.to_numpy()
...: idx = a.argsort()
...: np.searchsorted(a[idx], a) / len(df)
...:
...:
684 µs ± 19.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [75]: %%timeit
...: a = df.a.to_numpy()
...: (a[:, None] < a).sum(axis=0) / len(a)
...:
...:
122 ms ± 2.37 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
等式检查
ranks = df['a'].rank()
maxi = ranks.max()
ris = ((ranks-1)/maxi).to_numpy()
jez = (a[:, None] < a).sum(axis=0) / len(a)
idx = a.argsort()
ch3 = np.searchsorted(a[idx], a) / len(df)
(jez == ch3).all()
# True
(jez == ris).all()
# False