Pandas: 创建条件列
Pandas: Conditional column creating
我正在尝试根据 A 列和 B 列中的值创建 C 列,条件如下:
if A < 5000: C = A * B
else: C = A
下面给出语法错误:
df['C'] = df.apply(lambda x (x['A'] * x['B)'] if x['A'] < 5000 else x = x['A']),axis=1)
离我有多远?
使用向量化numpy.where
:
df['C'] = np.where(df['A'] < 5000, df['A'] * df['B'], df['A'])
性能:
np.random.seed(2019)
N = 1000
data = np.asarray([np.random.rand(N).tolist(), list(range(N))]).T
df = pd.DataFrame(data, columns=['A', 'B'])
In [56]: %timeit df['C'] = np.where(df['A'] < 5000, df['A'] * df['B'], df['A'])
536 µs ± 47.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [57]: %timeit df['C'] = df.apply(lambda x: x.A * x.B if x.A > 0.5 else x.A, 1)
30.9 ms ± 597 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
N = 100000
data = np.asarray([np.random.rand(N).tolist(), list(range(N))]).T
df = pd.DataFrame(data, columns=['A', 'B'])
In [59]: %timeit df['C'] = np.where(df['A'] < 5000, df['A'] * df['B'], df['A'])
1.29 ms ± 23.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [60]: %timeit df['C'] = df.apply(lambda x: x.A * x.B if x.A > 0.5 else x.A, 1)
3.32 s ± 374 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
我想你会想要这样的东西
df['C'] = df.apply(lambda x: x.A * x.B if x.A > 0.5 else x.A, 1)
完整示例:
import pandas as pd
import numpy as np
N = 10
data = np.asarray([np.random.rand(N).tolist(), list(range(N))]).T
df = pd.DataFrame(data, columns=['A', 'B'])
df['C'] = df.apply(lambda x: x.A * x.B if x.A > 0.5 else x.A, 1)
我确定之前提供的解决方案更好,但我用第三种方式解决了。数据集很小,所以现在就可以了。
乘法 = df['A'] * df['B']
df['C'] = multiply.where(df['A'] < 5000, 其他=df['A'])
我正在尝试根据 A 列和 B 列中的值创建 C 列,条件如下:
if A < 5000: C = A * B
else: C = A
下面给出语法错误:
df['C'] = df.apply(lambda x (x['A'] * x['B)'] if x['A'] < 5000 else x = x['A']),axis=1)
离我有多远?
使用向量化numpy.where
:
df['C'] = np.where(df['A'] < 5000, df['A'] * df['B'], df['A'])
性能:
np.random.seed(2019)
N = 1000
data = np.asarray([np.random.rand(N).tolist(), list(range(N))]).T
df = pd.DataFrame(data, columns=['A', 'B'])
In [56]: %timeit df['C'] = np.where(df['A'] < 5000, df['A'] * df['B'], df['A'])
536 µs ± 47.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [57]: %timeit df['C'] = df.apply(lambda x: x.A * x.B if x.A > 0.5 else x.A, 1)
30.9 ms ± 597 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
N = 100000
data = np.asarray([np.random.rand(N).tolist(), list(range(N))]).T
df = pd.DataFrame(data, columns=['A', 'B'])
In [59]: %timeit df['C'] = np.where(df['A'] < 5000, df['A'] * df['B'], df['A'])
1.29 ms ± 23.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [60]: %timeit df['C'] = df.apply(lambda x: x.A * x.B if x.A > 0.5 else x.A, 1)
3.32 s ± 374 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
我想你会想要这样的东西
df['C'] = df.apply(lambda x: x.A * x.B if x.A > 0.5 else x.A, 1)
完整示例:
import pandas as pd
import numpy as np
N = 10
data = np.asarray([np.random.rand(N).tolist(), list(range(N))]).T
df = pd.DataFrame(data, columns=['A', 'B'])
df['C'] = df.apply(lambda x: x.A * x.B if x.A > 0.5 else x.A, 1)
我确定之前提供的解决方案更好,但我用第三种方式解决了。数据集很小,所以现在就可以了。
乘法 = df['A'] * df['B'] df['C'] = multiply.where(df['A'] < 5000, 其他=df['A'])