基于现有列创建新列
Create new column based on existing columns
我有一个 Pandas 数据框,看起来像这样。
Deviated_price standard_price
744,600 789,276
693,600 789,276
693,600 735,216
735,216
744,600 735,216
735,216
我想创建一个名为 net_standard_price 的新专栏。净标准价格的值将基于 Deviated_price 和 standard_price 列。
如果 Deviated price 不为空,则 net_standard_price 应为空。
如果偏差价格为空,则 net_standard_price 应包含 standard_price 值。
Net_standard_price 应该是这样的。
Deviated_price standard_price Net_standard_price
789,276 789,276
693,600 789,276
693,600 735,216
735,216 735,216
744,600 735,216
735,216 735,216
我使用 np.where 尝试了以下代码,但 Net_standard_price 对于所有记录都是空的。
df['Net_standard_price'] = np.where(df['Deviated_price'] != '',
'', df['standard_price'])
最有效的方法是什么?
迁移到 numpy 域带来了一些性能提升
import pandas as pd
import numpy as np
from timeit import Timer
def make_df():
random_state = np.random.RandomState()
df = pd.DataFrame(random_state.random((10000, 2)), columns=['Deviated_price', 'standard_price'], dtype=str)
df['Deviated_price'][random_state.randint(0, 2, len(df)).astype(np.bool)] = None
return df
def test1(df):
df['Net_standard_price'] = np.where(df['Deviated_price'] != '',
'', df['standard_price'])
def test2(df):
df['Net_standard_price'] = np.where(df['Deviated_price'].isna(), df['standard_price'], None)
def test3(df):
temp = df['standard_price'].values
temp2 = df['Deviated_price'].values
net_standard_price = temp.copy()
net_standard_price[temp2 == ''] = ''
df['Net_standard_price'] = net_standard_price
timing = Timer(setup='df = make_df()', stmt='test1(df)', globals=globals()).timeit(500)
print('test1: ', timing)
timing = Timer(setup='df = make_df()', stmt='test2(df)', globals=globals()).timeit(500)
print('test2: ', timing)
timing = Timer(setup='df = make_df()', stmt='test3(df)', globals=globals()).timeit(500)
print('test3: ', timing)
test1: 0.42146812000000006
test2: 0.417552648
test3: 0.2913768969999999
我有一个 Pandas 数据框,看起来像这样。
Deviated_price standard_price
744,600 789,276
693,600 789,276
693,600 735,216
735,216
744,600 735,216
735,216
我想创建一个名为 net_standard_price 的新专栏。净标准价格的值将基于 Deviated_price 和 standard_price 列。
如果 Deviated price 不为空,则 net_standard_price 应为空。 如果偏差价格为空,则 net_standard_price 应包含 standard_price 值。
Net_standard_price 应该是这样的。
Deviated_price standard_price Net_standard_price
789,276 789,276
693,600 789,276
693,600 735,216
735,216 735,216
744,600 735,216
735,216 735,216
我使用 np.where 尝试了以下代码,但 Net_standard_price 对于所有记录都是空的。
df['Net_standard_price'] = np.where(df['Deviated_price'] != '',
'', df['standard_price'])
最有效的方法是什么?
迁移到 numpy 域带来了一些性能提升
import pandas as pd
import numpy as np
from timeit import Timer
def make_df():
random_state = np.random.RandomState()
df = pd.DataFrame(random_state.random((10000, 2)), columns=['Deviated_price', 'standard_price'], dtype=str)
df['Deviated_price'][random_state.randint(0, 2, len(df)).astype(np.bool)] = None
return df
def test1(df):
df['Net_standard_price'] = np.where(df['Deviated_price'] != '',
'', df['standard_price'])
def test2(df):
df['Net_standard_price'] = np.where(df['Deviated_price'].isna(), df['standard_price'], None)
def test3(df):
temp = df['standard_price'].values
temp2 = df['Deviated_price'].values
net_standard_price = temp.copy()
net_standard_price[temp2 == ''] = ''
df['Net_standard_price'] = net_standard_price
timing = Timer(setup='df = make_df()', stmt='test1(df)', globals=globals()).timeit(500)
print('test1: ', timing)
timing = Timer(setup='df = make_df()', stmt='test2(df)', globals=globals()).timeit(500)
print('test2: ', timing)
timing = Timer(setup='df = make_df()', stmt='test3(df)', globals=globals()).timeit(500)
print('test3: ', timing)
test1: 0.42146812000000006
test2: 0.417552648
test3: 0.2913768969999999