如何获得每行最接近的先前值?
How to get the closest previous value for each row?
我有一个 DataFrame,它有很多股票(在本例中只有 GM 和 F)、销售增长、股票增长。
import pandas as pd
import numpy as np
d = {'year': [2000, 2000, 2001, 2001, 2002, 2002, 2003, 2003, 2004, 2004], 'Stock': ["GM", "F", "GM", "F", "GM", "F", "GM", "F", "GM", "F"], 'US Sales Growth': [.3, .3, .1, .1, .6, .6, .12, .12, .7, .7], 'Stock Growth': [.1, .2, .3, .4, .14, .16, .2, .1,.15,.16]}
df = pd.DataFrame(data=d)
我的目标是添加一个名为“closets_sales_growth_corresponding_stock_growth”的列,将当前销售增长与过去的销售增长相匹配,并将与最近的过去销售增长对应的库存增长收集到一个新列中。
它应该是这样的:
不漂亮,但它有效:)
import pandas as pd
import numpy as np
d = {'year': [2000, 2000, 2001, 2001, 2002, 2002, 2003, 2003, 2004, 2004], 'Stock': ["GM", "F", "GM", "F", "GM", "F", "GM", "F", "GM", "F"], 'US Sales Growth': [.3, .3, .1, .1, .6, .6, .12, .12, .7, .7], 'Stock Growth': [.1, .2, .3, .4, .14, .16, .2, .1,.15,.16]}
df = pd.DataFrame(data=d)
close_s_g_s = []
for i in df.index:
stock = df['Stock'][i]
cur_s_g = df['US Sales Growth'][i]
stock_growth = np.nan
min_s_g_dif = np.inf
for i_less in range(0,i):
if df['Stock'][i_less] == stock:
s_g_dif = abs(df['US Sales Growth'][i_less] - cur_s_g)
if s_g_dif < min_s_g_dif:
min_s_g_dif = s_g_dif
stock_growth = df["Stock Growth"][i_less]
close_s_g_s.append(stock_growth)
new_col = "closets_sales_growth_corresponding_stock_growth"
df[new_col] = close_s_g_s
我会创建一个函数,returns 每行的库存增长。然后可以将其应用于每一行:
import pandas as pd
import numpy as np
# Create dataframe
d = {'year': [2000, 2000, 2001, 2001, 2002, 2002, 2003, 2003, 2004, 2004], 'Stock': ["GM", "F", "GM", "F", "GM", "F", "GM", "F", "GM", "F"], 'US Sales Growth': [.3, .3, .1, .1, .6, .6, .12, .12, .7, .7], 'Stock Growth': [.1, .2, .3, .4, .14, .16, .2, .1,.15,.16]}
df = pd.DataFrame(data=d)
# Define function to find nearest value
def find_nearest_value(df, year, stock, sales_growth):
# Filter df to only include rows of same stock and earlier year
filtered_df = df[(df['year'] < year) & (df['Stock'] == stock)]
# Order the filtered row by how close they are to current sales growth
ordered = (filtered_df['US Sales Growth']-sales_growth).abs().argsort()
# Return nan if we do not find any previous value
if len(ordered) == 0:
return np.nan
stock_growth = filtered_df['Stock Growth'].iloc[(ordered[:1])].values[0]
return stock_growth
# Apply function on each row
df['closets_sales_growth_corresponding_stock_growth'] = df.apply(lambda x: find_nearest_value(df, x['year'], x['Stock'], x['US Sales Growth']), axis=1)
df
输出:
year Stock US Sales Growth Stock Growth closets_sales_growth_corresponding_stock_growth
0 2000 GM 0.30 0.10 NaN
1 2000 F 0.30 0.20 NaN
2 2001 GM 0.10 0.30 0.10
3 2001 F 0.10 0.40 0.20
4 2002 GM 0.60 0.14 0.10
5 2002 F 0.60 0.16 0.20
6 2003 GM 0.12 0.20 0.30
7 2003 F 0.12 0.10 0.40
8 2004 GM 0.70 0.15 0.14
9 2004 F 0.70 0.16 0.16
这是使用 groupby
的一种方法。基本上,只需 groupby
“库存”并向每个组应用一个函数,该函数可以找到每行最近的过去美国销售增长的库存增长。
def get_new_col(g):
out = [np.nan]
for idx in g.index[1:]:
# get the index of the previous sales growth closest in absolute value to the current one
closest_val_idx = (g.loc[idx, 'US Sales Growth'] - g.loc[:idx-1, 'US Sales Growth']).abs().idxmin()
# index the stock growth with the index found above
out.append(g.loc[closest_val_idx, 'Stock Growth'])
return pd.Series(out, index=g.index)
df['growth_corresponding_stock_growth'] = df.groupby('Stock').apply(get_new_col).droplevel(0)
输出:
year Stock US Sales Growth Stock Growth growth_corresponding_stock_growth
0 2000 GM 0.30 0.10 NaN
1 2000 F 0.30 0.20 NaN
2 2001 GM 0.10 0.30 0.10
3 2001 F 0.10 0.40 0.20
4 2002 GM 0.60 0.14 0.10
5 2002 F 0.60 0.16 0.20
6 2003 GM 0.12 0.20 0.30
7 2003 F 0.12 0.10 0.40
8 2004 GM 0.70 0.15 0.14
9 2004 F 0.70 0.16 0.16
我有一个 DataFrame,它有很多股票(在本例中只有 GM 和 F)、销售增长、股票增长。
import pandas as pd
import numpy as np
d = {'year': [2000, 2000, 2001, 2001, 2002, 2002, 2003, 2003, 2004, 2004], 'Stock': ["GM", "F", "GM", "F", "GM", "F", "GM", "F", "GM", "F"], 'US Sales Growth': [.3, .3, .1, .1, .6, .6, .12, .12, .7, .7], 'Stock Growth': [.1, .2, .3, .4, .14, .16, .2, .1,.15,.16]}
df = pd.DataFrame(data=d)
我的目标是添加一个名为“closets_sales_growth_corresponding_stock_growth”的列,将当前销售增长与过去的销售增长相匹配,并将与最近的过去销售增长对应的库存增长收集到一个新列中。
它应该是这样的:
不漂亮,但它有效:)
import pandas as pd
import numpy as np
d = {'year': [2000, 2000, 2001, 2001, 2002, 2002, 2003, 2003, 2004, 2004], 'Stock': ["GM", "F", "GM", "F", "GM", "F", "GM", "F", "GM", "F"], 'US Sales Growth': [.3, .3, .1, .1, .6, .6, .12, .12, .7, .7], 'Stock Growth': [.1, .2, .3, .4, .14, .16, .2, .1,.15,.16]}
df = pd.DataFrame(data=d)
close_s_g_s = []
for i in df.index:
stock = df['Stock'][i]
cur_s_g = df['US Sales Growth'][i]
stock_growth = np.nan
min_s_g_dif = np.inf
for i_less in range(0,i):
if df['Stock'][i_less] == stock:
s_g_dif = abs(df['US Sales Growth'][i_less] - cur_s_g)
if s_g_dif < min_s_g_dif:
min_s_g_dif = s_g_dif
stock_growth = df["Stock Growth"][i_less]
close_s_g_s.append(stock_growth)
new_col = "closets_sales_growth_corresponding_stock_growth"
df[new_col] = close_s_g_s
我会创建一个函数,returns 每行的库存增长。然后可以将其应用于每一行:
import pandas as pd
import numpy as np
# Create dataframe
d = {'year': [2000, 2000, 2001, 2001, 2002, 2002, 2003, 2003, 2004, 2004], 'Stock': ["GM", "F", "GM", "F", "GM", "F", "GM", "F", "GM", "F"], 'US Sales Growth': [.3, .3, .1, .1, .6, .6, .12, .12, .7, .7], 'Stock Growth': [.1, .2, .3, .4, .14, .16, .2, .1,.15,.16]}
df = pd.DataFrame(data=d)
# Define function to find nearest value
def find_nearest_value(df, year, stock, sales_growth):
# Filter df to only include rows of same stock and earlier year
filtered_df = df[(df['year'] < year) & (df['Stock'] == stock)]
# Order the filtered row by how close they are to current sales growth
ordered = (filtered_df['US Sales Growth']-sales_growth).abs().argsort()
# Return nan if we do not find any previous value
if len(ordered) == 0:
return np.nan
stock_growth = filtered_df['Stock Growth'].iloc[(ordered[:1])].values[0]
return stock_growth
# Apply function on each row
df['closets_sales_growth_corresponding_stock_growth'] = df.apply(lambda x: find_nearest_value(df, x['year'], x['Stock'], x['US Sales Growth']), axis=1)
df
输出:
year Stock US Sales Growth Stock Growth closets_sales_growth_corresponding_stock_growth
0 2000 GM 0.30 0.10 NaN
1 2000 F 0.30 0.20 NaN
2 2001 GM 0.10 0.30 0.10
3 2001 F 0.10 0.40 0.20
4 2002 GM 0.60 0.14 0.10
5 2002 F 0.60 0.16 0.20
6 2003 GM 0.12 0.20 0.30
7 2003 F 0.12 0.10 0.40
8 2004 GM 0.70 0.15 0.14
9 2004 F 0.70 0.16 0.16
这是使用 groupby
的一种方法。基本上,只需 groupby
“库存”并向每个组应用一个函数,该函数可以找到每行最近的过去美国销售增长的库存增长。
def get_new_col(g):
out = [np.nan]
for idx in g.index[1:]:
# get the index of the previous sales growth closest in absolute value to the current one
closest_val_idx = (g.loc[idx, 'US Sales Growth'] - g.loc[:idx-1, 'US Sales Growth']).abs().idxmin()
# index the stock growth with the index found above
out.append(g.loc[closest_val_idx, 'Stock Growth'])
return pd.Series(out, index=g.index)
df['growth_corresponding_stock_growth'] = df.groupby('Stock').apply(get_new_col).droplevel(0)
输出:
year Stock US Sales Growth Stock Growth growth_corresponding_stock_growth
0 2000 GM 0.30 0.10 NaN
1 2000 F 0.30 0.20 NaN
2 2001 GM 0.10 0.30 0.10
3 2001 F 0.10 0.40 0.20
4 2002 GM 0.60 0.14 0.10
5 2002 F 0.60 0.16 0.20
6 2003 GM 0.12 0.20 0.30
7 2003 F 0.12 0.10 0.40
8 2004 GM 0.70 0.15 0.14
9 2004 F 0.70 0.16 0.16