如何使用模型字典和 return 对原始测试数据帧进行分组数据帧预测?
How to predict on a grouped DataFrame, using a dictionary of models, and return to original test DataFrame?
我创建了一个回归模型字典,由训练数据集中 group
的值索引,d
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
d = pd.DataFrame({
"group":["cat","fish","horse","cat","fish","horse","cat","horse"],
"x":[1,4,7,2,5,8,3,9],
"y":[10,20,14,12,12,3,12,2],
"z":[3,5,3,5,9,1,2,3]
})
features, models =['x','z'],{}
for animal in ['horse','cat','fish']:
models[animal] = Pipeline([("estimator",LinearRegression(fit_intercept=True))])
x,y = d.loc[d.group==animal,features],d.loc[d.group==animal,"y"]
models[animal].fit(x,y)
我还有一个测试数据集,test_d
,它有一些行,但不是所有组(即所有模型)。
test_d = pd.DataFrame({
"group":["dog","fish","horse","dog","fish","horse","dog","horse"],
"x":[1,2,3,4,5,6,7,8],
"z":[3,5,3,5,9,1,2,3]
})
我想在分组 test_d
上使用 apply
,利用 .name
查找正确的模型(如果存在),以及 return 预测,使用函数 f()
def f(g):
try:
predictions = models[g.name].predict(g[features])
except:
predictions = [None]*len(g)
return predictions
函数“有效”的意思是它return是正确的值
grouping_column ="group"
test_d.groupby(grouping_column, group_keys=False).apply(f)
输出:
group
dog [None, None, None]
fish [20.94117647058824, 12.000000000000004]
horse [38.0, 15.0, 8.0]
dtype: object
问题:
f()
应该怎么写才能直接给test_d
赋值?我想做这样的事情:
test_d["predictions"] = test_d.groupby(grouping_column, group_keys=False).apply(f)
但这显然行不通。
group x z predictions
0 dog 1 3 NaN
1 fish 2 5 NaN
2 horse 3 3 NaN
3 dog 4 5 NaN
4 fish 5 9 NaN
5 horse 6 1 NaN
6 dog 7 2 NaN
7 horse 8 3 NaN
预期输出
group x z predictions
0 dog 1 3 NaN
1 fish 2 5 20.941176
2 horse 3 3 38.000000
3 dog 4 5 NaN
4 fish 5 9 12.000000
5 horse 6 1 15.000000
6 dog 7 2 NaN
7 horse 8 3 8.000000
您的函数 f
应该 return 具有原始索引的系列:
def f(g):
try:
predictions = models[g.name].predict(g[features])
except:
predictions = [None]*len(g)
return pd.Series(predictions, index=g.index)
test_d.groupby('group', group_keys=False).apply(f)
输出:
0 None
3 None
6 None
1 20.941176
4 12.0
2 38.0
5 15.0
7 8.0
dtype: object
因此,如果您分配,索引将对齐:
test_d['predictions'] = test_d.groupby('group', group_keys=False).apply(f)
输出:
group x z predictions
0 dog 1 3 None
1 fish 2 5 20.941176
2 horse 3 3 38.0
3 dog 4 5 None
4 fish 5 9 12.0
5 horse 6 1 15.0
6 dog 7 2 None
7 horse 8 3 8.0
我创建了一个回归模型字典,由训练数据集中 group
的值索引,d
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
d = pd.DataFrame({
"group":["cat","fish","horse","cat","fish","horse","cat","horse"],
"x":[1,4,7,2,5,8,3,9],
"y":[10,20,14,12,12,3,12,2],
"z":[3,5,3,5,9,1,2,3]
})
features, models =['x','z'],{}
for animal in ['horse','cat','fish']:
models[animal] = Pipeline([("estimator",LinearRegression(fit_intercept=True))])
x,y = d.loc[d.group==animal,features],d.loc[d.group==animal,"y"]
models[animal].fit(x,y)
我还有一个测试数据集,test_d
,它有一些行,但不是所有组(即所有模型)。
test_d = pd.DataFrame({
"group":["dog","fish","horse","dog","fish","horse","dog","horse"],
"x":[1,2,3,4,5,6,7,8],
"z":[3,5,3,5,9,1,2,3]
})
我想在分组 test_d
上使用 apply
,利用 .name
查找正确的模型(如果存在),以及 return 预测,使用函数 f()
def f(g):
try:
predictions = models[g.name].predict(g[features])
except:
predictions = [None]*len(g)
return predictions
函数“有效”的意思是它return是正确的值
grouping_column ="group"
test_d.groupby(grouping_column, group_keys=False).apply(f)
输出:
group
dog [None, None, None]
fish [20.94117647058824, 12.000000000000004]
horse [38.0, 15.0, 8.0]
dtype: object
问题:
f()
应该怎么写才能直接给test_d
赋值?我想做这样的事情:
test_d["predictions"] = test_d.groupby(grouping_column, group_keys=False).apply(f)
但这显然行不通。
group x z predictions
0 dog 1 3 NaN
1 fish 2 5 NaN
2 horse 3 3 NaN
3 dog 4 5 NaN
4 fish 5 9 NaN
5 horse 6 1 NaN
6 dog 7 2 NaN
7 horse 8 3 NaN
预期输出
group x z predictions
0 dog 1 3 NaN
1 fish 2 5 20.941176
2 horse 3 3 38.000000
3 dog 4 5 NaN
4 fish 5 9 12.000000
5 horse 6 1 15.000000
6 dog 7 2 NaN
7 horse 8 3 8.000000
您的函数 f
应该 return 具有原始索引的系列:
def f(g):
try:
predictions = models[g.name].predict(g[features])
except:
predictions = [None]*len(g)
return pd.Series(predictions, index=g.index)
test_d.groupby('group', group_keys=False).apply(f)
输出:
0 None
3 None
6 None
1 20.941176
4 12.0
2 38.0
5 15.0
7 8.0
dtype: object
因此,如果您分配,索引将对齐:
test_d['predictions'] = test_d.groupby('group', group_keys=False).apply(f)
输出:
group x z predictions
0 dog 1 3 None
1 fish 2 5 20.941176
2 horse 3 3 38.0
3 dog 4 5 None
4 fish 5 9 12.0
5 horse 6 1 15.0
6 dog 7 2 None
7 horse 8 3 8.0