如何使用模型字典和 return 对原始测试数据帧进行分组数据帧预测?

How to predict on a grouped DataFrame, using a dictionary of models, and return to original test DataFrame?

我创建了一个回归模型字典,由训练数据集中 group 的值索引,d

import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline

d = pd.DataFrame({
    "group":["cat","fish","horse","cat","fish","horse","cat","horse"],
    "x":[1,4,7,2,5,8,3,9],
    "y":[10,20,14,12,12,3,12,2],
    "z":[3,5,3,5,9,1,2,3]
})

features, models =['x','z'],{}
for animal in ['horse','cat','fish']:
    models[animal] = Pipeline([("estimator",LinearRegression(fit_intercept=True))])
    x,y = d.loc[d.group==animal,features],d.loc[d.group==animal,"y"]
    models[animal].fit(x,y)

我还有一个测试数据集,test_d,它有一些行,但不是所有组(即所有模型)。

test_d = pd.DataFrame({
    "group":["dog","fish","horse","dog","fish","horse","dog","horse"],
    "x":[1,2,3,4,5,6,7,8],
    "z":[3,5,3,5,9,1,2,3]
})

我想在分组 test_d 上使用 apply,利用 .name 查找正确的模型(如果存在),以及 return 预测,使用函数 f()

def f(g):
    try:
        predictions = models[g.name].predict(g[features])
    except:
        predictions = [None]*len(g)
    return predictions

函数“有效”的意思是它return是正确的值

grouping_column ="group"
test_d.groupby(grouping_column, group_keys=False).apply(f)

输出:

group
dog                           [None, None, None]
fish     [20.94117647058824, 12.000000000000004]
horse                          [38.0, 15.0, 8.0]
dtype: object

问题:

f()应该怎么写才能直接给test_d赋值?我想做这样的事情:

test_d["predictions"] = test_d.groupby(grouping_column, group_keys=False).apply(f)

但这显然行不通。

   group  x  z predictions
0    dog  1  3         NaN
1   fish  2  5         NaN
2  horse  3  3         NaN
3    dog  4  5         NaN
4   fish  5  9         NaN
5  horse  6  1         NaN
6    dog  7  2         NaN
7  horse  8  3         NaN

预期输出

   group  x  z  predictions
0    dog  1  3          NaN
1   fish  2  5    20.941176
2  horse  3  3    38.000000
3    dog  4  5          NaN
4   fish  5  9    12.000000
5  horse  6  1    15.000000
6    dog  7  2          NaN
7  horse  8  3     8.000000

您的函数 f 应该 return 具有原始索引的系列:

def f(g):
    try:
        predictions = models[g.name].predict(g[features])
    except:
        predictions = [None]*len(g)
    return pd.Series(predictions, index=g.index)

test_d.groupby('group', group_keys=False).apply(f)

输出:

0         None
3         None
6         None
1    20.941176
4         12.0
2         38.0
5         15.0
7          8.0
dtype: object

因此,如果您分配,索引将对齐:

test_d['predictions'] = test_d.groupby('group', group_keys=False).apply(f)

输出:

   group  x  z predictions
0    dog  1  3        None
1   fish  2  5   20.941176
2  horse  3  3        38.0
3    dog  4  5        None
4   fish  5  9        12.0
5  horse  6  1        15.0
6    dog  7  2        None
7  horse  8  3         8.0