PatsyError: numbers besides '0' and '1' are only allowed with ** doesnt' not resolve when using Q

PatsyError: numbers besides '0' and '1' are only allowed with ** doesnt' not resolve when using Q

我正在尝试 运行 对如下所示的数据框进行方差分析测试:

>>>code   2020-11-01    2020-11-02   2020-11-03   2020-11-04 ...
0  1       22.5         73.1          12.2          77.5
1  1       23.1         75.4          12.4          78.3
2  2       43.1         72.1          13.4          85.4
3  2       41.6         85.1          34.1          96.5
4  3       97.3         43.2          31.1          55.3
5  3       12.1         44.4          32.2          52.1
...

我想根据代码为每一列计算单方差分析。我已经使用了那个 statsmodel 和 for loop :

keys = []
tables = []
for variable in df.columns[1:]:
    model = ols('{} ~ code'.format(variable), data=df).fit()
    anova_table = sm.stats.anova_lm(model)

    keys.append(variable)
    tables.append(anova_table)

df_anova = pd.concat(tables, keys=keys, axis=0)

df_anova

问题是我不断收到第 4 行的错误消息:

PatsyError: numbers besides '0' and '1' are only allowed with ** 2020-11-01 ~ code ^^^^

我尝试按照建议使用 Q 参数 here:

...
   model = ols('{Q(x)} ~ code'.format(x=variable), data=df).fit()

KeyError: 'Q(x)'

我也试过在外面找到Q,但得到了同样的错误。

我的最终目标:根据“代码”列计算每一天(每一列)的单向差。

您可以尝试将其长期旋转并跳过列的迭代:

import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

df = pd.DataFrame({"code":[1,1,2,2,3,3],
                   "2020-11-01":[22.5,23.1,43.1,41.6,97.3,12.1],
                  "2020-11-02":[73.1,75.4,72.1,85.1,43.2,44.4]})

df_long = df.melt(id_vars="code")

df_long

    code    variable    value
0   1   2020-11-01  22.5
1   1   2020-11-01  23.1
2   2   2020-11-01  43.1
3   2   2020-11-01  41.6
4   3   2020-11-01  97.3
5   3   2020-11-01  12.1
6   1   2020-11-02  73.1
7   1   2020-11-02  75.4
8   2   2020-11-02  72.1
9   2   2020-11-02  85.1
10  3   2020-11-02  43.2
11  3   2020-11-02  44.4

然后应用您的代码:

tables = []
keys = df_long.variable.unique()
for D in keys:
    model = ols('value ~ code', data=df_long[df_long.variable == D]).fit()
    anova_table = sm.stats.anova_lm(model)
    tables.append(anova_table)

pd.concat(tables,keys=keys)

或者简单地说:

def aov_func(x):
    model = ols('value ~ code', data=x).fit()
    return sm.stats.anova_lm(model)
        
df_long.groupby("variable").apply(aov_func)

给出这个结果:

        df  sum_sq  mean_sq F   PR(>F)
variable                        
2020-11-01  code    1.0 1017.6100   1017.610000 1.115768    0.350405
Residual    4.0 3648.1050   912.026250  NaN NaN
2020-11-02  code    1.0 927.2025    927.202500  6.194022    0.067573
Residual    4.0 598.7725    149.693125  NaN NaN