逻辑回归模型（二进制）交叉表错误 = 传递值形状问题

Question

我目前正在尝试运行数据集的逻辑回归。我对我的 cat 变量进行了虚拟编码并对我的连续变量进行了标准化，然后我用 -1 填充空值（这适用于我的数据集）。我正在执行这些步骤，在我尝试运行我的交叉表之前我没有收到任何错误，它抱怨我传递的值的形状。对于 LogR w/ 和 w/out CV，我都遇到了同样的错误。我在下面包含了我的代码，我没有包含编码，因为这似乎不是问题或代码 LogR w/out CV 因为它基本上相同，除了它不包括 CV。

# read in the df w/ encoded variables
allyrs=pd.read_csv("C:/Users/cyrra/OneDrive/Documents/Pythonread/HDS805/CS1W1/modelready_working.csv")

# Find locations of where I need to trim the data down selecting only the encoded variables
allyrs.columns.get_loc("BMI_C__-1.0")
23
allyrs.columns.get_loc("N_BMIR")
152

# Finding the location of the Y col
allyrs.columns.get_loc("CM")
23

#create new X and y for binary LR
y_bi = allyrs[["CM"]]
X_bi = allyrs.iloc[0:1305720, 23:152]

然后我继续检查两个变量的长度并检查 X 集中的所有列，一切都在那里。取值如下：y_bi = 1305720 rows x 1 col , X_bi = 1305720 rows × 129 columns

# Create test/train
# Create test/train for bi column
from sklearn.model_selection import train_test_split
Xbi_train, Xbi_test, ybi_train, ybi_test = train_test_split(X_bi, y_bi,
                                                    train_size=0.8,test_size = 0.2)

再次检查Xbi_train和&Ybi_train的大小：Xbi_train=1044576行×129列，ybi_train=1044576行×1列

# LRw/CV for the binary col
from sklearn.linear_model import LogisticRegressionCV
logitbi_cv = LogisticRegressionCV(cv=2, random_state=0).fit(Xbi_train, ybi_train)

# Set predicted (checking to see if its an array)
logitbi_cv.predict(Xbi_train)
array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

# Set predicted to its own variable 
[IN]:pred_logitbi_cv =logitbi_cv.predict(Xbi_train)

# Cross tab LR w/0ut
from sklearn.metrics import confusion_matrix
ct_bi_cv=pd.crosstab(ybi_train, pred_logitbi_cv)

错误：

[OUT]:
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
~\anaconda3\lib\site-packages\pandas\core\internals\managers.py in create_block_manager_from_arrays(arrays, names, axes)
   1701         blocks = _form_blocks(arrays, names, axes)
-> 1702         mgr = BlockManager(blocks, axes)
   1703         mgr._consolidate_inplace()

~\anaconda3\lib\site-packages\pandas\core\internals\managers.py in __init__(self, blocks, axes, do_integrity_check)
    142         if do_integrity_check:
--> 143             self._verify_integrity()
    144 

~\anaconda3\lib\site-packages\pandas\core\internals\managers.py in _verify_integrity(self)
    322             if block.shape[1:] != mgr_shape[1:]:
--> 323                 raise construction_error(tot_items, block.shape[1:], self.axes)
    324         if len(self.items) != tot_items:

ValueError: Shape of passed values is (1, 2), indices imply (1044576, 2)

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
<ipython-input-121-c669b17c171f> in <module>
      1 # LR W/ CV
      2 # Cross tab LR w/0ut
----> 3 ct_bi_cv=pd.crosstab(ybi_train, pred_logitbi_cv)

~\anaconda3\lib\site-packages\pandas\core\reshape\pivot.py in crosstab(index, columns, values, rownames, colnames, aggfunc, margins, margins_name, dropna, normalize)
    596         **dict(zip(unique_colnames, columns)),
    597     }
--> 598     df = DataFrame(data, index=common_idx)
    599     original_df_cols = df.columns
    600 

~\anaconda3\lib\site-packages\pandas\core\frame.py in __init__(self, data, index, columns, dtype, copy)
    527 
    528         elif isinstance(data, dict):
--> 529             mgr = init_dict(data, index, columns, dtype=dtype)
    530         elif isinstance(data, ma.MaskedArray):
    531             import numpy.ma.mrecords as mrecords

~\anaconda3\lib\site-packages\pandas\core\internals\construction.py in init_dict(data, index, columns, dtype)
    285             arr if not is_datetime64tz_dtype(arr) else arr.copy() for arr in arrays
    286         ]
--> 287     return arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
    288 
    289 

~\anaconda3\lib\site-packages\pandas\core\internals\construction.py in arrays_to_mgr(arrays, arr_names, index, columns, dtype, verify_integrity)
     93     axes = [columns, index]
     94 
---> 95     return create_block_manager_from_arrays(arrays, arr_names, axes)
     96 
     97 

~\anaconda3\lib\site-packages\pandas\core\internals\managers.py in create_block_manager_from_arrays(arrays, names, axes)
   1704         return mgr
   1705     except ValueError as e:
-> 1706         raise construction_error(len(arrays), arrays[0].shape, axes, e)
   1707 
   1708 

ValueError: Shape of passed values is (1, 2), indices imply (1044576, 2)

我知道这是在说传递给交叉表的行数不匹配，但有人能告诉我为什么会这样或我哪里出错了吗？我正在使用我自己的数据复制示例代码，完全按照我正在使用的书中提供的数据。

非常感谢！

Answer 1

您的目标变量的形状应为 (n,) 而不是 (n,1) ，就像您调用 y_bi = allyrs[["CM"]] 时的情况一样。见相关help page。应该对此发出警告，因为合身性不起作用，但我想这是不知何故遗漏了。

如果你调用y_bi = allyrs["CM"]，比如我设置了一些虚拟数据：

import numpy as np
import pandas as pd

np.random.seed(111)
allyrs = pd.DataFrame(np.random.binomial(1,0.5,(100,4)),columns=['x1','x2','x3','CM'])
X_bi = allyrs.iloc[:,:4]
y_bi = allyrs["CM"]

然后运行火车测试拆分，然后是拟合：

from sklearn.model_selection import train_test_split
Xbi_train, Xbi_test, ybi_train, ybi_test = train_test_split(X_bi, y_bi,
                                                    train_size=0.8,test_size = 0.2)

from sklearn.linear_model import LogisticRegressionCV
logitbi_cv = LogisticRegressionCV(cv=2, random_state=0).fit(Xbi_train, ybi_train)

pred_logitbi_cv =logitbi_cv.predict(Xbi_train)
pd.crosstab(ybi_train, pred_logitbi_cv)

col_0   0   1
CM           
0      39   0
1       0  41

逻辑回归模型（二进制）交叉表错误 = 传递值形状问题

Logistic Regression Model (binary) crosstab error = shape of passed values issue

python

crosstab

pandas

scikit-learn

logistic-regression