这个错误对 StratifiedShuffleSplit 意味着什么？

Question

总的来说，我对数据科学完全陌生，希望有人能解释为什么这不起作用：

我正在使用来自以下 url 的广告数据集：“http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv”，它有 3 个特征列（"TV"、"Radio"、"Newspaper") 和 1 个标签列 ("sales")。我的完整数据集名为 data.

接下来我尝试使用sklearn的StratifiedShuffleSplit函数将数据分为训练集和测试集

from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1, random_state=0) # can use test_size=0.8
for train_index, test_index in split.split(data.drop("sales", axis=1), data["sales"]): # Generate indices to split data into training and test set.
    strat_train_set = data.loc[train_index]
    strat_test_set = data.loc[test_index]

我明白了ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.

在另一个具有 14 个特征列和 1 个标签列的数据集上使用相同的代码适当地分隔数据。为什么它在这里不起作用？谢谢。

Answer 1

我认为问题是你的 data_y 是二维矩阵。

但正如我在 sklearn.model_selection.StratifiedShuffleSplit doc 中看到的那样，它应该是 1D 向量。尝试将 data_y 的每一行编码为整数（它将被解释为 class），然后使用 split.

或者您的 y 可能是回归变量（连续数值数据）。（Vivek 的 link）

这个错误对 StratifiedShuffleSplit 意味着什么？

What does this error mean with StratifiedShuffleSplit?

python

pandas

scikit-learn

data-science

sklearn-pandas