进行分层时是否应保留类别比例？

Question

我有 30,000 个按情绪分类的短语。

我要使用朴素贝叶斯。

这是比例（情绪 -> 短语数）。

anger           98
boredom        157
empty          659
enthusiasm     522
fun           1088
happiness     2986
hate          1187
love          2068
neutral       6340
relief        1021
sadness       4828
surprise      1613
worry         7433

所以，我必须将我的数据集拆分成 train/test 来执行我的模型等，对吗？

在进行分层时是否应该保留类别的比例？

我的意思是，如果我选择 30% 作为测试样本，我应该保留每种情绪的 30% 而不是整个数据集的 30% 吗？

我想是的，但我想有一个更有经验的意见。

你会怎么做？这里的任何人都知道更好的方法而不是执行 python 循环，测试哪种情绪，计算 30%，放入字典等？

是否有任何 Pandas 技巧可以按类别特征进行分层，同时保持比例？

Answer 1

Should I keep the proportion of the categories when executing the stratification?

您似乎对术语有点困惑；分层（或stratified sampling）的定义就是保持比例，否则就是简单的随机抽样。

if I pick 30% for the test sample, should I keep 30% of each sentiment instead of 30% of the whole dataset?

他们并不矛盾，不是吗？如果你保留每个类别的 30%，你最终不会得到初始设置的 30% 吗？

Is there any Pandas trick to stratify by a category feature, keeping the proportion?

不知道 pandas，但是 scikit-learn（我猜你接下来会用到）model_selection.train_test_split 包含这样一个 stratify 选项：

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    stratify=y, 
                                                    test_size=0.3)

进行分层时是否应保留类别比例？

Should I keep the proportion of categories when executing an stratification?

machine-learning

nltk

pandas

scikit-learn

naivebayes