我可以在不选择同一组两次（不替换）的情况下对数据框中的数据集进行采样吗？

Question

我是 python 的新手，我想按他们的组对以下数据框中的数据集进行采样，而无需 select 对同一组进行两次采样。我编写的代码确实对数据集进行了正确采样，但是，它可以 select 同一组两次。

请注意：以下数据是测试数据，我使用代码的实际数据的大小要大得多，因此无法使用索引。

数据：

d={'group': ['A','A','A','B','B','B','C','C','C','D','D','D','E','E','E'], 'number': [1,2,3,1,2,3,1,2,3,1,2,3,1,2,3],'weather':['hot','hot','hot','cold','cold','cold','hot','hot','hot','cold','cold','cold','hot','hot','hot']}```
df = pd.DataFrame(data=d)
df
group   number  weather
A       1       hot
A       2       hot
A       3       hot
B       1       cold
B       2       cold
B       3       cold
C       1       hot
C       2       hot
C       3       hot
D       1       cold
D       2       cold
D       3       cold
E       1       hot
E       2       hot
E       3       hot

我的代码

df_s=[]
for typ in df.group.sample(3,replace=False):
    df_s.append(df[df['group']==typ])
df_s=pd.concat(df_s)
df_s

结果

group   number  weather
E       1       hot
E       2       hot
E       3       hot
E       1       hot
E       2       hot
E       3       hot
D       1       cold
D       2       cold
D       3       cold

结果应该给出 3 个不同的组数据，但是可以看出只有 2 个（E 和 D），这意味着代码可以 select 同一组不止一次。

Answer 1

方法 sample 与参数 replace=False 一起使用将确保您在创建的示例 df 中没有 行重复项 。但是，您确实有几行具有相同的字母表示组（您的列 group）。

为了快速修复您的代码：

df_s=[]
for typ in np.random.choice(df["group"].unique(), 3, replace=False):
    df_s.append(df[df['group']==typ])
df_s=pd.concat(df_s)
df_s

我可以在不选择同一组两次（不替换）的情况下对数据框中的数据集进行采样吗？

Can i sample sets of data within a dataframe without selecting the same set twice (without replacement)?

python

random

sample

sample-data

pandas