Pandas 数据框不相交组的随机抽样

Question

我需要通过属性 'ids' 将一个数据帧随机分成两个不相交的集合。例如，考虑以下数据框：

df=
Out[470]: 
          0     1     2     3       ids
0      17.0  18.0  16.0  15.0      13.0
1      18.0  16.0  15.0  15.0      13.0
2      16.0  15.0  15.0  16.0      13.0
131    12.0   8.0  21.0  19.0      14.0
132     8.0  21.0  19.0  20.0      14.0
133    21.0  19.0  20.0   9.0      14.0
248     NaN   NaN  12.0  11.0      17.0
249     NaN  12.0  11.0  10.0      17.0
250    12.0  11.0  10.0   NaN      17.0
287     3.0   3.0   1.0   8.0      20.0
288     3.0   1.0   8.0   3.0      20.0
289     1.0   8.0   3.0   3.0      20.0
413    21.0   7.0  16.0  18.0      25.0
414     7.0  16.0  18.0  19.0      25.0
415    16.0  18.0  19.0  18.0      25.0
665    10.0   8.0   8.0   7.0      27.0
666     8.0   8.0   7.0   9.0      27.0
667     8.0   7.0   9.0   8.0      27.0
790     NaN   NaN  15.0   NaN      33.0
791     NaN  15.0   NaN  10.0      33.0
792    15.0   NaN  10.0   NaN      33.0
812     NaN  16.0   NaN  17.0      34.0
813    16.0   NaN  17.0   NaN      34.0
814     NaN  17.0   NaN  13.0      34.0
944     3.0   4.0   3.0  18.0      35.0
945     4.0   3.0  18.0  18.0      35.0
946     3.0  18.0  18.0  11.0      35.0
1059    9.0  10.0   3.0   4.0      56.0
1060   10.0   3.0   4.0   3.0      56.0
1061    3.0   4.0   3.0   3.0      56.0
    ...   ...   ...   ...       ...
10125   NaN   9.0   5.0   5.0  101317.0
10126   9.0   5.0   5.0   5.0  101317.0
10127   5.0   5.0   5.0   7.0  101317.0

我需要得到两个（随机分开一些分数大小）数据帧没有相交值ids。

我知道如何用'non-pandasian'方式解决：

获取 ids
将唯一值随机分成两个不相交的组
select 根据 ids 的值在两组中使用 .isin()

我想知道是否有一种简单而巧妙的方法可以使用一些 pandas 内置函数来完成它，比如 .sample()?

Answer 1

更新：

df1 = df.sample(frac=1).loc[df.ids % 2 == 0]
df2 = df.loc[df.index.difference(df1.index)]

OLD 不正确（它不关心分隔 ID）答案：

您可以先使用 sample(frac=1) 洗牌，然后使用 np.split():

df1, df2 = np.split(df.sample(frac=1), 2)

Answer 2

使用sklearn.model_selection.GroupShuffleSplit进行分割：

from sklearn.model_selection import GroupShuffleSplit

# Initialize the GroupShuffleSplit.
gss = GroupShuffleSplit(n_splits=1, test_size=0.5)

# Get the indexers for the split.
idx1, idx2 = next(gss.split(df, groups=df.ids))

# Get the split DataFrames.
df1, df2 = df.iloc[idx1], df.iloc[idx2]

Pandas 数据框不相交组的随机抽样

random sampling with Pandas data frame disjoint groups

python

disjoint-sets

pandas