Pandas 中数据帧子集的随机样本

Question

假设我有一个包含 100,000 个条目的数据框，我想将它分成 100 个部分，每部分 1000 个条目。

我如何从 100 个部分中的一个部分中随机抽取 50 个样本。数据集已经排序，因此前 1000 个结果是第一部分，下一部分是下一部分，依此类推。

非常感谢

Answer 1

一种解决方案是使用 numpy 中的 choice 函数。

假设您想要 100 个条目中的 50 个条目，您可以使用：

import numpy as np
chosen_idx = np.random.choice(1000, replace=False, size=50)
df_trimmed = df.iloc[chosen_idx]

这当然没有考虑你的块结构。例如，如果您想要块 i 中的 50 项样本，您可以这样做：

import numpy as np
block_start_idx = 1000 * i
chosen_idx = np.random.choice(1000, replace=False, size=50)
df_trimmed_from_block_i = df.iloc[block_start_idx + chosen_idx]

Answer 2

可以使用sample方法*:

In [11]: df = pd.DataFrame([[1, 2], [3, 4], [5, 6], [7, 8]], columns=["A", "B"])

In [12]: df.sample(2)
Out[12]:
   A  B
0  1  2
2  5  6

In [13]: df.sample(2)
Out[13]:
   A  B
3  7  8
0  1  2

*在 DataFrames 部分之一。

注意：如果您的样本量大于 DataFrame 的大小，这将引发错误，除非您使用替换样本进行抽样。

In [14]: df.sample(5)
ValueError: Cannot take a larger sample than population when 'replace=False'

In [15]: df.sample(5, replace=True)
Out[15]:
   A  B
0  1  2
1  3  4
2  5  6
3  7  8
1  3  4

Answer 3

这是递归的好地方。

def main2():
    rows = 8  # say you have 8 rows, real data will need len(rows) for int
    rands = []
    for i in range(rows):
        gen = fun(rands)
        rands.append(gen)
    print(rands)  # now range through random values


def fun(rands):
    gen = np.random.randint(0, 8)
    if gen in rands:
        a = fun(rands)
        return a
    else: return gen


if __name__ == "__main__":
    main2()

output: [6, 0, 7, 1, 3, 5, 4, 2]

Answer 4

您可以向数据中添加一个 "section" 列，然后执行分组和抽样：

import numpy as np
import pandas as pd

df = pd.DataFrame(
    {"x": np.arange(1_000 * 100), "section": np.repeat(np.arange(100), 1_000)}
)
# >>> df
#            x  section
# 0          0        0
# 1          1        0
# 2          2        0
# 3          3        0
# 4          4        0
# ...      ...      ...
# 99995  99995       99
# 99996  99996       99
# 99997  99997       99
# 99998  99998       99
# 99999  99999       99
#
# [100000 rows x 2 columns]

sample = df.groupby("section").sample(50)
# >>> sample
#            x  section
# 907      907        0
# 494      494        0
# 775      775        0
# 20        20        0
# 230      230        0
# ...      ...      ...
# 99740  99740       99
# 99272  99272       99
# 99863  99863       99
# 99198  99198       99
# 99555  99555       99
#
# [5000 rows x 2 columns]

附加 .query("section == 42") 或者如果您只对特定部分感兴趣的话。

请注意，这需要 pandas 1.1.0，请参阅此处的文档：https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.DataFrameGroupBy.sample.html

对于旧版本，请参阅@msh5678 的回答

Answer 5

谢谢你，杰夫，但是我收到一个错误；

AttributeError: Cannot access callable attribute 'sample' of 'DataFrameGroupBy' objects, try using the 'apply' method

所以我建议使用下面的命令而不是 sample = df.groupby("section").sample(50) :

df.groupby('section').apply(lambda grp: grp.sample(50))

Pandas 中数据帧子集的随机样本

Random Sample of a subset of a dataframe in Pandas

python

sample

random-sample

pandas