Python - 从数据框中采样行而不进行替换

Python - Sampling rows from a data frame without replacement

我想对 pandas 数据框中的行进行采样而不进行替换。我的意思是这个。在 for 循环的每次迭代中,我从 COMBINED 中采样一定数量的行而不进行替换。我想确保超过 50,000 次迭代后,我不再对同一行进行采样。我下面的代码试图解决这个采样问题,但我遇到了错误。

COMBINEDTEMPMERGEDSAMPLESAMPLE_2PROBABILITY_GENERATED_POISSON 是数据帧。 lst 是一个列表。

请看下面我的代码:

#FOR LOOP TO SAMPLE FROM COMBINED BASED ON NUMBER OF EVENTS PER YEAR
#AVOIDING REPEATED SAMPLING OF SAME EVENTS
for i in range(50000):
    #IF THERE ARE NO EVENTS FOR THAT PARTICULAR YEAR, THERE WILL BE NO EVENT NUMBER AND NO LOSS
    if PROBABILITY_GENERATED_POISSON.iloc[i,:].item == 0:
        lst.append(0)
    #IF THERE ARE MORE THAN 0 EVENTS FOR THAT YEAR, FOLLOW THE BELOW PROCESS 
    else:
        SAMPLE = COMBINED.sample(n = PROBABILITY_GENERATED_POISSON.iloc[i,:], 
                                 replace = False,
                                 weights = LOSS_EVENT_SAMPLE_PROBABILITY,
                                 axis = 0)
        SAMPLE['Sample'] = i
        #CREATE TEMP DATA FRAME WHICH CONSISTS OF ALL ROWS SAMPLED IN PREVIOUS ITERATIONS
        #except FUNCTION IS FOR ERROR HANDLING - IT PREVENTS THE LOOP FROM STOPPING MIDWAY
        try:
            TEMP = pd.DataFrame(lst)
            #PERFORM AN INNER JOIN - SELECTING COMMON ROWS FROM TEMP AND SAMPLE
            MERGED = TEMP.merge(SAMPLE, how = "inner")
            #AVOIDING DUPLICATION WITHIN LIST
            #IF THERE ARE NO COMMON ROWS (nrow(MERGED) == 0), THEN INPUT SAMPLE INTO lst
            if MERGED.shape[0] == 0:
                lst.append(SAMPLE)
            else:
                #IF THERE ARE COMMON ROWS (nrow(MERGED) > 0), THEN SAMPLE AGAIN, BUT AFTER EXCLUDING THE COMMON ROWS FROM 
                #THE COMBINED DATA FRAME. BY EXCLUDING THE COMMON ROWS, WE ENSURE THAT WE ARE NOT SAMPLING ROWS WHICH
                #WERE SAMPLED IN PREVIOUS ITERATIONS.
                COMBINED_2 = COMBINED.subtract(SAMPLE)
                SAMPLE_2 = COMBINED_2.sample(n = PROBABILITY_GENERATED_POISSON.iloc[i,:], 
                                 replace = False,
                                 weights = LOSS_EVENT_SAMPLE_PROBABILITY,
                                 axis = 0)
                SAMPLE_2['Sample'] = i
                lst.append(SAMPLE_2)
        except:
            continue
    
    print(i)

我得到的错误已附在图片上。

我想就我的问题获得一些反馈。

谢谢。

这里有两种解决方法:

  1. 解决方案使用 pandas .sample 函数
n = 50000
COMBINED.sample(n, replace=False)
  1. 使用与 .sample()
  2. 相同的简单算法的解决方案
# use the diamonds dataset to illustrate and test the algorithm
import seaborn as sns
import pandas as pd

df_input = sns.load_dataset('diamonds')

df = df_input.loc[[]]
df_temp = df_input # this is where we're sampling from
n_samples = 1000
for _ in range(n_samples):
    sample = df_temp.sample(1)
    df_temp.drop(index=sample.index, inplace=True)
    df = df.append(sample)

assert((df.index.value_counts() > 1).sum() == 0)
df

我修正了错误。 PROBABILITY_GENERATED_POISSON 需要是一个列表。