Python - 从数据框中采样行而不进行替换
Python - Sampling rows from a data frame without replacement
我想对 pandas 数据框中的行进行采样而不进行替换。我的意思是这个。在 for 循环的每次迭代中,我从 COMBINED
中采样一定数量的行而不进行替换。我想确保超过 50,000 次迭代后,我不再对同一行进行采样。我下面的代码试图解决这个采样问题,但我遇到了错误。
COMBINED
、TEMP
、MERGED
、SAMPLE
、SAMPLE_2
和 PROBABILITY_GENERATED_POISSON
是数据帧。 lst
是一个列表。
请看下面我的代码:
#FOR LOOP TO SAMPLE FROM COMBINED BASED ON NUMBER OF EVENTS PER YEAR
#AVOIDING REPEATED SAMPLING OF SAME EVENTS
for i in range(50000):
#IF THERE ARE NO EVENTS FOR THAT PARTICULAR YEAR, THERE WILL BE NO EVENT NUMBER AND NO LOSS
if PROBABILITY_GENERATED_POISSON.iloc[i,:].item == 0:
lst.append(0)
#IF THERE ARE MORE THAN 0 EVENTS FOR THAT YEAR, FOLLOW THE BELOW PROCESS
else:
SAMPLE = COMBINED.sample(n = PROBABILITY_GENERATED_POISSON.iloc[i,:],
replace = False,
weights = LOSS_EVENT_SAMPLE_PROBABILITY,
axis = 0)
SAMPLE['Sample'] = i
#CREATE TEMP DATA FRAME WHICH CONSISTS OF ALL ROWS SAMPLED IN PREVIOUS ITERATIONS
#except FUNCTION IS FOR ERROR HANDLING - IT PREVENTS THE LOOP FROM STOPPING MIDWAY
try:
TEMP = pd.DataFrame(lst)
#PERFORM AN INNER JOIN - SELECTING COMMON ROWS FROM TEMP AND SAMPLE
MERGED = TEMP.merge(SAMPLE, how = "inner")
#AVOIDING DUPLICATION WITHIN LIST
#IF THERE ARE NO COMMON ROWS (nrow(MERGED) == 0), THEN INPUT SAMPLE INTO lst
if MERGED.shape[0] == 0:
lst.append(SAMPLE)
else:
#IF THERE ARE COMMON ROWS (nrow(MERGED) > 0), THEN SAMPLE AGAIN, BUT AFTER EXCLUDING THE COMMON ROWS FROM
#THE COMBINED DATA FRAME. BY EXCLUDING THE COMMON ROWS, WE ENSURE THAT WE ARE NOT SAMPLING ROWS WHICH
#WERE SAMPLED IN PREVIOUS ITERATIONS.
COMBINED_2 = COMBINED.subtract(SAMPLE)
SAMPLE_2 = COMBINED_2.sample(n = PROBABILITY_GENERATED_POISSON.iloc[i,:],
replace = False,
weights = LOSS_EVENT_SAMPLE_PROBABILITY,
axis = 0)
SAMPLE_2['Sample'] = i
lst.append(SAMPLE_2)
except:
continue
print(i)
我得到的错误已附在图片上。
我想就我的问题获得一些反馈。
谢谢。
这里有两种解决方法:
- 解决方案使用 pandas
.sample
函数
n = 50000
COMBINED.sample(n, replace=False)
- 使用与
.sample()
相同的简单算法的解决方案
# use the diamonds dataset to illustrate and test the algorithm
import seaborn as sns
import pandas as pd
df_input = sns.load_dataset('diamonds')
df = df_input.loc[[]]
df_temp = df_input # this is where we're sampling from
n_samples = 1000
for _ in range(n_samples):
sample = df_temp.sample(1)
df_temp.drop(index=sample.index, inplace=True)
df = df.append(sample)
assert((df.index.value_counts() > 1).sum() == 0)
df
我修正了错误。 PROBABILITY_GENERATED_POISSON
需要是一个列表。
我想对 pandas 数据框中的行进行采样而不进行替换。我的意思是这个。在 for 循环的每次迭代中,我从 COMBINED
中采样一定数量的行而不进行替换。我想确保超过 50,000 次迭代后,我不再对同一行进行采样。我下面的代码试图解决这个采样问题,但我遇到了错误。
COMBINED
、TEMP
、MERGED
、SAMPLE
、SAMPLE_2
和 PROBABILITY_GENERATED_POISSON
是数据帧。 lst
是一个列表。
请看下面我的代码:
#FOR LOOP TO SAMPLE FROM COMBINED BASED ON NUMBER OF EVENTS PER YEAR
#AVOIDING REPEATED SAMPLING OF SAME EVENTS
for i in range(50000):
#IF THERE ARE NO EVENTS FOR THAT PARTICULAR YEAR, THERE WILL BE NO EVENT NUMBER AND NO LOSS
if PROBABILITY_GENERATED_POISSON.iloc[i,:].item == 0:
lst.append(0)
#IF THERE ARE MORE THAN 0 EVENTS FOR THAT YEAR, FOLLOW THE BELOW PROCESS
else:
SAMPLE = COMBINED.sample(n = PROBABILITY_GENERATED_POISSON.iloc[i,:],
replace = False,
weights = LOSS_EVENT_SAMPLE_PROBABILITY,
axis = 0)
SAMPLE['Sample'] = i
#CREATE TEMP DATA FRAME WHICH CONSISTS OF ALL ROWS SAMPLED IN PREVIOUS ITERATIONS
#except FUNCTION IS FOR ERROR HANDLING - IT PREVENTS THE LOOP FROM STOPPING MIDWAY
try:
TEMP = pd.DataFrame(lst)
#PERFORM AN INNER JOIN - SELECTING COMMON ROWS FROM TEMP AND SAMPLE
MERGED = TEMP.merge(SAMPLE, how = "inner")
#AVOIDING DUPLICATION WITHIN LIST
#IF THERE ARE NO COMMON ROWS (nrow(MERGED) == 0), THEN INPUT SAMPLE INTO lst
if MERGED.shape[0] == 0:
lst.append(SAMPLE)
else:
#IF THERE ARE COMMON ROWS (nrow(MERGED) > 0), THEN SAMPLE AGAIN, BUT AFTER EXCLUDING THE COMMON ROWS FROM
#THE COMBINED DATA FRAME. BY EXCLUDING THE COMMON ROWS, WE ENSURE THAT WE ARE NOT SAMPLING ROWS WHICH
#WERE SAMPLED IN PREVIOUS ITERATIONS.
COMBINED_2 = COMBINED.subtract(SAMPLE)
SAMPLE_2 = COMBINED_2.sample(n = PROBABILITY_GENERATED_POISSON.iloc[i,:],
replace = False,
weights = LOSS_EVENT_SAMPLE_PROBABILITY,
axis = 0)
SAMPLE_2['Sample'] = i
lst.append(SAMPLE_2)
except:
continue
print(i)
我得到的错误已附在图片上。
我想就我的问题获得一些反馈。
谢谢。
这里有两种解决方法:
- 解决方案使用 pandas
.sample
函数
n = 50000
COMBINED.sample(n, replace=False)
- 使用与
.sample()
相同的简单算法的解决方案
# use the diamonds dataset to illustrate and test the algorithm
import seaborn as sns
import pandas as pd
df_input = sns.load_dataset('diamonds')
df = df_input.loc[[]]
df_temp = df_input # this is where we're sampling from
n_samples = 1000
for _ in range(n_samples):
sample = df_temp.sample(1)
df_temp.drop(index=sample.index, inplace=True)
df = df.append(sample)
assert((df.index.value_counts() > 1).sum() == 0)
df
我修正了错误。 PROBABILITY_GENERATED_POISSON
需要是一个列表。