如何在 pandas 数据帧上生成具有随机值的合成数据?
How to generate synthetic data with random values on pandas dataframe?
我有一个包含 50K 行的数据框。我想用随机值替换 20% 的数据(给出随机数的区间)。目的是生成合成异常值以测试算法。以下数据框是我拥有的 df 的一小部分。应该用随机异常值替换的值是 'value' 列。
import pandas as pd
dict = {'date':["2016-11-10", "2016-11-10", "2016-11-11", "2016-11-11","2016-11-11","2016-11-11","2016-11-11", "2016-11-11" ],
'time': ["22:00:00", "23:00:00", "00:00:00", "01:00:00", "02:00:00", "03:00:00", "04:00:00", "04:00:00"],
'value':[90, 91, 80, 87, 84,94, 91, 94]}
df = pd.DataFrame(dict)
print(df)
date time value
0 2016-11-10 22:00:00 90
1 2016-11-10 23:00:00 91
2 2016-11-11 00:00:00 80
3 2016-11-11 01:00:00 87
4 2016-11-11 02:00:00 84
5 2016-11-11 03:00:00 94
6 2016-11-11 04:00:00 91
7 2016-11-11 05:00:00 94
例如,我想给出一个从 1 到 50 的随机值区间,所需的 df 如下所示:
date time value
0 2016-11-10 22:00:00 90
1 2016-11-10 23:00:00 91
2 2016-11-11 00:00:00 80
3 2016-11-11 01:00:00 4
4 2016-11-11 02:00:00 84
5 2016-11-11 03:00:00 94
6 2016-11-11 04:00:00 32
7 2016-11-11 05:00:00 94
如果有任何想法,我将不胜感激。谢谢!
您可以使用以下几个步骤。如上所述,您不应使用 dict
作为变量名。我在下面做了,因为我只是复制了你的代码输入。
此代码根据替换比率和数据帧的长度生成索引列表,然后将这些位置的值替换为 0-20 之间的统一随机整数,包括:
In [49]: # %load 32-36
...: df=pd.DataFrame(dict)
...: import random
...: replacement_ratio = 0.50
...: replacement_count = int(replacement_ratio * len(df))
...: replacement_idx = random.sample(range(len(df)), replacement_count)
In [50]: replacement_idx
Out[50]: [5, 2, 7, 6]
In [51]: for idx in replacement_idx:
...: df.loc[idx, 'value'] = random.randint(0,20)
...:
In [52]: df
Out[52]:
date time value
0 2016-11-10 22:00:00 90
1 2016-11-10 23:00:00 91
2 2016-11-11 00:00:00 4
3 2016-11-11 01:00:00 87
4 2016-11-11 02:00:00 84
5 2016-11-11 03:00:00 4
6 2016-11-11 04:00:00 17
7 2016-11-11 04:00:00 8
In [53]:
这可能有效。
outliers = []
def get_outlier(x):
num = 3
mean_ = np.mean(x)
std_ = np.std(x)
for y in x:
z_score = (y - mean_) / std_
if np.abs(z_score) > num:
outliers.append(y)
return get_outlier
detect_outliers = get_outlier(df['value'])
sorted(df['value'])
q1, q3 = np.percentile(df['value'], [25, 75])
iqr = q3 - q1
lb = q1 - (1.5 * iqr)
ub = q3 - (1.5 * iqr)
for i in range(len(df)):
if ((df['value'][i] < lb) | (df['value'][i] > ub)):
df['value'][i] = np.random.randint(1, 50)
这是一个 numpy
示例,应该很快。包含较高和较低替换值的示例假定您想要均匀地替换高值和低值 (50-50),如果不是这种情况,您可以将 mask_high = np.random.choice([0,1], p=[.5, .5], size=rand.shape).astype(np.bool)
中的 p
更改为任何值你要。
d = {'date':["2016-11-10", "2016-11-10", "2016-11-11", "2016-11-11","2016-11-11","2016-11-11","2016-11-11", "2016-11-11" ],
'time': ["22:00:00", "23:00:00", "00:00:00", "01:00:00", "02:00:00", "03:00:00", "04:00:00", "04:00:00"],
'value':[90, 91, 80, 87, 84,94, 91, 94]}
df = pd.DataFrame(d)
# create a function
def myFunc(df, replace_pct, start_range, stop_range, replace_col):
# create an array of the col you want to replace
val = df[replace_col].values
# create a boolean mask for the percent you want to replace
mask = np.random.choice([0,1], p=[1-replace_pct, replace_pct], size=val.shape).astype(np.bool)
# create a random ints between the range
rand = np.random.randint(start_range, stop_range, size=len(mask[mask == True]))
# replace values in the original array
val[mask] = rand
# update column
df[replace_col] = val
return df
myFunc(df, .2, 1, 50, 'value')
date time value
0 2016-11-10 22:00:00 90
1 2016-11-10 23:00:00 91
2 2016-11-11 00:00:00 80
3 2016-11-11 01:00:00 87
4 2016-11-11 02:00:00 46
5 2016-11-11 03:00:00 94
6 2016-11-11 04:00:00 91
7 2016-11-11 04:00:00 94
时间
%%timeit
myFunc(df, .2, 1, 50, 'value')
397 µs ± 27.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
高位和低位替换的示例
# create a function
def myFunc2(df, replace_pct, start_range_low, stop_range_low,
start_range_high, stop_range_high, replace_col):
# create array of col you want to replace
val = df[replace_col].values
# create a boolean mask for the percent you want to replace
mask = np.random.choice([0,1], p=[1-replace_pct, replace_pct], size=val.shape).astype(np.bool)
# create a random int between ranges
rand = np.random.randint(start_range_low, stop_range_low, size=len(mask[mask == True]))
# create a mask for the higher range
mask_high = np.random.choice([0,1], p=[.5, .5], size=rand.shape).astype(np.bool)
# create random ints between high ranges
rand_high = np.random.randint(start_range_high, stop_range_high, size=len(mask_high[mask_high == True]))
# replace values in the rand array
rand[mask_high] = rand_high
# replace values in the original array
val[mask] = rand
# update column
df[replace_col] = val
return df
myFunc2(df, .2, 1, 50, 200, 300, 'value')
date time value
0 2016-11-10 22:00:00 90
1 2016-11-10 23:00:00 216
2 2016-11-11 00:00:00 80
3 2016-11-11 01:00:00 49
4 2016-11-11 02:00:00 84
5 2016-11-11 03:00:00 94
6 2016-11-11 04:00:00 270
7 2016-11-11 04:00:00 94
时间
%%timeit
myFunc2(df, .2, 1, 50, 200, 300, 'value')
493 µs ± 41.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
另一次尝试,使用 DataFrame.sample()
。
import numpy as np
import pandas as pd
d = {'date':["2016-11-10", "2016-11-10", "2016-11-11", "2016-11-11","2016-11-11","2016-11-11","2016-11-11", "2016-11-11" ],
'time': ["22:00:00", "23:00:00", "00:00:00", "01:00:00", "02:00:00", "03:00:00", "04:00:00", "04:00:00"],
'value':[90, 91, 80, 87, 84,94, 91, 94]}
df = pd.DataFrame(d)
random_rows = df.sample(frac=.2) # 20% random rows from `df`
# we are replacing these 20% random rows with values from 1..50 and 200..300 (in ~1:1 ratio)
random_values = np.random.choice( np.concatenate( [np.random.randint(1, 50, size=len(random_rows) // 2 + 1),
np.random.randint(200, 300, size=len(random_rows) // 2 + 1)] ),
size=len(random_rows) )
df.loc[random_rows.index, 'value'] = random_values
print(df)
这会打印(例如):
date time value
0 2016-11-10 22:00:00 31 <-- 31
1 2016-11-10 23:00:00 91
2 2016-11-11 00:00:00 80
3 2016-11-11 01:00:00 87
4 2016-11-11 02:00:00 84
5 2016-11-11 03:00:00 236 <-- 236
6 2016-11-11 04:00:00 91
7 2016-11-11 04:00:00 94
使用 sample
的类似答案:
示例df
:
import pandas as pd
df = pd.DataFrame({"time_col" : pd.date_range("2018-01-01", "2019-01-01", freq = "H")})
df["date"], df["time"] = df["time_col"].dt.date, df["time_col"].dt.hour
df["value"] = pd.np.random.randint(100, 150, df.shape[0])
seed = 11 # deterministic behavior, that's what heroes do
rnd_rows_idx = df.sample(frac = 0.2, random_state=seed).index # grabbing indexes
original_rows = df.loc[rnd_rows_idx, "value"] # keeping a trace of original values
### Replacing the values selected at random ###
df.loc[rnd_rows_idx, "value"] = pd.np.random.randint(1, 50, rnd_rows_idx.shape[0])
我有一个包含 50K 行的数据框。我想用随机值替换 20% 的数据(给出随机数的区间)。目的是生成合成异常值以测试算法。以下数据框是我拥有的 df 的一小部分。应该用随机异常值替换的值是 'value' 列。
import pandas as pd
dict = {'date':["2016-11-10", "2016-11-10", "2016-11-11", "2016-11-11","2016-11-11","2016-11-11","2016-11-11", "2016-11-11" ],
'time': ["22:00:00", "23:00:00", "00:00:00", "01:00:00", "02:00:00", "03:00:00", "04:00:00", "04:00:00"],
'value':[90, 91, 80, 87, 84,94, 91, 94]}
df = pd.DataFrame(dict)
print(df)
date time value
0 2016-11-10 22:00:00 90
1 2016-11-10 23:00:00 91
2 2016-11-11 00:00:00 80
3 2016-11-11 01:00:00 87
4 2016-11-11 02:00:00 84
5 2016-11-11 03:00:00 94
6 2016-11-11 04:00:00 91
7 2016-11-11 05:00:00 94
例如,我想给出一个从 1 到 50 的随机值区间,所需的 df 如下所示:
date time value
0 2016-11-10 22:00:00 90
1 2016-11-10 23:00:00 91
2 2016-11-11 00:00:00 80
3 2016-11-11 01:00:00 4
4 2016-11-11 02:00:00 84
5 2016-11-11 03:00:00 94
6 2016-11-11 04:00:00 32
7 2016-11-11 05:00:00 94
如果有任何想法,我将不胜感激。谢谢!
您可以使用以下几个步骤。如上所述,您不应使用 dict
作为变量名。我在下面做了,因为我只是复制了你的代码输入。
此代码根据替换比率和数据帧的长度生成索引列表,然后将这些位置的值替换为 0-20 之间的统一随机整数,包括:
In [49]: # %load 32-36
...: df=pd.DataFrame(dict)
...: import random
...: replacement_ratio = 0.50
...: replacement_count = int(replacement_ratio * len(df))
...: replacement_idx = random.sample(range(len(df)), replacement_count)
In [50]: replacement_idx
Out[50]: [5, 2, 7, 6]
In [51]: for idx in replacement_idx:
...: df.loc[idx, 'value'] = random.randint(0,20)
...:
In [52]: df
Out[52]:
date time value
0 2016-11-10 22:00:00 90
1 2016-11-10 23:00:00 91
2 2016-11-11 00:00:00 4
3 2016-11-11 01:00:00 87
4 2016-11-11 02:00:00 84
5 2016-11-11 03:00:00 4
6 2016-11-11 04:00:00 17
7 2016-11-11 04:00:00 8
In [53]:
这可能有效。
outliers = []
def get_outlier(x):
num = 3
mean_ = np.mean(x)
std_ = np.std(x)
for y in x:
z_score = (y - mean_) / std_
if np.abs(z_score) > num:
outliers.append(y)
return get_outlier
detect_outliers = get_outlier(df['value'])
sorted(df['value'])
q1, q3 = np.percentile(df['value'], [25, 75])
iqr = q3 - q1
lb = q1 - (1.5 * iqr)
ub = q3 - (1.5 * iqr)
for i in range(len(df)):
if ((df['value'][i] < lb) | (df['value'][i] > ub)):
df['value'][i] = np.random.randint(1, 50)
这是一个 numpy
示例,应该很快。包含较高和较低替换值的示例假定您想要均匀地替换高值和低值 (50-50),如果不是这种情况,您可以将 mask_high = np.random.choice([0,1], p=[.5, .5], size=rand.shape).astype(np.bool)
中的 p
更改为任何值你要。
d = {'date':["2016-11-10", "2016-11-10", "2016-11-11", "2016-11-11","2016-11-11","2016-11-11","2016-11-11", "2016-11-11" ],
'time': ["22:00:00", "23:00:00", "00:00:00", "01:00:00", "02:00:00", "03:00:00", "04:00:00", "04:00:00"],
'value':[90, 91, 80, 87, 84,94, 91, 94]}
df = pd.DataFrame(d)
# create a function
def myFunc(df, replace_pct, start_range, stop_range, replace_col):
# create an array of the col you want to replace
val = df[replace_col].values
# create a boolean mask for the percent you want to replace
mask = np.random.choice([0,1], p=[1-replace_pct, replace_pct], size=val.shape).astype(np.bool)
# create a random ints between the range
rand = np.random.randint(start_range, stop_range, size=len(mask[mask == True]))
# replace values in the original array
val[mask] = rand
# update column
df[replace_col] = val
return df
myFunc(df, .2, 1, 50, 'value')
date time value
0 2016-11-10 22:00:00 90
1 2016-11-10 23:00:00 91
2 2016-11-11 00:00:00 80
3 2016-11-11 01:00:00 87
4 2016-11-11 02:00:00 46
5 2016-11-11 03:00:00 94
6 2016-11-11 04:00:00 91
7 2016-11-11 04:00:00 94
时间
%%timeit
myFunc(df, .2, 1, 50, 'value')
397 µs ± 27.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
高位和低位替换的示例
# create a function
def myFunc2(df, replace_pct, start_range_low, stop_range_low,
start_range_high, stop_range_high, replace_col):
# create array of col you want to replace
val = df[replace_col].values
# create a boolean mask for the percent you want to replace
mask = np.random.choice([0,1], p=[1-replace_pct, replace_pct], size=val.shape).astype(np.bool)
# create a random int between ranges
rand = np.random.randint(start_range_low, stop_range_low, size=len(mask[mask == True]))
# create a mask for the higher range
mask_high = np.random.choice([0,1], p=[.5, .5], size=rand.shape).astype(np.bool)
# create random ints between high ranges
rand_high = np.random.randint(start_range_high, stop_range_high, size=len(mask_high[mask_high == True]))
# replace values in the rand array
rand[mask_high] = rand_high
# replace values in the original array
val[mask] = rand
# update column
df[replace_col] = val
return df
myFunc2(df, .2, 1, 50, 200, 300, 'value')
date time value
0 2016-11-10 22:00:00 90
1 2016-11-10 23:00:00 216
2 2016-11-11 00:00:00 80
3 2016-11-11 01:00:00 49
4 2016-11-11 02:00:00 84
5 2016-11-11 03:00:00 94
6 2016-11-11 04:00:00 270
7 2016-11-11 04:00:00 94
时间
%%timeit
myFunc2(df, .2, 1, 50, 200, 300, 'value')
493 µs ± 41.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
另一次尝试,使用 DataFrame.sample()
。
import numpy as np
import pandas as pd
d = {'date':["2016-11-10", "2016-11-10", "2016-11-11", "2016-11-11","2016-11-11","2016-11-11","2016-11-11", "2016-11-11" ],
'time': ["22:00:00", "23:00:00", "00:00:00", "01:00:00", "02:00:00", "03:00:00", "04:00:00", "04:00:00"],
'value':[90, 91, 80, 87, 84,94, 91, 94]}
df = pd.DataFrame(d)
random_rows = df.sample(frac=.2) # 20% random rows from `df`
# we are replacing these 20% random rows with values from 1..50 and 200..300 (in ~1:1 ratio)
random_values = np.random.choice( np.concatenate( [np.random.randint(1, 50, size=len(random_rows) // 2 + 1),
np.random.randint(200, 300, size=len(random_rows) // 2 + 1)] ),
size=len(random_rows) )
df.loc[random_rows.index, 'value'] = random_values
print(df)
这会打印(例如):
date time value
0 2016-11-10 22:00:00 31 <-- 31
1 2016-11-10 23:00:00 91
2 2016-11-11 00:00:00 80
3 2016-11-11 01:00:00 87
4 2016-11-11 02:00:00 84
5 2016-11-11 03:00:00 236 <-- 236
6 2016-11-11 04:00:00 91
7 2016-11-11 04:00:00 94
使用 sample
的类似答案:
示例df
:
import pandas as pd
df = pd.DataFrame({"time_col" : pd.date_range("2018-01-01", "2019-01-01", freq = "H")})
df["date"], df["time"] = df["time_col"].dt.date, df["time_col"].dt.hour
df["value"] = pd.np.random.randint(100, 150, df.shape[0])
seed = 11 # deterministic behavior, that's what heroes do
rnd_rows_idx = df.sample(frac = 0.2, random_state=seed).index # grabbing indexes
original_rows = df.loc[rnd_rows_idx, "value"] # keeping a trace of original values
### Replacing the values selected at random ###
df.loc[rnd_rows_idx, "value"] = pd.np.random.randint(1, 50, rnd_rows_idx.shape[0])