缺失数据的循环函数

Loop function for missing data

我想用 np.random.normal(mu,s,n) 函数和列表理解方法更改 NaN 值,但我不能。

df_column_values = ["NaN","1","NaN","2","NaN","3","94","4","168","5","NaN"]

n, mu, sigma = 700, 155, 118
array = np.random.normal(mu, sigma, n)
for i in array:
    if i > 0 and i < 400:    
        data['Insulin'].replace(0,(i), inplace=True)  

此函数有效,但所有 NaN 值的输出都相同。 我该如何改进此代码?

原始数据来自Kaggle

您似乎想用 (0, 400) 范围内的正态分布随机值替换缺失值。您需要为此使用截断的正态分布。

然后您应该创建一个随机变量向量,其长度与您可能要替换的数据的长度相同。

data = pd.DataFrame({'Insulin': ["NaN","1","NaN","2","NaN","3",
"94","4","168","5","NaN"]})
​
import scipy.stats as stats
​
lower, upper = 0, 400
mu, sigma = 155, 118
X = stats.truncnorm(
    (lower - mu) / sigma, 
    (upper - mu) / sigma, 
    loc=mu, scale=sigma)
​
data['Insulin'] = np.where(
     data['Insulin']=="NaN", 
     X.rvs(len(data)),
     data['Insulin'])

data['Insulin'] = np.where(
     data['Insulin'].isna(), 
     X.rvs(len(data)),
     data['Insulin'])

print(data)
       Insulin
0    59.069239
1            1
2   113.143013
3            2
4    63.488282
5            3
6           94
7            4
8          168
9            5
10  109.272469