CDF x 值在 50% 和平均值不显示相同的数字
CDF x value at 50% and mean don't show the same number
我有一个数据框,我创建了 days
列的 CDF:
...
#create DF from SQL
df = pd.read_sql_query(query, conn)
days = df['days'].dropna()
#create CDF definition
def ecdf(data):
n = len(data)
x = np.sort(data)
y = np.arange(1.0, n+1) / n
return x, y
#unpack x and y
x, y = ecdf(days)
sns.set()
#plot CDF
ax = plt.plot(x, y, marker='.', linestyle='none')
#Overlay quartiles
percentiles= np.array([25,50,75])
x_p = np.percentile(days, percentiles)
y_p = percentiles/100.0
ax = plt.plot(x_p, y_p, marker='D', color='red', linestyle='none') # Overlay percentiles
#get current axes and add annotation and quartile points
ax=plt.gca()
for x,y in zip(x_p, y_p):
ax.annotate('%s' % x, xy=(x,y), xytext=(15,0), textcoords='offset points')
在 50% 标记处,CDF 叠加层中的数据点显示 120 平均值,但是 print(np.mean(df['days_to_engaged']))
给我 154.
为什么会出现差异?
print(df['days'].dropna())
:
389
350
130
344
392
92
51
28
309
357
64
380
332
109
284
105
50
66
156
116
75
315
155
34
155
241
320
50
97
41
274
99
133
95
306
62
187
56
110
338
102
285
386
231
238
145
216
148
105
368
176
155
106
107
36
16
28
6
322
95
122
82
64
35
72
214
192
91
117
277
101
159
96
325
79
154
314
142
147
138
48
50
178
146
224
282
141
75
151
93
135
82
125
111
49
113
165
19
118
105
92
133
77
54
72
34
您是在比较中位数和平均值。这归结为以下几点:
a = np.array([1, 1, 2, 4])
ecdf
只是第二个元素 (1
)。而均值是 (4 + 2 + 1 + 1) / 4 == 2
.
我有一个数据框,我创建了 days
列的 CDF:
...
#create DF from SQL
df = pd.read_sql_query(query, conn)
days = df['days'].dropna()
#create CDF definition
def ecdf(data):
n = len(data)
x = np.sort(data)
y = np.arange(1.0, n+1) / n
return x, y
#unpack x and y
x, y = ecdf(days)
sns.set()
#plot CDF
ax = plt.plot(x, y, marker='.', linestyle='none')
#Overlay quartiles
percentiles= np.array([25,50,75])
x_p = np.percentile(days, percentiles)
y_p = percentiles/100.0
ax = plt.plot(x_p, y_p, marker='D', color='red', linestyle='none') # Overlay percentiles
#get current axes and add annotation and quartile points
ax=plt.gca()
for x,y in zip(x_p, y_p):
ax.annotate('%s' % x, xy=(x,y), xytext=(15,0), textcoords='offset points')
在 50% 标记处,CDF 叠加层中的数据点显示 120 平均值,但是 print(np.mean(df['days_to_engaged']))
给我 154.
为什么会出现差异?
print(df['days'].dropna())
:
389
350
130
344
392
92
51
28
309
357
64
380
332
109
284
105
50
66
156
116
75
315
155
34
155
241
320
50
97
41
274
99
133
95
306
62
187
56
110
338
102
285
386
231
238
145
216
148
105
368
176
155
106
107
36
16
28
6
322
95
122
82
64
35
72
214
192
91
117
277
101
159
96
325
79
154
314
142
147
138
48
50
178
146
224
282
141
75
151
93
135
82
125
111
49
113
165
19
118
105
92
133
77
54
72
34
您是在比较中位数和平均值。这归结为以下几点:
a = np.array([1, 1, 2, 4])
ecdf
只是第二个元素 (1
)。而均值是 (4 + 2 + 1 + 1) / 4 == 2
.