CDF x 值在 50% 和平均值不显示相同的数字

Question

我有一个数据框，我创建了 days 列的 CDF：

...
#create DF from SQL
df = pd.read_sql_query(query, conn)

days = df['days'].dropna()

#create CDF definition
def ecdf(data):
    n = len(data)
    x = np.sort(data)
    y = np.arange(1.0, n+1) / n
    return x, y

#unpack x and y
x, y = ecdf(days)
sns.set()

#plot CDF
ax = plt.plot(x, y, marker='.', linestyle='none') 

#Overlay quartiles
percentiles= np.array([25,50,75])
x_p = np.percentile(days, percentiles)
y_p = percentiles/100.0
ax = plt.plot(x_p, y_p, marker='D', color='red', linestyle='none') # Overlay percentiles

#get current axes and add annotation and quartile points
ax=plt.gca()
for x,y in zip(x_p, y_p):                                        
    ax.annotate('%s' % x, xy=(x,y), xytext=(15,0), textcoords='offset points')

在 50% 标记处，CDF 叠加层中的数据点显示 120 平均值，但是 print(np.mean(df['days_to_engaged'])) 给我 154.

为什么会出现差异？

print(df['days'].dropna()):

Answer 1

您是在比较中位数和平均值。这归结为以下几点：

a = np.array([1, 1, 2, 4])

ecdf 只是第二个元素 (1)。而均值是 (4 + 2 + 1 + 1) / 4 == 2.

CDF x 值在 50% 和平均值不显示相同的数字

CDF x value at 50% and mean don't show the same number

python

numpy

python-2.7

pandas

cdf