Python 中的 CDF 显示不正确
CDF in Python not displaying correctly
早上好,
在 Python 中,我有一个具有以下值的字典(称为 packet_size_dist):
34 => 0.00909909009099
42 => 0.02299770023
54 => 0.578742125787
58 => 0.211278872113
62 => 0.00529947005299
66 => 0.031796820318
70 => 0.0530946905309
74 => 0.0876912308769
注意值的总和 == 1。
我正在尝试生成一个 CDF,我成功地做到了,但它看起来不对,我想知道我是否打算错误地生成它。有问题的代码是:
sorted_p = sorted(packet_size_dist.items(), key=operator.itemgetter(0))
yvals = np.arange(len(sorted_p))/float(len(sorted_p))
plt.plot(sorted_p, yvals)
plt.show()
但生成的图形如下所示:
这似乎与字典中的值不太匹配。有任何想法吗?我还看到图表左侧有一条模糊的绿线,我不知道它是什么。例如,该图描述了大小为 70 的数据包大约有 78% 的时间出现,而在我的字典中它表示为 5% 的时间出现。
这不是对您问题的直接回答。但是,我想我应该指出您的数据来自离散随机变量(而不是连续变量),因此,在某些情况下,用一系列线段表示它们可能会产生误导。 cumulative distribution function 中的表示可能有点矫枉过正。我提供以下简化。
一个'x'表示截断。点表示 closed-open 区间的封闭端。
这是代码。我没想到要用 np.cumsum
!
import numpy as np
import pylab as pl
from matplotlib import collections as mc
p = [0.00909909009099,0.02299770023,0.578742125787,0.211278872113,0.00529947005299,0.031796820318,0.0530946905309,0.0876912308769]
cumSums = [0] + [sum(p[:i]) for i in range(1,len(p)+1)]
counts = [30,34,42,54,58,62,66,70,74,80]
lines =[[(counts[i],cumSums[i]),(counts[i+1],cumSums[i])] for i in range(-1+len(counts))]
lc = mc.LineCollection(lines, linewidths=2)
fig, ax = pl.subplots()
ax.add_collection(lc)
pl.plot([30, 80],[0, 1],'bx')
pl.plot(counts[1:-1], cumSums[1:], 'bo')
ax.autoscale()
ax.margins(0.1)
pl.show()
这更像是你想要的情节。 (我希望已更正。)
代码。
import numpy as np
import pylab as pl
from matplotlib import collections as mc
from sys import exit
p = [0.00909909009099,0.02299770023,0.578742125787,0.211278872113,0.00529947005299,0.031796820318,0.0530946905309,0.0876912308769]
cumSums = [sum(p[:i]) for i in range(1,len(p)+1)]
counts = [34,42,54,58,62,66,70,74]
lines = [[(counts[i],cumSums[i]),(counts[i+1],cumSums[i+1])] for i in range(-1+len(p))]
lc = mc.LineCollection(lines, linewidths=2)
fig, ax = pl.subplots()
ax.add_collection(lc)
ax.autoscale()
ax.margins(0.1)
pl.show()
使用 numpy 让一切变得更容易。所以首先你可以将你的字典转换成一个 2 列的 numpy 数组。然后,您可以按其第一列对其进行排序。最后简单地计算第二列的累积和并将其与第一列作图。
dic = { 34 : 0.00909909009099,
42 : 0.02299770023,
54 : 0.578742125787,
58 : 0.211278872113,
62 : 0.00529947005299,
66 : 0.031796820318,
70 : 0.0530946905309,
74 : 0.0876912308769 }
import numpy as np
import matplotlib.pyplot as plt
data = np.array([[k,v] for k,v in dic.iteritems()]) # use dic.items() for python3
data = data[data[:,0].argsort()]
cdf = np.cumsum(data[:,1])
plt.plot(data[:,0], cdf)
plt.show()
早上好,
在 Python 中,我有一个具有以下值的字典(称为 packet_size_dist):
34 => 0.00909909009099
42 => 0.02299770023
54 => 0.578742125787
58 => 0.211278872113
62 => 0.00529947005299
66 => 0.031796820318
70 => 0.0530946905309
74 => 0.0876912308769
注意值的总和 == 1。
我正在尝试生成一个 CDF,我成功地做到了,但它看起来不对,我想知道我是否打算错误地生成它。有问题的代码是:
sorted_p = sorted(packet_size_dist.items(), key=operator.itemgetter(0))
yvals = np.arange(len(sorted_p))/float(len(sorted_p))
plt.plot(sorted_p, yvals)
plt.show()
但生成的图形如下所示:
这似乎与字典中的值不太匹配。有任何想法吗?我还看到图表左侧有一条模糊的绿线,我不知道它是什么。例如,该图描述了大小为 70 的数据包大约有 78% 的时间出现,而在我的字典中它表示为 5% 的时间出现。
这不是对您问题的直接回答。但是,我想我应该指出您的数据来自离散随机变量(而不是连续变量),因此,在某些情况下,用一系列线段表示它们可能会产生误导。 cumulative distribution function 中的表示可能有点矫枉过正。我提供以下简化。
一个'x'表示截断。点表示 closed-open 区间的封闭端。
这是代码。我没想到要用 np.cumsum
!
import numpy as np
import pylab as pl
from matplotlib import collections as mc
p = [0.00909909009099,0.02299770023,0.578742125787,0.211278872113,0.00529947005299,0.031796820318,0.0530946905309,0.0876912308769]
cumSums = [0] + [sum(p[:i]) for i in range(1,len(p)+1)]
counts = [30,34,42,54,58,62,66,70,74,80]
lines =[[(counts[i],cumSums[i]),(counts[i+1],cumSums[i])] for i in range(-1+len(counts))]
lc = mc.LineCollection(lines, linewidths=2)
fig, ax = pl.subplots()
ax.add_collection(lc)
pl.plot([30, 80],[0, 1],'bx')
pl.plot(counts[1:-1], cumSums[1:], 'bo')
ax.autoscale()
ax.margins(0.1)
pl.show()
这更像是你想要的情节。 (我希望已更正。)
代码。
import numpy as np
import pylab as pl
from matplotlib import collections as mc
from sys import exit
p = [0.00909909009099,0.02299770023,0.578742125787,0.211278872113,0.00529947005299,0.031796820318,0.0530946905309,0.0876912308769]
cumSums = [sum(p[:i]) for i in range(1,len(p)+1)]
counts = [34,42,54,58,62,66,70,74]
lines = [[(counts[i],cumSums[i]),(counts[i+1],cumSums[i+1])] for i in range(-1+len(p))]
lc = mc.LineCollection(lines, linewidths=2)
fig, ax = pl.subplots()
ax.add_collection(lc)
ax.autoscale()
ax.margins(0.1)
pl.show()
使用 numpy 让一切变得更容易。所以首先你可以将你的字典转换成一个 2 列的 numpy 数组。然后,您可以按其第一列对其进行排序。最后简单地计算第二列的累积和并将其与第一列作图。
dic = { 34 : 0.00909909009099,
42 : 0.02299770023,
54 : 0.578742125787,
58 : 0.211278872113,
62 : 0.00529947005299,
66 : 0.031796820318,
70 : 0.0530946905309,
74 : 0.0876912308769 }
import numpy as np
import matplotlib.pyplot as plt
data = np.array([[k,v] for k,v in dic.iteritems()]) # use dic.items() for python3
data = data[data[:,0].argsort()]
cdf = np.cumsum(data[:,1])
plt.plot(data[:,0], cdf)
plt.show()