Python-计算一组数据的直方图

Python-compute the histogram of a set of data

下面的 Python 函数用于计算数据的直方图,具有相等大小的 bin。我想得到正确的结果

[1, 6, 4, 6]

但是在我运行代码之后,它得到了结果

[7, 12, 17, 17]

这是不正确的。有人知道怎么解决吗?

# Computes the histogram of a set of data
def histogram(data, num_bins):

# Find what range the data spans, and use it to calculate the bin size.
span = max(data) - min(data)
bin_size = span / num_bins

# Calculate the thresholds for each bin.
thresholds = [0] * num_bins
for i in range(num_bins):
    thresholds[i] += bin_size * (i+1)

# Compute the histogram
counts = [0] * num_bins
for datum in data:
    # Increment the count of the bin that the datum falls in
    for bin_index, threshold in enumerate(thresholds):
        if datum <= threshold:
            counts[bin_index] += 1
return counts

# Some random data
data = [-3.2, 0, 1, 1.5, 1.6, 1.9, 5, 6, 9, 1, 4, 5, 8, 9, 5, 6.7, 9]
print("Correct result:\t" + str([1, 6, 4, 6]))
print("Your result:\t" + str(histogram(data, num_bins=4)))

如果要查找直方图,请使用 numpy

import numpy as np
np.histogram([-3.2, 0, 1, 1.5, 1.6, 1.9, 5, 6, 9, 1, 4, 5, 8, 9, 5, 6.7, 9],4)

只有你有两个逻辑错误

(1)计算阈值

(2) 添加中断,一旦找到范围

def histogram(data, num_bins):
  span = max(data) - min(data)
  bin_size = float(span) / num_bins
  thresholds = [0] * num_bins

  for i in range(num_bins):
    #I change thresholds calc
    thresholds[i] = min(data) + bin_size * (i+1)

  counts = [0] * num_bins
  for datum in data:
    for bin_index, threshold in enumerate(thresholds):
      if datum <= threshold:
        counts[bin_index] += 1
        #I add a break
        break
  return counts

data = [-3.2, 0, 1, 1.5, 1.6, 1.9, 5, 6, 9, 1, 4, 5, 8, 9, 5, 6.7, 9]
print("Correct result:\t" + str([1, 6, 4, 6]))
print("Your result:\t" + str(histogram(data, num_bins=4)))

检查阈值定义和 if 语句。 这有效:

def histogram(data, num_bins):

    # Find what range the data spans, and use it to calculate the bin size.
    span = max(data) - min(data)
    bin_size = span / float(num_bins)

    # Calculate the thresholds for each bin.
    thresholds = [0 for i in range(num_bins+1)]
    for i in range(num_bins):
        thresholds[i] += bin_size * (i)

    print thresholds
    # Compute the histogram
    counts = [0 for i in range(num_bins)]
    for datum in data:
        # Increment the count of the bin that the datum falls in
        for bin_index, threshold in enumerate(thresholds):
            if thresholds[bin_index-1] <= datum <= threshold:
                counts[bin_index] += 1
    return counts

首先,如果只是想对数据进行直方图绘制,numpy 提供了这个功能。但是,您问自己如何做到这一点。你的代码表明你忘记了你想做什么,所以把你的功能分解成更小的功能。例如,要计算阈值,请编写一个函数 thresholds(xmin, xmax, nbins),或者使用 numpy.linspace 更好。如果您假设相对于 0(而不是 min(data))递增,这将引起您的注意,并且,如果您幸运的话,可能会提醒您不要希望精确的浮点数积累。所以你可能会得到

def thresholds(xmin, xmax, nbins):
    span = (xmax - xmin) / float(nbins)
    thresholds = [xmin + (i+1)*span for i in range(nbins)]
    thresholds[-1] = xmax
    return thresholds

接下来,您需要获取 bin 计数。同样,您可以只使用 numpy.digitize。与您的代码相比,重要的是不要增加超过一个 bin。最后你可能会得到类似

的东西
def counts(data, bounds):
    counts = [0] * len(bounds)
    for datum in data:
        bin = min(i for i,bound in enumerate(bounds) if bound >= datum)
        counts[bin] += 1
    return counts

现在您可以开始了:

def histogram02(data, num_bins):
    xmin = min(data)
    xmax = max(data)
    th = thresholds(xmin, xmax, num_bins)
    return counts(data, th)