没有 pandas/numpy 的 Pythonic 数据分箱方式
Pythonic way of binning data without pandas/numpy
我正在寻找一种将包含数百个条目的数据集分箱到 20 个分箱中的方法。但是没有使用像 pandas(剪切)和 numpy(数字化)这样的大模块。
谁能想到比 18 个 elif 更好的解决方案?
您需要做的就是弄清楚每个元素在哪个箱子里。考虑到箱子的大小,如果它们是统一的,这是相当微不足道的。从您的数组中,您可以找到 minval
和 maxval
。然后,binwidth = (maxval - minval) / nbins
。对于数组 elem
的一个元素,以及已知的最小值 minval
和 bin 宽度 binwidth
,该元素将落在 bin 编号 int((elem - minval) / binwidth)
中。这留下了 elem == maxval
的边缘情况。在这种情况下,bin 编号等于 nbins
(第 nbins + 1
个 bin,因为 python 是从零开始的),因此我们必须为这种情况减少 bin 编号.
因此我们可以编写一个函数来执行此操作:
import random
def splitIntoBins(arr, nbins, minval=None, maxval=None):
minval = min(arr) if minval is None else minval # Select minval if specified, otherwise min of data
maxval = max(arr) if maxval is None else maxval # Same for maxval
binwidth = (maxval - minval) / nbins # Bin width
allbins = [[] for _ in range(nbins)] # Pre-make a list-of-lists to hold values
for elem in arr:
binnum = int((elem - minval) // binwidth) # Find which bin this element belongs in
binindex = min(nbins-1, binnum) # To handle the case of elem == maxval
allbins[binindex].append(elem) # Add this element to the bin
return allbins
# Make 1000 random numbers between 0 and 1
x = [random.random() for _ in range(1000)]
# split into 10 bins from 0 to 1, i.e. a bin every 0.1
b = splitIntoBins(x, 10, 0, 1)
# Get min, max, count for each bin
counts = [(min(v), max(v), len(v)) for v in b]
print(counts)
这给出了
[(0.00017731201786974626, 0.09983758434153, 101),
(0.10111204267013452, 0.19959594179848794, 97),
(0.20089309189822557, 0.2990120768922335, 100),
(0.3013915797055913, 0.39922131591077614, 90),
(0.4009006835799309, 0.49969892298935836, 83),
(0.501675740585966, 0.5999729295882031, 119),
(0.6010149249108184, 0.7000366124696699, 120),
(0.7008002068562794, 0.7970568220766774, 91),
(0.8018697850229161, 0.8990963218226316, 99),
(0.9000732426223624, 0.9967964437788829, 100)]
这看起来像我们预期的那样。
对于非uniform bins,不再是算术计算。在这种情况下,元素 elem
位于下限小于 elem
且上限大于 elem
.
的 bin 中
def splitIntoBins2(arr, bins):
binends = bins[1:]
binstarts = bins[:-1]
allbins = [[] for _ in binends] # Pre-make a list-of-lists to hold values
for elem in arr:
for i, (lower_bound, upper_bound) in enumerate(zip(binstarts, binends)):
if upper_bound >= elem and lower_bound <= elem:
allbins[i].append(elem) # Add this element to the bin
break
return allbins
我正在寻找一种将包含数百个条目的数据集分箱到 20 个分箱中的方法。但是没有使用像 pandas(剪切)和 numpy(数字化)这样的大模块。 谁能想到比 18 个 elif 更好的解决方案?
您需要做的就是弄清楚每个元素在哪个箱子里。考虑到箱子的大小,如果它们是统一的,这是相当微不足道的。从您的数组中,您可以找到 minval
和 maxval
。然后,binwidth = (maxval - minval) / nbins
。对于数组 elem
的一个元素,以及已知的最小值 minval
和 bin 宽度 binwidth
,该元素将落在 bin 编号 int((elem - minval) / binwidth)
中。这留下了 elem == maxval
的边缘情况。在这种情况下,bin 编号等于 nbins
(第 nbins + 1
个 bin,因为 python 是从零开始的),因此我们必须为这种情况减少 bin 编号.
因此我们可以编写一个函数来执行此操作:
import random
def splitIntoBins(arr, nbins, minval=None, maxval=None):
minval = min(arr) if minval is None else minval # Select minval if specified, otherwise min of data
maxval = max(arr) if maxval is None else maxval # Same for maxval
binwidth = (maxval - minval) / nbins # Bin width
allbins = [[] for _ in range(nbins)] # Pre-make a list-of-lists to hold values
for elem in arr:
binnum = int((elem - minval) // binwidth) # Find which bin this element belongs in
binindex = min(nbins-1, binnum) # To handle the case of elem == maxval
allbins[binindex].append(elem) # Add this element to the bin
return allbins
# Make 1000 random numbers between 0 and 1
x = [random.random() for _ in range(1000)]
# split into 10 bins from 0 to 1, i.e. a bin every 0.1
b = splitIntoBins(x, 10, 0, 1)
# Get min, max, count for each bin
counts = [(min(v), max(v), len(v)) for v in b]
print(counts)
这给出了
[(0.00017731201786974626, 0.09983758434153, 101),
(0.10111204267013452, 0.19959594179848794, 97),
(0.20089309189822557, 0.2990120768922335, 100),
(0.3013915797055913, 0.39922131591077614, 90),
(0.4009006835799309, 0.49969892298935836, 83),
(0.501675740585966, 0.5999729295882031, 119),
(0.6010149249108184, 0.7000366124696699, 120),
(0.7008002068562794, 0.7970568220766774, 91),
(0.8018697850229161, 0.8990963218226316, 99),
(0.9000732426223624, 0.9967964437788829, 100)]
这看起来像我们预期的那样。
对于非uniform bins,不再是算术计算。在这种情况下,元素 elem
位于下限小于 elem
且上限大于 elem
.
def splitIntoBins2(arr, bins):
binends = bins[1:]
binstarts = bins[:-1]
allbins = [[] for _ in binends] # Pre-make a list-of-lists to hold values
for elem in arr:
for i, (lower_bound, upper_bound) in enumerate(zip(binstarts, binends)):
if upper_bound >= elem and lower_bound <= elem:
allbins[i].append(elem) # Add this element to the bin
break
return allbins