使用 bin 边界平滑值：你在哪里设置一个正好位于下边界和上边界之间的值？

Smooth values using bin Boundaries: Where do you set a value who sits right between the lower and upper boundary?

针对@j.jerrod.taylor的回答，让我改一下我的问题以消除任何误解。

我是数据挖掘的新手，正在学习如何通过 "Bin Boundaries" 使用 Equal-width/Distance 分箱法平滑我的数据来处理噪声数据。假设数据集 1,2,2,3,5,6,6,7,7,8,9。我要表演：

使用 3 个 bin 的距离 binning，并且
Bin Boundaries 基于 #1 中分箱的值平滑值。

基于 (Han,Kamber,Pei, 2012, Data Mining Concepts and Techniques, Section 3.2.2 Noisy Data) 中的定义：

In smoothing by bin boundaries, the minimum and maximum values in a given bin are identified as the bin boundaries. Each bin value is then replaced by the closest boundary value.

区间宽度=（最大-最小）/k=（9-1）/3=2.7
Bin 间隔 = [1,3.7),[3.7,6.4],[6.4,9.1]
原始 Bin1：1,2,2,3 | bin 边界：(1,3) |按 Bin Boundaries 平滑值：1,1,1,3
原始 Bin2：5,6,6 | bin 边界：(5,6) |按 Bin Boundaries 平滑值：5,6,6
原始 Bin3：7,7,8,9 | bin 边界：(7,9) |按 Bin Boundaries 平滑值：7,7,8,9

问题： - 当使用 Bin 边界方法分箱时，8 在 Bin3 中属于哪里，因为它是 7 的 +1 和 9 的 -1？

如果这是一个问题，那么您计算的 bin 宽度不正确。例如，创建直方图是数据分箱的一个示例。

您可以阅读 this 关于交叉验证的回复。但一般来说，如果您尝试对整数进行分箱，那么您的边界将是双倍的。

例如，如果您希望 2 到 6 之间的所有内容都在一个容器中，则您的实际边界将是 1.5 到 6.5。由于您的所有数据都是整数，所以任何东西都不可能不被分类。

edit:我也有同一本书，虽然看起来我有不同的版本，因为关于数据离散化的部分在第 2 章而不是你指出的第 3 章。根据你的问题，你似乎还没有真正理解这个概念。

以下是关于数据预处理的第 88 页第 2 章的例外情况。我用的是第二版的课文。

For example, attribute values can be discretized by applying equal-width or equal-frequency binning, and then replacing each bin value by the bin mean or median, as in smoothing by bin means or smoothing by bin medians, respectively. 8 doesn't belong anywhere other than in bin 3. This gives you two options. You can either take the mean/median of all of the numbers that fall in bin 3 or you can use bin 3 as a category.

根据您的示例，我们可以取 bin 3 中 4 个数字的平均值。这给出了 7.75。我们现在将 7.75 用于该 bin 中的四个数字，而不是 7、7、8 和 9。

第二个选项是使用 bin 编号。例如，bin 3 中的所有内容都将获得类别标签 3，bin 2 中的所有内容将获得标签 2，依此类推。

更新正确答案：

我的class终于涵盖了这个话题，我自己的问题的答案是8可以属于7或9。这种情况描述为"tie-breaking"，其中值相等距任一边界的距离。所有此类值始终绑定到同一边界是可以接受的。

这是 NIH 分析论文的真实示例，解释了在遇到等距离值时使用 "tie breaking"：http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3807594/

使用 bin 边界平滑值：你在哪里设置一个正好位于下边界和上边界之间的值？

Smooth values using bin Boundaries: Where do you set a value who sits right between the lower and upper boundary?

statistics

data-mining

binning