预测特定值在数据集中的位置的最佳数据分析方法

Question

我正在使用一个非常小的数据集来自学预测数据分析。我正在使用 Weka 和 Orange 来尝试解决这个问题。

首先，我使用这个 csv 文件来训练系统：

gender,weight
M,82
F,71
M,90
F,76
M,88
F,56
M,100
F,63
M,84
F,79
M,92
F,66

你会注意到所有的 F 值都在 80 以下，所有的 M 值都在 80 以上。

然后我有这个数据文件：

weight, gender
70,,
100,,
69,,
76,,
99,,

请注意缺少 'gender' 值。

我想设计一个系统来读取数据文件，并根据一些数据分析将 M 或 F 放入性别字段。

我研究了线性回归，但这涉及到两个移动值之间的关系（随着 X 增加 - Y 也增加）

然后我查看了 K-Clustering，但所做的只是向我展示了两个 M > 80 和 F < 80 的集群

请问我可以使用一个系统来尝试对我的数据集应用一些预测吗？

非常感谢

Answer 1

这看起来像 decision tree 可以轻松完成的事情。我为您查找了一个 weka tutorial，因为我从未使用过它。但是概念是一样的。

Answer 2

从 Ilyas 的回答开始，这里是 python's scikit-learn documentation. I'd suggest checking out the classification entries in the supervised learning doc for python's scikit-learn, found here.

Answer 3

按照 Ilyas Moutawwakil 的建议，使用 Weka，您可以这样做：

首先，将您的数据转换为 ARFF（ARFF 格式在其 header 中指定分类值以避免 CSV 文件和潜在的 missing/additional 值出现问题）：

你的训练数据：

@relation train

@attribute weight numeric
@attribute gender {F,M}

@data
82,M
71,F
90,M
76,F
88,M
56,F
100,M
63,F
84,M
79,F
92,M
66,F

您想要预测的数据：

@relation predict

@attribute weight numeric
@attribute gender {F,M}

@data
70,?
100,?
69,?
76,?
99,?

然后您可以使用决策树算法，例如 J48 来训练您的训练数据并在您的其他数据集上生成预测（当然，调整到 weka.jar 和您的数据集的路径）：

java -cp weka.jar weka.classifiers.trees.J48 -t train.arff -T predict.arff -p 1

注意： 使用 -p 1 我们将第一个属性 (weight) 添加到输出中。

如果您希望将预测结果输出到 CSV 文件 (predictions.csv)，您可以这样做：

java -cp weka.jar weka.classifiers.trees.J48 -t train.arff -T predict.arff -classifications "weka.classifiers.evaluation.output.prediction.CSV -p 1 -file predictions.csv -suppress"

predictions.csv 文件将如下所示：

inst#,actual,predicted,error,prediction,weight
1,1:?,1:F,,1,70
2,1:?,2:M,,1,100
3,1:?,1:F,,1,69
4,1:?,1:F,,1,76
5,1:?,2:M,,1,99

预测特定值在数据集中的位置的最佳数据分析方法

Best data analysis method to predict where a certain value will fit in a dataset

python

data-analysis

prediction

weka

orange