测试特征和标签值的正态性和相关性

Question

我有一个存储在二维 numpy 数组中的数据集。我想测试作为数组列的每个特征的正态性和相关性，然后绘制它。

我知道使用 R，可以通过运行以下命令轻松完成：

shapiro.test(Class$Feature)
ggqqplot(Wage$age, ylab = "Feature")

在 R 中，相关性测试可以通过运行以下命令轻松完成：

res <- cor.test(Class$Feature, Class$class, method = "pearson")

如何在 python 中执行这些步骤？

我尝试了 Scipy 的 Normaltest 和多列数据集，但 id 不起作用。

from scipy import stats
df = pd.DataFrame(data)
k2, p = stats.normaltest(df[:,1], df[:,5]) #Testing Feature 1 agains Feature 5
print (p)

Answer 1

经过大量搜索后，我发现使用 numpy 数组可能不是解决此问题的合适方法。这就是为什么我将数据集加载到 pandas 数据框中，然后使用以下代码：

from scipy.stats import shapiro
import pylab
import scipy.stats as stats
def test_normality(data_frame, features, feature_for_test):
    for feature in features:
        print("Test Result: " + str(shapiro(data_frame[feature])))
        stats.probplot(data_frame[feature], dist="norm", plot=pylab)
        pylab.show()

test_normality(data_frame, ["feature1","feature2", "feature3"], "feature_for_test")

对于相关性测试，我使用了以下代码：

from scipy.stats import pearsonr
def correlation_test(data_frame, features, feature_for_test):
for feature in features:
    cor, _ = pearsonr(data_frame[feature], data_frame[feature_for_test])
    print("Pearson Correlation Test Result: %.3f" % cor)

correlation_test(data_frame, ["feature1","feature2", "feature3"], "feature_for_test")

测试特征和标签值的正态性和相关性

Testing the normality and correlation of the feature and label values

python

numpy

r

data-visualization

scipy