二分变量与连续变量的相关性

Question

我有两个数据帧，一个（Lots），其结构如下：

Lot Group	Lot Number	Booking Stage	Date
1	216000.00	HPRESM	2020-08-28
2	890000.01	PART	2013-04-17

另外一个测量结果如下：

Mid	Date	Measurement 1	Measurement 2
1901827	2020-08-28	44.5	23.22
2981632	2013-04-17	49.0	34.5

两个数据框中的日期列具有唯一的日期，并且它们在两个数据框中相同，因为它们具有相同的长度。

我想做的是计算测量列之间的相关性，测量列是连续变量，批次组是 1（好批次）或 2（坏批次，即二分变量。测量变量有很多 NaN 超过 50%。我的问题是，我试图计算点双序列相关性，因为我读到它用于计算这两种类型变量之间的相关性，但我得到 nan 的统计数据和 1 的 p 值。

columns = measurement.select_dtypes(exclude = ["object", "datetime"]).columns
for col in columns:
    stat, p  = ss.pointbiserialr(lots["LosGruppe"], measurement[col])
    print(f"Variable: {col}, Correlation: {stat}, P-Value: {p}")

Output:
Variable: Mes 1, Correlation: nan, P-Value: 1.0
Variable: Mes 2, Correlation: nan, P-Value: 1.0
Variable: Mes 3, Correlation: nan, P-Value: 1.0
Variable: Mes 4, Correlation: nan, P-Value: 1.0
Variable: Mes 5, Correlation: nan, P-Value: 1.0

作为这个问题的解决方案或原因，您有什么建议？这些变量之间合适的关联方法是什么？

Answer 1

点双序列相关在这里是一种很好的方法，但是您遇到的问题是缺失值。您需要先使用 dropna()

删除这些

二分变量与连续变量的相关性

Correlation between dichotomous variable and continuous variable

python

statistics

correlation

pandas