如何将 str 变量转换为数据框中的不同类别?

How to convert str variables into distinct categories in a dataframe?

我正在尝试转换数据以便能够对其进行分析,但由于我不是很有经验,所以我一直 运行 遇到问题。我已经从社区收到了一些很好的建议,但我又一次被难住了。

我从 https://www.kaggle.com/datasets/majunbajun/himalayan-climbing-expeditions 下载了一个数据文件。

@LancelotduLac 非常友好地为我解决了问题的第一部分,向我展示了如何将终止的各种原因转换为二进制变量

from pandas import read_csv

RE = '^Success.*$'
NRE = '^((?!Success).)*$'
TR = 'termination_reason'
BD = 'basecamp_date'
SE = 'season'

data = read_csv('C:\Users\joepf\OneDrive\Desktop\Data analytics course\Programming1\CA2\data\expeditions.csv')

exp_win_v_fail = data[[TR, BD, SE]]

for v, re_ in enumerate((NRE, RE)):
    exp_win_v_fail[TR] = exp_win_v_fail[TR].replace(to_replace=re_, value=v, regex=True)

然后我尝试将季节转换为分类变量,以便进行方差分析,但效果并不理想

# Turn the season column into a categorical
exp_win_v_fail['season'] = exp_win_v_fail['season'].astype('category')
exp_win_v_fail['season'].dtypes


from scipy.stats import f_oneway

# One-way ANOVA
f_value, p_value = f_oneway(exp_win_v_fail[SE], exp_win_v_fail[TR])
print("F-score: " + str(f_value))
print("p value: " + str(p_value))

我假设如果我将季节转换为分类变量,我就不需要将它们从 str 转换,但随后控制台抛出此错误消息,这让我再次猜测该假设:

 File "C:\Users\joepf\anaconda3\lib\site-packages\numpy\core\_asarray.py", line 102, in asarray
    return array(a, dtype, copy=False, order=order)

ValueError: could not convert string to float: 'Spring'

如有任何建议,我们将不胜感激

通过将季节更改为整数

想出了如何做到这一点 运行
#convert seasons from strings to ints
exp_win_v_fail['season'] = exp_win_v_fail['season'].replace('Spring', 1)
exp_win_v_fail['season'] = exp_win_v_fail['season'].replace('Summer', 2)
exp_win_v_fail['season'] = exp_win_v_fail['season'].replace('Autumn', 3)
exp_win_v_fail['season'] = exp_win_v_fail['season'].replace('Winter', 4)
exp_win_v_fail = exp_win_v_fail[(exp_win_v_fail['season'] != 'Unknown')]

# Turn the season column into a categorical
exp_win_v_fail['season'] = exp_win_v_fail['season'].astype('category')
exp_win_v_fail['season'].dtypes