如何将 str 变量转换为数据框中的不同类别?
How to convert str variables into distinct categories in a dataframe?
我正在尝试转换数据以便能够对其进行分析,但由于我不是很有经验,所以我一直 运行 遇到问题。我已经从社区收到了一些很好的建议,但我又一次被难住了。
我从 https://www.kaggle.com/datasets/majunbajun/himalayan-climbing-expeditions 下载了一个数据文件。
@LancelotduLac 非常友好地为我解决了问题的第一部分,向我展示了如何将终止的各种原因转换为二进制变量
from pandas import read_csv
RE = '^Success.*$'
NRE = '^((?!Success).)*$'
TR = 'termination_reason'
BD = 'basecamp_date'
SE = 'season'
data = read_csv('C:\Users\joepf\OneDrive\Desktop\Data analytics course\Programming1\CA2\data\expeditions.csv')
exp_win_v_fail = data[[TR, BD, SE]]
for v, re_ in enumerate((NRE, RE)):
exp_win_v_fail[TR] = exp_win_v_fail[TR].replace(to_replace=re_, value=v, regex=True)
然后我尝试将季节转换为分类变量,以便进行方差分析,但效果并不理想
# Turn the season column into a categorical
exp_win_v_fail['season'] = exp_win_v_fail['season'].astype('category')
exp_win_v_fail['season'].dtypes
from scipy.stats import f_oneway
# One-way ANOVA
f_value, p_value = f_oneway(exp_win_v_fail[SE], exp_win_v_fail[TR])
print("F-score: " + str(f_value))
print("p value: " + str(p_value))
我假设如果我将季节转换为分类变量,我就不需要将它们从 str 转换,但随后控制台抛出此错误消息,这让我再次猜测该假设:
File "C:\Users\joepf\anaconda3\lib\site-packages\numpy\core\_asarray.py", line 102, in asarray
return array(a, dtype, copy=False, order=order)
ValueError: could not convert string to float: 'Spring'
如有任何建议,我们将不胜感激
通过将季节更改为整数
想出了如何做到这一点 运行
#convert seasons from strings to ints
exp_win_v_fail['season'] = exp_win_v_fail['season'].replace('Spring', 1)
exp_win_v_fail['season'] = exp_win_v_fail['season'].replace('Summer', 2)
exp_win_v_fail['season'] = exp_win_v_fail['season'].replace('Autumn', 3)
exp_win_v_fail['season'] = exp_win_v_fail['season'].replace('Winter', 4)
exp_win_v_fail = exp_win_v_fail[(exp_win_v_fail['season'] != 'Unknown')]
# Turn the season column into a categorical
exp_win_v_fail['season'] = exp_win_v_fail['season'].astype('category')
exp_win_v_fail['season'].dtypes
我正在尝试转换数据以便能够对其进行分析,但由于我不是很有经验,所以我一直 运行 遇到问题。我已经从社区收到了一些很好的建议,但我又一次被难住了。
我从 https://www.kaggle.com/datasets/majunbajun/himalayan-climbing-expeditions 下载了一个数据文件。
@LancelotduLac 非常友好地为我解决了问题的第一部分,向我展示了如何将终止的各种原因转换为二进制变量
from pandas import read_csv
RE = '^Success.*$'
NRE = '^((?!Success).)*$'
TR = 'termination_reason'
BD = 'basecamp_date'
SE = 'season'
data = read_csv('C:\Users\joepf\OneDrive\Desktop\Data analytics course\Programming1\CA2\data\expeditions.csv')
exp_win_v_fail = data[[TR, BD, SE]]
for v, re_ in enumerate((NRE, RE)):
exp_win_v_fail[TR] = exp_win_v_fail[TR].replace(to_replace=re_, value=v, regex=True)
然后我尝试将季节转换为分类变量,以便进行方差分析,但效果并不理想
# Turn the season column into a categorical
exp_win_v_fail['season'] = exp_win_v_fail['season'].astype('category')
exp_win_v_fail['season'].dtypes
from scipy.stats import f_oneway
# One-way ANOVA
f_value, p_value = f_oneway(exp_win_v_fail[SE], exp_win_v_fail[TR])
print("F-score: " + str(f_value))
print("p value: " + str(p_value))
我假设如果我将季节转换为分类变量,我就不需要将它们从 str 转换,但随后控制台抛出此错误消息,这让我再次猜测该假设:
File "C:\Users\joepf\anaconda3\lib\site-packages\numpy\core\_asarray.py", line 102, in asarray
return array(a, dtype, copy=False, order=order)
ValueError: could not convert string to float: 'Spring'
如有任何建议,我们将不胜感激
通过将季节更改为整数
想出了如何做到这一点 运行#convert seasons from strings to ints
exp_win_v_fail['season'] = exp_win_v_fail['season'].replace('Spring', 1)
exp_win_v_fail['season'] = exp_win_v_fail['season'].replace('Summer', 2)
exp_win_v_fail['season'] = exp_win_v_fail['season'].replace('Autumn', 3)
exp_win_v_fail['season'] = exp_win_v_fail['season'].replace('Winter', 4)
exp_win_v_fail = exp_win_v_fail[(exp_win_v_fail['season'] != 'Unknown')]
# Turn the season column into a categorical
exp_win_v_fail['season'] = exp_win_v_fail['season'].astype('category')
exp_win_v_fail['season'].dtypes