我正在使用的单向方差分析函数不断吐出没有意义的 F 值
The one-way ANOVA function I'm using keeps spitting out F values that don't make sense
我正在为大学做一个项目,它让我很烦恼。
我从 https://www.kaggle.com/datasets/majunbajun/himalayan-climbing-expeditions
下载了一个数据文件
我正在尝试使用方差分析来查看各个季节之间登顶所需时间是否存在统计上的显着差异。
我找回的F值好像没有任何意义。有什么建议吗?
#import pandas
import pandas as pd
#import expeditions as csv file
exp = pd.read_csv('C:\filepath\expeditions.csv')
#extract only the data relating to everest
exp= exp[exp['peak_name'] == 'Everest']
#create a subset of the data only containing
exp_peaks = exp[['peak_name', 'member_deaths', 'termination_reason', 'hired_staff_deaths', 'year', 'season', 'basecamp_date', 'highpoint_date']]
#extract successful attempts
exp_peaks = exp_peaks[(exp_peaks['termination_reason'] == 'Success (main peak)')]
#drop missing values from basecamp_date & highpoint_date
exp_peaks = exp_peaks.dropna(subset=['basecamp_date', 'highpoint_date'])
#convert basecamp date to datetime
exp_peaks['basecamp_date'] = pd.to_datetime(exp_peaks['basecamp_date'])
#convert basecamp date to datetime
exp_peaks['highpoint_date'] = pd.to_datetime(exp_peaks['highpoint_date'])
from datetime import datetime
exp_peaks['time_taken'] = exp_peaks['highpoint_date'] - exp_peaks['basecamp_date']
#convert seasons from strings to ints
exp_peaks['season'] = exp_peaks['season'].replace('Spring', 1)
exp_peaks['season'] = exp_peaks['season'].replace('Autumn', 3)
exp_peaks['season'] = exp_peaks['season'].replace('Winter', 4)
#remove summer and unknown
exp_peaks = exp_peaks[(exp_peaks['season'] != 'Summer')]
exp_peaks = exp_peaks[(exp_peaks['season'] != 'Unknown')]
#subset the data according to the season
exp_peaks_spring = exp_peaks[exp_peaks['season'] == 1]
exp_peaks_autumn = exp_peaks[exp_peaks['season'] == 3]
exp_peaks_winter = exp_peaks[exp_peaks['season'] == 4]
#calculate the average time taken in spring
exp_peaks_spring_duration = exp_peaks_spring['time_taken']
mean_exp_peaks_spring_duration = exp_peaks_spring_duration.mean()
#calculate the average time taken in autumn
exp_peaks_autumn_duration = exp_peaks_autumn['time_taken']
mean_exp_peaks_autumn_duration = exp_peaks_autumn_duration.mean()
#calculate the average time taken in winter
exp_peaks_winter_duration = exp_peaks_winter['time_taken']
mean_exp_peaks_winter_duration = exp_peaks_winter_duration.mean()
# Turn the season column into a categorical
exp_peaks['season'] = exp_peaks['season'].astype('category')
exp_peaks['season'].dtypes
from scipy.stats import f_oneway
# One-way ANOVA
f_value, p_value = f_oneway(exp_peaks['season'], exp_peaks['time_taken'])
print("F-score: " + str(f_value))
print("p value: " + str(p_value))
似乎f_oneway
需要连续数据的不同样本作为参数,而不是采用分类变量参数。您可以使用 groupby
.
实现此目的
f_oneway(*(group for _, group in exp_peaks.groupby("season")["time_taken"]))
或者等价地,因为您已经为每个季节创建了系列:
f_oneway(exp_peaks_spring_duration, exp_peaks_autumn_duration, exp_peaks_winter_duration)
我原以为在这种常见情况下会有更简单的方差分析方法,但找不到。
我正在为大学做一个项目,它让我很烦恼。
我从 https://www.kaggle.com/datasets/majunbajun/himalayan-climbing-expeditions
下载了一个数据文件我正在尝试使用方差分析来查看各个季节之间登顶所需时间是否存在统计上的显着差异。
我找回的F值好像没有任何意义。有什么建议吗?
#import pandas
import pandas as pd
#import expeditions as csv file
exp = pd.read_csv('C:\filepath\expeditions.csv')
#extract only the data relating to everest
exp= exp[exp['peak_name'] == 'Everest']
#create a subset of the data only containing
exp_peaks = exp[['peak_name', 'member_deaths', 'termination_reason', 'hired_staff_deaths', 'year', 'season', 'basecamp_date', 'highpoint_date']]
#extract successful attempts
exp_peaks = exp_peaks[(exp_peaks['termination_reason'] == 'Success (main peak)')]
#drop missing values from basecamp_date & highpoint_date
exp_peaks = exp_peaks.dropna(subset=['basecamp_date', 'highpoint_date'])
#convert basecamp date to datetime
exp_peaks['basecamp_date'] = pd.to_datetime(exp_peaks['basecamp_date'])
#convert basecamp date to datetime
exp_peaks['highpoint_date'] = pd.to_datetime(exp_peaks['highpoint_date'])
from datetime import datetime
exp_peaks['time_taken'] = exp_peaks['highpoint_date'] - exp_peaks['basecamp_date']
#convert seasons from strings to ints
exp_peaks['season'] = exp_peaks['season'].replace('Spring', 1)
exp_peaks['season'] = exp_peaks['season'].replace('Autumn', 3)
exp_peaks['season'] = exp_peaks['season'].replace('Winter', 4)
#remove summer and unknown
exp_peaks = exp_peaks[(exp_peaks['season'] != 'Summer')]
exp_peaks = exp_peaks[(exp_peaks['season'] != 'Unknown')]
#subset the data according to the season
exp_peaks_spring = exp_peaks[exp_peaks['season'] == 1]
exp_peaks_autumn = exp_peaks[exp_peaks['season'] == 3]
exp_peaks_winter = exp_peaks[exp_peaks['season'] == 4]
#calculate the average time taken in spring
exp_peaks_spring_duration = exp_peaks_spring['time_taken']
mean_exp_peaks_spring_duration = exp_peaks_spring_duration.mean()
#calculate the average time taken in autumn
exp_peaks_autumn_duration = exp_peaks_autumn['time_taken']
mean_exp_peaks_autumn_duration = exp_peaks_autumn_duration.mean()
#calculate the average time taken in winter
exp_peaks_winter_duration = exp_peaks_winter['time_taken']
mean_exp_peaks_winter_duration = exp_peaks_winter_duration.mean()
# Turn the season column into a categorical
exp_peaks['season'] = exp_peaks['season'].astype('category')
exp_peaks['season'].dtypes
from scipy.stats import f_oneway
# One-way ANOVA
f_value, p_value = f_oneway(exp_peaks['season'], exp_peaks['time_taken'])
print("F-score: " + str(f_value))
print("p value: " + str(p_value))
似乎f_oneway
需要连续数据的不同样本作为参数,而不是采用分类变量参数。您可以使用 groupby
.
f_oneway(*(group for _, group in exp_peaks.groupby("season")["time_taken"]))
或者等价地,因为您已经为每个季节创建了系列:
f_oneway(exp_peaks_spring_duration, exp_peaks_autumn_duration, exp_peaks_winter_duration)
我原以为在这种常见情况下会有更简单的方差分析方法,但找不到。