Dataframe 中多列的 T 检验

Question

数据框类似于：

decade     rain     snow
1910       0.2      0.2
1910       0.3      0.4
2000       0.4      0.5
2010       0.1      0.1

我希望对 python 至运行中的函数提供一些帮助，以比较给定列的十进制组合。这个函数很好用，除了不接受输入列，如雨或雪。

from itertools import combinations

def ttest_run(c1, c2):
    results = st.ttest_ind(cat1, cat2,nan_policy='omit')
    df = pd.DataFrame({'dec1': c1,
                       'dec2': c2,
                       'tstat': results.statistic,
                       'pvalue': results.pvalue}, 
                       index = [0])    
    return df

df_list = [ttest_run(i, j) for i, j in combinations(data['decade'].unique().tolist(), 2)]

final_df = pd.concat(df_list, ignore_index = True)

Answer 1

我想你想要这样的东西：

import pandas as pd
from itertools import combinations
from scipy import stats as st


d = {'decade': ['1910', '1910', '2000', '2010', '1990', '1990', '1990', '1990'], 
     'rain': [0.2, 0.3, 0.3, 0.1, 0.1, 0.2, 0.3, 0.4], 
     'snow': [0.2, 0.4, 0.5, 0.1, 0.1, 0.2, 0.3, 0.4]}
df = pd.DataFrame(data=d)


def all_pairwise(df, compare_col = 'decade'):
    decade_pairs = [(i,j) for i, j in combinations(df[compare_col].unique().tolist(), 2)]
    # or add a list of colnames to function signature
    cols = list(df.columns)
    cols.remove(compare_col)
    list_of_dfs = []
    for pair in decade_pairs:
        for col in cols:
            c1 = df[df[compare_col] == pair[0]][col]
            c2 = df[df[compare_col] == pair[1]][col]
            results = st.ttest_ind(c1, c2, nan_policy='omit')
            tmp = pd.DataFrame({'dec1': pair[0],
                                'dec2': pair[1],
                                'tstat': results.statistic,
                                'pvalue': results.pvalue}, index = [col])
            list_of_dfs.append(tmp)
    df_stats = pd.concat(list_of_dfs)
    return df_stats

df_stats = all_pairwise(df)
df_stats

现在，如果您执行该代码，您将在计算 t-statistics 时因数据点太少而出现被 0 除错误的运行时警告，这会导致输出 [=17] 中的 Nan =]

>>> df_stats
      dec1  dec2     tstat    pvalue
rain  1910  2000       NaN       NaN
snow  1910  2000       NaN       NaN
rain  1910  2010       NaN       NaN
snow  1910  2010       NaN       NaN
rain  1910  1990  0.000000  1.000000
snow  1910  1990  0.436436  0.685044
rain  2000  2010       NaN       NaN
...

如果您不想要所有列，而只想要一些指定的集合，请将函数 signature/definition 行更改为：

def all_pairwise(df, cols, compare_col = 'decade'):

其中 cols 应该是字符串列名称的可迭代（列表可以正常工作）。您需要删除这两行：

    cols = list(df.columns)
    cols.remove(compare_col)

来自函数体，否则将正常工作。

你总是会收到运行时警告，除非你在传递给函数之前过滤掉记录太少的几十年。

下面是接受列列表作为参数并显示运行时警告的版本的示例调用。

>>> all_pairwise(df, cols=['rain'])
/usr/local/lib/python3.8/site-packages/numpy/core/fromnumeric.py:3723: RuntimeWarning: Degrees of freedom <= 0 for slice
  return _methods._var(a, axis=axis, dtype=dtype, out=out, ddof=ddof,
/usr/local/lib/python3.8/site-packages/numpy/core/_methods.py:254: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
      dec1  dec2  tstat  pvalue
rain  1910  2000    NaN     NaN
rain  1910  2010    NaN     NaN
rain  1910  1990    0.0     1.0
rain  2000  2010    NaN     NaN
rain  2000  1990    NaN     NaN
rain  2010  1990    NaN     NaN
>>>

Dataframe 中多列的 T 检验

T Test on Multiple Columns in Dataframe

python

function

scipy

scipy.stats