python 中多级分类数据的描述性统计

descriptive statistics of multi level categorical data in python

下面是一个包含三列的 df 示例,每列都有多级分类数据。我想计算列中每个级别的三列的一些描述性统计数据——例如每个位置和状态下每个年龄组的人数,包括计数、比例和标准差(我想这实际上应该是一个置信区间这里)。但我不确定如何以优雅的方式做到这一点。非常感谢任何建议,非常感谢

birth_year = pd.DataFrame(([random.randint(1900,2000) for x in range(50)]), columns = ['year'])

from datetime import date

def age(df,col):
    today = date.today()
    age = today.year - df[col]
    bins = [18,30,40,50,60,70,120]
    labs = ['-30','30-39','40-49','50-59','60-69','70+']
    group = pd.cut(age, bins, labels = labs)
    return(group)

birth_year.loc[:,'age_bin'] = age(birth_year,'year')


location = pd.DataFrame((Rand(1, 6, 50)), columns = ['location'])

def label_loc (row):
    if row['location'] == 1 :
        return 'england'
    if row['location'] == 2 :
        return 'ireland'
    if row['location'] == 3:
        return 'scotland'
    if row['location']  == 4:
        return 'wales'
    if row['location']  == 5:
        return 'jersey'
    if row['location']  == 6:
        return 'gurnsey'
    return 'Other'

location = location.apply(lambda row: label_loc(row), axis=1)


def Rand(start, end, num):
    out = []
    for x in range(num):
        out.append(random.randint(start, end))
    return out


status = pd.DataFrame((Rand(1, 6, 50)), columns = ['status'])

def label_stat (row):
    if row['status'] == 1 :
        return 'married'
    if row['status'] == 2 :
        return 'divorced'
    if row['status'] == 3:
        return 'single'
    if row['status']  == 4:
        return 'window'
    return 'Other'

status = status.apply(lambda row: label_stat(row), axis=1)


df = pd.DataFrame(list(zip(birth_year["age_bin"], status, location)), columns =['year', 'gender', 'ethnicity'])

(请参阅 this gist 以了解稍微重写的示例设置。)

让我们举个例子:

the number of people per age group in each location and status

如果您有一个连续变量,例如 year,您可以简单地告诉 groupby().agg() 您想要哪个平均统计数据:

print(df.groupby(['location', 'status'])['year'].agg(['mean', 'std']))

                          mean        std
location status                          
england  Other     1961.000000  16.792856
         divorced  1934.666667  30.270998
         married   1917.000000        NaN
         single    1907.000000        NaN
         window    1962.600000  34.011763
ireland  Other     1982.000000        NaN
         divorced  1949.750000  37.303932
         married   1991.000000        NaN
         single    1986.500000   2.121320
         window    1965.500000   3.535534
jersey   Other     1939.800000  26.204961
         divorced  1984.000000        NaN
         married   1986.000000        NaN
         single    1942.500000  54.447222
scotland Other     1942.666667  12.701706
         divorced  1946.000000  49.497475
         married   1914.000000        NaN
         single    1968.000000        NaN
         window    1933.500000  24.748737
wales    Other     1950.666667  39.526363
         divorced  1978.000000        NaN
         married   1959.000000  52.325902
         single    1929.000000        NaN
         window    1990.000000        NaN

对于分类值,您可以使用 value_counts() 对它们进行计数,这会增加一个额外的索引级别(您可以取消堆叠):

grouped_age_bin = df.groupby(['location', 'status'])['age_bin']
counts = grouped_age_bin.value_counts().unstack('age_bin')
print(counts)

age_bin            -30  30-39  40-49  50-59  60-69  70+
location status                                        
england  Other       0      1      0      1      0    2
         divorced    0      0      0      1      0    2
         married     0      0      0      0      0    1
         single      0      0      0      0      0    1
         window      0      1      2      1      0    1
ireland  Other       0      1      0      0      0    0
         divorced    1      0      0      0      1    2
         married     1      0      0      0      0    0
         single      0      2      0      0      0    0
         window      0      0      0      2      0    0
jersey   Other       0      0      1      0      1    3
         divorced    0      1      0      0      0    0
         married     0      1      0      0      0    0
         single      0      1      0      0      0    1
scotland Other       0      0      0      0      0    3
         divorced    0      1      0      0      0    1
         married     0      0      0      0      0    1
         single      0      0      0      1      0    0
         window      0      0      0      0      1    1
wales    Other       0      1      0      1      0    1
         divorced    0      0      1      0      0    0
         married     1      0      0      0      0    1
         single      0      0      0      0      0    1
         window      0      1      0      0      0    0

如果你想要每个类别的平均值,你可以除以组大小,即 grouped_age_bin.size():

print(counts.div(grouped_age_bin.size(), axis='index'))

age_bin                 -30  30-39     40-49     50-59  60-69       70+
location status                                                        
england  Other     0.000000    0.0  0.000000  0.000000   0.00  1.000000
         married   0.500000    0.0  0.000000  0.000000   0.00  0.500000
         single    0.000000    0.0  0.000000  0.000000   0.00  1.000000
         window    0.250000    0.0  0.000000  0.000000   0.25  0.500000
ireland  Other     0.000000    0.0  0.000000  0.000000   0.00  1.000000
         married   0.000000    0.0  0.000000  0.000000   0.00  1.000000
         single    0.000000    0.0  0.000000  0.000000   1.00  0.000000
         window    0.000000    0.0  0.333333  0.333333   0.00  0.333333
jersey   Other     0.000000    0.0  1.000000  0.000000   0.00  0.000000
         divorced  0.000000    0.0  1.000000  0.000000   0.00  0.000000
         married   0.000000    0.0  0.000000  0.000000   0.00  1.000000
         single    0.000000    0.0  0.200000  0.400000   0.20  0.200000
         window    0.000000    0.5  0.000000  0.000000   0.00  0.500000
scotland divorced  0.333333    0.0  0.000000  0.000000   0.00  0.666667
         married   0.000000    0.0  0.333333  0.333333   0.00  0.333333
         single    0.000000    0.5  0.000000  0.000000   0.00  0.500000
         window    0.000000    0.0  0.500000  0.000000   0.00  0.500000
wales    Other     0.000000    0.5  0.000000  0.000000   0.00  0.500000
         divorced  0.000000    0.0  0.000000  0.000000   0.00  1.000000
         married   0.500000    0.0  0.000000  0.000000   0.00  0.500000
         single    0.000000    0.0  0.000000  0.000000   0.00  1.000000
         window    0.500000    0.0  0.500000  0.000000   0.00  0.000000

现在有了人口规模和总数,您就可以计算置信区间了。或者您可以进行简单的字符串聚合。要同时拥有人口规模和总数,我会使用 pd.DataFrame.transform + pd.Series.combine,这样你只需要编写一个 lambda 来获取类别中的数量和总数:

print(counts.transform(pd.Series.combine, 'index', grouped_age_bin.size(), lambda num, tot: f'{100 * num / tot:.1f}% (n={num})'))

age_bin                    -30        30-39        40-49         50-59         60-69           70+
location status                                                                                   
england  Other      0.0% (n=0)   0.0% (n=0)  50.0% (n=1)    0.0% (n=0)    0.0% (n=0)   50.0% (n=1)
         divorced   0.0% (n=0)  50.0% (n=1)   0.0% (n=0)    0.0% (n=0)    0.0% (n=0)   50.0% (n=1)
         married   33.3% (n=1)   0.0% (n=0)  33.3% (n=1)    0.0% (n=0)    0.0% (n=0)   33.3% (n=1)
         single     0.0% (n=0)   0.0% (n=0)   0.0% (n=0)    0.0% (n=0)    0.0% (n=0)  100.0% (n=1)
         window     0.0% (n=0)   0.0% (n=0)   0.0% (n=0)    0.0% (n=0)    0.0% (n=0)  100.0% (n=2)
ireland  Other      0.0% (n=0)   0.0% (n=0)   0.0% (n=0)    0.0% (n=0)    0.0% (n=0)  100.0% (n=2)
         divorced   0.0% (n=0)   0.0% (n=0)   0.0% (n=0)   50.0% (n=1)    0.0% (n=0)   50.0% (n=1)
         married    0.0% (n=0)   0.0% (n=0)   0.0% (n=0)    0.0% (n=0)  100.0% (n=2)    0.0% (n=0)
         single     0.0% (n=0)   0.0% (n=0)   0.0% (n=0)    0.0% (n=0)    0.0% (n=0)  100.0% (n=1)
         window    33.3% (n=1)   0.0% (n=0)   0.0% (n=0)    0.0% (n=0)    0.0% (n=0)   66.7% (n=2)
jersey   Other      0.0% (n=0)   0.0% (n=0)   0.0% (n=0)    0.0% (n=0)  100.0% (n=1)    0.0% (n=0)
         married    0.0% (n=0)   0.0% (n=0)   0.0% (n=0)    0.0% (n=0)    0.0% (n=0)  100.0% (n=1)
         single     0.0% (n=0)   0.0% (n=0)   0.0% (n=0)  100.0% (n=1)    0.0% (n=0)    0.0% (n=0)
scotland Other      0.0% (n=0)   0.0% (n=0)  50.0% (n=1)    0.0% (n=0)    0.0% (n=0)   50.0% (n=1)
         divorced   0.0% (n=0)   0.0% (n=0)   0.0% (n=0)    0.0% (n=0)    0.0% (n=0)  100.0% (n=3)
         married    0.0% (n=0)   0.0% (n=0)   0.0% (n=0)    0.0% (n=0)    0.0% (n=0)  100.0% (n=2)
         single     0.0% (n=0)   0.0% (n=0)   0.0% (n=0)    0.0% (n=0)    0.0% (n=0)  100.0% (n=3)
         window    25.0% (n=1)   0.0% (n=0)   0.0% (n=0)   25.0% (n=1)    0.0% (n=0)   50.0% (n=2)
wales    Other      0.0% (n=0)   0.0% (n=0)   0.0% (n=0)  100.0% (n=1)    0.0% (n=0)    0.0% (n=0)
         divorced  16.7% (n=1)   0.0% (n=0)  33.3% (n=2)    0.0% (n=0)    0.0% (n=0)   50.0% (n=3)
         married    0.0% (n=0)  33.3% (n=1)   0.0% (n=0)    0.0% (n=0)    0.0% (n=0)   66.7% (n=2)
         single     0.0% (n=0)   0.0% (n=0)  33.3% (n=1)   33.3% (n=1)    0.0% (n=0)   33.3% (n=1)