python 中多级分类数据的描述性统计
descriptive statistics of multi level categorical data in python
下面是一个包含三列的 df 示例,每列都有多级分类数据。我想计算列中每个级别的三列的一些描述性统计数据——例如每个位置和状态下每个年龄组的人数,包括计数、比例和标准差(我想这实际上应该是一个置信区间这里)。但我不确定如何以优雅的方式做到这一点。非常感谢任何建议,非常感谢
birth_year = pd.DataFrame(([random.randint(1900,2000) for x in range(50)]), columns = ['year'])
from datetime import date
def age(df,col):
today = date.today()
age = today.year - df[col]
bins = [18,30,40,50,60,70,120]
labs = ['-30','30-39','40-49','50-59','60-69','70+']
group = pd.cut(age, bins, labels = labs)
return(group)
birth_year.loc[:,'age_bin'] = age(birth_year,'year')
location = pd.DataFrame((Rand(1, 6, 50)), columns = ['location'])
def label_loc (row):
if row['location'] == 1 :
return 'england'
if row['location'] == 2 :
return 'ireland'
if row['location'] == 3:
return 'scotland'
if row['location'] == 4:
return 'wales'
if row['location'] == 5:
return 'jersey'
if row['location'] == 6:
return 'gurnsey'
return 'Other'
location = location.apply(lambda row: label_loc(row), axis=1)
def Rand(start, end, num):
out = []
for x in range(num):
out.append(random.randint(start, end))
return out
status = pd.DataFrame((Rand(1, 6, 50)), columns = ['status'])
def label_stat (row):
if row['status'] == 1 :
return 'married'
if row['status'] == 2 :
return 'divorced'
if row['status'] == 3:
return 'single'
if row['status'] == 4:
return 'window'
return 'Other'
status = status.apply(lambda row: label_stat(row), axis=1)
df = pd.DataFrame(list(zip(birth_year["age_bin"], status, location)), columns =['year', 'gender', 'ethnicity'])
(请参阅 this gist 以了解稍微重写的示例设置。)
让我们举个例子:
the number of people per age group in each location and status
如果您有一个连续变量,例如 year
,您可以简单地告诉 groupby().agg()
您想要哪个平均统计数据:
print(df.groupby(['location', 'status'])['year'].agg(['mean', 'std']))
mean std
location status
england Other 1961.000000 16.792856
divorced 1934.666667 30.270998
married 1917.000000 NaN
single 1907.000000 NaN
window 1962.600000 34.011763
ireland Other 1982.000000 NaN
divorced 1949.750000 37.303932
married 1991.000000 NaN
single 1986.500000 2.121320
window 1965.500000 3.535534
jersey Other 1939.800000 26.204961
divorced 1984.000000 NaN
married 1986.000000 NaN
single 1942.500000 54.447222
scotland Other 1942.666667 12.701706
divorced 1946.000000 49.497475
married 1914.000000 NaN
single 1968.000000 NaN
window 1933.500000 24.748737
wales Other 1950.666667 39.526363
divorced 1978.000000 NaN
married 1959.000000 52.325902
single 1929.000000 NaN
window 1990.000000 NaN
对于分类值,您可以使用 value_counts()
对它们进行计数,这会增加一个额外的索引级别(您可以取消堆叠):
grouped_age_bin = df.groupby(['location', 'status'])['age_bin']
counts = grouped_age_bin.value_counts().unstack('age_bin')
print(counts)
age_bin -30 30-39 40-49 50-59 60-69 70+
location status
england Other 0 1 0 1 0 2
divorced 0 0 0 1 0 2
married 0 0 0 0 0 1
single 0 0 0 0 0 1
window 0 1 2 1 0 1
ireland Other 0 1 0 0 0 0
divorced 1 0 0 0 1 2
married 1 0 0 0 0 0
single 0 2 0 0 0 0
window 0 0 0 2 0 0
jersey Other 0 0 1 0 1 3
divorced 0 1 0 0 0 0
married 0 1 0 0 0 0
single 0 1 0 0 0 1
scotland Other 0 0 0 0 0 3
divorced 0 1 0 0 0 1
married 0 0 0 0 0 1
single 0 0 0 1 0 0
window 0 0 0 0 1 1
wales Other 0 1 0 1 0 1
divorced 0 0 1 0 0 0
married 1 0 0 0 0 1
single 0 0 0 0 0 1
window 0 1 0 0 0 0
如果你想要每个类别的平均值,你可以除以组大小,即 grouped_age_bin.size()
:
print(counts.div(grouped_age_bin.size(), axis='index'))
age_bin -30 30-39 40-49 50-59 60-69 70+
location status
england Other 0.000000 0.0 0.000000 0.000000 0.00 1.000000
married 0.500000 0.0 0.000000 0.000000 0.00 0.500000
single 0.000000 0.0 0.000000 0.000000 0.00 1.000000
window 0.250000 0.0 0.000000 0.000000 0.25 0.500000
ireland Other 0.000000 0.0 0.000000 0.000000 0.00 1.000000
married 0.000000 0.0 0.000000 0.000000 0.00 1.000000
single 0.000000 0.0 0.000000 0.000000 1.00 0.000000
window 0.000000 0.0 0.333333 0.333333 0.00 0.333333
jersey Other 0.000000 0.0 1.000000 0.000000 0.00 0.000000
divorced 0.000000 0.0 1.000000 0.000000 0.00 0.000000
married 0.000000 0.0 0.000000 0.000000 0.00 1.000000
single 0.000000 0.0 0.200000 0.400000 0.20 0.200000
window 0.000000 0.5 0.000000 0.000000 0.00 0.500000
scotland divorced 0.333333 0.0 0.000000 0.000000 0.00 0.666667
married 0.000000 0.0 0.333333 0.333333 0.00 0.333333
single 0.000000 0.5 0.000000 0.000000 0.00 0.500000
window 0.000000 0.0 0.500000 0.000000 0.00 0.500000
wales Other 0.000000 0.5 0.000000 0.000000 0.00 0.500000
divorced 0.000000 0.0 0.000000 0.000000 0.00 1.000000
married 0.500000 0.0 0.000000 0.000000 0.00 0.500000
single 0.000000 0.0 0.000000 0.000000 0.00 1.000000
window 0.500000 0.0 0.500000 0.000000 0.00 0.000000
现在有了人口规模和总数,您就可以计算置信区间了。或者您可以进行简单的字符串聚合。要同时拥有人口规模和总数,我会使用 pd.DataFrame.transform
+ pd.Series.combine
,这样你只需要编写一个 lambda 来获取类别中的数量和总数:
print(counts.transform(pd.Series.combine, 'index', grouped_age_bin.size(), lambda num, tot: f'{100 * num / tot:.1f}% (n={num})'))
age_bin -30 30-39 40-49 50-59 60-69 70+
location status
england Other 0.0% (n=0) 0.0% (n=0) 50.0% (n=1) 0.0% (n=0) 0.0% (n=0) 50.0% (n=1)
divorced 0.0% (n=0) 50.0% (n=1) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 50.0% (n=1)
married 33.3% (n=1) 0.0% (n=0) 33.3% (n=1) 0.0% (n=0) 0.0% (n=0) 33.3% (n=1)
single 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 100.0% (n=1)
window 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 100.0% (n=2)
ireland Other 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 100.0% (n=2)
divorced 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 50.0% (n=1) 0.0% (n=0) 50.0% (n=1)
married 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 100.0% (n=2) 0.0% (n=0)
single 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 100.0% (n=1)
window 33.3% (n=1) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 66.7% (n=2)
jersey Other 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 100.0% (n=1) 0.0% (n=0)
married 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 100.0% (n=1)
single 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 100.0% (n=1) 0.0% (n=0) 0.0% (n=0)
scotland Other 0.0% (n=0) 0.0% (n=0) 50.0% (n=1) 0.0% (n=0) 0.0% (n=0) 50.0% (n=1)
divorced 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 100.0% (n=3)
married 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 100.0% (n=2)
single 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 100.0% (n=3)
window 25.0% (n=1) 0.0% (n=0) 0.0% (n=0) 25.0% (n=1) 0.0% (n=0) 50.0% (n=2)
wales Other 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 100.0% (n=1) 0.0% (n=0) 0.0% (n=0)
divorced 16.7% (n=1) 0.0% (n=0) 33.3% (n=2) 0.0% (n=0) 0.0% (n=0) 50.0% (n=3)
married 0.0% (n=0) 33.3% (n=1) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 66.7% (n=2)
single 0.0% (n=0) 0.0% (n=0) 33.3% (n=1) 33.3% (n=1) 0.0% (n=0) 33.3% (n=1)
下面是一个包含三列的 df 示例,每列都有多级分类数据。我想计算列中每个级别的三列的一些描述性统计数据——例如每个位置和状态下每个年龄组的人数,包括计数、比例和标准差(我想这实际上应该是一个置信区间这里)。但我不确定如何以优雅的方式做到这一点。非常感谢任何建议,非常感谢
birth_year = pd.DataFrame(([random.randint(1900,2000) for x in range(50)]), columns = ['year'])
from datetime import date
def age(df,col):
today = date.today()
age = today.year - df[col]
bins = [18,30,40,50,60,70,120]
labs = ['-30','30-39','40-49','50-59','60-69','70+']
group = pd.cut(age, bins, labels = labs)
return(group)
birth_year.loc[:,'age_bin'] = age(birth_year,'year')
location = pd.DataFrame((Rand(1, 6, 50)), columns = ['location'])
def label_loc (row):
if row['location'] == 1 :
return 'england'
if row['location'] == 2 :
return 'ireland'
if row['location'] == 3:
return 'scotland'
if row['location'] == 4:
return 'wales'
if row['location'] == 5:
return 'jersey'
if row['location'] == 6:
return 'gurnsey'
return 'Other'
location = location.apply(lambda row: label_loc(row), axis=1)
def Rand(start, end, num):
out = []
for x in range(num):
out.append(random.randint(start, end))
return out
status = pd.DataFrame((Rand(1, 6, 50)), columns = ['status'])
def label_stat (row):
if row['status'] == 1 :
return 'married'
if row['status'] == 2 :
return 'divorced'
if row['status'] == 3:
return 'single'
if row['status'] == 4:
return 'window'
return 'Other'
status = status.apply(lambda row: label_stat(row), axis=1)
df = pd.DataFrame(list(zip(birth_year["age_bin"], status, location)), columns =['year', 'gender', 'ethnicity'])
(请参阅 this gist 以了解稍微重写的示例设置。)
让我们举个例子:
the number of people per age group in each location and status
如果您有一个连续变量,例如 year
,您可以简单地告诉 groupby().agg()
您想要哪个平均统计数据:
print(df.groupby(['location', 'status'])['year'].agg(['mean', 'std']))
mean std
location status
england Other 1961.000000 16.792856
divorced 1934.666667 30.270998
married 1917.000000 NaN
single 1907.000000 NaN
window 1962.600000 34.011763
ireland Other 1982.000000 NaN
divorced 1949.750000 37.303932
married 1991.000000 NaN
single 1986.500000 2.121320
window 1965.500000 3.535534
jersey Other 1939.800000 26.204961
divorced 1984.000000 NaN
married 1986.000000 NaN
single 1942.500000 54.447222
scotland Other 1942.666667 12.701706
divorced 1946.000000 49.497475
married 1914.000000 NaN
single 1968.000000 NaN
window 1933.500000 24.748737
wales Other 1950.666667 39.526363
divorced 1978.000000 NaN
married 1959.000000 52.325902
single 1929.000000 NaN
window 1990.000000 NaN
对于分类值,您可以使用 value_counts()
对它们进行计数,这会增加一个额外的索引级别(您可以取消堆叠):
grouped_age_bin = df.groupby(['location', 'status'])['age_bin']
counts = grouped_age_bin.value_counts().unstack('age_bin')
print(counts)
age_bin -30 30-39 40-49 50-59 60-69 70+
location status
england Other 0 1 0 1 0 2
divorced 0 0 0 1 0 2
married 0 0 0 0 0 1
single 0 0 0 0 0 1
window 0 1 2 1 0 1
ireland Other 0 1 0 0 0 0
divorced 1 0 0 0 1 2
married 1 0 0 0 0 0
single 0 2 0 0 0 0
window 0 0 0 2 0 0
jersey Other 0 0 1 0 1 3
divorced 0 1 0 0 0 0
married 0 1 0 0 0 0
single 0 1 0 0 0 1
scotland Other 0 0 0 0 0 3
divorced 0 1 0 0 0 1
married 0 0 0 0 0 1
single 0 0 0 1 0 0
window 0 0 0 0 1 1
wales Other 0 1 0 1 0 1
divorced 0 0 1 0 0 0
married 1 0 0 0 0 1
single 0 0 0 0 0 1
window 0 1 0 0 0 0
如果你想要每个类别的平均值,你可以除以组大小,即 grouped_age_bin.size()
:
print(counts.div(grouped_age_bin.size(), axis='index'))
age_bin -30 30-39 40-49 50-59 60-69 70+
location status
england Other 0.000000 0.0 0.000000 0.000000 0.00 1.000000
married 0.500000 0.0 0.000000 0.000000 0.00 0.500000
single 0.000000 0.0 0.000000 0.000000 0.00 1.000000
window 0.250000 0.0 0.000000 0.000000 0.25 0.500000
ireland Other 0.000000 0.0 0.000000 0.000000 0.00 1.000000
married 0.000000 0.0 0.000000 0.000000 0.00 1.000000
single 0.000000 0.0 0.000000 0.000000 1.00 0.000000
window 0.000000 0.0 0.333333 0.333333 0.00 0.333333
jersey Other 0.000000 0.0 1.000000 0.000000 0.00 0.000000
divorced 0.000000 0.0 1.000000 0.000000 0.00 0.000000
married 0.000000 0.0 0.000000 0.000000 0.00 1.000000
single 0.000000 0.0 0.200000 0.400000 0.20 0.200000
window 0.000000 0.5 0.000000 0.000000 0.00 0.500000
scotland divorced 0.333333 0.0 0.000000 0.000000 0.00 0.666667
married 0.000000 0.0 0.333333 0.333333 0.00 0.333333
single 0.000000 0.5 0.000000 0.000000 0.00 0.500000
window 0.000000 0.0 0.500000 0.000000 0.00 0.500000
wales Other 0.000000 0.5 0.000000 0.000000 0.00 0.500000
divorced 0.000000 0.0 0.000000 0.000000 0.00 1.000000
married 0.500000 0.0 0.000000 0.000000 0.00 0.500000
single 0.000000 0.0 0.000000 0.000000 0.00 1.000000
window 0.500000 0.0 0.500000 0.000000 0.00 0.000000
现在有了人口规模和总数,您就可以计算置信区间了。或者您可以进行简单的字符串聚合。要同时拥有人口规模和总数,我会使用 pd.DataFrame.transform
+ pd.Series.combine
,这样你只需要编写一个 lambda 来获取类别中的数量和总数:
print(counts.transform(pd.Series.combine, 'index', grouped_age_bin.size(), lambda num, tot: f'{100 * num / tot:.1f}% (n={num})'))
age_bin -30 30-39 40-49 50-59 60-69 70+
location status
england Other 0.0% (n=0) 0.0% (n=0) 50.0% (n=1) 0.0% (n=0) 0.0% (n=0) 50.0% (n=1)
divorced 0.0% (n=0) 50.0% (n=1) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 50.0% (n=1)
married 33.3% (n=1) 0.0% (n=0) 33.3% (n=1) 0.0% (n=0) 0.0% (n=0) 33.3% (n=1)
single 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 100.0% (n=1)
window 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 100.0% (n=2)
ireland Other 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 100.0% (n=2)
divorced 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 50.0% (n=1) 0.0% (n=0) 50.0% (n=1)
married 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 100.0% (n=2) 0.0% (n=0)
single 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 100.0% (n=1)
window 33.3% (n=1) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 66.7% (n=2)
jersey Other 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 100.0% (n=1) 0.0% (n=0)
married 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 100.0% (n=1)
single 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 100.0% (n=1) 0.0% (n=0) 0.0% (n=0)
scotland Other 0.0% (n=0) 0.0% (n=0) 50.0% (n=1) 0.0% (n=0) 0.0% (n=0) 50.0% (n=1)
divorced 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 100.0% (n=3)
married 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 100.0% (n=2)
single 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 100.0% (n=3)
window 25.0% (n=1) 0.0% (n=0) 0.0% (n=0) 25.0% (n=1) 0.0% (n=0) 50.0% (n=2)
wales Other 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 100.0% (n=1) 0.0% (n=0) 0.0% (n=0)
divorced 16.7% (n=1) 0.0% (n=0) 33.3% (n=2) 0.0% (n=0) 0.0% (n=0) 50.0% (n=3)
married 0.0% (n=0) 33.3% (n=1) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 66.7% (n=2)
single 0.0% (n=0) 0.0% (n=0) 33.3% (n=1) 33.3% (n=1) 0.0% (n=0) 33.3% (n=1)