Python 从现有数据框创建数据框

Python Create data frame from existing data frame

我得到以下数据框,其中包含列“total_bill”、“小费”、“性别”、“吸烟者”、“日期”、“时间”和“大小”。 “吸烟者”的行值可以是“是”或“否”。 “时间”的行值可以是“午餐”或“晚餐”。

给定数据框:

total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sunday Dinner 2
1 10.34 1.66 Male No Sunday Dinner 3
2 21.01 3.50 Male No Sunday Dinner 3
3 23.68 3.31 Male No Sunday Dinner 2
4 24.59 3.61 Female No Sunday Dinner 4

我需要创建以下内容:

  1. 晚餐和午餐的大小(数量)数据框
  2. 按时间划分的吸烟人数数据框(午餐:吸烟人数,晚餐:吸烟人数)
  3. 合并上面的两个数据框

预期输出:“?”表示导出的数值

Records Smokers
Lunch ? ?
Dinner ? ?

我的代码:

# Number of Time (Lunch/Dinner) records
df1 = tips_df.groupby('time')['size'].sum()

def myfunction(x):
    if x == 'Yes':
        return 1
    else:
        return 0

#Number of Smoker records by Time (Lunch/Dinner)
mydata['Smoker_numerical'] = tips_df['smoker'].apply(lambda x: myfunction(x))
mydata = mydata.astype({'Smoker_numerical': 'int32'})
mydata2 = mydata.groupby('time')['Smoker_numerical'].sum()

result = concat([df1, mydata2], axis=1)
result

我按时间确定吸烟者数量的代码输出以下错误消息。

KeyError                                  Traceback (most recent call last)
/var/folders/wv/42dn23fd1cb0czpvqdnb6zw00000gn/T/ipykernel_13833/1039123812.py in <module>
      8 
      9 mydata['Smoker_numerical'] = tips_df['smoker'].apply(lambda x: myfunction(x))
---> 10 mydata = mydata.astype({'Smoker_numerical': 'int32'})
     11 mydata2 = mydata.groupby('time')['Smoker_numerical'].sum()
     12 

~/opt/miniconda3/lib/python3.9/site-packages/pandas/core/generic.py in astype(self, dtype, copy, errors)
   5876             if self.ndim == 1:  # i.e. Series
   5877                 if len(dtype) > 1 or self.name not in dtype:
-> 5878                     raise KeyError(
   5879                         "Only the Series name can be used for "
   5880                         "the key in Series dtype mappings."

KeyError: 'Only the Series name can be used for the key in Series dtype mappings.'

有什么方法可以解决这个问题,或者有其他方法可以按时间确定吸烟者的数量吗?谢谢。

IIUC,使用agg:

result = df.assign(smoker=df['smoker'] == 'Yes').groupby('time', as_index=False) \
           .agg(Records=('size', 'sum'), Smokers=('smoker', 'sum'))
print(result)

# Output
     time  Records  Smokers
0  Dinner       14        0

更新

How do I get the smokers by time?

>>> df.groupby('smoker', as_index=False)['time'].count() 
  smoker  time
0     No     5