Python 从现有数据框创建数据框
Python Create data frame from existing data frame
我得到以下数据框,其中包含列“total_bill”、“小费”、“性别”、“吸烟者”、“日期”、“时间”和“大小”。 “吸烟者”的行值可以是“是”或“否”。 “时间”的行值可以是“午餐”或“晚餐”。
给定数据框:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sunday Dinner 2
1 10.34 1.66 Male No Sunday Dinner 3
2 21.01 3.50 Male No Sunday Dinner 3
3 23.68 3.31 Male No Sunday Dinner 2
4 24.59 3.61 Female No Sunday Dinner 4
我需要创建以下内容:
- 晚餐和午餐的大小(数量)数据框
- 按时间划分的吸烟人数数据框(午餐:吸烟人数,晚餐:吸烟人数)
- 合并上面的两个数据框
预期输出:“?”表示导出的数值
Records
Smokers
Lunch
?
?
Dinner
?
?
我的代码:
# Number of Time (Lunch/Dinner) records
df1 = tips_df.groupby('time')['size'].sum()
def myfunction(x):
if x == 'Yes':
return 1
else:
return 0
#Number of Smoker records by Time (Lunch/Dinner)
mydata['Smoker_numerical'] = tips_df['smoker'].apply(lambda x: myfunction(x))
mydata = mydata.astype({'Smoker_numerical': 'int32'})
mydata2 = mydata.groupby('time')['Smoker_numerical'].sum()
result = concat([df1, mydata2], axis=1)
result
我按时间确定吸烟者数量的代码输出以下错误消息。
KeyError Traceback (most recent call last)
/var/folders/wv/42dn23fd1cb0czpvqdnb6zw00000gn/T/ipykernel_13833/1039123812.py in <module>
8
9 mydata['Smoker_numerical'] = tips_df['smoker'].apply(lambda x: myfunction(x))
---> 10 mydata = mydata.astype({'Smoker_numerical': 'int32'})
11 mydata2 = mydata.groupby('time')['Smoker_numerical'].sum()
12
~/opt/miniconda3/lib/python3.9/site-packages/pandas/core/generic.py in astype(self, dtype, copy, errors)
5876 if self.ndim == 1: # i.e. Series
5877 if len(dtype) > 1 or self.name not in dtype:
-> 5878 raise KeyError(
5879 "Only the Series name can be used for "
5880 "the key in Series dtype mappings."
KeyError: 'Only the Series name can be used for the key in Series dtype mappings.'
有什么方法可以解决这个问题,或者有其他方法可以按时间确定吸烟者的数量吗?谢谢。
IIUC,使用agg
:
result = df.assign(smoker=df['smoker'] == 'Yes').groupby('time', as_index=False) \
.agg(Records=('size', 'sum'), Smokers=('smoker', 'sum'))
print(result)
# Output
time Records Smokers
0 Dinner 14 0
更新
How do I get the smokers by time?
>>> df.groupby('smoker', as_index=False)['time'].count()
smoker time
0 No 5
我得到以下数据框,其中包含列“total_bill”、“小费”、“性别”、“吸烟者”、“日期”、“时间”和“大小”。 “吸烟者”的行值可以是“是”或“否”。 “时间”的行值可以是“午餐”或“晚餐”。
给定数据框:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sunday Dinner 2
1 10.34 1.66 Male No Sunday Dinner 3
2 21.01 3.50 Male No Sunday Dinner 3
3 23.68 3.31 Male No Sunday Dinner 2
4 24.59 3.61 Female No Sunday Dinner 4
我需要创建以下内容:
- 晚餐和午餐的大小(数量)数据框
- 按时间划分的吸烟人数数据框(午餐:吸烟人数,晚餐:吸烟人数)
- 合并上面的两个数据框
预期输出:“?”表示导出的数值
Records | Smokers | |
---|---|---|
Lunch | ? | ? |
Dinner | ? | ? |
我的代码:
# Number of Time (Lunch/Dinner) records
df1 = tips_df.groupby('time')['size'].sum()
def myfunction(x):
if x == 'Yes':
return 1
else:
return 0
#Number of Smoker records by Time (Lunch/Dinner)
mydata['Smoker_numerical'] = tips_df['smoker'].apply(lambda x: myfunction(x))
mydata = mydata.astype({'Smoker_numerical': 'int32'})
mydata2 = mydata.groupby('time')['Smoker_numerical'].sum()
result = concat([df1, mydata2], axis=1)
result
我按时间确定吸烟者数量的代码输出以下错误消息。
KeyError Traceback (most recent call last)
/var/folders/wv/42dn23fd1cb0czpvqdnb6zw00000gn/T/ipykernel_13833/1039123812.py in <module>
8
9 mydata['Smoker_numerical'] = tips_df['smoker'].apply(lambda x: myfunction(x))
---> 10 mydata = mydata.astype({'Smoker_numerical': 'int32'})
11 mydata2 = mydata.groupby('time')['Smoker_numerical'].sum()
12
~/opt/miniconda3/lib/python3.9/site-packages/pandas/core/generic.py in astype(self, dtype, copy, errors)
5876 if self.ndim == 1: # i.e. Series
5877 if len(dtype) > 1 or self.name not in dtype:
-> 5878 raise KeyError(
5879 "Only the Series name can be used for "
5880 "the key in Series dtype mappings."
KeyError: 'Only the Series name can be used for the key in Series dtype mappings.'
有什么方法可以解决这个问题,或者有其他方法可以按时间确定吸烟者的数量吗?谢谢。
IIUC,使用agg
:
result = df.assign(smoker=df['smoker'] == 'Yes').groupby('time', as_index=False) \
.agg(Records=('size', 'sum'), Smokers=('smoker', 'sum'))
print(result)
# Output
time Records Smokers
0 Dinner 14 0
更新
How do I get the smokers by time?
>>> df.groupby('smoker', as_index=False)['time'].count()
smoker time
0 No 5