如何在没有类型错误的情况下根据列名填充默认值 - Pandas
How to fill default values based on column names without typeerror - Pandas
我有一个如下所示的数据框
obs = pd.DataFrame({'person_id' :[1,2,3],'obs_date':['12/31/2007','11/25/2009',np.nan],
'hero_id':[2,4,np.nan],'date':['12/31/2017',np.nan,'10/06/2015'],
'heroine_id':[1,np.nan,5],'datetime':['12/31/2027','11/25/2029',np.nan],
'bud_source_value':[1250000,250000,np.nan],
'prod__source_value':[10000,20000,np.nan]})
objective是根据列名填3个默认值
以 id 结尾的列 - fillna 以 0
以值结尾的列 - 填充 np.nan(保持原样)
cols containing date[pandas datetime format] - fillna with 12/31/2000
当我尝试下面的方法时
col = obs.columns.str
c1 = col.endswith('id')
c2 = col.contains('value')
c3 = col.endswith('date')
c4 = col.endswith('datetime')
filled_values = np.select([c1,c2,c3,c4], [obs.fillna(0), obs.fillna(np.nan), obs.fillna("1900-01-01"), obs.fillna("1900-01-01 00:00:00")])
obs= pd.DataFrame(filled_values,columns=cols)
但是我收到如下错误信息
typeerror u10 cannot be converted to an IntegerDtype pandas
此外,在我上面的一个函数中,我使用了一段如下代码
testdf['hero_id'] = testdf['hero_id'].astype('float').astype('Int64')
不确定上述行是否导致错误。我不确定如何 reproduce/debug 这个错误。请帮我避免这个错误
这可能不是您所要求的,但希望对您有所帮助:
import pandas as pd
import numpy as np
obs = pd.DataFrame({'person_id': [1, 2, 3], 'obs_date': ['12/31/2007', '11/25/2009', np.nan],
'hero_id': [2, 4, np.nan], 'date': ['12/31/2017', np.nan, '10/06/2015'],
'heroine_id': [1, np.nan, 5], 'datetime': ['12/31/2027', '11/25/2029', np.nan],
'bud_source_value': [1250000, 250000, np.nan],
'prod__source_value': [10000, 20000, np.nan]})
for column_name in obs.columns:
if column_name.endswith("id"):
obs[column_name] = obs[column_name].fillna(0)
elif "value" in column_name:
obs[column_name] = obs[column_name].fillna(np.nan)
elif column_name.endswith("date"):
obs[column_name] = obs[column_name].fillna("1900-01-01")
elif column_name.endswith("datetime"):
obs[column_name] = obs[column_name].fillna("1900-01-01 00:00:00")
print(obs)
输出
person_id obs_date hero_id date heroine_id datetime bud_source_value prod__source_value
0 1 12/31/2007 2.0 12/31/2017 1.0 12/31/2027 1250000.0 10000.0
1 2 11/25/2009 4.0 1900-01-01 0.0 11/25/2029 250000.0 20000.0
2 3 1900-01-01 0.0 10/06/2015 5.0 1900-01-01 00:00:00 NaN NaN
使用列名和相应的 dtypes 和 fillna 进行过滤
将日期强制转换为日期时间
obs['obs_date']=pd.to_datetime(obs['obs_date'])
obs['date']=pd.to_datetime(obs['date'])
obs['datetime']=pd.to_datetime(obs['datetime'])
使用列名和数据类型进行过滤。那会给你不同的数据框。加入数据帧并使用它们使用 combine_first 方法更新 obs 数据帧。
下面的代码
df= obs.combine_first((obs.filter(regex='[_id]$', axis=1).fillna('0')#filter columns ending with _id and fillna with 0
.join(
(obs.select_dtypes(include=['datetime64[ns]'])).fillna('12/31/2000')) # join output of columns ending with _id with columns of the dtype datetime64
))
df=df[list(obs.columns)]#rearrange column order to the original df
输出
person_id obs_date hero_id date heroine_id \
0 1 2007-12-31 00:00:00 2.0 2017-12-31 00:00:00 1.0
1 2 2009-11-25 00:00:00 4.0 12/31/2000 0
2 3 12/31/2000 0 2015-10-06 00:00:00 5.0
datetime bud_source_value prod__source_value
0 2027-12-31 00:00:00 1250000.0 10000.0
1 2029-11-25 00:00:00 250000.0 20000.0
2 12/31/2000 NaN NaN
想法是通过具有替换值的列名称创建字典并传递给 DataFrame.fillna
:
c1 = dict.fromkeys(obs.filter(regex='id$').columns, 0)
c2 = dict.fromkeys(obs.filter(like='value').columns, np.nan)
c3 = dict.fromkeys(obs.filter(regex='date$').columns, "1900-01-01")
c4 = dict.fromkeys(obs.filter(regex='datetime$').columns, "1900-01-01 00:00:00")
d = {**c1, **c2, **c3, **c4}
print (d)
{'person_id': 0, 'hero_id': 0, 'heroine_id': 0,
'bud_source_value': nan, 'prod__source_value': nan,
'obs_date': '1900-01-01', 'date': '1900-01-01',
'datetime': '1900-01-01 00:00:00'}
obs = obs.fillna(d)
print (obs)
person_id obs_date hero_id date heroine_id \
0 1 12/31/2007 2.0 12/31/2017 1.0
1 2 11/25/2009 4.0 1900-01-01 0.0
2 3 1900-01-01 0.0 10/06/2015 5.0
datetime bud_source_value prod__source_value
0 12/31/2027 1250000.0 10000.0
1 11/25/2029 250000.0 20000.0
2 1900-01-01 00:00:00 NaN NaN
我有一个如下所示的数据框
obs = pd.DataFrame({'person_id' :[1,2,3],'obs_date':['12/31/2007','11/25/2009',np.nan],
'hero_id':[2,4,np.nan],'date':['12/31/2017',np.nan,'10/06/2015'],
'heroine_id':[1,np.nan,5],'datetime':['12/31/2027','11/25/2029',np.nan],
'bud_source_value':[1250000,250000,np.nan],
'prod__source_value':[10000,20000,np.nan]})
objective是根据列名填3个默认值
以 id 结尾的列 - fillna 以 0
以值结尾的列 - 填充 np.nan(保持原样)
cols containing date[pandas datetime format] - fillna with 12/31/2000
当我尝试下面的方法时
col = obs.columns.str
c1 = col.endswith('id')
c2 = col.contains('value')
c3 = col.endswith('date')
c4 = col.endswith('datetime')
filled_values = np.select([c1,c2,c3,c4], [obs.fillna(0), obs.fillna(np.nan), obs.fillna("1900-01-01"), obs.fillna("1900-01-01 00:00:00")])
obs= pd.DataFrame(filled_values,columns=cols)
但是我收到如下错误信息
typeerror u10 cannot be converted to an IntegerDtype pandas
此外,在我上面的一个函数中,我使用了一段如下代码
testdf['hero_id'] = testdf['hero_id'].astype('float').astype('Int64')
不确定上述行是否导致错误。我不确定如何 reproduce/debug 这个错误。请帮我避免这个错误
这可能不是您所要求的,但希望对您有所帮助:
import pandas as pd
import numpy as np
obs = pd.DataFrame({'person_id': [1, 2, 3], 'obs_date': ['12/31/2007', '11/25/2009', np.nan],
'hero_id': [2, 4, np.nan], 'date': ['12/31/2017', np.nan, '10/06/2015'],
'heroine_id': [1, np.nan, 5], 'datetime': ['12/31/2027', '11/25/2029', np.nan],
'bud_source_value': [1250000, 250000, np.nan],
'prod__source_value': [10000, 20000, np.nan]})
for column_name in obs.columns:
if column_name.endswith("id"):
obs[column_name] = obs[column_name].fillna(0)
elif "value" in column_name:
obs[column_name] = obs[column_name].fillna(np.nan)
elif column_name.endswith("date"):
obs[column_name] = obs[column_name].fillna("1900-01-01")
elif column_name.endswith("datetime"):
obs[column_name] = obs[column_name].fillna("1900-01-01 00:00:00")
print(obs)
输出
person_id obs_date hero_id date heroine_id datetime bud_source_value prod__source_value
0 1 12/31/2007 2.0 12/31/2017 1.0 12/31/2027 1250000.0 10000.0
1 2 11/25/2009 4.0 1900-01-01 0.0 11/25/2029 250000.0 20000.0
2 3 1900-01-01 0.0 10/06/2015 5.0 1900-01-01 00:00:00 NaN NaN
使用列名和相应的 dtypes 和 fillna 进行过滤
将日期强制转换为日期时间
obs['obs_date']=pd.to_datetime(obs['obs_date'])
obs['date']=pd.to_datetime(obs['date'])
obs['datetime']=pd.to_datetime(obs['datetime'])
使用列名和数据类型进行过滤。那会给你不同的数据框。加入数据帧并使用它们使用 combine_first 方法更新 obs 数据帧。
下面的代码
df= obs.combine_first((obs.filter(regex='[_id]$', axis=1).fillna('0')#filter columns ending with _id and fillna with 0
.join(
(obs.select_dtypes(include=['datetime64[ns]'])).fillna('12/31/2000')) # join output of columns ending with _id with columns of the dtype datetime64
))
df=df[list(obs.columns)]#rearrange column order to the original df
输出
person_id obs_date hero_id date heroine_id \
0 1 2007-12-31 00:00:00 2.0 2017-12-31 00:00:00 1.0
1 2 2009-11-25 00:00:00 4.0 12/31/2000 0
2 3 12/31/2000 0 2015-10-06 00:00:00 5.0
datetime bud_source_value prod__source_value
0 2027-12-31 00:00:00 1250000.0 10000.0
1 2029-11-25 00:00:00 250000.0 20000.0
2 12/31/2000 NaN NaN
想法是通过具有替换值的列名称创建字典并传递给 DataFrame.fillna
:
c1 = dict.fromkeys(obs.filter(regex='id$').columns, 0)
c2 = dict.fromkeys(obs.filter(like='value').columns, np.nan)
c3 = dict.fromkeys(obs.filter(regex='date$').columns, "1900-01-01")
c4 = dict.fromkeys(obs.filter(regex='datetime$').columns, "1900-01-01 00:00:00")
d = {**c1, **c2, **c3, **c4}
print (d)
{'person_id': 0, 'hero_id': 0, 'heroine_id': 0,
'bud_source_value': nan, 'prod__source_value': nan,
'obs_date': '1900-01-01', 'date': '1900-01-01',
'datetime': '1900-01-01 00:00:00'}
obs = obs.fillna(d)
print (obs)
person_id obs_date hero_id date heroine_id \
0 1 12/31/2007 2.0 12/31/2017 1.0
1 2 11/25/2009 4.0 1900-01-01 0.0
2 3 1900-01-01 0.0 10/06/2015 5.0
datetime bud_source_value prod__source_value
0 12/31/2027 1250000.0 10000.0
1 11/25/2029 250000.0 20000.0
2 1900-01-01 00:00:00 NaN NaN