pandas 将字符串评估为数字
pandas evaluating strings as numeric
假设 df 为;
data = {'duration':['1week 3day 2hour 4min 23', '2hour 4min 23sec', '2hour 4min', np.nan, '', '23sec']}
df = pd.DataFrame(data)
我正在尝试将持续时间计算为秒数总和。将值替换为:
df['duration'] = df['duration'].str.replace('week', '*604800+') \
.str.replace('day', '*604800+') \
.str.replace('hour', '*3600+') \
.str.replace('min', '*60+') \
.str.replace('sec', '') \
.str.replace(' ', '')
但不能 运行 eval 函数,如(pd.eval、apply.eval、eval 等)。某些单元格以“+”号或其他 string/na 问题结尾。有帮助吗?
Ps: 这不是一个重复的问题。
您可以将正则表达式与自定义函数结合使用,将周替换为 7 天,并在单独的数字上添加秒数(您可以添加其他单位)。然后转换 to_timedelta
:
def change_units(m):
d = {'week': (7, 'days'), '': (1, 's')}
_, i, period = m.groups()
factor, txt = d[period]
return f'{factor*int(i)}{txt}'
df['delta'] = pd.to_timedelta(df['duration'].str.replace(r'((\d)\s*(week|)\b)',
replace, regex=True))
输出:
duration delta
0 1week 3day 2hour 4min 23 10 days 02:04:23
1 2hour 4min 23sec 0 days 02:04:23
2 2hour 4min 0 days 02:04:00
3 NaN NaT
4 NaT
5 23sec 0 days 00:00:23
然后您可以从 TimeDelta 对象中受益,例如转换为 total_seconds
:
pd.to_timedelta(df['duration'].str.replace(r'((\d)\s*(week|)\b)',
change_units, regex=True)
).dt.total_seconds()
输出:
0 871463.0
1 7463.0
2 7440.0
3 NaN
4 NaN
5 23.0
Name: duration, dtype: float64
我对有不同的方法:
我写了一个函数将字符串转换成秒:
def convert_all(s):
if not isinstance(s, str):
# E.g. for np.nan
return s
return sum(convert_part(part) for part in s.split())
def convert_part(part):
"""Convert an individual segment into seconds.
>>> convert_part('1day')
86400.0
"""
if part.isnumeric():
return float(part)
in_seconds = {'week': 7*24*60*60, 'day': 24*60*60, 'hour': 60*60, 'min': 60, 'sec': 1}
for k,v in in_seconds.items():
if part.endswith(k):
return float(part.strip(k))*v
else:
# Handle error here - just printing for now
print(part)
return 0.0
那么你可以使用 Series.apply
:
df['duration_sec'] = df['duration'].apply(convert_all)
输出:
duration duration_sec
0 1week 3day 2hour 4min 23 871463.0
1 2hour 4min 23sec 7463.0
2 2hour 4min 7440.0
3 NaN NaN
4 0.0
5 23sec 23.0
假设 df 为;
data = {'duration':['1week 3day 2hour 4min 23', '2hour 4min 23sec', '2hour 4min', np.nan, '', '23sec']}
df = pd.DataFrame(data)
我正在尝试将持续时间计算为秒数总和。将值替换为:
df['duration'] = df['duration'].str.replace('week', '*604800+') \
.str.replace('day', '*604800+') \
.str.replace('hour', '*3600+') \
.str.replace('min', '*60+') \
.str.replace('sec', '') \
.str.replace(' ', '')
但不能 运行 eval 函数,如(pd.eval、apply.eval、eval 等)。某些单元格以“+”号或其他 string/na 问题结尾。有帮助吗?
Ps: 这不是一个重复的问题。
您可以将正则表达式与自定义函数结合使用,将周替换为 7 天,并在单独的数字上添加秒数(您可以添加其他单位)。然后转换 to_timedelta
:
def change_units(m):
d = {'week': (7, 'days'), '': (1, 's')}
_, i, period = m.groups()
factor, txt = d[period]
return f'{factor*int(i)}{txt}'
df['delta'] = pd.to_timedelta(df['duration'].str.replace(r'((\d)\s*(week|)\b)',
replace, regex=True))
输出:
duration delta
0 1week 3day 2hour 4min 23 10 days 02:04:23
1 2hour 4min 23sec 0 days 02:04:23
2 2hour 4min 0 days 02:04:00
3 NaN NaT
4 NaT
5 23sec 0 days 00:00:23
然后您可以从 TimeDelta 对象中受益,例如转换为 total_seconds
:
pd.to_timedelta(df['duration'].str.replace(r'((\d)\s*(week|)\b)',
change_units, regex=True)
).dt.total_seconds()
输出:
0 871463.0
1 7463.0
2 7440.0
3 NaN
4 NaN
5 23.0
Name: duration, dtype: float64
我对
我写了一个函数将字符串转换成秒:
def convert_all(s):
if not isinstance(s, str):
# E.g. for np.nan
return s
return sum(convert_part(part) for part in s.split())
def convert_part(part):
"""Convert an individual segment into seconds.
>>> convert_part('1day')
86400.0
"""
if part.isnumeric():
return float(part)
in_seconds = {'week': 7*24*60*60, 'day': 24*60*60, 'hour': 60*60, 'min': 60, 'sec': 1}
for k,v in in_seconds.items():
if part.endswith(k):
return float(part.strip(k))*v
else:
# Handle error here - just printing for now
print(part)
return 0.0
那么你可以使用 Series.apply
:
df['duration_sec'] = df['duration'].apply(convert_all)
输出:
duration duration_sec
0 1week 3day 2hour 4min 23 871463.0
1 2hour 4min 23sec 7463.0
2 2hour 4min 7440.0
3 NaN NaN
4 0.0
5 23sec 23.0