Pandas 通过拆分可选的前导非字符串字符来计算新列

Question

我是 Pandas 的新手，正在尝试添加两个新列，其中的值是根据现有 'Result' 列计算的。

现有列包含带有可选限定符（“<”、“>”、“<>”）的数字。

'Result' 中的一些示例编号可能是：

0.5
12.67
3
<1
4.5
>10.0

我想要一个包含非数字限定符（如果存在）的新 'Result_Q' 列，否则为 NULL (None) 和一个包含数字部分的新 'Result_Value' 列.

我第一次失败的尝试是：

df['Result_Q'] = df.Result.str[0] if not df.Result.str[0].isdigit() else None

这会产生错误 AttributeError: 'Series' object has no attribute 'isdigit'

（在研究了这个错误之后，我尝试了一些其他的变体，这些变体会产生 ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all() )

Answer 1

使用Series.str.isdigit with numpy.where:

df['Result_Q'] = np.where(df.Result.str[0].str.isdigit(), None, df.Result.str[0])

替代Series.mask：

df['Result_Q'] = df.Result.str[0].mask(df.Result.str[0].str.isdigit(), None)

print (df)
  Result Result_Q
0    0.5     None
1  12.67     None
2      3     None
3     <1        <
4    4.5     None
5  >10.0        >

或 Series.str.extract 将 NaN 更改为 None:

df['Result_Q'] = df.Result.str[0].str.extract('(\D)').mask(lambda x: x.isna(), None)
print (df)
  Result Result_Q
0    0.5     None
1  12.67     None
2      3     None
3     <1        <
4    4.5     None
5  >10.0        >

Answer 2

您可以使用 df.apply 创建新列：

import pandas as pd
df = pd.DataFrame({'result': ['0.5', '12.67', '<1', '4.5', '>10.0']})
df['Result_Q'] = df['result'].apply(lambda x: x[0] if not x[0].isdigit() else None)
print(df)


  result Result_Q
0    0.5     None
1  12.67     None
2     <1        <
3    4.5     None
4  >10.0        >

Answer 3

或尝试：

df['Result_Q'] = df['Result'].str.replace('\d+', '').str.strip('.').replace('', np.nan)
print(df)

输出：

  Result Result_Q
0    0.5      NaN
1  12.67      NaN
2      3      NaN
3     <1        <
4    4.5      NaN
5  >10.0        >

Pandas 通过拆分可选的前导非字符串字符来计算新列

Pandas calculate new column by splitting off optional leading non-string characters

python

calculated-columns

pandas