Python 3 Pandas Select Dataframe 使用 Startswith + 或
Python 3 Pandas Select Dataframe using Startswith + or
正在寻找执行 str.startswith 的正确语法,但我想要多个条件。
工作代码我只有 returns 个以字母 "N":
开头的办公室
new_df = df[df['Office'].str.startswith("N", na=False)]
正在寻找 returns 可以以字母 "N"、"M"、"V" 或 "R" 开头的办公室的代码。以下似乎不起作用:
new_df = df[df['Office'].str.startswith("N|M|V|R", na=False)]
我错过了什么?谢谢!
试试这个:
df[df['Office'].str.contains("^(?:N|M|V|R)")]
或:
df[df['Office'].str.contains("^[NMVR]+")]
演示:
In [91]: df
Out[91]:
Office
0 No-No
1 AAAA
2 MicroHard
3 Valley
4 vvvvv
5 zzzzzzzzzz
6 Risk is fun
In [92]: df[df['Office'].str.contains("^(?:N|M|V|R)")]
Out[92]:
Office
0 No-No
2 MicroHard
3 Valley
6 Risk is fun
In [93]: df[df['Office'].str.contains("^[NMVR]+")]
Out[93]:
Office
0 No-No
2 MicroHard
3 Valley
6 Risk is fun
方法startswith
允许字符串或元组作为它的第一个参数:
# Option 1
new_df = df[df['Office'].str.startswith(('N','M','V','R'), na=False)
示例:
df = pd.DataFrame(data=[np.nan, 'Austria', 'Norway', 'Madagascar', 'Romania', 'Spain', 'Uruguay', 'Yemen'], columns=['Office'])
print(df)
df.Office.str.startswith(('N','M','V','R'), na=False)
输出:
Office
0 NaN
1 Austria
2 Norway
3 Madagascar
4 Romania
5 Spain
6 Uruguay
7 Yemen
0 False
1 False
2 True
3 True
4 True
5 False
6 False
7 False
@MaxU 指出的其他选项是:
# Option 2
df[df['Office'].str.contains("^(?:N|M|V|R)")]
# Option 3
df[df['Office'].str.contains("^[NMVR]+")]
性能(非详尽测试):
from datetime import datetime
n = 100000
start_time = datetime.now()
for i in range(n):
df['Office'].str.startswith(('N','M','V','R'), na=False)
print ("Option 1: ", datetime.now() - start_time)
start_time = datetime.now()
for i in range(n):
df['Office'].str.contains("^(?:N|M|V|R)", na=False)
print ("Option 2: ", datetime.now() - start_time)
start_time = datetime.now()
for i in range(n):
df['Office'].str.contains("^[NMVR]+", na=False)
print ("Option 3: ", datetime.now() - start_time)
结果:
Option 1: 0:00:22.952533
Option 2: 0:00:23.502708
Option 3: 0:00:23.733182
最终选择:时间差不大,因为sintax更简单,性能更好,我会选择选项1.
正在寻找执行 str.startswith 的正确语法,但我想要多个条件。
工作代码我只有 returns 个以字母 "N":
开头的办公室new_df = df[df['Office'].str.startswith("N", na=False)]
正在寻找 returns 可以以字母 "N"、"M"、"V" 或 "R" 开头的办公室的代码。以下似乎不起作用:
new_df = df[df['Office'].str.startswith("N|M|V|R", na=False)]
我错过了什么?谢谢!
试试这个:
df[df['Office'].str.contains("^(?:N|M|V|R)")]
或:
df[df['Office'].str.contains("^[NMVR]+")]
演示:
In [91]: df
Out[91]:
Office
0 No-No
1 AAAA
2 MicroHard
3 Valley
4 vvvvv
5 zzzzzzzzzz
6 Risk is fun
In [92]: df[df['Office'].str.contains("^(?:N|M|V|R)")]
Out[92]:
Office
0 No-No
2 MicroHard
3 Valley
6 Risk is fun
In [93]: df[df['Office'].str.contains("^[NMVR]+")]
Out[93]:
Office
0 No-No
2 MicroHard
3 Valley
6 Risk is fun
方法startswith
允许字符串或元组作为它的第一个参数:
# Option 1
new_df = df[df['Office'].str.startswith(('N','M','V','R'), na=False)
示例:
df = pd.DataFrame(data=[np.nan, 'Austria', 'Norway', 'Madagascar', 'Romania', 'Spain', 'Uruguay', 'Yemen'], columns=['Office'])
print(df)
df.Office.str.startswith(('N','M','V','R'), na=False)
输出:
Office
0 NaN
1 Austria
2 Norway
3 Madagascar
4 Romania
5 Spain
6 Uruguay
7 Yemen
0 False
1 False
2 True
3 True
4 True
5 False
6 False
7 False
@MaxU 指出的其他选项是:
# Option 2
df[df['Office'].str.contains("^(?:N|M|V|R)")]
# Option 3
df[df['Office'].str.contains("^[NMVR]+")]
性能(非详尽测试):
from datetime import datetime
n = 100000
start_time = datetime.now()
for i in range(n):
df['Office'].str.startswith(('N','M','V','R'), na=False)
print ("Option 1: ", datetime.now() - start_time)
start_time = datetime.now()
for i in range(n):
df['Office'].str.contains("^(?:N|M|V|R)", na=False)
print ("Option 2: ", datetime.now() - start_time)
start_time = datetime.now()
for i in range(n):
df['Office'].str.contains("^[NMVR]+", na=False)
print ("Option 3: ", datetime.now() - start_time)
结果:
Option 1: 0:00:22.952533
Option 2: 0:00:23.502708
Option 3: 0:00:23.733182
最终选择:时间差不大,因为sintax更简单,性能更好,我会选择选项1.