使用 str.contains 并基于 if-else 条件创建新列
Create new column using str.contains and based on if-else condition
我有一个名称列表 'pattern',我希望将其与列 'url_text' 中的字符串匹配。如果匹配,即 True
名称应打印在新列 'pol_names_block' 中,如果 False
则将该行留空。
pattern = '|'.join(pol_names_list)
print(pattern)
'Jon Kyl|Doug Jones|Tim Kaine|Lindsey Graham|Cory Booker|Kamala Harris|Orrin Hatch|Bernie Sanders|Thom Tillis|Jerry Moran|Shelly Moore Capito|Maggie Hassan|Tom Carper|Martin Heinrich|Steve Daines|Pat Toomey|Todd Young|Bill Nelson|John Barrasso|Chris Murphy|Mike Rounds|Mike Crapo|John Thune|John. McCain|Susan Collins|Patty Murray|Dianne Feinstein|Claire McCaskill|Lamar Alexander|Jack Reed|Chuck Grassley|Catherine Masto|Pat Roberts|Ben Cardin|Dean Heller|Ron Wyden|Dick Durbin|Jeanne Shaheen|Tammy Duckworth|Sheldon Whitehouse|Tom Cotton|Sherrod Brown|Bob Corker|Tom Udall|Mitch McConnell|James Lankford|Ted Cruz|Mike Enzi|Gary Peters|Jeff Flake|Johnny Isakson|Jim Inhofe|Lindsey Graham|Marco Rubio|Angus King|Kirsten Gillibrand|Bob Casey|Chris Van Hollen|Thad Cochran|Richard Burr|Rob Portman|Jon Tester|Bob Menendez|John Boozman|Mazie Hirono|Joe Manchin|Deb Fischer|Michael Bennet|Debbie Stabenow|Ben Sasse|Brian Schatz|Jim Risch|Mike Lee|Elizabeth Warren|Richard Blumenthal|David Perdue|Al Franken|Bill Cassidy|Cory Gardner|Lisa Murkowski|Maria Cantwell|Tammy Baldwin|Joe Donnelly|Roger Wicker|Amy Klobuchar|Joel Heitkamp|Joni Ernst|Chris Coons|Mark Warner|John Cornyn|Ron Johnson|Patrick Leahy|Chuck Schumer|John Kennedy|Jeff Merkley|Roy Blunt|Richard Shelby|John Hoeven|Rand Paul|Dan Sullivan|Tim Scott|Ed Markey'
我正在使用以下代码 df['url_text'].str.contains(pattern)
,如果 'pattern' 中的名称出现在列 'url_text' 和 [=15] 的一行中,则结果为 True
=] 否则。有了它,我尝试了以下代码:
df['pol_name_block'] = df.apply(
lambda row: pol_names_list if df['url_text'].str.contains(pattern) in row['url_text'] else ' ',
axis=1
)
我收到错误:
TypeError: 'in <string>' requires string as left operand, not Series
来自这个玩具数据框:
>>> import pandas as pd
>>> from io import StringIO
>>> df = pd.read_csv(StringIO("""
... id,url_text
... 1,Tim Kaine
... 2,Tim Kain
... 3,Tim
... 4,Lindsey Graham.com
... """), sep=',')
>>> df
id url_text
0 1 Tim Kaine
1 2 Tim Kain
2 3 Tim
3 4 Lindsey Graham.com
从 pol_names_list
开始,我们通过像这样格式化来构建 patterns
:
patterns = '(%s)' % '|'.join(pol_names_list)
然后,我们可以使用extract
方法给pol_name_block
列赋值,得到预期的结果:
df['pol_name_block'] = df['url_text'].str.extract(patterns)
输出:
id url_text pol_name_block
0 1 Tim Kaine Tim Kaine
1 2 Tim Kain NaN
2 3 Tim NaN
3 4 Lindsey Graham.com Lindsey Graham
更改您的模式以将其包含在捕获组周围 ()
并使用 extract
:
pattern = fr"({'|'.join(pol_names_list)})"
df['pol_name_block'] = df['url_text'].str.extract(pattern)
print(df)
# Output <- with the sample of @tlentali
id url_text pol_name_block
0 1 Tim Kaine Tim Kaine
1 2 Tim Kain NaN
2 3 Tim NaN
3 4 Lindsey Graham Lindsey Graham
重要提示:即使有多个匹配项,您也只能提取一个元素。如果你想提取所有元素,你必须使用 findall
或 extractall
(只有输出格式会改变)
# New sample, same pattern
>>> df
id url_text
0 1 Tim Kaine and Lindsey Graham
1 2 Tim Kain
2 3 Tim
3 4 Lindsey Graham
# findall
>>> df['url_text'].str.findall(pattern)
0 [Tim Kaine, Lindsey Graham]
1 []
2 []
3 [Lindsey Graham]
Name: url_text, dtype: object
# extractall
>>> df['url_text'].str.extractall(pattern)
0
match
0 0 Tim Kaine
1 Lindsey Graham
3 0 Lindsey Graham
我有一个名称列表 'pattern',我希望将其与列 'url_text' 中的字符串匹配。如果匹配,即 True
名称应打印在新列 'pol_names_block' 中,如果 False
则将该行留空。
pattern = '|'.join(pol_names_list)
print(pattern)
'Jon Kyl|Doug Jones|Tim Kaine|Lindsey Graham|Cory Booker|Kamala Harris|Orrin Hatch|Bernie Sanders|Thom Tillis|Jerry Moran|Shelly Moore Capito|Maggie Hassan|Tom Carper|Martin Heinrich|Steve Daines|Pat Toomey|Todd Young|Bill Nelson|John Barrasso|Chris Murphy|Mike Rounds|Mike Crapo|John Thune|John. McCain|Susan Collins|Patty Murray|Dianne Feinstein|Claire McCaskill|Lamar Alexander|Jack Reed|Chuck Grassley|Catherine Masto|Pat Roberts|Ben Cardin|Dean Heller|Ron Wyden|Dick Durbin|Jeanne Shaheen|Tammy Duckworth|Sheldon Whitehouse|Tom Cotton|Sherrod Brown|Bob Corker|Tom Udall|Mitch McConnell|James Lankford|Ted Cruz|Mike Enzi|Gary Peters|Jeff Flake|Johnny Isakson|Jim Inhofe|Lindsey Graham|Marco Rubio|Angus King|Kirsten Gillibrand|Bob Casey|Chris Van Hollen|Thad Cochran|Richard Burr|Rob Portman|Jon Tester|Bob Menendez|John Boozman|Mazie Hirono|Joe Manchin|Deb Fischer|Michael Bennet|Debbie Stabenow|Ben Sasse|Brian Schatz|Jim Risch|Mike Lee|Elizabeth Warren|Richard Blumenthal|David Perdue|Al Franken|Bill Cassidy|Cory Gardner|Lisa Murkowski|Maria Cantwell|Tammy Baldwin|Joe Donnelly|Roger Wicker|Amy Klobuchar|Joel Heitkamp|Joni Ernst|Chris Coons|Mark Warner|John Cornyn|Ron Johnson|Patrick Leahy|Chuck Schumer|John Kennedy|Jeff Merkley|Roy Blunt|Richard Shelby|John Hoeven|Rand Paul|Dan Sullivan|Tim Scott|Ed Markey'
我正在使用以下代码 df['url_text'].str.contains(pattern)
,如果 'pattern' 中的名称出现在列 'url_text' 和 [=15] 的一行中,则结果为 True
=] 否则。有了它,我尝试了以下代码:
df['pol_name_block'] = df.apply(
lambda row: pol_names_list if df['url_text'].str.contains(pattern) in row['url_text'] else ' ',
axis=1
)
我收到错误:
TypeError: 'in <string>' requires string as left operand, not Series
来自这个玩具数据框:
>>> import pandas as pd
>>> from io import StringIO
>>> df = pd.read_csv(StringIO("""
... id,url_text
... 1,Tim Kaine
... 2,Tim Kain
... 3,Tim
... 4,Lindsey Graham.com
... """), sep=',')
>>> df
id url_text
0 1 Tim Kaine
1 2 Tim Kain
2 3 Tim
3 4 Lindsey Graham.com
从 pol_names_list
开始,我们通过像这样格式化来构建 patterns
:
patterns = '(%s)' % '|'.join(pol_names_list)
然后,我们可以使用extract
方法给pol_name_block
列赋值,得到预期的结果:
df['pol_name_block'] = df['url_text'].str.extract(patterns)
输出:
id url_text pol_name_block
0 1 Tim Kaine Tim Kaine
1 2 Tim Kain NaN
2 3 Tim NaN
3 4 Lindsey Graham.com Lindsey Graham
更改您的模式以将其包含在捕获组周围 ()
并使用 extract
:
pattern = fr"({'|'.join(pol_names_list)})"
df['pol_name_block'] = df['url_text'].str.extract(pattern)
print(df)
# Output <- with the sample of @tlentali
id url_text pol_name_block
0 1 Tim Kaine Tim Kaine
1 2 Tim Kain NaN
2 3 Tim NaN
3 4 Lindsey Graham Lindsey Graham
重要提示:即使有多个匹配项,您也只能提取一个元素。如果你想提取所有元素,你必须使用 findall
或 extractall
(只有输出格式会改变)
# New sample, same pattern
>>> df
id url_text
0 1 Tim Kaine and Lindsey Graham
1 2 Tim Kain
2 3 Tim
3 4 Lindsey Graham
# findall
>>> df['url_text'].str.findall(pattern)
0 [Tim Kaine, Lindsey Graham]
1 []
2 []
3 [Lindsey Graham]
Name: url_text, dtype: object
# extractall
>>> df['url_text'].str.extractall(pattern)
0
match
0 0 Tim Kaine
1 Lindsey Graham
3 0 Lindsey Graham