使用 targetblank 标记从数据框列中删除 url
Removing urls from a data-frame column with targetblank tag
我想从数据框中的列中删除 url。我感兴趣的栏目称为评论,评论中的示例条目是:
|comment |
|:--------------------------------------:|
| """Drone Strikes Up 432 Percent Under. |
|Donald Trump"" by Joe Wolverton, II, |
|J.D. |
|<a |
|href=""https://www.thenewamerican.com/ |
|usne |
|ws/foreign-policy/item/25604-drone- |
|strikes-up-432-percent-under-donald- |
|trump"" |
|title=""https://www.thenewamerican.com/ |
|usn |
|ews/foreign-policy/item/25604-drone- |
|strikes-up-432-percent-under-donald- |
|trump"" |
|target=""_blank"">https://www.thenewamer|
|c |
|an.com/usnews/foreign-policy/item/25604-|
|drone-st...</a><br/>""Trump is weighing |
| major escalation in Yemen's devastating|
|war<br/>The war has already killed at |
|least 10,000, displaced 3 million, and. |
|left millions more at risk of famine."" |
|<br/>" |
以上条目显示了我正在尝试解决的问题。我想完全删除:
<a href=""https://www.thenewamerican.com/usnews/foreign-policy/item/25604-drone-strikes-up-432-percent-under-donald-trump"" title=""https://www.thenewamerican.com/usnews/foreign-policy/item/25604-drone-strikes-up-432-percent-under-donald-trump"" target=""_blank"">https://www.thenewamerican.com/usnews/foreign-policy/item/25604-drone-st...</a>
我试过:
df['comment'] = df['comment'].replace(r'https\S+', ' ', regex=True).replace(r'www\S+', ' ', regex=True).replace(r'http\S+', ' ', regex=True)
不过我喜欢这个
href title targetblank com
您可以尝试使用正则表达式执行替换,使用 re.sub。
例如:
import re
s = """Drone Strikes Up 432 Percent Under. Donald Trump"" by Joe Wolverton, II, J.D. <a href=""https://www.thenewamerican.com/usnews/foreign-policy/item/25604-drone-strikes-up-432-percent-under-donald-trump"" title=""https://www.thenewamerican.com/usnews/foreign-policy/item/25604-drone-strikes-up-432-percent-under-donald-trump"" target=""_blank"">https://www.thenewamerc an.com/usnews/foreign-policy/item/25604-drone-st...</a><br/>""Trump is weighing major escalation in Yemen's devastating war<br/>The war has already killed at least 10,000, displaced 3 million, and. left millions more at risk of famine."" <br/>"""
print(re.sub('<a\s[^>]*.*?<\/a>', '', s))
对于您的情况,您可以使用 .applay
来实现您的目标:
import re
import pandas as pd
df = pd.DataFrame({'comment': ["Drone Strikes Up 432 Percent Under. Donald Trump"" by Joe Wolverton, II, J.D. <a href=""https://www.thenewamerican.com/usnews/foreign-policy/item/25604-drone-strikes-up-432-percent-under-donald-trump"" title=""https://www.thenewamerican.com/usnews/foreign-policy/item/25604-drone-strikes-up-432-percent-under-donald-trump"" target=""_blank"">https://www.thenewamerc an.com/usnews/foreign-policy/item/25604-drone-st...</a><br/>""Trump is weighing major escalation in Yemen's devastating war<br/>The war has already killed at least 10,000, displaced 3 million, and. left millions more at risk of famine."" <br/>"""]})
df['comment'] = df['comment'].apply(lambda x: re.sub('<a\s[^>]*.*?<\/a>', '', x))
print(df)
尝试:
df['comment'] = df['comment'].str.replace('<a\s[^>]*.*?<\/a>', '')
输出:
>>> df.loc[0, 'comment']
'Drone Strikes Up 432 Percent Under. Donald Trump"" by Joe Wolverton, II, J.D. <br/>""Trump is weighing major escalation in Yemen\'s devastating war<br/>The war has already killed at least 10,000, displaced 3 million, and. left millions more at risk of famine."" <br/>'
我想从数据框中的列中删除 url。我感兴趣的栏目称为评论,评论中的示例条目是:
|comment |
|:--------------------------------------:|
| """Drone Strikes Up 432 Percent Under. |
|Donald Trump"" by Joe Wolverton, II, |
|J.D. |
|<a |
|href=""https://www.thenewamerican.com/ |
|usne |
|ws/foreign-policy/item/25604-drone- |
|strikes-up-432-percent-under-donald- |
|trump"" |
|title=""https://www.thenewamerican.com/ |
|usn |
|ews/foreign-policy/item/25604-drone- |
|strikes-up-432-percent-under-donald- |
|trump"" |
|target=""_blank"">https://www.thenewamer|
|c |
|an.com/usnews/foreign-policy/item/25604-|
|drone-st...</a><br/>""Trump is weighing |
| major escalation in Yemen's devastating|
|war<br/>The war has already killed at |
|least 10,000, displaced 3 million, and. |
|left millions more at risk of famine."" |
|<br/>" |
以上条目显示了我正在尝试解决的问题。我想完全删除:
<a href=""https://www.thenewamerican.com/usnews/foreign-policy/item/25604-drone-strikes-up-432-percent-under-donald-trump"" title=""https://www.thenewamerican.com/usnews/foreign-policy/item/25604-drone-strikes-up-432-percent-under-donald-trump"" target=""_blank"">https://www.thenewamerican.com/usnews/foreign-policy/item/25604-drone-st...</a>
我试过:
df['comment'] = df['comment'].replace(r'https\S+', ' ', regex=True).replace(r'www\S+', ' ', regex=True).replace(r'http\S+', ' ', regex=True)
不过我喜欢这个
href title targetblank com
您可以尝试使用正则表达式执行替换,使用 re.sub。
例如:
import re
s = """Drone Strikes Up 432 Percent Under. Donald Trump"" by Joe Wolverton, II, J.D. <a href=""https://www.thenewamerican.com/usnews/foreign-policy/item/25604-drone-strikes-up-432-percent-under-donald-trump"" title=""https://www.thenewamerican.com/usnews/foreign-policy/item/25604-drone-strikes-up-432-percent-under-donald-trump"" target=""_blank"">https://www.thenewamerc an.com/usnews/foreign-policy/item/25604-drone-st...</a><br/>""Trump is weighing major escalation in Yemen's devastating war<br/>The war has already killed at least 10,000, displaced 3 million, and. left millions more at risk of famine."" <br/>"""
print(re.sub('<a\s[^>]*.*?<\/a>', '', s))
对于您的情况,您可以使用 .applay
来实现您的目标:
import re
import pandas as pd
df = pd.DataFrame({'comment': ["Drone Strikes Up 432 Percent Under. Donald Trump"" by Joe Wolverton, II, J.D. <a href=""https://www.thenewamerican.com/usnews/foreign-policy/item/25604-drone-strikes-up-432-percent-under-donald-trump"" title=""https://www.thenewamerican.com/usnews/foreign-policy/item/25604-drone-strikes-up-432-percent-under-donald-trump"" target=""_blank"">https://www.thenewamerc an.com/usnews/foreign-policy/item/25604-drone-st...</a><br/>""Trump is weighing major escalation in Yemen's devastating war<br/>The war has already killed at least 10,000, displaced 3 million, and. left millions more at risk of famine."" <br/>"""]})
df['comment'] = df['comment'].apply(lambda x: re.sub('<a\s[^>]*.*?<\/a>', '', x))
print(df)
尝试:
df['comment'] = df['comment'].str.replace('<a\s[^>]*.*?<\/a>', '')
输出:
>>> df.loc[0, 'comment']
'Drone Strikes Up 432 Percent Under. Donald Trump"" by Joe Wolverton, II, J.D. <br/>""Trump is weighing major escalation in Yemen\'s devastating war<br/>The war has already killed at least 10,000, displaced 3 million, and. left millions more at risk of famine."" <br/>'