Python 网络抓取 Pubmed Abstract - "Abstract" 与第一个单词合并（例如，"AbstractINTRODUCTION:"）

Question

我正在网络上从 Pubmed.gov 抓取摘要，虽然我能够获得我需要的文本，但单词 "abstract" 正在与摘要的第一个单词组合。这是一个示例摘要：https://www.ncbi.nlm.nih.gov/pubmed/30470520

例如第一个字变成"AbstractBACKGROUND:"

问题是摘要有时可能是 "AbstractBACKGROUND"、"AbstractINTRODUCTION" 或其他词（我不知道）。尽管如此，它的开头总是 "Abstract" 。否则，我只会运行一个替换命令并删除抽象部分。

我宁愿去掉单词的 "Abstract" 或者在摘要和第一个单词之间有一个换行符，像这样：

摘要

简介：

我知道使用替换命令行不通，但我想证明，作为一个 n00b，我至少尝试过。感谢您为这项工作提供的任何帮助！下面是我的代码：

import requests
import re
from bs4 import BeautifulSoup
from urllib.request import urlopen

listofa_urls = ['https://www.ncbi.nlm.nih.gov/pubmed/30470520', 
'https://www.ncbi.nlm.nih.gov/pubmed/31063262']

for l in listofa_urls:
   response = requests.get(l)
   soup = BeautifulSoup(response.content, 'html.parser')
   x = soup.find(class_='abstr').get_text()
   x = x.replace('abstract','abstract: ')
   print(x)

Answer 1

使用re.sub

例如：

import requests
import re
from bs4 import BeautifulSoup
from urllib.request import urlopen

listofa_urls = ['https://www.ncbi.nlm.nih.gov/pubmed/30470520', 
'https://www.ncbi.nlm.nih.gov/pubmed/31063262']

for l in listofa_urls:
   response = requests.get(l)
   soup = BeautifulSoup(response.content, 'html.parser')
   x = soup.find(class_='abstr').get_text()
   print(x.encode("utf-8"))
   x = re.sub(r"\babstract(.*?)", r"", x, flags=re.I)
   print(x.encode("utf-8"))

输出：

b'AbstractBACKGROUND: The amount of insulin needed to...
b'BACKGROUND: The amount of insulin needed to ....

b'AbstractCirrhosis is morbid and increasingly prevalent - ...
b'Cirrhosis is morbid and increasingly prevalent -...

Python 网络抓取 Pubmed Abstract - "Abstract" 与第一个单词合并（例如，"AbstractINTRODUCTION:"）

Python Web scraping Pubmed Abstract - "Abstract" is consolidated with first word of (e.g., "AbstractINTRODUCTION:")

python

regex

text

web-scraping

pubmed