Python 网络抓取 Pubmed Abstract - "Abstract" 与第一个单词合并(例如,"AbstractINTRODUCTION:")

Python Web scraping Pubmed Abstract - "Abstract" is consolidated with first word of (e.g., "AbstractINTRODUCTION:")

我正在网络上从 Pubmed.gov 抓取摘要,虽然我能够获得我需要的文本,但单词 "abstract" 正在与摘要的第一个单词组合。这是一个示例摘要:https://www.ncbi.nlm.nih.gov/pubmed/30470520

例如第一个字变成"AbstractBACKGROUND:"

问题是摘要有时可能是 "AbstractBACKGROUND"、"AbstractINTRODUCTION" 或其他词(我不知道)。尽管如此,它的开头总是 "Abstract" 。否则,我只会 运行 一个替换命令并删除抽象部分。

我宁愿去掉单词的 "Abstract" 或者在摘要和第一个单词之间有一个换行符,像这样:

摘要

简介:

我知道使用替换命令行不通,但我想证明,作为一个 n00b,我至少尝试过。感谢您为这项工作提供的任何帮助!下面是我的代码:

import requests
import re
from bs4 import BeautifulSoup
from urllib.request import urlopen

listofa_urls = ['https://www.ncbi.nlm.nih.gov/pubmed/30470520', 
'https://www.ncbi.nlm.nih.gov/pubmed/31063262']

for l in listofa_urls:
   response = requests.get(l)
   soup = BeautifulSoup(response.content, 'html.parser')
   x = soup.find(class_='abstr').get_text()
   x = x.replace('abstract','abstract: ')
   print(x)

使用re.sub

例如:

import requests
import re
from bs4 import BeautifulSoup
from urllib.request import urlopen

listofa_urls = ['https://www.ncbi.nlm.nih.gov/pubmed/30470520', 
'https://www.ncbi.nlm.nih.gov/pubmed/31063262']

for l in listofa_urls:
   response = requests.get(l)
   soup = BeautifulSoup(response.content, 'html.parser')
   x = soup.find(class_='abstr').get_text()
   print(x.encode("utf-8"))
   x = re.sub(r"\babstract(.*?)", r"", x, flags=re.I)
   print(x.encode("utf-8"))

输出:

b'AbstractBACKGROUND: The amount of insulin needed to...
b'BACKGROUND: The amount of insulin needed to ....

b'AbstractCirrhosis is morbid and increasingly prevalent - ...
b'Cirrhosis is morbid and increasingly prevalent -...