Beautifulsoup 抓取 .text 并自动拆分它们

Beautifulsoup scraping .text and splitting them automatically

这里共有 BeautifulSoup 个新手。我需要从某个 URL(在下面的代码示例中列出)中抓取发布选项卡的内容。我需要抓取出版物并将它们拆分为 'authors'、'title' 和 'journal',然后我可以将其转换为 pandas DataFrame。我尝试使用以下代码抓取内容:

page_url = ''

def get_soup(url):
    r = requests.get('http://localhost:8050/render.html', params={'url': url, 'wait': 3})
    soup = BeautifulSoup(r.text, 'html.parser')

url_soup = get_soup(page_url)

publis = url_soup.find_all('div',{'class':'tab-content pad'})

for item in publis:
    publi = {
        'items': item.find('div',{'id':'tabs-publications'}).text.split('\n\n')

publication_df = pd.DataFrame(publi)
new_publication_df = publication_df['items'].str.split(':', 1, expand=True)
new_publication_df[[1,2]] = new_publication_df[1].str.split(']',expand=True)

这是我需要的 returns,但它确实容易出现拼写错误(例如,有一份出版物使用 'J' 而不是 ']')。有什么方法可以让BeautifulSoup自动将文本分成三列?


Select 所有 <em> 持有 title,其 previous_sibling,持有 authors 以及它的 next_sibling,其中包含日志。

import requests
from bs4 import BeautifulSoup

r = requests.get('')
soup = BeautifulSoup(r.text)
data = []
for e in'#tabs-publications em'):

[{'author': 'Hantsoo Liisa, Kornfield Sara, Anguera Montserrat C, Epperson C Neill',
  'title': 'Inflammation: A Proposed Intermediary Between\xa0Maternal Stress and Offspring Neuropsychiatric Risk. [PMID30314641]',
  'journal': 'Biological psychiatry 85(2): 97-106, Jan 2019.'},
 {'author': 'Sierra Isabel, Anguera Montserrat C',
  'title': 'Enjoy the silence: X-chromosome inactivation diversity in somatic cells.[PMID31108425]',
  'journal': 'Current opinion in genetics & development 55: 26-31, May 2019.'},
 {'author': 'Syrett Camille M, Anguera Montserrat C',
  'title': 'When the balance is broken: X-linked gene dosage from two X chromosomes and female-biased autoimmunity. [PMID31125996]',
  'journal': 'Journal of leukocyte biology May 2019.'},
 {'author': 'Kotzin Jonathan J, Iseka Fany, Wright Jasmine, Basavappa Megha G, Clark Megan L, Ali Mohammed-Alkhatim, Abdel-Hakeem Mohamed S, Robertson Tanner F, Mowel Walter K, Joannas Leonel, Neal Vanessa D, Spencer Sean P, Syrett Camille M, Anguera Montserrat C, Williams Adam, Wherry E John, Henao-Mejia Jorge',
  'title': 'The long noncoding RNA regulates CD8 T cells in response to viral infection.[PMID31138702]',
  'journal': 'Proceedings of the National Academy of Sciences of the United States of America 116(24): 11916-11925, Jun 2019.'},
 {'author': 'Syrett Camille M, Paneru Bam, Sandoval-Heglund Donavon, Wang Jianle, Banerjee Sarmistha, Sindhava Vishal, Behrens Edward M, Atchison Michael, Anguera Montserrat C',
  'title': 'Altered X-chromosome inactivation in T cells may promote sex-biased autoimmune diseases. [PMID30944248',
  'journal': 'JCI insight 4(7), Apr 2019.'},...]
author title journal
