Beautifulsoup 抓取 .text 并自动拆分它们
Beautifulsoup scraping .text and splitting them automatically
这里共有 BeautifulSoup 个新手。我需要从某个 URL(在下面的代码示例中列出)中抓取发布选项卡的内容。我需要抓取出版物并将它们拆分为 'authors'、'title' 和 'journal',然后我可以将其转换为 pandas DataFrame。我尝试使用以下代码抓取内容:
page_url = 'https://www.vet.upenn.edu/research/centers-laboratories/research-laboratory/research-laboratory/anguera-laboratory'
def get_soup(url):
r = requests.get('http://localhost:8050/render.html', params={'url': url, 'wait': 3})
soup = BeautifulSoup(r.text, 'html.parser')
url_soup = get_soup(page_url)
publis = url_soup.find_all('div',{'class':'tab-content pad'})
for item in publis:
publi = {
'items': item.find('div',{'id':'tabs-publications'}).text.split('\n\n')
}
publication_df = pd.DataFrame(publi)
new_publication_df = publication_df['items'].str.split(':', 1, expand=True)
new_publication_df[[1,2]] = new_publication_df[1].str.split(']',expand=True)
这是我需要的 returns,但它确实容易出现拼写错误(例如,有一份出版物使用 'J' 而不是 ']')。有什么方法可以让BeautifulSoup自动将文本分成三列?
认为您应该使用可用的结构而不是拆分字符串。
Select 所有 <em>
持有 title,其 previous_sibling
,持有 authors 以及它的 next_sibling
,其中包含日志。
例子
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.vet.upenn.edu/research/centers-laboratories/research-laboratory/research-laboratory/anguera-laboratory')
soup = BeautifulSoup(r.text)
data = []
for e in soup.select('#tabs-publications em'):
data.append({
'author':e.previous.get_text(strip=True)[:-1],
'title':e.get_text(strip=True),
'journal':e.next_sibling.get_text(strip=True)
})
data
输出数据
[{'author': 'Hantsoo Liisa, Kornfield Sara, Anguera Montserrat C, Epperson C Neill',
'title': 'Inflammation: A Proposed Intermediary Between\xa0Maternal Stress and Offspring Neuropsychiatric Risk. [PMID30314641]',
'journal': 'Biological psychiatry 85(2): 97-106, Jan 2019.'},
{'author': 'Sierra Isabel, Anguera Montserrat C',
'title': 'Enjoy the silence: X-chromosome inactivation diversity in somatic cells.[PMID31108425]',
'journal': 'Current opinion in genetics & development 55: 26-31, May 2019.'},
{'author': 'Syrett Camille M, Anguera Montserrat C',
'title': 'When the balance is broken: X-linked gene dosage from two X chromosomes and female-biased autoimmunity. [PMID31125996]',
'journal': 'Journal of leukocyte biology May 2019.'},
{'author': 'Kotzin Jonathan J, Iseka Fany, Wright Jasmine, Basavappa Megha G, Clark Megan L, Ali Mohammed-Alkhatim, Abdel-Hakeem Mohamed S, Robertson Tanner F, Mowel Walter K, Joannas Leonel, Neal Vanessa D, Spencer Sean P, Syrett Camille M, Anguera Montserrat C, Williams Adam, Wherry E John, Henao-Mejia Jorge',
'title': 'The long noncoding RNA regulates CD8 T cells in response to viral infection.[PMID31138702]',
'journal': 'Proceedings of the National Academy of Sciences of the United States of America 116(24): 11916-11925, Jun 2019.'},
{'author': 'Syrett Camille M, Paneru Bam, Sandoval-Heglund Donavon, Wang Jianle, Banerjee Sarmistha, Sindhava Vishal, Behrens Edward M, Atchison Michael, Anguera Montserrat C',
'title': 'Altered X-chromosome inactivation in T cells may promote sex-biased autoimmune diseases. [PMID30944248',
'journal': 'JCI insight 4(7), Apr 2019.'},...]
输出数据帧
pd.DataFrame(data)
author
title
journal
Hantsoo Liisa, Kornfield Sara, Anguera Montserrat C, Epperson C Neill
Inflammation: A Proposed Intermediary Between Maternal Stress and Offspring Neuropsychiatric Risk. [PMID30314641]
Biological psychiatry 85(2): 97-106, Jan 2019.
Sierra Isabel, Anguera Montserrat C
Enjoy the silence: X-chromosome inactivation diversity in somatic cells.[PMID31108425]
Current opinion in genetics & development 55: 26-31, May 2019.
Syrett Camille M, Anguera Montserrat C
When the balance is broken: X-linked gene dosage from two X chromosomes and female-biased autoimmunity. [PMID31125996]
Journal of leukocyte biology May 2019.
Kotzin Jonathan J, Iseka Fany, Wright Jasmine, Basavappa Megha G, Clark Megan L, Ali Mohammed-Alkhatim, Abdel-Hakeem Mohamed S, Robertson Tanner F, Mowel Walter K, Joannas Leonel, Neal Vanessa D, Spencer Sean P, Syrett Camille M, Anguera Montserrat C, Williams Adam, Wherry E John, Henao-Mejia Jorge
The long noncoding RNA regulates CD8 T cells in response to viral infection.[PMID31138702]
Proceedings of the National Academy of Sciences of the United States of America 116(24): 11916-11925, Jun 2019.
Syrett Camille M, Paneru Bam, Sandoval-Heglund Donavon, Wang Jianle, Banerjee Sarmistha, Sindhava Vishal, Behrens Edward M, Atchison Michael, Anguera Montserrat C
Altered X-chromosome inactivation in T cells may promote sex-biased autoimmune diseases. [PMID30944248
JCI insight 4(7), Apr 2019.
Syrett Camille M, Sindhava Vishal, Sierra Isabel, Dubin Aimee H, Atchison Michael, Anguera Montserrat C
Diversity of Epigenetic Features of the Inactive X-Chromosome in NK Cells, Dendritic Cells, and Macrophages. [PMID30671059]
Frontiers in immunology 9: 3087, 2018.
Le Coz Carole, Trofa Melissa, Syrett Camille M, Martin Anna, Jyonouchi Harumi, Jyonouchi Soma, Anguera Montserrat C, Romberg Neil
CD40LG duplication-associated autoimmune disease is silenced by nonrandom X-chromosome inactivation.[PMID29499223]
The Journal of allergy and clinical immunology 141(6): 2308-2311.e7, Jun 2018.
Syrett Camille M, Sierra Isabel, Berry Corbett L, Beiting Daniel, Anguera Montserrat C
Sex-Specific Gene Expression Differences Are Evident in Human Embryonic Stem Cells and During In Vitro Differentiation of Human Placental Progenitor Cells. [PMID29993333]
Stem cells and development 27(19): 1360-1375, Oct 2018.
Wang Jianle, Anguera Montserrat C
In Vitro Differentiation of Human Pluripotent Stem Cells into Trophoblastic Cells. [PMID28362386]
Journal of visualized experiments : JoVE(121), Mar 2017.
Syrett Camille M, Sindhava Vishal, Hodawadekar Suchita, Myles Arpita, Liang Guanxiang, Zhang Yue, Nandi Satabdi, Cancro Michael, Atchison Michael, Anguera Montserrat C
Loss of Xist RNA from the inactive X during B cell development is restored in a dynamic YY1-dependent two-step process in activated B cells. [PMID28991910]
PLoS genetics 13(10): e1007050, Oct 2017.
Penkala Ian, Wang Jianle, Syrett Camille M, Goetzl Laura, López Carolina B, Anguera Montserrat C
lncRHOXF1, a Long Noncoding RNA from the X Chromosome That Suppresses Viral Response Genes during Development of the Early Human Placenta. [PMID27066803]
Molecular and cellular biology 36(12): 1764-75, Jun 2016.
Wang Jianle, Syrett Camille M, Kramer Marianne C, Basu Arindam, Atchison Michael L, Anguera Montserrat C
Unusual maintenance of X chromosome inactivation predisposes female lymphocytes for increased expression from the inactive X. [PMID27001848]
Proceedings of the National Academy of Sciences of the United States of America 113(14): E2029-38, Apr 2016.
...
这里共有 BeautifulSoup 个新手。我需要从某个 URL(在下面的代码示例中列出)中抓取发布选项卡的内容。我需要抓取出版物并将它们拆分为 'authors'、'title' 和 'journal',然后我可以将其转换为 pandas DataFrame。我尝试使用以下代码抓取内容:
page_url = 'https://www.vet.upenn.edu/research/centers-laboratories/research-laboratory/research-laboratory/anguera-laboratory'
def get_soup(url):
r = requests.get('http://localhost:8050/render.html', params={'url': url, 'wait': 3})
soup = BeautifulSoup(r.text, 'html.parser')
url_soup = get_soup(page_url)
publis = url_soup.find_all('div',{'class':'tab-content pad'})
for item in publis:
publi = {
'items': item.find('div',{'id':'tabs-publications'}).text.split('\n\n')
}
publication_df = pd.DataFrame(publi)
new_publication_df = publication_df['items'].str.split(':', 1, expand=True)
new_publication_df[[1,2]] = new_publication_df[1].str.split(']',expand=True)
这是我需要的 returns,但它确实容易出现拼写错误(例如,有一份出版物使用 'J' 而不是 ']')。有什么方法可以让BeautifulSoup自动将文本分成三列?
认为您应该使用可用的结构而不是拆分字符串。
Select 所有 <em>
持有 title,其 previous_sibling
,持有 authors 以及它的 next_sibling
,其中包含日志。
例子
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.vet.upenn.edu/research/centers-laboratories/research-laboratory/research-laboratory/anguera-laboratory')
soup = BeautifulSoup(r.text)
data = []
for e in soup.select('#tabs-publications em'):
data.append({
'author':e.previous.get_text(strip=True)[:-1],
'title':e.get_text(strip=True),
'journal':e.next_sibling.get_text(strip=True)
})
data
输出数据
[{'author': 'Hantsoo Liisa, Kornfield Sara, Anguera Montserrat C, Epperson C Neill',
'title': 'Inflammation: A Proposed Intermediary Between\xa0Maternal Stress and Offspring Neuropsychiatric Risk. [PMID30314641]',
'journal': 'Biological psychiatry 85(2): 97-106, Jan 2019.'},
{'author': 'Sierra Isabel, Anguera Montserrat C',
'title': 'Enjoy the silence: X-chromosome inactivation diversity in somatic cells.[PMID31108425]',
'journal': 'Current opinion in genetics & development 55: 26-31, May 2019.'},
{'author': 'Syrett Camille M, Anguera Montserrat C',
'title': 'When the balance is broken: X-linked gene dosage from two X chromosomes and female-biased autoimmunity. [PMID31125996]',
'journal': 'Journal of leukocyte biology May 2019.'},
{'author': 'Kotzin Jonathan J, Iseka Fany, Wright Jasmine, Basavappa Megha G, Clark Megan L, Ali Mohammed-Alkhatim, Abdel-Hakeem Mohamed S, Robertson Tanner F, Mowel Walter K, Joannas Leonel, Neal Vanessa D, Spencer Sean P, Syrett Camille M, Anguera Montserrat C, Williams Adam, Wherry E John, Henao-Mejia Jorge',
'title': 'The long noncoding RNA regulates CD8 T cells in response to viral infection.[PMID31138702]',
'journal': 'Proceedings of the National Academy of Sciences of the United States of America 116(24): 11916-11925, Jun 2019.'},
{'author': 'Syrett Camille M, Paneru Bam, Sandoval-Heglund Donavon, Wang Jianle, Banerjee Sarmistha, Sindhava Vishal, Behrens Edward M, Atchison Michael, Anguera Montserrat C',
'title': 'Altered X-chromosome inactivation in T cells may promote sex-biased autoimmune diseases. [PMID30944248',
'journal': 'JCI insight 4(7), Apr 2019.'},...]
输出数据帧
pd.DataFrame(data)
author | title | journal |
---|---|---|
Hantsoo Liisa, Kornfield Sara, Anguera Montserrat C, Epperson C Neill | Inflammation: A Proposed Intermediary Between Maternal Stress and Offspring Neuropsychiatric Risk. [PMID30314641] | Biological psychiatry 85(2): 97-106, Jan 2019. |
Sierra Isabel, Anguera Montserrat C | Enjoy the silence: X-chromosome inactivation diversity in somatic cells.[PMID31108425] | Current opinion in genetics & development 55: 26-31, May 2019. |
Syrett Camille M, Anguera Montserrat C | When the balance is broken: X-linked gene dosage from two X chromosomes and female-biased autoimmunity. [PMID31125996] | Journal of leukocyte biology May 2019. |
Kotzin Jonathan J, Iseka Fany, Wright Jasmine, Basavappa Megha G, Clark Megan L, Ali Mohammed-Alkhatim, Abdel-Hakeem Mohamed S, Robertson Tanner F, Mowel Walter K, Joannas Leonel, Neal Vanessa D, Spencer Sean P, Syrett Camille M, Anguera Montserrat C, Williams Adam, Wherry E John, Henao-Mejia Jorge | The long noncoding RNA regulates CD8 T cells in response to viral infection.[PMID31138702] | Proceedings of the National Academy of Sciences of the United States of America 116(24): 11916-11925, Jun 2019. |
Syrett Camille M, Paneru Bam, Sandoval-Heglund Donavon, Wang Jianle, Banerjee Sarmistha, Sindhava Vishal, Behrens Edward M, Atchison Michael, Anguera Montserrat C | Altered X-chromosome inactivation in T cells may promote sex-biased autoimmune diseases. [PMID30944248 | JCI insight 4(7), Apr 2019. |
Syrett Camille M, Sindhava Vishal, Sierra Isabel, Dubin Aimee H, Atchison Michael, Anguera Montserrat C | Diversity of Epigenetic Features of the Inactive X-Chromosome in NK Cells, Dendritic Cells, and Macrophages. [PMID30671059] | Frontiers in immunology 9: 3087, 2018. |
Le Coz Carole, Trofa Melissa, Syrett Camille M, Martin Anna, Jyonouchi Harumi, Jyonouchi Soma, Anguera Montserrat C, Romberg Neil | CD40LG duplication-associated autoimmune disease is silenced by nonrandom X-chromosome inactivation.[PMID29499223] | The Journal of allergy and clinical immunology 141(6): 2308-2311.e7, Jun 2018. |
Syrett Camille M, Sierra Isabel, Berry Corbett L, Beiting Daniel, Anguera Montserrat C | Sex-Specific Gene Expression Differences Are Evident in Human Embryonic Stem Cells and During In Vitro Differentiation of Human Placental Progenitor Cells. [PMID29993333] | Stem cells and development 27(19): 1360-1375, Oct 2018. |
Wang Jianle, Anguera Montserrat C | In Vitro Differentiation of Human Pluripotent Stem Cells into Trophoblastic Cells. [PMID28362386] | Journal of visualized experiments : JoVE(121), Mar 2017. |
Syrett Camille M, Sindhava Vishal, Hodawadekar Suchita, Myles Arpita, Liang Guanxiang, Zhang Yue, Nandi Satabdi, Cancro Michael, Atchison Michael, Anguera Montserrat C | Loss of Xist RNA from the inactive X during B cell development is restored in a dynamic YY1-dependent two-step process in activated B cells. [PMID28991910] | PLoS genetics 13(10): e1007050, Oct 2017. |
Penkala Ian, Wang Jianle, Syrett Camille M, Goetzl Laura, López Carolina B, Anguera Montserrat C | lncRHOXF1, a Long Noncoding RNA from the X Chromosome That Suppresses Viral Response Genes during Development of the Early Human Placenta. [PMID27066803] | Molecular and cellular biology 36(12): 1764-75, Jun 2016. |
Wang Jianle, Syrett Camille M, Kramer Marianne C, Basu Arindam, Atchison Michael L, Anguera Montserrat C | Unusual maintenance of X chromosome inactivation predisposes female lymphocytes for increased expression from the inactive X. [PMID27001848] | Proceedings of the National Academy of Sciences of the United States of America 113(14): E2029-38, Apr 2016. |
...