使用 bs4 从 multi url link 中提取 python 中的指定数据

Extracting a specified data in python from multi url link using bs4

需要帮助从 Main URL 中提取数据,该 Main URL 重定向到需要 grep 所需数据的子 URL link。

主要 Url = "https://www.ncbi.nlm.nih.gov/gene/{gene_id}" sub Url = "https://www.ncbi.nlm.nih.gov/gene/{unique_gene_id_from_remote_side}"

用户使用所需的变量定义变量 gene_id [例如:APO3、SLC7A11]

[即 main_url = https://www.ncbi.nlm.nih.gov/gene/term?=APO3 ,这个 link 重定向到子 link其中包含需要 grep sub_url = https://www.ncbi.nlm.nih.gov/gene/348 的 id 信息,因此 link 只需要 grep 摘要标签 ]

我可以得到它们直到第二个 URL 但无法从中 grep href 标签和 grep 摘要

我试过的代码

import requests
from bs4 import BeautifulSoup

gen_ids = ['APOE','SLC7A11']

for gen in gen_ids:
    url = f"https://www.ncbi.nlm.nih.gov/gene/?term={gen}"
    print(url)
    r = requests.get(url)
    html_doc = r.text
    soup = BeautifulSoup(html_doc, 'lxml')
    x = soup.find('div',class_='panel')
   
    h = soup.find('h4',class_='ncbi-doc-title')
    h1 = [a['href'] for a in h.find_all('a')]
    
   
    print(h)
    print(h1)
    

你可以这样试试

这将打印所有子链接的摘要。

import requests
from bs4 import BeautifulSoup

gen_ids = ['APOE','SLC7A11']

for gen in gen_ids:
    url = f"https://www.ncbi.nlm.nih.gov/gene/?term={gen}"
    print(url)
    r = requests.get(url)
    base_url = 'https://www.ncbi.nlm.nih.gov'
    html_doc = r.text
    soup = BeautifulSoup(html_doc, 'lxml')    
    h1 = soup.find_all('td', class_='gene-name-id')
    links = [base_url + i.find('a')['href'] for i in h1]

    for i in links:
        soup = BeautifulSoup(requests.get(i).text, 'lxml')
        summary = soup.find('div', class_='rprt-section gene-summary')
        print(list(summary.stripped_strings))
https://www.ncbi.nlm.nih.gov/gene/?term=APOE
['Summary', 'Go to the top of the page', 'Help', 'Official\n                         Symbol', 'APOE', 'provided by', 'HGNC', 'Official\n                         Full Name', 'apolipoprotein E', 'provided by', 'HGNC', 'Primary source', 'HGNC:HGNC:613', 'See related', 'Ensembl:ENSG00000130203', 'MIM:107741', 'Gene type', 'protein coding', 'RefSeq status', 'REVIEWED', 'Organism', 'Homo sapiens', 'Lineage', 'Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini; Catarrhini; Hominidae; Homo', 'Also known as', 'AD2; LPG; APO-E; ApoE4; LDLCQ5', 'Summary', 'The protein encoded by this gene is a major apoprotein of the chylomicron. It binds to a specific liver and peripheral cell receptor, and is essential for the normal catabolism of triglyceride-rich lipoprotein constituents. This gene maps to chromosome 19 in a cluster with the related apolipoprotein C1 and C2 genes. Mutations in this gene result in familial dysbetalipoproteinemia, or type III hyperlipoproteinemia (HLP III), in which increased plasma cholesterol and triglycerides are the consequence of impaired clearance of chylomicron and VLDL remnants. [provided by RefSeq, Jun 2016]', 'Expression', 'Biased expression in liver (RPKM 1021.7), kidney (RPKM 648.1) and 7 other tissues', 'See more', 'Orthologs', 'mouse', 'all', 'NEW', 'Try the new', 'Gene table', 'Try the new', 'Transcript table']


https://www.ncbi.nlm.nih.gov/gene/?term=SLC7A11
['Summary', 'Go to the top of the page', 'Help', 'Official\n                         Symbol', 'SLC7A11', 'provided by', 'HGNC', 'Official\n                         Full Name', 'solute carrier family 7 member 11', 'provided by', 'HGNC', 'Primary source', 'HGNC:HGNC:11059', 'See related', 'Ensembl:ENSG00000151012', 'MIM:607933', 'Gene type', 'protein coding', 'RefSeq status', 'VALIDATED', 'Organism', 'Homo sapiens', 'Lineage', 'Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini; Catarrhini; Hominidae; Homo', 'Also known as', 'xCT; CCBR1', 'Summary', 'This gene encodes a member of a heteromeric, sodium-independent, anionic amino acid transport system that is highly specific for cysteine and glutamate. In this system, designated Xc(-), the anionic form of cysteine is transported in exchange for glutamate. This protein has been identified as the predominant mediator of Kaposi sarcoma-associated herpesvirus fusion and entry permissiveness into cells. Also, increased expression of this gene in primary gliomas (compared to normal brain tissue) was associated with increased glutamate secretion via the XCT channels, resulting in neuronal cell death. [provided by RefSeq, Sep 2011]', 'Expression', 'Biased expression in brain (RPKM 12.7), thyroid (RPKM 4.9) and 8 other tissues', 'See more', 'Orthologs', 'mouse', 'all', 'NEW', 'Try the new', 'Gene table', 'Try the new', 'Transcript table']