如何在 scrapy python 中获取段落?

How to get paragraphs in scrapy python?

我需要从某些网站提取段落文本,例如sample 使用 scrapy。屏幕截图显示了结构。下面是代码。

class ArticlesSpider(scrapy.Spider):
    name = 'articles'
    start_urls = ['https://www.globenewswire.com/news-release/2022/05/05/2437159/0/en/ORYZON-Reports-Results-and-Corporate-Update-for-Quarter-Ended-March-31-2022.html']
    
    def parse(self, response):
        
        article = testScrapyItem()
        article['title'] = response.css('h1.article-headline::text').get()
        article['sub_title'] = response.css('h2.article-sub-headline::text').get()
        article['publish_date'] = response.css('time::text').get()
        article['body'] = response.css('div.main-body-container').getall()
        yield article

我对标题、sub_title 和 publish_date 没有任何问题。但是在 body 中,我无法提取文本。我收到了包含所有 html 标签的文本。我想要的是列表中的所有段落。

你可以使用 xpath

article['body'] = response.xpath('//*[@class="main-body-container article-body"]//p//text()').getall()

 OR

article['body'] = response.xpath('//*[@class="main-body-container article-body"]//text()').getall()

工作示例:

from scrapy.crawler import CrawlerProcess
import scrapy
class ArticlesSpider(scrapy.Spider):
    name = 'articles'
    start_urls = ['https://www.globenewswire.com/news-release/2022/05/05/2437159/0/en/ORYZON-Reports-Results-and-Corporate-Update-for-Quarter-Ended-March-31-2022.html']
    
    def parse(self, response):
        
        yield {
            #'body':response.xpath('//*[@class="main-body-container article-body"]//p//text()').getall()
            'body':''.join(response.xpath('//*[@class="main-body-container article-body"]//text()').getall()).replace('\xa0','').strip()
            
            }

if __name__ == "__main__":
    process =CrawlerProcess(ArticlesSpider)
    process.crawl()
    process.start()

Output:

{'body': "MADRID, Spain and CAMBRIDGE, Mass., May  05, 2022  (GLOBE NEWSWIRE) -- Oryzon Genomics, S.A. (ISIN Code: ES0167733015, ORY), a clinical-stage biopharmaceutical company leveraging epigenetics to develop therapies in diseases with strong unmet medical need, today reported financial results for the first quarter of 2022 and provided an update on recent developments.      Dr Carlos Buesa, Oryzon’s Chief Executive Officer, said: “We continued to make strong 
progress on our clinical pipeline this quarter. In oncology, we obtained approval from the U.S. Food and Drug Administration (FDA) for our Investigational New Drug application (IND) for FRIDA, a Phase Ib trial with iadademstat in combination with gilteritinib in relapsed/refractory acute myeloid leukemia patients harboring a FMS-like tyrosine kinase mutation”.      “In CNS, we also made strong progress. We are actively recruiting patients into the Phase IIb PORTICO 
trial with vafidemstat in Borderline Personality Disorder (BPD) in the USA, Spain, Germany, Bulgaria and Serbia. Vafidemstat’s Phase IIb trial in schizophrenia, called EVOLUTION, has continued to enroll patients. Furthermore, we are working with the most prestigious key opinion leaders in the space to finalize the design of HOPE, the first randomized Phase I/II personalized medicine trial with an LSD1 inhibitor, in Kabuki Syndrome (KS) patients, which we expect to start in the first half of 2022. Oryzon has further strengthened its presence in the US with the appointment of Douglas V. Faller, MD, PhD as Global Chief Medical Officer, and Dr Ana Limón 
as SVP in Clinical Development and Global Medical Affairs, both based in Boston. We finished this first quarter with a solid cash position of .0 million, which provides funding for further development of our exciting pipeline until 1H2023.”      First Quarter and Recent Highlights       Iadademstat in oncology:      Oryzon received notification from the U.S. FDA that its IND for iadademstat is approved to initiate FRIDA, a Phase Ib clinical trial in patients with relapsed/refractory (R/R) Acute Myeloid Leukemia (AML) harboring a FMS-like tyrosine kinase mutation (FLT3mut+). FRIDA is an open-label, multicenter study of iadademstat plus gilteritinib for the treatment of patients with R/R AML with FLT3·mutations. The primary objectives are to evaluate the safety and tolerability of iadademstat in combination with gilteritinib in patients with FLT3mut+ R/R AML and to establish the Recommended Phase 2 Dose (RP2D) for this combination. Secondary objectives include evaluation of the treatment efficacy, measured as the rate of complete remission and complete remission with partial hematological recovery (CR/CRh), the Duration of Responses (DoR) and the assessment of Measurable Residual Disease. The study will accrue up to approximately 45 patients and if successful, the Company and FDA have agreed to hold a meeting to discuss the best plan to further develop this combination in this much in need AML population.The Phase II ALICE trial, investigating iadademstat in combination with azacitidine in AML, is fully enrolled, with a total of 36 patients. Preliminary data corresponding to the 36 months of the study were presented at the ASH 2021 congress last December, showing robust signs of clinical efficacy, with ORR of 78%, of which 62% were CR/CRi, as well as a good safety profile for the combination of iadademstat and azacitidine. The duration of observed responses was very encouraging, with 77% of the CR/CRi lasting over 6 months. The longest remission at the data cut-off date for ASH-2021 was over 1,000 days, with the patient being transfusion independent and MRD-negative. The company plans to present a new clinical update on ALICE at the EHA-2022 congress and final data at ASH-2022.New trials in combination in solid tumors are under preparation. In small cell lung cancer (SCLC), the STELLAR trial is in preparation. STELLAR is a randomized, multicenter Phase Ib/II study of iadademstat plus a checkpoint inhibitor in first line extensive disease SCLC. The company believes that STELLAR could potentially support an application for accelerated approval. In addition, the company is preparing a Phase Ib/II basket trial of iadademstat in combination with synergistic agents in platinum R/R SCLC and extrapulmonary high grade neuroendocrine tumors (NET). Both trials will be conducted in the US.      Vafidemstat in large multifactorial CNS indications:      Following receipt of 
approval for the Serbian arm of the PORTICO Phase IIb clinical trial with vafidemstat in patients with BPD, the deployment phase of this trial was completed and patients are actively being recruited in Europe and the US. PORTICO is a multicenter, double-blind, randomized, placebo-controlled Phase IIb to evaluate the efficacy and safety of vafidemstat in BPD patients. The trial has two independent primary objectives: reduction of aggression/agitation and overall BPD 
improvement. The study will include 156 patients, with 78 patients in each arm, and has a pre-defined interim analysis to adjust the sample size in case of excessive variability around the endpoints or an unexpectedly high placebo rate. The trial will be conducted in 15-20 sites in Europe and US.The EVOLUTION Phase IIb clinical trial with vafidemstat in patients with schizophrenia has continued to enroll patients. This Phase IIb study aims to evaluate the efficacy of vafidemstat on negative symptoms and cognitive impairment in patients with schizophrenia. This project is partially financed with public funds from the Spanish Ministry of Science and Innovation and will be carried out in various Spanish hospitals.      Vafidemstat in monogenic CNS indications      We are finalizing the preparation of a new precision medicine trial in KS 
patients. This Phase I/II trial, named HOPE, will be a multicenter, multi-arm, randomized, double-blind and placebo-controlled trial to explore the safety and efficacy of vafidemstat in improving several impairments described in KS patients. The trial plans to enroll 50-60 patients and will be carried out in children older than 12 years and in young adults. The company expects to start HOPE in the first half of 2022 in several hospitals and sites in the United States and, possibly, in Europe. Considering the FDA and EMA precedents in rare diseases and CNS disorders, we believe that if the HOPE trial demonstrates relevant clinical improvements, it may potentially serve as the basis for accelerated approval in the EU and the United States.Our precision medicine programs in psychiatric disease continue to progress. We have collaborations in autism with researchers at the Seaver Autism Center for Research and Treatment at Icahn School of Medicine at Mount Sinai Hospital in New York and the Institute of Medical and Molecular Genetics (INGEMM) at Hospital Universitario La Paz of Madrid and in schizophrenia with researchers from Columbia University in New York. The results of the ongoing pilot studies to characterize patients with specific mutations to inform subsequent precision psychiatry clinical trials with vafidemstat are expected to conclude in 2022.      Financial Update: First Quarter 2022 Financial Results      Research and development (R&D) expenses were .2 million for the first quarter ended March 31, 2022, compared to .3 million for the first quarter ended March 
31, 2021.      General and administrative expenses were .3 million for the first quarter ended March 31, 2022, at the same level for the first quarter ended March 31, 2021.      Net losses were .7 million for the first quarter ended March 31, 2022, compared to net losses of .0 million for the first quarter ended March 31, 2021. The result is as expected, given the biotechnology business model where companies in the development phase typically have a long-term 
maturation period for products, and do not have recurrent income.      Negative net result of 
.8 million (-[=11=].03 per share) for the first quarter ended March 31, 2022, compared to a negative net result of .1 million (- [=11=].04 per share) for the first quarter ended March 31, 2021.      Cash, cash equivalents and marketable securities totaled .0 million as of March 31, 
2022.   ORYZON GENOMICS, S.A.BALANCE SHEET DATA (UNAUDITED)1(Amounts in thousands US $)March 31st, 2022March 31st, 2021Cash and cash equivalents28,02845,157Marketable securities00Total Assets103,462111,872Deferred revenue9410Total Stockholders' equity77,29686,896      ORYZON GENOMICS, S.A.STATEMENTS OF OPERATIONS (UNAUDITED)1(US $, amounts in thousands except per share data)Three Months EndedMarch 31st20222021Collaboration Revenue00Operating expenses:Research and Development4,2284,278General and administrative1,3431,302Total operating expenses5,5715,580Loss 
from Operations-5,571-5,580Other income, net3,8263,536Net Loss-1,745-2,044Net Financial & Tax-67-89Net Result-1,812-2,133Loss per share allocable to common stockholders:Basic-0.03-0.04Diluted-0.03-0.04Weighted average Shares outstandingBasic52,761,55452,761,554Diluted52,761,55452,761,5541Spanish GAAPExchange Euro/Dollar (1.1101 for 2022 and 1.1725 in 2021)   About OryzonFounded in 2000 in Barcelona, Spain, Oryzon (ISIN Code: ES0167733015) is a clinical stage biopharmaceutical company considered as the European leader in epigenetics. Oryzon has one of the strongest portfolios in the field, with two LSD1 inhibitors, iadademstat and vafidemstat, in Phase II clinical trials, and other pipeline assets directed against other epigenetic targets. In 
addition, Oryzon has a strong platform for biomarker identification and target validation for 
a variety of malignant and neurological diseases. For more information, visit www.oryzon.com  
    About Iadademstat Iadademstat (ORY-1001) is a small oral molecule, which acts as a highly 
selective inhibitor of the epigenetic enzyme LSD1 and has a powerful differentiating effect in hematologic cancers (see Maes et al., Cancer Cell 2018 Mar 12; 33 (3): 495-511.e12.doi: 10.1016 / j.ccell.2018.02.002.). A FiM Phase I/IIa clinical trial with iadademstat in R/R AML patients demonstrated the safety and good tolerability of the drug and preliminary signs of antileukemic activity, including a CRi (see Salamero et al, J Clin Oncol, 2020, 38(36): 4260-4273. doi: 10.1200/JCO.19.03250). In an ongoing Phase IIa trial in elder 1L-AML patients (ALICE trial), iadademstat has shown encouraging safety and efficacy data in combination with azacitidine (see Salamero et al., ASH 2021 poster). The company has recently obtained approval from the U.S. FDA for its IND for FRIDA, a Phase Ib trial of iadademstat plus gilteritinib in patients with relapsed/refractory AML with FLT3 mutations. Beyond hematological cancers, the inhibition of LSD1 has been proposed as a valid therapeutic approach in some solid tumors such as small cell lung cancer (SCLC), neuroendocrine tumors (NET), medulloblastoma and others. In a Phase IIa 
trial in combination with platinum/etoposide in second line ED-SCLC patients (CLEPSIDRA trial), preliminary activity and safety results have been reported (see Navarro et al., ESMO 2018 poster). New trials in combination in SCLC and NET are under preparation. In total iadademstat has been dosed so far to more than 100 cancer patients in four clinical trials.      About Vafidemstat Vafidemstat (ORY-2001) is an oral, CNS optimized LSD1 inhibitor. The molecule acts on 
several levels: it reduces cognitive impairment, including memory loss and neuroinflammation, 
and at the same time has neuroprotective effects. In animal studies vafidemstat not only restores memory but reduces the exacerbated aggressiveness of SAMP8 mice, a model for accelerated aging and Alzheimer’s disease (AD), to normal levels and also reduces social avoidance and enhances sociability in murine models. In addition, vafidemstat exhibits fast, strong and durable 
efficacy in several preclinical models of multiple sclerosis (MS). Oryzon has performed two Phase IIa clinical trials in aggressiveness in patients with different psychiatric disorders (REIMAGINE) and in aggressive/agitated patients with moderate or severe AD (REIMAGINE-AD), with positive clinical results reported in both. Additional finalized Phase IIa clinical trials with vafidemstat include the ETHERAL trial in patients with Mild to Moderate AD, where a significant reduction of the inflammatory biomarker YKL40 has been observed after 6 and 12 months of treatment, and the pilot, small scale SATEEN trial in Relapse-Remitting and Secondary Progressive MS, where antiinflammatory activity has also been observed. Vafidemstat has also been tested in a Phase II in severe Covid-19 patients (ESCAPE) assessing the capability of the drug to prevent ARDS, one of the most severe complications of the viral infection, where it showed significant anti-inflammatory effects in severe Covid-19 patients. Currently, vafidemstat is in two Phase IIb trials in borderline personality disorder (PORTICO) and in schizophrenia patients (EVOLUTION). The company is also deploying a CNS precision medicine approach with vafidemstat in genetically-defined patient subpopulations of certain CNS disorders and is preparing a clinical trial in Kabuki Syndrome patients that is expected to start in 1H 2022. The company is also exploring the clinical development of vafidemstat in other neurodevelopmental syndromes.    
  FORWARD-LOOKING STATEMENTS This communication contains, or may contain, forward-looking information and statements about Oryzon, including financial projections and estimates and their underlying assumptions, statements regarding plans, objectives and expectations with respect to future operations, capital expenditures, synergies, products and services, and statements regarding future performance. Forward-looking statements are statements that are not historical facts and are generally identified by the words “expects,” “anticipates,” “believes,” “intends,” “estimates” and similar expressions. Although Oryzon believes that the expectations reflected in such forward-looking statements are reasonable, investors and holders of Oryzon shares are cautioned that forward-looking information and statements are subject to various risks and uncertainties, many of which are difficult to predict and generally beyond the control of Oryzon that could cause actual results and developments to differ materially from those expressed in, or implied or projected by, the forward-looking information and statements. These risks and uncertainties include those discussed or identified in the documents sent by Oryzon to the Spanish Comisión Nacional del Mercado de Valores (CNMV), which are accessible to the public. Forward-looking statements are not guarantees of future performance and have not been reviewed by the auditors of Oryzon. You are cautioned not to place undue reliance on the forward-looking 
statements, which speak only as of the date they were made. All subsequent oral or written forward-looking statements attributable to Oryzon or any of its members, directors, officers, employees or any persons acting on its behalf are expressly qualified in their entirety by the cautionary statement above. All forward-looking statements included herein are based on information available to Oryzon on the date hereof. Except as required by applicable law, Oryzon does 
not undertake any obligation to publicly update or revise any forward‐looking statements, whether as a result of new information, future events or otherwise. This press release is not an offer of securities for sale in the United States or any other jurisdiction. Oryzon’s securities may not be offered or sold in the United States absent registration or an exemption from registration. Any public offering of Oryzon’s securities to be made in the United States will be 
made by means of a prospectus that may be obtained from Oryzon or the selling security holder, as applicable, that will contain detailed information about Oryzon and management, as well as financial statements.   IR, USIR & Media, EuropeSpainOryzonAshley R. RobinsonSandya von der WeidPatricia CoboSaikat NandiLifeSci Advisors, LLCLifeSci Advisors, LLC/ Carlos C. UngríaChief 
Business Officer+1 617 430 7577+41 78 680 05 38+34 91 564 07 25+1 917 208 8293arr@lifesciadvisors.comsvonderweid@lifesciadvisors.compcobo@atrevia.comcungria@atrevia.comsnandi@oryzon.com"}