为什么这种抓取在第一次迭代后停止？

Question

我的代码访问一个页面，其中每一行可能有也可能没有存在更多信息的下拉列表。

我有一个 try and except 语句来检查这个。

第 1 行工作正常，但第 2 行不行？

import requests
from bs4 import BeautifulSoup as bs
import re
import pandas as pd

gg=[]
r = requests.get('https://library.iaslc.org/conference-program?product_id=24&author=&category=&date=&session_type=&session=&presentation=&keyword=&available=&cme=&page=2')
soup = bs(r.text, 'lxml')
sessions = soup.select('#accordin > ul > li')

for session in sessions:
    jj=(session.select_one('h4').text)
    print(jj)
    sub_session = session.select('.sub_accordin_presentation')
    try:
        if sub_session:
            kk=([re.sub(r'[\n\s]+', ' ', i.text) for i in sub_session])
            print(kk)
    except:
        kk=' '
    dict={"Title":jj,"Sub":kk}
    gg.append(dict)

df=pd.DataFrame(gg)
df.to_csv('test2.csv')

Answer 1

要获取所有部分 + 子部分，请尝试：

import requests
from bs4 import BeautifulSoup as bs
import pandas as pd

r = requests.get(
    "https://library.iaslc.org/conference-program?product_id=24&author=&category=&date=&session_type=&session=&presentation=&keyword=&available=&cme=&page=2"
)
soup = bs(r.text, "lxml")
sessions = soup.select("#accordin > ul > li")

gg = []
for session in sessions:
    jj = session.h4.get_text(strip=True, separator=" ")
    sub_sessions = session.select(".sub_accordin_presentation")

    if sub_sessions:
        for sub_session in sub_sessions:
            gg.append(
                {
                    "Title": jj,
                    "Sub": sub_session.h4.get_text(strip=True, separator=" "),
                }
            )
    else:
        gg.append(
            {
                "Title": jj,
                "Sub": "None",
            }
        )


df = pd.DataFrame(gg)
df.to_csv("data.csv", index=False)
print(df)

打印：

                                                                                                                                                                                                    Title                                                                                                                                                      Sub
0                                                                                            IS05 - Industry Symposium Sponsored by Amgen: Advancing Lung Cancer Treatment with Novel Therapeutic Targets                                                                                                                                                     None
1                                 IS06 - Industry Symposium Sponsored by Jazz Pharmaceuticals: Exploring a Treatment Option for Patients with Previously Treated Metastatic Small Cell Lung Cancer (SCLC)                                                                                                                                                     None
2                                                                                      IS07 - Satellite CME Symposium by Sanofi Genzyme: On the Frontline: Immunotherapeutic Approaches in Advanced NSCLC                                                                                                                                                     None
3                                                                                             PL02A - Plenary 2: Presidential Symposium (Rebroadcast) (Japanese, Mandarin, Spanish Translation Available)                          PL02A.01 - Durvalumab ± Tremelimumab + Chemotherapy as First-line Treatment for mNSCLC: Results from the Phase 3 POSEIDON Study
4                                                                                             PL02A - Plenary 2: Presidential Symposium (Rebroadcast) (Japanese, Mandarin, Spanish Translation Available)                                                                                                                                    PL02A.02 - Discussant
5                                                                                             PL02A - Plenary 2: Presidential Symposium (Rebroadcast) (Japanese, Mandarin, Spanish Translation Available)                              PL02A.03 - Lurbinectedin/doxorubicin versus CAV or Topotecan in Relapsed SCLC Patients: Phase III Randomized ATLANTIS Trial

...

并创建 data.csv（来自 LibreOffice 的屏幕截图）：

为什么这种抓取在第一次迭代后停止？

Why Does This Scrape Stop After 1st Iteration?

beautifulsoup

request

css-selectors

web-scraping

python-re