为什么这种抓取在第一次迭代后停止?
Why Does This Scrape Stop After 1st Iteration?
我的代码访问一个页面,其中每一行可能有也可能没有存在更多信息的下拉列表。
我有一个 try and except 语句来检查这个。
第 1 行工作正常,但第 2 行不行?
import requests
from bs4 import BeautifulSoup as bs
import re
import pandas as pd
gg=[]
r = requests.get('https://library.iaslc.org/conference-program?product_id=24&author=&category=&date=&session_type=&session=&presentation=&keyword=&available=&cme=&page=2')
soup = bs(r.text, 'lxml')
sessions = soup.select('#accordin > ul > li')
for session in sessions:
jj=(session.select_one('h4').text)
print(jj)
sub_session = session.select('.sub_accordin_presentation')
try:
if sub_session:
kk=([re.sub(r'[\n\s]+', ' ', i.text) for i in sub_session])
print(kk)
except:
kk=' '
dict={"Title":jj,"Sub":kk}
gg.append(dict)
df=pd.DataFrame(gg)
df.to_csv('test2.csv')
要获取所有部分 + 子部分,请尝试:
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
r = requests.get(
"https://library.iaslc.org/conference-program?product_id=24&author=&category=&date=&session_type=&session=&presentation=&keyword=&available=&cme=&page=2"
)
soup = bs(r.text, "lxml")
sessions = soup.select("#accordin > ul > li")
gg = []
for session in sessions:
jj = session.h4.get_text(strip=True, separator=" ")
sub_sessions = session.select(".sub_accordin_presentation")
if sub_sessions:
for sub_session in sub_sessions:
gg.append(
{
"Title": jj,
"Sub": sub_session.h4.get_text(strip=True, separator=" "),
}
)
else:
gg.append(
{
"Title": jj,
"Sub": "None",
}
)
df = pd.DataFrame(gg)
df.to_csv("data.csv", index=False)
print(df)
打印:
Title Sub
0 IS05 - Industry Symposium Sponsored by Amgen: Advancing Lung Cancer Treatment with Novel Therapeutic Targets None
1 IS06 - Industry Symposium Sponsored by Jazz Pharmaceuticals: Exploring a Treatment Option for Patients with Previously Treated Metastatic Small Cell Lung Cancer (SCLC) None
2 IS07 - Satellite CME Symposium by Sanofi Genzyme: On the Frontline: Immunotherapeutic Approaches in Advanced NSCLC None
3 PL02A - Plenary 2: Presidential Symposium (Rebroadcast) (Japanese, Mandarin, Spanish Translation Available) PL02A.01 - Durvalumab ± Tremelimumab + Chemotherapy as First-line Treatment for mNSCLC: Results from the Phase 3 POSEIDON Study
4 PL02A - Plenary 2: Presidential Symposium (Rebroadcast) (Japanese, Mandarin, Spanish Translation Available) PL02A.02 - Discussant
5 PL02A - Plenary 2: Presidential Symposium (Rebroadcast) (Japanese, Mandarin, Spanish Translation Available) PL02A.03 - Lurbinectedin/doxorubicin versus CAV or Topotecan in Relapsed SCLC Patients: Phase III Randomized ATLANTIS Trial
...
并创建 data.csv
(来自 LibreOffice 的屏幕截图):
我的代码访问一个页面,其中每一行可能有也可能没有存在更多信息的下拉列表。
我有一个 try and except 语句来检查这个。
第 1 行工作正常,但第 2 行不行?
import requests
from bs4 import BeautifulSoup as bs
import re
import pandas as pd
gg=[]
r = requests.get('https://library.iaslc.org/conference-program?product_id=24&author=&category=&date=&session_type=&session=&presentation=&keyword=&available=&cme=&page=2')
soup = bs(r.text, 'lxml')
sessions = soup.select('#accordin > ul > li')
for session in sessions:
jj=(session.select_one('h4').text)
print(jj)
sub_session = session.select('.sub_accordin_presentation')
try:
if sub_session:
kk=([re.sub(r'[\n\s]+', ' ', i.text) for i in sub_session])
print(kk)
except:
kk=' '
dict={"Title":jj,"Sub":kk}
gg.append(dict)
df=pd.DataFrame(gg)
df.to_csv('test2.csv')
要获取所有部分 + 子部分,请尝试:
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
r = requests.get(
"https://library.iaslc.org/conference-program?product_id=24&author=&category=&date=&session_type=&session=&presentation=&keyword=&available=&cme=&page=2"
)
soup = bs(r.text, "lxml")
sessions = soup.select("#accordin > ul > li")
gg = []
for session in sessions:
jj = session.h4.get_text(strip=True, separator=" ")
sub_sessions = session.select(".sub_accordin_presentation")
if sub_sessions:
for sub_session in sub_sessions:
gg.append(
{
"Title": jj,
"Sub": sub_session.h4.get_text(strip=True, separator=" "),
}
)
else:
gg.append(
{
"Title": jj,
"Sub": "None",
}
)
df = pd.DataFrame(gg)
df.to_csv("data.csv", index=False)
print(df)
打印:
Title Sub
0 IS05 - Industry Symposium Sponsored by Amgen: Advancing Lung Cancer Treatment with Novel Therapeutic Targets None
1 IS06 - Industry Symposium Sponsored by Jazz Pharmaceuticals: Exploring a Treatment Option for Patients with Previously Treated Metastatic Small Cell Lung Cancer (SCLC) None
2 IS07 - Satellite CME Symposium by Sanofi Genzyme: On the Frontline: Immunotherapeutic Approaches in Advanced NSCLC None
3 PL02A - Plenary 2: Presidential Symposium (Rebroadcast) (Japanese, Mandarin, Spanish Translation Available) PL02A.01 - Durvalumab ± Tremelimumab + Chemotherapy as First-line Treatment for mNSCLC: Results from the Phase 3 POSEIDON Study
4 PL02A - Plenary 2: Presidential Symposium (Rebroadcast) (Japanese, Mandarin, Spanish Translation Available) PL02A.02 - Discussant
5 PL02A - Plenary 2: Presidential Symposium (Rebroadcast) (Japanese, Mandarin, Spanish Translation Available) PL02A.03 - Lurbinectedin/doxorubicin versus CAV or Topotecan in Relapsed SCLC Patients: Phase III Randomized ATLANTIS Trial
...
并创建 data.csv
(来自 LibreOffice 的屏幕截图):