如何根据前面的 <b> 专门抓取 <ol> 的内容？

Question

我想在网站上吸引期刊信息https://www.fed.cuhk.edu.hk/cri/faculty/prof-bai-barry/。

但是，当我尝试使用以下方法时，它吸引了所有信息，例如 Selected Research 和 Development Projects 以及 ol层，我不想得到。

for ol in soup.select(".ar-faculty-section-content ol li"):

它将 return class 名称中的所有 ol 事物 ".ar-faculty-section-content"。

我期望得到的是只得到<b> Refereed Journal articles </b> 标签下的ol内容。我该如何处理？

Answer 1

在示例中使用 css selectors 通过文本查找元素，并使用 adjacent sibling combinator 选择 ol 及其 li:

for e in soup.select('b:-soup-contains("Refereed Journal articles") + ol li'):
    print(e.text)

例子

import requests
from bs4 import BeautifulSoup    

headers = {'User-Agent': 'Mozilla/5.0'}
r = requests.get('https://www.fed.cuhk.edu.hk/cri/faculty/prof-bai-barry/', headers=headers)

soup = BeautifulSoup(r.text)

for e in soup.select('b:-soup-contains("Refereed Journal articles") + ol li'):
    print(e.text)

输出

Wang,  J., & Bai, B., (2022). Whose goal emphases play a more important role in  ESL/EFL learners’ motivation, self-regulated learning and achievement?:  Teachers’ or parents’. Research Papers in  Education, 1-22. doi:10.1080/02671522.2022.2030395. 
Guo, W. J., Lau, K. L., Wei, J., & Bai, B.  (2021). Academic subject and gender differences in high school students’  self-regulated learning of language and mathematics. Current Psychology, 1-16. doi: 10.1007/s12144-021-02120-9. 
Guo, W. J., Bai, B, & Song, H. (2021). Influences of process-based  instruction on students’ use of self-regulated learning strategies in EFL  writing. System, 1-11. doi: 10.1016/j.system.2021.102578. 
...

编辑

事实上，并非所有 sites 都有带有文本 Refereed Journal articles 的 <b> 以上方法以空列表结尾。

如果目标是获取所有出版物，您可以 select 带有文本 Selected Publications 的 <h5> 和 general sibling combinator 结合 nth-of-type()：

for e in soup.select('h5:-soup-contains("Selected Publications") ~ ol:nth-of-type(1) li'):
    print(e.text)

如何根据前面的 <b> 专门抓取 <ol> 的内容？

How exclusively scrape content of <ol> in dependency to preceding <b>?

python

beautifulsoup

css-selectors

web-scraping

例子

输出

编辑