如何根据前面的 <b> 专门抓取 <ol> 的内容?

How exclusively scrape content of <ol> in dependency to preceding <b>?

我想在网站上吸引期刊信息https://www.fed.cuhk.edu.hk/cri/faculty/prof-bai-barry/

但是,当我尝试使用以下方法时,它吸引了所有信息,例如 Selected ResearchDevelopment Projects 以及 ol层,我不想得到。

for ol in soup.select(".ar-faculty-section-content ol li"):

它将 return class 名称中的所有 ol 事物 ".ar-faculty-section-content"

我期望得到的是只得到<b> Refereed Journal articles </b> 标签下的ol内容。我该如何处理?

在示例中使用 css selectors 通过文本查找元素,并使用 adjacent sibling combinator 选择 ol 及其 li:

for e in soup.select('b:-soup-contains("Refereed Journal articles") + ol li'):
    print(e.text)
例子
import requests
from bs4 import BeautifulSoup    

headers = {'User-Agent': 'Mozilla/5.0'}
r = requests.get('https://www.fed.cuhk.edu.hk/cri/faculty/prof-bai-barry/', headers=headers)

soup = BeautifulSoup(r.text)

for e in soup.select('b:-soup-contains("Refereed Journal articles") + ol li'):
    print(e.text)
输出
Wang,  J., & Bai, B., (2022). Whose goal emphases play a more important role in  ESL/EFL learners’ motivation, self-regulated learning and achievement?:  Teachers’ or parents’. Research Papers in  Education, 1-22. doi:10.1080/02671522.2022.2030395. 
Guo, W. J., Lau, K. L., Wei, J., & Bai, B.  (2021). Academic subject and gender differences in high school students’  self-regulated learning of language and mathematics. Current Psychology, 1-16. doi: 10.1007/s12144-021-02120-9. 
Guo, W. J., Bai, B, & Song, H. (2021). Influences of process-based  instruction on students’ use of self-regulated learning strategies in EFL  writing. System, 1-11. doi: 10.1016/j.system.2021.102578. 
...

编辑

事实上,并非所有 sites 都有带有文本 Refereed Journal articles<b> 以上方法以空列表结尾。

如果目标是获取所有出版物,您可以 select 带有文本 Selected Publications<h5>general sibling combinator 结合 nth-of-type()

for e in soup.select('h5:-soup-contains("Selected Publications") ~ ol:nth-of-type(1) li'):
    print(e.text)