如何根据前面的 <b> 专门抓取 <ol> 的内容?
How exclusively scrape content of <ol> in dependency to preceding <b>?
我想在网站上吸引期刊信息https://www.fed.cuhk.edu.hk/cri/faculty/prof-bai-barry/。
但是,当我尝试使用以下方法时,它吸引了所有信息,例如 Selected Research 和 Development Projects 以及 ol
层,我不想得到。
for ol in soup.select(".ar-faculty-section-content ol li"):
它将 return class 名称中的所有 ol
事物 ".ar-faculty-section-content"
。
我期望得到的是只得到<b> Refereed Journal articles </b>
标签下的ol
内容。我该如何处理?
在示例中使用 css selectors
通过文本查找元素,并使用 adjacent sibling combinator
选择 ol
及其 li
:
for e in soup.select('b:-soup-contains("Refereed Journal articles") + ol li'):
print(e.text)
例子
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0'}
r = requests.get('https://www.fed.cuhk.edu.hk/cri/faculty/prof-bai-barry/', headers=headers)
soup = BeautifulSoup(r.text)
for e in soup.select('b:-soup-contains("Refereed Journal articles") + ol li'):
print(e.text)
输出
Wang, J., & Bai, B., (2022). Whose goal emphases play a more important role in ESL/EFL learners’ motivation, self-regulated learning and achievement?: Teachers’ or parents’. Research Papers in Education, 1-22. doi:10.1080/02671522.2022.2030395.
Guo, W. J., Lau, K. L., Wei, J., & Bai, B. (2021). Academic subject and gender differences in high school students’ self-regulated learning of language and mathematics. Current Psychology, 1-16. doi: 10.1007/s12144-021-02120-9.
Guo, W. J., Bai, B, & Song, H. (2021). Influences of process-based instruction on students’ use of self-regulated learning strategies in EFL writing. System, 1-11. doi: 10.1016/j.system.2021.102578.
...
编辑
事实上,并非所有 sites 都有带有文本 Refereed Journal articles
的 <b>
以上方法以空列表结尾。
如果目标是获取所有出版物,您可以 select 带有文本 Selected Publications
的 <h5>
和 general sibling combinator
结合 nth-of-type()
:
for e in soup.select('h5:-soup-contains("Selected Publications") ~ ol:nth-of-type(1) li'):
print(e.text)
我想在网站上吸引期刊信息https://www.fed.cuhk.edu.hk/cri/faculty/prof-bai-barry/。
但是,当我尝试使用以下方法时,它吸引了所有信息,例如 Selected Research 和 Development Projects 以及 ol
层,我不想得到。
for ol in soup.select(".ar-faculty-section-content ol li"):
它将 return class 名称中的所有 ol
事物 ".ar-faculty-section-content"
。
我期望得到的是只得到<b> Refereed Journal articles </b>
标签下的ol
内容。我该如何处理?
在示例中使用 css selectors
通过文本查找元素,并使用 adjacent sibling combinator
选择 ol
及其 li
:
for e in soup.select('b:-soup-contains("Refereed Journal articles") + ol li'):
print(e.text)
例子
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0'}
r = requests.get('https://www.fed.cuhk.edu.hk/cri/faculty/prof-bai-barry/', headers=headers)
soup = BeautifulSoup(r.text)
for e in soup.select('b:-soup-contains("Refereed Journal articles") + ol li'):
print(e.text)
输出
Wang, J., & Bai, B., (2022). Whose goal emphases play a more important role in ESL/EFL learners’ motivation, self-regulated learning and achievement?: Teachers’ or parents’. Research Papers in Education, 1-22. doi:10.1080/02671522.2022.2030395.
Guo, W. J., Lau, K. L., Wei, J., & Bai, B. (2021). Academic subject and gender differences in high school students’ self-regulated learning of language and mathematics. Current Psychology, 1-16. doi: 10.1007/s12144-021-02120-9.
Guo, W. J., Bai, B, & Song, H. (2021). Influences of process-based instruction on students’ use of self-regulated learning strategies in EFL writing. System, 1-11. doi: 10.1016/j.system.2021.102578.
...
编辑
事实上,并非所有 sites 都有带有文本 Refereed Journal articles
的 <b>
以上方法以空列表结尾。
如果目标是获取所有出版物,您可以 select 带有文本 Selected Publications
的 <h5>
和 general sibling combinator
结合 nth-of-type()
:
for e in soup.select('h5:-soup-contains("Selected Publications") ~ ol:nth-of-type(1) li'):
print(e.text)