requests-html解析多个tag/class in python的步骤

Question

问题介绍 语言版本：Python3.8 操作系统：Windows10 任何其他相关软件：Jupyter Notebook 和 html-requests

上下文：我跟随 this tutorial on parsing websites with requests-html.

问题陈述：

目标：我的目标是通过在更困难的网站（例如 Whosebug）上应用他的代码来了解更多信息。我使用下面的代码成功地隔离了 'div' tag/class。我现在打算对标记为 div 的 Whosebug 最近问题页面上的所有内容进行排序，以找到 'question-summary' 并以某种方式隔离问题 ID。

预期结果：

我想隔离问题 ID，保存该唯一问题的关联 html 页面，并阅读前 3 页（150 个问题）中每个问题的每个 html 页最近发布的问题。

问题： 在视频中的 17:29 处，他指出他使用选择器的 tag/class 只使用了一次，如果使用了不止一次，他将“需要重新设计” .

我正在尝试搜索与 'id' 或 question-summary-#' 相关的内容。我不确定我在寻找什么，但我知道会有不止一个。 下一步是什么？

当前代码的示例结果：

<Element 'div' class=('question-summary',) id='question-summary-64050283'>,

我尝试过的事情： 当前代码：

import datetime
import requests
import requests_html
from requests_html import HTML
from importlib import reload
import sys
reload(sys)

now=datetime.datetime.now()
month=now.month
day=now.day
year=now.year
hour=now.hour
minute=now.minute
second=now.second

def url_to_txt(url, filename="world.html", save=False):
    r=requests.get(url)
    if r.status_code == 200:
        html_text=r.text
        if save:
            with open(f"world-{month}-{day}-{year}-{hour}-{minute}-{second}.html", 'w') as f:
                f.write(html_text)
        return html_text
    return ""

url = 'https://whosebug.com/questions?tab=newest&page=2'

html_text = url_to_txt(url)

r_html=HTML(html=html_text)
table_class = "div"
r_table = r_html.find(table_class)

print(r_table)

Answer 1

专注于从 id 属性中获取 question-summary-xxx 值，您可以尝试这样的操作：

from requests_html import HTMLSession
session = HTMLSession()
url = 'https://whosebug.com/questions?tab=newest&pagesize=50'
r = session.get(url
targets = r.html.xpath('//div[starts-with(@id,"question-summary-")]/@id')
targets

输出：

['question-summary-64248540',
 'question-summary-64248536',
 'question-summary-64248535',
 'question-summary-64248530',
...]

等等

requests-html解析多个tag/class in python的步骤

Steps for requests-html to parse more than one tag/class in python

html

python

parsing

html-parsing

python-requests