从 html / json 页面中提取特定部分的最佳方法？

Question

我从 python 请求中返回了以下内容：

{"error":{"ErrorMessage":"
<div>
<p>To protect your privacy, this form will not display details such as a clinical or assisted collection. If you believe that the information detailed above is incomplete or incorrect, please tell us here 
    <a href=\"http:\/\/www.southhams.gov.uk\/wastequestion\">www.southhams.gov.uk\/wastequestion<\/a><\/p><\/div>","CodeName":"Success","ErrorStatus":0},"calendar":{"calendar":"
        <div class=\"wsResponse\">To protect your privacy, this form will not display details such as a clinical or assisted collection. If you believe that the information detailed above is incomplete or incorrect, please tell us here 
            <a href=\"http:\/\/www.southhams.gov.uk\/wastequestion\">www.southhams.gov.uk\/wastequestion<\/a><\/div>"},"binCollections":{"tile":[["
                <div class=\'collectionDiv\'>
                    <div class=\'fullwidth\'>
                        <h3>Organic Collection Service (Brown Organic Bin)<\/h3><\/div>
                            <div class=\"collectionImg\">
                                <img src=\"https:\/\/southhams.fccenvironment.co.uk\/library\/images\/brown bin.png\" \/><\/div>\n                    
                                <div class=\'wdshDetWrap\'>Your brown organic bin collection is 
                                    <b>Fortnightly<\/b> on a 
                                        <b>Thursday<\/b>.
                                            <br\/> \n                    Your next scheduled collection is 
                                            <b>Friday, 29 May 2020<\/b>. 
                                                <br\/>
                                                <br\/>
                                                <a href=\"https:\/\/www.southhams.gov.uk\/article\/3427\">Read more about the Organic Collection Service &gt;<\/a><\/div><\/div>"],["
                                                    <div class=\'collectionDiv\'>
                                                        <div class=\'fullwidth\'>
                                                            <h3>Recycling Collection Service (Recycling Sacks)<\/h3><\/div>
                                                                <div class=\"collectionImg\">
                                                                    <img src=\"https:\/\/southhams.fccenvironment.co.uk\/library\/images\/SH_two_rec_sacks.png\" \/><\/div>\n                    
                                                                    <div class=\'wdshDetWrap\'>Your recycling sacks collection is 
                                                                        <b>Fortnightly<\/b> on a 
                                                                            <b>Thursday<\/b>.
                                                                                <br\/> \n                    Your next scheduled collection is 
                                                                                <b>Friday, 29 May 2020<\/b>. 
                                                                                    <br\/>
                                                                                    <br\/>
                                                                                    <a href=\"https:\/\/www.southhams.gov.uk\/article\/3383\">Read more about the Recycling Collection Service &gt;<\/a><\/div><\/div>"],["
                                                                                        <div class=\'collectionDiv\'>
                                                                                            <div class=\'fullwidth\'>
                                                                                                <h3>Refuse Collection Service (Grey Refuse Bin)<\/h3><\/div>
                                                                                                    <div class=\"collectionImg\">
                                                                                                        <img src=\"https:\/\/southhams.fccenvironment.co.uk\/library\/images\/grey bin.png\" \/><\/div>\n                    
                                                                                                        <div class=\'wdshDetWrap\'>Your grey refuse bin collection is 
                                                                                                            <b>Fortnightly<\/b> on a 
                                                                                                                <b>Thursday<\/b>.
                                                                                                                    <br\/> \n                    Your next scheduled collection is 
                                                                                                                    <b>Thursday, 04 June 2020<\/b>. 
                                                                                                                        <br\/>
                                                                                                                        <br\/>
                                                                                                                        <a href=\"https:\/\/www.southhams.gov.uk\/article\/3384\">Read more about the Refuse Collection Service &gt;<\/a><\/div><\/div>"]]}}

我想为每个 collectiondiv (3) 提取以下内容

有机收集服务（棕色有机垃圾箱） 2020 年 5 月 29 日星期五

回收收集服务（回收袋） 2020 年 5 月 29 日星期五

垃圾收集服务（灰色垃圾桶） 2020 年 6 月 4 日星期四

目前我已经尝试将 response.content 加载到 python json 处理程序中，但仍然无法提取数据，所以我尝试 BeautifulSoup 和 soup.find_all("div", class_="wdshDetWrap") 但仍然无法提取准确的数据 lxml 或类似的方法会更简单吗？

感谢观看

请求代码：

url = "https://southhams.fccenvironment.co.uk/mycollections"

response = requests.request("GET", url)

cookiejar = response.cookies
for cookie in cookiejar:
print(cookie.name,cookie.value)

url = "https://southhams.fccenvironment.co.uk/ajaxprocessor/getcollectiondetails"

payload = 'fcc_session_token={}&uprn=100040282539'.format(cookie.value)
headers = {
  'X-Requested-With': 'XMLHttpRequest',
  'Content-Type': 'application/x-www-form-urlencoded',
  'Cookie': 'fcc_session_cookie={}'.format(cookie.value)
}

response = requests.request("POST", url, headers=headers, data = payload)

print(response.status_code)

Answer 1

来自特定站点的 HTML 文档格式不正确。我仍然设法解决（在大约 1000 个标签的规模上效率低下）。

所以可以改进。

headers = soup.find_all('h3')
names = [tag.text[:tag.text.find('<')] for tag in headers]
dates = [tag.find_all('b')[2].text[:tag.find_all('b')[2].text.find('<')] for tag in headers]

print(names)
print(dates)

#Output
['Organic Collection Service (Brown Organic Bin)', 'Recycling Collection Service (Recycling Sacks)', 'Refuse Collection Service (Grey Refuse Bin)']
['Friday, 29 May 2020', 'Friday, 29 May 2020', 'Thursday, 04 June 2020']

Answer 2

您直接获得 json 然后可以调用该 html 值。完成后，使用 beautifulsoup 解析 html 并在找到它的标签中打印出 context/text：

import requests
from bs4 import BeautifulSoup

url = "https://southhams.fccenvironment.co.uk/mycollections"

response = requests.get(url)

cookiejar = response.cookies
for cookie in cookiejar:
    print(cookie.name,cookie.value)

url = "https://southhams.fccenvironment.co.uk/ajaxprocessor/getcollectiondetails"

payload = 'fcc_session_token={}&uprn=100040282539'.format(cookie.value)
headers = {
  'X-Requested-With': 'XMLHttpRequest',
  'Content-Type': 'application/x-www-form-urlencoded',
  'Cookie': 'fcc_session_cookie={}'.format(cookie.value)
}

jsonData = requests.post(url, headers=headers, data = payload).json()


data = jsonData['binCollections']['tile']
for each in data:
    soup = BeautifulSoup(each[0], 'html.parser')
    collection = soup.find('div', {'class':'collectionDiv'}).find('h3').text.strip()
    date = soup.find_all('b')[-1].text.strip()

    print (collection, date)

输出：

Organic Collection Service (Brown Organic Bin) Friday, 29 May 2020
Recycling Collection Service (Recycling Sacks) Friday, 29 May 2020
Refuse Collection Service (Grey Refuse Bin) Thursday, 04 June 2020

从 html / json 页面中提取特定部分的最佳方法？

Best way to extract specific parts from html / json page?

html

python

json

lxml

beautifulsoup