从 html / json 页面中提取特定部分的最佳方法?

Best way to extract specific parts from html / json page?

我从 python 请求中返回了以下内容:

{"error":{"ErrorMessage":"
<div>
<p>To protect your privacy, this form will not display details such as a clinical or assisted collection. If you believe that the information detailed above is incomplete or incorrect, please tell us here 
    <a href=\"http:\/\/www.southhams.gov.uk\/wastequestion\">www.southhams.gov.uk\/wastequestion<\/a><\/p><\/div>","CodeName":"Success","ErrorStatus":0},"calendar":{"calendar":"
        <div class=\"wsResponse\">To protect your privacy, this form will not display details such as a clinical or assisted collection. If you believe that the information detailed above is incomplete or incorrect, please tell us here 
            <a href=\"http:\/\/www.southhams.gov.uk\/wastequestion\">www.southhams.gov.uk\/wastequestion<\/a><\/div>"},"binCollections":{"tile":[["
                <div class=\'collectionDiv\'>
                    <div class=\'fullwidth\'>
                        <h3>Organic Collection Service (Brown Organic Bin)<\/h3><\/div>
                            <div class=\"collectionImg\">
                                <img src=\"https:\/\/southhams.fccenvironment.co.uk\/library\/images\/brown bin.png\" \/><\/div>\n                    
                                <div class=\'wdshDetWrap\'>Your brown organic bin collection is 
                                    <b>Fortnightly<\/b> on a 
                                        <b>Thursday<\/b>.
                                            <br\/> \n                    Your next scheduled collection is 
                                            <b>Friday, 29 May 2020<\/b>. 
                                                <br\/>
                                                <br\/>
                                                <a href=\"https:\/\/www.southhams.gov.uk\/article\/3427\">Read more about the Organic Collection Service &gt;<\/a><\/div><\/div>"],["
                                                    <div class=\'collectionDiv\'>
                                                        <div class=\'fullwidth\'>
                                                            <h3>Recycling Collection Service (Recycling Sacks)<\/h3><\/div>
                                                                <div class=\"collectionImg\">
                                                                    <img src=\"https:\/\/southhams.fccenvironment.co.uk\/library\/images\/SH_two_rec_sacks.png\" \/><\/div>\n                    
                                                                    <div class=\'wdshDetWrap\'>Your recycling sacks collection is 
                                                                        <b>Fortnightly<\/b> on a 
                                                                            <b>Thursday<\/b>.
                                                                                <br\/> \n                    Your next scheduled collection is 
                                                                                <b>Friday, 29 May 2020<\/b>. 
                                                                                    <br\/>
                                                                                    <br\/>
                                                                                    <a href=\"https:\/\/www.southhams.gov.uk\/article\/3383\">Read more about the Recycling Collection Service &gt;<\/a><\/div><\/div>"],["
                                                                                        <div class=\'collectionDiv\'>
                                                                                            <div class=\'fullwidth\'>
                                                                                                <h3>Refuse Collection Service (Grey Refuse Bin)<\/h3><\/div>
                                                                                                    <div class=\"collectionImg\">
                                                                                                        <img src=\"https:\/\/southhams.fccenvironment.co.uk\/library\/images\/grey bin.png\" \/><\/div>\n                    
                                                                                                        <div class=\'wdshDetWrap\'>Your grey refuse bin collection is 
                                                                                                            <b>Fortnightly<\/b> on a 
                                                                                                                <b>Thursday<\/b>.
                                                                                                                    <br\/> \n                    Your next scheduled collection is 
                                                                                                                    <b>Thursday, 04 June 2020<\/b>. 
                                                                                                                        <br\/>
                                                                                                                        <br\/>
                                                                                                                        <a href=\"https:\/\/www.southhams.gov.uk\/article\/3384\">Read more about the Refuse Collection Service &gt;<\/a><\/div><\/div>"]]}}

我想为每个 collectiondiv (3) 提取以下内容

有机收集服务(棕色有机垃圾箱) 2020 年 5 月 29 日星期五

回收收集服务(回收袋) 2020 年 5 月 29 日星期五

垃圾收集服务(灰色垃圾桶) 2020 年 6 月 4 日星期四

目前我已经尝试将 response.content 加载到 python json 处理程序中,但仍然无法提取数据,所以我尝试 BeautifulSoup 和 soup.find_all("div", class_="wdshDetWrap") 但仍然无法提取准确的数据 lxml 或类似的方法会更简单吗?

感谢观看

请求代码:

url = "https://southhams.fccenvironment.co.uk/mycollections"

response = requests.request("GET", url)

cookiejar = response.cookies
for cookie in cookiejar:
print(cookie.name,cookie.value)

url = "https://southhams.fccenvironment.co.uk/ajaxprocessor/getcollectiondetails"

payload = 'fcc_session_token={}&uprn=100040282539'.format(cookie.value)
headers = {
  'X-Requested-With': 'XMLHttpRequest',
  'Content-Type': 'application/x-www-form-urlencoded',
  'Cookie': 'fcc_session_cookie={}'.format(cookie.value)
}

response = requests.request("POST", url, headers=headers, data = payload)

print(response.status_code)

来自特定站点的 HTML 文档格式不正确。我仍然设法解决(在大约 1000 个标签的规模上效率低下)。

所以可以改进。

headers = soup.find_all('h3')
names = [tag.text[:tag.text.find('<')] for tag in headers]
dates = [tag.find_all('b')[2].text[:tag.find_all('b')[2].text.find('<')] for tag in headers]

print(names)
print(dates)

#Output
['Organic Collection Service (Brown Organic Bin)', 'Recycling Collection Service (Recycling Sacks)', 'Refuse Collection Service (Grey Refuse Bin)']
['Friday, 29 May 2020', 'Friday, 29 May 2020', 'Thursday, 04 June 2020']

您直接获得 json 然后可以调用该 html 值。完成后,使用 beautifulsoup 解析 html 并在找到它的标签中打印出 context/text:

import requests
from bs4 import BeautifulSoup

url = "https://southhams.fccenvironment.co.uk/mycollections"

response = requests.get(url)

cookiejar = response.cookies
for cookie in cookiejar:
    print(cookie.name,cookie.value)

url = "https://southhams.fccenvironment.co.uk/ajaxprocessor/getcollectiondetails"

payload = 'fcc_session_token={}&uprn=100040282539'.format(cookie.value)
headers = {
  'X-Requested-With': 'XMLHttpRequest',
  'Content-Type': 'application/x-www-form-urlencoded',
  'Cookie': 'fcc_session_cookie={}'.format(cookie.value)
}

jsonData = requests.post(url, headers=headers, data = payload).json()


data = jsonData['binCollections']['tile']
for each in data:
    soup = BeautifulSoup(each[0], 'html.parser')
    collection = soup.find('div', {'class':'collectionDiv'}).find('h3').text.strip()
    date = soup.find_all('b')[-1].text.strip()

    print (collection, date)

输出:

Organic Collection Service (Brown Organic Bin) Friday, 29 May 2020
Recycling Collection Service (Recycling Sacks) Friday, 29 May 2020
Refuse Collection Service (Grey Refuse Bin) Thursday, 04 June 2020