具体 HTML 解析 Python 3 和 BeautifulSoup

Question

我正在尝试解析以下 link 右下角 table 中的信息，table 表示 Current schedule submissions:

dnedesign.us.to/tables/

我能够将其解析为：

{s:12:"cfdb7_status";s:6:"unread";s:3:"Day";s:6:"Sunday";s:9:"startTime";s:5:"14:30";s:7:"endTime";s:5:"16:30";}
{s:12:"cfdb7_status";s:6:"unread";s:3:"Day";s:6:"Sunday";s:9:"startTime";s:5:"14:30";s:7:"endTime";s:5:"15:30";}
{s:12:"cfdb7_status";s:6:"unread";s:3:"Day";s:6:"Sunday";s:9:"startTime";s:5:"16:30";s:7:"endTime";s:5:"18:30";}
{s:12:"cfdb7_status";s:6:"unread";s:3:"Day";s:6:"Sunday";s:9:"startTime";s:0:"";s:7:"endTime";s:0:"";}
{s:12:"cfdb7_status";s:6:"unread";s:3:"Day";s:6:"Sunday";s:9:"startTime";s:0:"";s:7:"endTime";s:0:"";}
{s:12:"cfdb7_status";s:6:"unread";s:3:"Day";s:6:"Sunday";s:9:"startTime";s:5:"12:30";s:7:"endTime";s:5:"16:30";}
{s:12:"cfdb7_status";s:6:"unread";s:3:"Day";s:6:"Sunday";s:9:"startTime";s:5:"12:30";s:7:"endTime";s:5:"16:30";}
{s:12:"cfdb7_status";s:6:"unread";s:3:"Day";s:6:"Sunday";s:9:"startTime";s:5:"12:30";s:7:"endTime";s:5:"14:30";}
{s:12:"cfdb7_status";s:6:"unread";s:3:"Day";s:7:"Tuesday";s:9:"startTime";s:5:"14:30";s:7:"endTime";s:5:"16:30";}

下面是执行解析以获得上述内容的代码：

try:
    from urllib.request  import urlopen
except ImportError:
    from urllib2 import urlopen
    from bs4 import BeautifulSoup
url = 'http://dnedesign.us.to/tables/'
page = urlopen(url)
soup = BeautifulSoup(page, "html.parser")
for rows in soup.find_all('tr'):
    for td in rows.find_all('td'):      
        if 'a:' in td.text:
            print(td.text[4:])

我正在尝试将其解析为以下内容：

Day:Tuesday    Starttime:14:30    Endtime:16:30
Day:Sunday     Starttime:12:30    Endtime:14:30
Day:Sunday     Starttime:12:30    Endtime:16:30
Day:Sunday     Starttime:12:30    Endtime:16:30
....
....

以此类推 table。

我在 Linux Mint Cinnamon 19.1 上使用 Python 3.6.9 和 Httpie 0.9.8。这是我的毕业设计，任何帮助将不胜感激，谢谢。尼尔 M.

Answer 1

您可以使用正则表达式来解析格式正确的 table 数据，注意查找空字符串：

try:
    from urllib.request import urlopen
except ImportError:
    from urllib2 import urlopen

import re
from bs4 import BeautifulSoup

url = 'http://dnedesign.us.to/tables/'
soup = BeautifulSoup(urlopen(url), "html.parser")
data = []

for rows in soup.find_all('tr'):
    for td in rows.find_all('td'):      
        if 'a:' in td.text:
            cols = re.findall(r"s:\d+:\"(.*?)\"", td.text)
            data.append({cols[x]: cols[x+1] for x in range(0, len(cols), 2)})

for row in data[::-1]:
    row = {
        k: re.sub(
            r"[a-zA-Z]+", lambda x: x.group().capitalize(), "%s:%s" % (k, v)
        ) for k, v in row.items()
    }
    print("    ".join([row["Day"], row["startTime"], row["endTime"]]))

输出：

Day:Tuesday    Starttime:14:30    Endtime:16:30
Day:Sunday    Starttime:12:30    Endtime:14:30
Day:Sunday    Starttime:12:30    Endtime:16:30
Day:Sunday    Starttime:12:30    Endtime:16:30
Day:Sunday    Starttime:    Endtime:
Day:Sunday    Starttime:    Endtime:
Day:Sunday    Starttime:16:30    Endtime:18:30
Day:Sunday    Starttime:14:30    Endtime:15:30
Day:Sunday    Starttime:14:30    Endtime:16:30

第二阶段根据您的格式规范创建字符串，但创建 data 列表以存储每行列数据的键值对的中间步骤才是工作的重点。

根据您将项目放入 class 的请求，您可以创建 Schedule 的实例并填充相关字段而不是使用字典：

try:
    from urllib.request import urlopen
except ImportError:
    from urllib2 import urlopen

import re
from bs4 import BeautifulSoup


class Schedule: 
    def __init__(self, day, start, end): 
        self.day = day
        self.start = start 
        self.end = end 


url = 'http://dnedesign.us.to/tables/'
soup = BeautifulSoup(urlopen(url), "html.parser")
schedules = []

for rows in soup.find_all('tr'):
    for td in rows.find_all('td'):      
        if 'a:' in td.text:
            cols = re.findall(r"s:\d+:\"(.*?)\"", td.text)
            data = {cols[x]: cols[x+1] for x in range(0, len(cols), 2)}
            schedules.append(Schedule(data["Day"], data["startTime"], data["endTime"]))

for schedule in schedules:
    print(schedule.day, schedule.start, schedule.end)

具体 HTML 解析 Python 3 和 BeautifulSoup

Specific HTML parsing with Python 3 and BeautifulSoup

beautifulsoup

python-3.x

httpie