具体 HTML 解析 Python 3 和 BeautifulSoup
Specific HTML parsing with Python 3 and BeautifulSoup
我正在尝试解析以下 link 右下角 table 中的信息,table 表示 Current schedule submissions
:
dnedesign.us.to/tables/
我能够将其解析为:
{s:12:"cfdb7_status";s:6:"unread";s:3:"Day";s:6:"Sunday";s:9:"startTime";s:5:"14:30";s:7:"endTime";s:5:"16:30";}
{s:12:"cfdb7_status";s:6:"unread";s:3:"Day";s:6:"Sunday";s:9:"startTime";s:5:"14:30";s:7:"endTime";s:5:"15:30";}
{s:12:"cfdb7_status";s:6:"unread";s:3:"Day";s:6:"Sunday";s:9:"startTime";s:5:"16:30";s:7:"endTime";s:5:"18:30";}
{s:12:"cfdb7_status";s:6:"unread";s:3:"Day";s:6:"Sunday";s:9:"startTime";s:0:"";s:7:"endTime";s:0:"";}
{s:12:"cfdb7_status";s:6:"unread";s:3:"Day";s:6:"Sunday";s:9:"startTime";s:0:"";s:7:"endTime";s:0:"";}
{s:12:"cfdb7_status";s:6:"unread";s:3:"Day";s:6:"Sunday";s:9:"startTime";s:5:"12:30";s:7:"endTime";s:5:"16:30";}
{s:12:"cfdb7_status";s:6:"unread";s:3:"Day";s:6:"Sunday";s:9:"startTime";s:5:"12:30";s:7:"endTime";s:5:"16:30";}
{s:12:"cfdb7_status";s:6:"unread";s:3:"Day";s:6:"Sunday";s:9:"startTime";s:5:"12:30";s:7:"endTime";s:5:"14:30";}
{s:12:"cfdb7_status";s:6:"unread";s:3:"Day";s:7:"Tuesday";s:9:"startTime";s:5:"14:30";s:7:"endTime";s:5:"16:30";}
下面是执行解析以获得上述内容的代码:
try:
from urllib.request import urlopen
except ImportError:
from urllib2 import urlopen
from bs4 import BeautifulSoup
url = 'http://dnedesign.us.to/tables/'
page = urlopen(url)
soup = BeautifulSoup(page, "html.parser")
for rows in soup.find_all('tr'):
for td in rows.find_all('td'):
if 'a:' in td.text:
print(td.text[4:])
我正在尝试将其解析为以下内容:
Day:Tuesday Starttime:14:30 Endtime:16:30
Day:Sunday Starttime:12:30 Endtime:14:30
Day:Sunday Starttime:12:30 Endtime:16:30
Day:Sunday Starttime:12:30 Endtime:16:30
....
....
以此类推 table。
我在 Linux Mint Cinnamon 19.1
上使用 Python 3.6.9
和 Httpie 0.9.8
。这是我的毕业设计,任何帮助将不胜感激,谢谢。
尼尔 M.
您可以使用正则表达式来解析格式正确的 table 数据,注意查找空字符串:
try:
from urllib.request import urlopen
except ImportError:
from urllib2 import urlopen
import re
from bs4 import BeautifulSoup
url = 'http://dnedesign.us.to/tables/'
soup = BeautifulSoup(urlopen(url), "html.parser")
data = []
for rows in soup.find_all('tr'):
for td in rows.find_all('td'):
if 'a:' in td.text:
cols = re.findall(r"s:\d+:\"(.*?)\"", td.text)
data.append({cols[x]: cols[x+1] for x in range(0, len(cols), 2)})
for row in data[::-1]:
row = {
k: re.sub(
r"[a-zA-Z]+", lambda x: x.group().capitalize(), "%s:%s" % (k, v)
) for k, v in row.items()
}
print(" ".join([row["Day"], row["startTime"], row["endTime"]]))
输出:
Day:Tuesday Starttime:14:30 Endtime:16:30
Day:Sunday Starttime:12:30 Endtime:14:30
Day:Sunday Starttime:12:30 Endtime:16:30
Day:Sunday Starttime:12:30 Endtime:16:30
Day:Sunday Starttime: Endtime:
Day:Sunday Starttime: Endtime:
Day:Sunday Starttime:16:30 Endtime:18:30
Day:Sunday Starttime:14:30 Endtime:15:30
Day:Sunday Starttime:14:30 Endtime:16:30
第二阶段根据您的格式规范创建字符串,但创建 data
列表以存储每行列数据的键值对的中间步骤才是工作的重点。
根据您将项目放入 class 的请求,您可以创建 Schedule
的实例并填充相关字段而不是使用字典:
try:
from urllib.request import urlopen
except ImportError:
from urllib2 import urlopen
import re
from bs4 import BeautifulSoup
class Schedule:
def __init__(self, day, start, end):
self.day = day
self.start = start
self.end = end
url = 'http://dnedesign.us.to/tables/'
soup = BeautifulSoup(urlopen(url), "html.parser")
schedules = []
for rows in soup.find_all('tr'):
for td in rows.find_all('td'):
if 'a:' in td.text:
cols = re.findall(r"s:\d+:\"(.*?)\"", td.text)
data = {cols[x]: cols[x+1] for x in range(0, len(cols), 2)}
schedules.append(Schedule(data["Day"], data["startTime"], data["endTime"]))
for schedule in schedules:
print(schedule.day, schedule.start, schedule.end)
我正在尝试解析以下 link 右下角 table 中的信息,table 表示 Current schedule submissions
:
dnedesign.us.to/tables/
我能够将其解析为:
{s:12:"cfdb7_status";s:6:"unread";s:3:"Day";s:6:"Sunday";s:9:"startTime";s:5:"14:30";s:7:"endTime";s:5:"16:30";}
{s:12:"cfdb7_status";s:6:"unread";s:3:"Day";s:6:"Sunday";s:9:"startTime";s:5:"14:30";s:7:"endTime";s:5:"15:30";}
{s:12:"cfdb7_status";s:6:"unread";s:3:"Day";s:6:"Sunday";s:9:"startTime";s:5:"16:30";s:7:"endTime";s:5:"18:30";}
{s:12:"cfdb7_status";s:6:"unread";s:3:"Day";s:6:"Sunday";s:9:"startTime";s:0:"";s:7:"endTime";s:0:"";}
{s:12:"cfdb7_status";s:6:"unread";s:3:"Day";s:6:"Sunday";s:9:"startTime";s:0:"";s:7:"endTime";s:0:"";}
{s:12:"cfdb7_status";s:6:"unread";s:3:"Day";s:6:"Sunday";s:9:"startTime";s:5:"12:30";s:7:"endTime";s:5:"16:30";}
{s:12:"cfdb7_status";s:6:"unread";s:3:"Day";s:6:"Sunday";s:9:"startTime";s:5:"12:30";s:7:"endTime";s:5:"16:30";}
{s:12:"cfdb7_status";s:6:"unread";s:3:"Day";s:6:"Sunday";s:9:"startTime";s:5:"12:30";s:7:"endTime";s:5:"14:30";}
{s:12:"cfdb7_status";s:6:"unread";s:3:"Day";s:7:"Tuesday";s:9:"startTime";s:5:"14:30";s:7:"endTime";s:5:"16:30";}
下面是执行解析以获得上述内容的代码:
try:
from urllib.request import urlopen
except ImportError:
from urllib2 import urlopen
from bs4 import BeautifulSoup
url = 'http://dnedesign.us.to/tables/'
page = urlopen(url)
soup = BeautifulSoup(page, "html.parser")
for rows in soup.find_all('tr'):
for td in rows.find_all('td'):
if 'a:' in td.text:
print(td.text[4:])
我正在尝试将其解析为以下内容:
Day:Tuesday Starttime:14:30 Endtime:16:30
Day:Sunday Starttime:12:30 Endtime:14:30
Day:Sunday Starttime:12:30 Endtime:16:30
Day:Sunday Starttime:12:30 Endtime:16:30
....
....
以此类推 table。
我在 Linux Mint Cinnamon 19.1
上使用 Python 3.6.9
和 Httpie 0.9.8
。这是我的毕业设计,任何帮助将不胜感激,谢谢。
尼尔 M.
您可以使用正则表达式来解析格式正确的 table 数据,注意查找空字符串:
try:
from urllib.request import urlopen
except ImportError:
from urllib2 import urlopen
import re
from bs4 import BeautifulSoup
url = 'http://dnedesign.us.to/tables/'
soup = BeautifulSoup(urlopen(url), "html.parser")
data = []
for rows in soup.find_all('tr'):
for td in rows.find_all('td'):
if 'a:' in td.text:
cols = re.findall(r"s:\d+:\"(.*?)\"", td.text)
data.append({cols[x]: cols[x+1] for x in range(0, len(cols), 2)})
for row in data[::-1]:
row = {
k: re.sub(
r"[a-zA-Z]+", lambda x: x.group().capitalize(), "%s:%s" % (k, v)
) for k, v in row.items()
}
print(" ".join([row["Day"], row["startTime"], row["endTime"]]))
输出:
Day:Tuesday Starttime:14:30 Endtime:16:30
Day:Sunday Starttime:12:30 Endtime:14:30
Day:Sunday Starttime:12:30 Endtime:16:30
Day:Sunday Starttime:12:30 Endtime:16:30
Day:Sunday Starttime: Endtime:
Day:Sunday Starttime: Endtime:
Day:Sunday Starttime:16:30 Endtime:18:30
Day:Sunday Starttime:14:30 Endtime:15:30
Day:Sunday Starttime:14:30 Endtime:16:30
第二阶段根据您的格式规范创建字符串,但创建 data
列表以存储每行列数据的键值对的中间步骤才是工作的重点。
根据您将项目放入 class 的请求,您可以创建 Schedule
的实例并填充相关字段而不是使用字典:
try:
from urllib.request import urlopen
except ImportError:
from urllib2 import urlopen
import re
from bs4 import BeautifulSoup
class Schedule:
def __init__(self, day, start, end):
self.day = day
self.start = start
self.end = end
url = 'http://dnedesign.us.to/tables/'
soup = BeautifulSoup(urlopen(url), "html.parser")
schedules = []
for rows in soup.find_all('tr'):
for td in rows.find_all('td'):
if 'a:' in td.text:
cols = re.findall(r"s:\d+:\"(.*?)\"", td.text)
data = {cols[x]: cols[x+1] for x in range(0, len(cols), 2)}
schedules.append(Schedule(data["Day"], data["startTime"], data["endTime"]))
for schedule in schedules:
print(schedule.day, schedule.start, schedule.end)