Beautiful Soup table table 解析
Beautiful Soup table table parse
我们正在做一个大学项目,我们想从大学时间提取数据table并将其用于我们自己的项目。我们有一个提取数据的 python 脚本,它在本地机器上运行良好,但是当我们尝试在 Amazon ec2 上使用相同的脚本时出现错误。
from bs4 import BeautifulSoup
import requests
# url from timetable.ucc.ie showing 3rd Year semester 1 timetable
url = 'http://timetable.ucc.ie/showtimetable2.asp?filter=%28None%29&identifier=BSCS3&days=1-5&periods=1-20&weeks=5-16&objectclass=programme%2Bof%2Bstudy&style=individual'
# Retrieve the web page at url and convert the data into a soup object
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data)
# Retrieve the table containing the timetable from the soup object for parsing
timetable_to_parse = soup.find('table', {'class' : 'grid-border-args'})
i = 0 # i is an index into pre_format_day
pre_format_day = [[],[],[],[],[],[]] # holds un-formatted day information
day = [[],[],[],[],[],[]] # hold formatted day information
day[0] = pre_format_day[0]
# look at each td within the table
for slot in timetable_to_parse.findAll('td'):
# if slot content is a day of the week, move pointer to next day
# indicated all td's relating to a day have been looked at
if slot.get_text() in ( 'Mon', 'Tue' , 'Wed' , 'Thu' , 'Fri'):
i += 1
else: # otherwise the td related to a time slot in a day
try:
if slot['colspan'] is "4": #test if colspan of td is 4
# if it is, append to list twice to represent 2 hours
pre_format_day[i].append(slot.get_text().replace('\n',''))
pre_format_day[i].append(slot.get_text().replace('\n',''))
except:
pass
# if length of text of td is 1, > 11 or contains ":00"
if len(slot.get_text()) == 1 or len(slot.get_text()) > 11 or ":00" in\
slot.get_text():
# add to pre_format_day
pre_format_day[i].append(slot.get_text().replace('\n',''))
# go through each day in pre_format_day and insert formatted version in day[]
for i in range(1,6):
j = 0
while j < 20:
if len(pre_format_day[i][j]) > 10: # if there is an event store in day
day[i].append(pre_format_day[i][j])
else: # insert space holder into slots with no events
day[i].append('----- ')
j += 2
# creates a string containing a html table for output
timetable = '<table><tr>'
timetable += '<th></th>'
for i in range(0, 10):
timetable += '<th>' + day[0][i] + '</th> '
days = ['', 'Mon', 'Tue' , 'Wed' , 'Thu' , 'Fri']
for i in range(1,6):
timetable += '</tr><tr><th>' + days[i] + '</th>'
for j in range(0,10):
if len(day[i][j]) > 10:
timetable += '<td class="lecture">' + day[i][j] + '</td>'
else:
timetable += '<td></td>'
timetable += '</tr></table>'
# output timetable string
print timetable
本地机器上的输出是包含所需数据的 table。
ec2 实例上的输出是
追溯(最近一次通话):
文件 "parse2.py",第 21 行,位于
对于 timetable_to_parse.findAll('td') 中的插槽:
AttributeError: 'NoneType' 对象没有属性 'findAll'
两台机器都是 运行 Ubuntu 14.10,Python 2.7 但出于某种原因我无法弄清楚它似乎没有从 url 并从中提取 table 但在那之后我丢失了。
非常感谢任何帮助。
登录到 EC2 实例并在 Python CLI 中逐行检查直到找到问题。出于某种原因,BeautifulSoup 解析在不同系统上的工作方式略有不同。我遇到了同样的问题,我不知道背后的原因。在不了解 HTML.
内容的情况下,我们很难为您提供具体的帮助。
问题是 ec2 使用了与本地机器不同的解析器。
固定的。
apt-get 安装 python-lxml
我们正在做一个大学项目,我们想从大学时间提取数据table并将其用于我们自己的项目。我们有一个提取数据的 python 脚本,它在本地机器上运行良好,但是当我们尝试在 Amazon ec2 上使用相同的脚本时出现错误。
from bs4 import BeautifulSoup
import requests
# url from timetable.ucc.ie showing 3rd Year semester 1 timetable
url = 'http://timetable.ucc.ie/showtimetable2.asp?filter=%28None%29&identifier=BSCS3&days=1-5&periods=1-20&weeks=5-16&objectclass=programme%2Bof%2Bstudy&style=individual'
# Retrieve the web page at url and convert the data into a soup object
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data)
# Retrieve the table containing the timetable from the soup object for parsing
timetable_to_parse = soup.find('table', {'class' : 'grid-border-args'})
i = 0 # i is an index into pre_format_day
pre_format_day = [[],[],[],[],[],[]] # holds un-formatted day information
day = [[],[],[],[],[],[]] # hold formatted day information
day[0] = pre_format_day[0]
# look at each td within the table
for slot in timetable_to_parse.findAll('td'):
# if slot content is a day of the week, move pointer to next day
# indicated all td's relating to a day have been looked at
if slot.get_text() in ( 'Mon', 'Tue' , 'Wed' , 'Thu' , 'Fri'):
i += 1
else: # otherwise the td related to a time slot in a day
try:
if slot['colspan'] is "4": #test if colspan of td is 4
# if it is, append to list twice to represent 2 hours
pre_format_day[i].append(slot.get_text().replace('\n',''))
pre_format_day[i].append(slot.get_text().replace('\n',''))
except:
pass
# if length of text of td is 1, > 11 or contains ":00"
if len(slot.get_text()) == 1 or len(slot.get_text()) > 11 or ":00" in\
slot.get_text():
# add to pre_format_day
pre_format_day[i].append(slot.get_text().replace('\n',''))
# go through each day in pre_format_day and insert formatted version in day[]
for i in range(1,6):
j = 0
while j < 20:
if len(pre_format_day[i][j]) > 10: # if there is an event store in day
day[i].append(pre_format_day[i][j])
else: # insert space holder into slots with no events
day[i].append('----- ')
j += 2
# creates a string containing a html table for output
timetable = '<table><tr>'
timetable += '<th></th>'
for i in range(0, 10):
timetable += '<th>' + day[0][i] + '</th> '
days = ['', 'Mon', 'Tue' , 'Wed' , 'Thu' , 'Fri']
for i in range(1,6):
timetable += '</tr><tr><th>' + days[i] + '</th>'
for j in range(0,10):
if len(day[i][j]) > 10:
timetable += '<td class="lecture">' + day[i][j] + '</td>'
else:
timetable += '<td></td>'
timetable += '</tr></table>'
# output timetable string
print timetable
本地机器上的输出是包含所需数据的 table。
ec2 实例上的输出是 追溯(最近一次通话): 文件 "parse2.py",第 21 行,位于 对于 timetable_to_parse.findAll('td') 中的插槽: AttributeError: 'NoneType' 对象没有属性 'findAll'
两台机器都是 运行 Ubuntu 14.10,Python 2.7 但出于某种原因我无法弄清楚它似乎没有从 url 并从中提取 table 但在那之后我丢失了。
非常感谢任何帮助。
登录到 EC2 实例并在 Python CLI 中逐行检查直到找到问题。出于某种原因,BeautifulSoup 解析在不同系统上的工作方式略有不同。我遇到了同样的问题,我不知道背后的原因。在不了解 HTML.
内容的情况下,我们很难为您提供具体的帮助。问题是 ec2 使用了与本地机器不同的解析器。 固定的。
apt-get 安装 python-lxml