使用 beutifulsoup 和 mechanize 从 html table 获取文本时出错
Error getting text from html table using beutifulsoup and mechanize
我正在尝试从 table 标签内的 html 代码中获取文本,但我没有获取全部 text.Instead 我只获取了部分文本和其余部分被忽略
这是我的输出和代码:
输出
Public Sector Organization (Recruitment Test)
Test held on: Saturday, 3rd & Sunday 4th, December 2016
>>>
代码
import mechanize
from bs4 import BeautifulSoup
import urllib
from PIL import Image
import os
Roll=60170001
url = "http://nts.org.pk/Test&Products/Results/012017/PubSecOrg_24122016_Result/Search.php"
br = mechanize.Browser()
br.set_handle_robots(False) # ignore robots
br.open(url)
br.select_form(nr=0)
rollnumber=str(Roll)
captcha=11111
cap=str(captcha)
br["RollNo"]=rollnumber
br["captcha"]=cap
res = br.submit()
content = res.read()
soup = BeautifulSoup(content,"html.parser")
rolln=soup('table')[2]
rolln=rolln.text.encode('utf-8')
print rolln
这种方法似乎可以满足您的要求。
>>> content = open(r"C:\scratch\___National Testing Service___.html").read()
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(content, 'lxml')
>>> tables = soup.findAll('table')
>>> len(tables)
8
>>> tables[2].text
'\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nPublic Sector Organization (Recruitment Test)\nTest held on: Saturday, 3rd & Sunday 4th, December 2016\n\n \n (Result)\n\n\n\n\n\n Search Result for the keyword "\n 60170001 \n"\n\n\n\nRoll No\nName\nFather Name\nCNIC\n\nPost\n\n\nKDPH\n\n\nNTS Marks\n\n\n\n60170001\nSARA ISLAM \nNAZAR UL ISLAM \n17301-2406027-4 \n\n Assistant Manager(Electronics Engineering) \n\n\n \n\n\n 63 \n\n\n\n\n\n\n\n\n\n\nCurrent Date / Time: Tuesday 21st, February 2017 , 11:49:59 PM \n\n\n\n\n\xa0\n\n'
假设 mechanize
给你的文件格式与我通过在 Chrome 浏览器中打开页面并保存它就能获得的格式相同,你应该没问题。
我正在尝试从 table 标签内的 html 代码中获取文本,但我没有获取全部 text.Instead 我只获取了部分文本和其余部分被忽略
这是我的输出和代码:
输出
Public Sector Organization (Recruitment Test)
Test held on: Saturday, 3rd & Sunday 4th, December 2016
>>>
代码
import mechanize
from bs4 import BeautifulSoup
import urllib
from PIL import Image
import os
Roll=60170001
url = "http://nts.org.pk/Test&Products/Results/012017/PubSecOrg_24122016_Result/Search.php"
br = mechanize.Browser()
br.set_handle_robots(False) # ignore robots
br.open(url)
br.select_form(nr=0)
rollnumber=str(Roll)
captcha=11111
cap=str(captcha)
br["RollNo"]=rollnumber
br["captcha"]=cap
res = br.submit()
content = res.read()
soup = BeautifulSoup(content,"html.parser")
rolln=soup('table')[2]
rolln=rolln.text.encode('utf-8')
print rolln
这种方法似乎可以满足您的要求。
>>> content = open(r"C:\scratch\___National Testing Service___.html").read()
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(content, 'lxml')
>>> tables = soup.findAll('table')
>>> len(tables)
8
>>> tables[2].text
'\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nPublic Sector Organization (Recruitment Test)\nTest held on: Saturday, 3rd & Sunday 4th, December 2016\n\n \n (Result)\n\n\n\n\n\n Search Result for the keyword "\n 60170001 \n"\n\n\n\nRoll No\nName\nFather Name\nCNIC\n\nPost\n\n\nKDPH\n\n\nNTS Marks\n\n\n\n60170001\nSARA ISLAM \nNAZAR UL ISLAM \n17301-2406027-4 \n\n Assistant Manager(Electronics Engineering) \n\n\n \n\n\n 63 \n\n\n\n\n\n\n\n\n\n\nCurrent Date / Time: Tuesday 21st, February 2017 , 11:49:59 PM \n\n\n\n\n\xa0\n\n'
假设 mechanize
给你的文件格式与我通过在 Chrome 浏览器中打开页面并保存它就能获得的格式相同,你应该没问题。