Python

Question

如何在 Python 中使用正则表达式解析 HTML 中的多行。我已经设法使用下面的代码在同一行上串匹配模式。

i=0
while i<len(newschoollist):
    url = "http://profiles.doe.mass.edu/profiles/general.aspx?topNavId=1&orgcode="+ newschoollist[i] +"&orgtypecode=6&"
    htmlfile = urllib.urlopen(url)
    htmltext = htmlfile.read()
    regex = '>Phone:</td><td>(.+?)</td></tr>'
    pattern = re.compile(regex)
    value = re.findall(pattern,htmltext)
    print newschoollist[i], valuetag, value
    i+=1

然而，当我尝试像这样识别更复杂的 HTML 时...

<td>Attendance Rate</td> 
<td class='center'>  90.1</td>

我得到空值。我相信问题出在我的语法上。我用谷歌搜索了正则表达式并阅读了大部分文档，但我正在寻找有关此类应用程序的帮助。我希望有人能指出我正确的方向。是否有类似 (.+?) 的组合可以帮助我告诉正则表达式跳转到 HTML?

行

我希望 findall 找到的是 90.1 "出勤率 "

谢谢！

Answer 1

Use an HTML Parser. Example using BeautifulSoup:

from urllib2 import urlopen
from bs4 import BeautifulSoup

url = 'http://profiles.doe.mass.edu/profiles/general.aspx?topNavId=1&orgcode=00350326'

soup = BeautifulSoup(urlopen(url))
for label in soup.select('div#whiteboxRight table td'):
    value = label.find_next_sibling('td')
    if not value:
        continue

    print label.get_text(strip=True), value.get_text(strip=True)
    print "----"

打印件（个人资料联系信息）：

...
----
NCES ID: 250279000331
----
Web Site: http://www.bostonpublicschools.org
----
MA School Type: Public School
----
NCES School Reconstituted: No
...

Answer 2

我最终使用了 (soup.get_text()) 并且效果很好。谢谢！

Python - 多行 HTML 的正则表达式查找

Python - regex lookup for multiple lines of HTML

html

regex

parsing

web-scraping