尝试从格式不正确的 HTML 网站提取数据

Question

我最近一直在尝试从一个网站上提取信息，虽然我大部分都成功了，但还是有点困难。

我目前一直在使用 Regex 查找一些信息（这里是我想查看的名称）

webAddress = 'http://meridian.puzzlepirates.com/yoweb/crew/info.wm?crewid=' + str(crewid)
htmlFile = urllib.urlopen(webAddress)
htmlText = htmlFile.read()

regex = 'classic&target=(.+?)">'
pattern = re.compile(regex)
checkMatch = re.findall(pattern,htmlText)

像这样。当该特定行上有一致的指示器时，这很好用。但是我现在遇到一个问题，我的指标不在那条线上。

 <td width="28" height="28"><a href="/ratings/top_5_0.html"><img 
  src="/yoweb/images/stat-5.png" width="28" height="28" border="0"
  alt="Gunning"></a></td>
<td align="left">
  <font size="-1">
      <i><b>Exalted</b></i>/<b>Master</b>
  </font>

特别想拉出倒数第二行，但是倒数第二行可能没有加粗或者 italicised/doesn 没有相同的词，所以我的指标必须是 "Gunning" 因为那是我关心的特定区域。不幸的是，它甚至不总是在不同页面的同一行上，所以我不能只查看特定行来尝试找到它。任何建议都会很棒！

编辑

我已经开始尝试 learn/use Beautiful Soup（感谢您为我指明方向。

一开始我不是很清楚，所以让我试着澄清一下。

特别是试图从像 this

这样的页面中拉出排名

 <td width="28" height="28"><a href="/ratings/top_5_0.html"><img 
  src="/yoweb/images/stat-5.png" width="28" height="28" border="0"
  alt="Gunning"></a></td>
<td align="left">
  <font size="-1">
      <i><b>Exalted</b></i>/<b>Master</b>
  </font>

我特别要查找的部分的 HTML 在上面，并且格式并不总是相同（例如，它可能是非粗体、粗体或粗体和斜体。所以不太确定我可以使用什么方法从该信息中可靠地提取特定统计信息。

我也尝试通过字体大小进行隔离，但结果数量不一致，因此我无法隔离我想要的特定统计数据。

Answer 1

标记肯定不好对付，但你肯定should not be approaching it with regular expressions。 不要仅仅因为您熟悉或擅长使用一种工具。在特定情况下使用最合适的工具table。

在这种情况下，您需要一个 HTML 解析器，例如 BeautifulSoup.

假设您要提取姓名（主要工作人员中以粗体显示的姓名table）：

>>> import requests
>>> from bs4 import BeautifulSoup
>>> url = "http://meridian.puzzlepirates.com/yoweb/crew/info.wm?crewid=5002373"
>>> 
>>> response = requests.get(url)
>>> 
>>> soup = BeautifulSoup(response.content, "html.parser")
>>> table = soup.find('table', width='330')  # relying on width, yeah, does not look reliable
>>> for b in table.find_all('b'):
...     print(b.get_text(strip=True))
... 
Captain
Senior Officer
Fleet Officer
Officer
Pirate
Cabin Person
Jobbing Pirate

尝试从格式不正确的 HTML 网站提取数据

Trying to pull data from a poorly formatted HTML website

html

python

regex

pull

html-parsing