使用 python urllib 和 beautiful soup 从 html 站点提取信息

Question

我正在尝试从该网站提取一些信息，即以下行：

规模（处女座 + GA + 沙普利）：29 pc/arcsec = 0.029 kpc/arcsec = 1.72 kpc/arcmin = 0.10 Mpc/degree

但是 : 之后的所有内容都是可变的，具体取决于 galtype。

我已经编写了一个使用 beautifulsoup 和 urllib 以及 returns sone 信息的代码，但我正在努力将数据进一步减少到我想要的信息。我怎样才能得到我想要的信息？

galname='M82'
a='http://ned.ipac.caltech.edu/cgi-bin/objsearch?objname='+galname+'&extend'+\
   '=no&hconst=73&omegam=0.27&omegav=0.73&corr_z=1&out_csys=Equatorial&out_equinox=J2000.0&obj'+\
   '_sort=RA+or+Longitude&of=pre_text&zv_breaker=30000.0&list_limit=5&img_stamp=YES'

print a
import urllib
f = urllib.urlopen(a)
from bs4 import BeautifulSoup
soup=BeautifulSoup(f)

soup.find_all(text=re.compile('Virgo')) and soup.find_all(text=re.compile('GA')) and soup.find_all(text=re.compile('Shapley'))

Answer 1

定义有助于BeautifulSoup找到合适节点的正则表达式模式，然后使用保存组提取数字：

pattern = re.compile(r"D \(Virgo \+ GA \+ Shapley\)\s+:\s+([0-9\.]+)")
print pattern.search(soup.find(text=pattern)).group(1)

打印 5.92.

此外，通常我反对使用正则表达式来解析 HTML，但是，由于这是一个文本搜索，我们不会使用正则表达式来匹配开始或结束标签或任何与HTML 提供的结构 - 您可以将您的模式应用到页面的 HTML 源而不涉及 HTML 解析器：

data = f.read()
pattern = re.compile(r"D \(Virgo \+ GA \+ Shapley\)\s+:\s+([0-9\.]+)")
print pattern.search(data).group(1)

使用 python urllib 和 beautiful soup 从 html 站点提取信息

using python urllib and beautiful soup to extract information from html site

python

urllib

beautifulsoup