Python - BeautifulSoup 网络抓取
Python - BeautifulSoup Webscrape
我正在尝试从以下网站 (http://thedataweb.rm.census.gov/ftp/cps_ftp.html) 中抓取 URL 列表,但按照教程操作时我的运气为零。这是我尝试过的代码示例之一:
from bs4 import BeautifulSoup
import urllib2
url = "http://thedataweb.rm.census.gov/ftp/cps_ftp.html"
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
cpsLinks = soup.findAll(text =
"http://thedataweb.rm.census.gov/pub/cps/basic/")
print(cpsLinks)
我正在尝试提取这些链接:
http://thedataweb.rm.census.gov/pub/cps/basic/201501-/jan15pub.dat.gz
这些链接大概有 200 个左右。我怎样才能得到它们?
据我了解,您想提取遵循特定模式的链接。 BeautifulSoup
允许您将 a regular expression pattern 指定为属性值。
让我们使用以下模式:pub/cps/basic/\d+\-/\w+\.dat\.gz$'
。它将匹配 pub/cps/basic/
后跟一位或多位数字 (\d+
)、连字符 (\-
)、斜杠、一个或多个字母数字字符 (\w+
),后跟 .dat.gz
在字符串的末尾。注意-
和.
在正则表达式中有特殊含义,需要用反斜杠转义。
代码:
import re
import urllib2
from bs4 import BeautifulSoup
url = "http://thedataweb.rm.census.gov/ftp/cps_ftp.html"
soup = BeautifulSoup(urllib2.urlopen(url))
links = soup.find_all(href=re.compile(r'pub/cps/basic/\d+\-/\w+\.dat\.gz$'))
for link in links:
print link.text, link['href']
打印:
13,232,040 http://thedataweb.rm.census.gov/pub/cps/basic/201501-/jan15pub.dat.gz
13,204,510 http://thedataweb.rm.census.gov/pub/cps/basic/201401-/dec14pub.dat.gz
13,394,607 http://thedataweb.rm.census.gov/pub/cps/basic/201401-/nov14pub.dat.gz
13,409,743 http://thedataweb.rm.census.gov/pub/cps/basic/201401-/oct14pub.dat.gz
13,208,428 http://thedataweb.rm.census.gov/pub/cps/basic/201401-/sep14pub.dat.gz
...
10,866,849 http://thedataweb.rm.census.gov/pub/cps/basic/199801-/jan99pub.dat.gz
3,172,305 http://thedataweb.rm.census.gov/pub/cps/basic/200701-/disability.dat.gz
我正在尝试从以下网站 (http://thedataweb.rm.census.gov/ftp/cps_ftp.html) 中抓取 URL 列表,但按照教程操作时我的运气为零。这是我尝试过的代码示例之一:
from bs4 import BeautifulSoup
import urllib2
url = "http://thedataweb.rm.census.gov/ftp/cps_ftp.html"
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
cpsLinks = soup.findAll(text =
"http://thedataweb.rm.census.gov/pub/cps/basic/")
print(cpsLinks)
我正在尝试提取这些链接:
http://thedataweb.rm.census.gov/pub/cps/basic/201501-/jan15pub.dat.gz
这些链接大概有 200 个左右。我怎样才能得到它们?
据我了解,您想提取遵循特定模式的链接。 BeautifulSoup
允许您将 a regular expression pattern 指定为属性值。
让我们使用以下模式:pub/cps/basic/\d+\-/\w+\.dat\.gz$'
。它将匹配 pub/cps/basic/
后跟一位或多位数字 (\d+
)、连字符 (\-
)、斜杠、一个或多个字母数字字符 (\w+
),后跟 .dat.gz
在字符串的末尾。注意-
和.
在正则表达式中有特殊含义,需要用反斜杠转义。
代码:
import re
import urllib2
from bs4 import BeautifulSoup
url = "http://thedataweb.rm.census.gov/ftp/cps_ftp.html"
soup = BeautifulSoup(urllib2.urlopen(url))
links = soup.find_all(href=re.compile(r'pub/cps/basic/\d+\-/\w+\.dat\.gz$'))
for link in links:
print link.text, link['href']
打印:
13,232,040 http://thedataweb.rm.census.gov/pub/cps/basic/201501-/jan15pub.dat.gz
13,204,510 http://thedataweb.rm.census.gov/pub/cps/basic/201401-/dec14pub.dat.gz
13,394,607 http://thedataweb.rm.census.gov/pub/cps/basic/201401-/nov14pub.dat.gz
13,409,743 http://thedataweb.rm.census.gov/pub/cps/basic/201401-/oct14pub.dat.gz
13,208,428 http://thedataweb.rm.census.gov/pub/cps/basic/201401-/sep14pub.dat.gz
...
10,866,849 http://thedataweb.rm.census.gov/pub/cps/basic/199801-/jan99pub.dat.gz
3,172,305 http://thedataweb.rm.census.gov/pub/cps/basic/200701-/disability.dat.gz