Python |网络爬虫 |我用对了吗？

Question

所以，我现在正在研究 Python，因为我很久以前就研究过它，并没有深入学习语言，现在，我正在研究又来了。

我现在正在研究的是网络爬虫，但我不确定我应该研究这个项目是否正确。如果我错了请纠正我，但这是我正在考虑的项目

我想编写一个程序，我可以在其中简单地启动它，然后输入一个网站 url（特定或完整的网站），它会扫描它以获取 Embed/iFrame 代码, 并将 link 下载到 table 例如:

页面标题 - | -# of iFrame's Found- | -嵌入 1- -/嵌入 1- | -嵌入2- -/嵌入2- 等等。

我是在研究正确的语言和方面，还是应该为此研究其他东西？

提前感谢您的任何 feedback/support！

Answer 1

有多种方法可以抓取网站。这是一个使用 BeautifulSoup.
的例子您可以使用
安装 BeautifulSoup pip install python-bs4 对于 windows
apt-get install python-bs4 对于 linux

可以开始了here

工作代码

from bs4 import BeautifulSoup
import urllib
r = urllib.urlopen('http://www.aflcio.org/Legislation-and-Politics/Legislative-Alerts').read()
soup = BeautifulSoup(r)
print soup.prettify()[0:1000]

输出：

<class 'bs4.BeautifulSoup'>
<!DOCTYPE html>
<!--[if lt IE 7]> <html class="no-js ie6 oldie" lang="en-US"> <![endif]-->
<!--[if IE 7]>    <html class="no-js ie7 oldie" lang="en-US"> <![endif]-->
<!--[if IE 8]>    <html class="no-js ie8 oldie" lang="en-US"> <![endif]-->
<!--[if gt IE 8]><!-->
<html class="no-js" lang="en-US">
 <!--<![endif]-->
 <head>
  <title>
   Access denied | www.aflcio.org used Cloudflare to restrict access
  </title>
  <meta charset="utf-8"/>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <meta content="IE=Edge,chrome=1" http-equiv="X-UA-Compatible"/>
  <meta content="noindex, nofollow" name="robots"/>
  <meta content="width=device-width,initial-scale=1,maximum-scale=1" name="viewport"/>
  <link href="/cdn-cgi/styles/cf.errors.css" id="cf_styles-css" media="screen,projection" rel="stylesheet" type="text/css"/>
  <!--[if lt IE 9]><link rel="stylesheet" id='cf_styles-ie-css' href="/cdn-cgi/styles/cf.errors.ie.css" type="text/css" media="screen,projection" /><![endif]--
>>>

你可以玩转输出来过滤你想要的内容，比如iFrame。更多详情 here.

Python |网络爬虫 |我用对了吗？

Python | Web Crawlers | Am I using it right?

python

embed

iframe