使用 python 的动态浏览（Mechanize，Beautifulsoup...）

Question

我目前正在编写 python 的解析器，用于从网站自动提取一些信息。我正在使用 mechanize 浏览网站。我获得以下 html 代码：

<html>
 <head>
  <title>
   XXXXX
  </title>
  <meta http-equiv="Content-Type" content="text/html; charset=utf-8; no-cache;" />
  <link rel="shortcut icon" type="image/x-icon" href="/favicon.ico" />
  <link rel="stylesheet" href="/rr/style_other.css" type="text/css" />
 </head>
 <frameset cols="*,370" border="1">
  <frame src="affiche_cp.php?uid=yyyyyyy&amp;type=entree" name="cdrg" />
  <frame src="affiche_bp.php?uid=yyyyyyy&amp;type=entree" name="cdrd" />
 </frameset>
</html>

我想访问两个框架：

在 cdrd 我必须填写一些表格并提交
在cdrg我会获取提交的结果

我该怎么做？

Answer 1

就我个人而言，我不使用 BeautifulSoup 来解析 HTML。但是我使用 PyQuery, which is similar but I like the CSS selector syntax as opposed to XPath. I also use Requests 来发出 HTTP 请求。

仅此一项就足以抓取数据并提交请求。它可以做你想做的事。我知道这可能不是您正在寻找的答案，但它可能对您很有用。

使用 PyQuery 抓取帧

import requests
import pyquery

response = requests.get('http://example.com')
dom = pyquery.PyQuery(response.text)
frames = dom('frame')

frame_one = frames[0]
frame_two = frames[1]

发出 HTTP 请求

import requests

response = requests.post('http://example.com/signup', data={
    'username': 'someuser',
    'password': 'secret'
})

response_text = response.text

data 是一个包含要提交给表单的 POST 数据的字典。您应该使用 Chrome 的网络浏览器、Fiddlr 或 Burp Suite 来监控请求。在监控的同时手动提交这两种形式。检查 HTTP 请求并使用 Requests.

重新创建请求

希望对您有所帮助。我在这个领域工作，所以如果您需要更多信息，请随时联系我。

Answer 2

我的问题的解决方案是加载第一帧并在此页面中填写表格。然后我加载第二帧，我可以读取它并获得与第一帧中的表单关联的结果。

使用 python 的动态浏览（Mechanize，Beautifulsoup...）

Dynamic browsing using python (Mechanize, Beautifulsoup...)

html

python

mechanize

beautifulsoup

frame