Shell 用于下载大量 HTML 文件并将它们与所有 CSS 静态存储的脚本

Question

我在一个科学论坛上 post 编辑了（大约 290 个问题），我想通过下载它们以及所有相关的答案来返回。

第一个问题是我必须登录我的个人 space 才能获得所有消息的列表。如何绕过第一个障碍，以便能够使用 shell 脚本或单个 wget 命令取回所有 URL 及其内容。我可以将登录名和密码传递给 wget 以便记录并重定向到适当的 URL 获取所有消息的列表吗？

第一个问题解决后，第二个问题是我必须从 6 个不同的菜单页面开始，这些菜单页面都包含标题和 link 问题。

此外，关于我的一些问题，答案和讨论可能在多个页面上。

所以我想知道我是否可以实现这种全局下载操作，因为我想将它们静态地存储在本地 CSS 也存储在我的计算机上（以便在我查阅它们时将相同的格式保存到我的浏览器中）在我的电脑上）。

第一个菜单页面的 URL 是（一旦我登录网站：如果我必须连接，用 wget 下载也可能是个问题） .

包含消息列表的 URL 示例，一旦我登录，是：

https://forums.futura-sciences.com/search.php?searchid=22897684

其他页面（主菜单页面总共有6、7页讨论标题出现）格式为： “https://forums.futura-sciences.com/search.php?searchid=22897684&pp=&page=2”（第 2 页）。

https://forums.futura-sciences.com/search.php?searchid=22897684&pp=&page=5 （第 5 页）

可以在这些页面的每一页上看到我想下载的每个讨论的标题和 link CSS（知道每个讨论也可能包含多个页面） :

例如第一页讨论“https://forums.futura-sciences.com/archives/804364-demonstration-dilatation-temps.html”

有第 2 页：“https://forums.futura-sciences.com/archives/804364-demonstration-dilatation-temps-2.html”

和第 3 页：“https://forums.futura-sciences.com/archives/804364-demonstration-dilatation-temps-3.html”

天真地，我试图只用一个命令来完成所有这些（以我在 post 开始时对我个人 space 的 URL 为例，即"https://forums.futura-sciences.com/search.php?searchid=22897684"):

wget -r --no-check-certificate --html-extension --convert-links "https://forums.futura-sciences.com/search.php?searchid=22897684"

但不幸的是，此命令会下载所有文件，甚至可能不是我想要的，即我的讨论。

我不知道使用什么方法：我必须首先将所有 URL 存储在一个文件中（所有 sub-pages 包含所有答案和每个 mu 初始问题的全局讨论)?

之后，我可能会做 wget -i all_URL_questions.txt。我该如何进行这个操作？

更新

我的问题需要脚本，我尝试使用 Python 以下内容：

1)

import urllib, urllib2, cookielib

username = 'USERNAME'
password = 'PASSWORD'

cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
login_data = urllib.urlencode({'username' : username, 'password' : password})
opener.open('https://forums.futura-sciences.com/login.php', login_data)
resp = opener.open('https://forums.futura-sciences.com/search.php?do=finduser&userid=253205&contenttype=vBForum_Post&showposts=1')
print resp.read()

但是打印出来的页面不是我主页变成个人的页面space。

2)

import requests

# Fill in your details here to be posted to the login form.
payload = { 
    'inUserName': 'USERNAME',
    'inUserPass': 'PASSWORD'
}

# Use 'with' to ensure the session context is closed after use.
with requests.Session() as s:
    p = s.post('https://forums.futura-sciences.com/login.php?do=login', data=payload)
    # print the html returned or something more intelligent to see if it's a successful login page.
    print p.text.encode('utf8')

    # An authorised request.
    r = s.get('https://forums.futura-sciences.com/search.php?do=finduser&userid=253205&contenttype=vBForum_Post&showposts=1')
    print r.text.encode('utf8')

这里也是，这个不行

3)

import requests
import bs4 

site_url = 'https://forums.futura-sciences.com/login.php?do=login'
userid = 'USERNAME'
password = 'PASSWWORD'

file_url = 'https://forums.futura-sciences.com/search.php?do=finduser&userid=253205&contenttype=vBForum_Post&showposts=1' 
o_file = 'abc.html'  

# create session
s = requests.Session()
# GET request. This will generate cookie for you
s.get(site_url)
# login to site.
s.post(site_url, data={'vb_login_username': userid, 'vb_login_password': password})
# Next thing will be to visit URL for file you would like to download.
r = s.get(file_url)

# Download file
with open(o_file, 'wb') as output:
    output.write(r.content)
print("requests:: File {o_file} downloaded successfully!")

# Close session once all work done
s.close()

一样，内容有误

4)

from selenium import webdriver
    
# To prevent download dialog
profile = webdriver.FirefoxProfile()
profile.set_preference('browser.download.folderList', 2) # custom location
profile.set_preference('browser.download.manager.showWhenStarting', False)
profile.set_preference('browser.download.dir', '/tmp')
profile.set_preference('browser.helperApps.neverAsk.saveToDisk', 'text/csv')

webdriver.get('https://forums.futura-sciences.com/')
webdriver.find_element_by_id('ID').send_keys('USERNAME')
webdriver.find_element_by_id ('ID').send_keys('PASSWORD')
webdriver.find_element_by_id('submit').click()
browser = webdriver.Firefox()
browser.get('https://forums.futura-sciences.com/search.php?do=finduser&userid=253205&contenttype=vBForum_Post&showposts=1')

仍然无法使用 USERNAME 和 PASSSWORD 登录并获取个人主页的内容 space

5)

from selenium import webdriver
from selenium.webdriver.firefox.webdriver import FirefoxProfile
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
import time

def MS_login(username, passwd):  # call this with username and password

firefox_capabilities = DesiredCapabilities.FIREFOX
    firefox_capabilities['moz:webdriverClick'] = False
    driver = webdriver.Firefox(capabilities=firefox_capabilities)
    fp = webdriver.FirefoxProfile()
    fp.set_preference("browser.download.folderList", 2) # 0 means to download to the desktop, 1 means to download to the default "Downloads" directory, 2 means to use the directory
    fp.set_preference("browser.download.dir","/Users/user/work_archives_futura/")
    driver.get('https://forums.futura-sciences.com/') # change the url to your website
    time.sleep(5) # wait for redirection and rendering
    driver.delete_all_cookies() # clean up the prior login sessions
    driver.find_element_by_xpath("//input[@name='vb_login_username']").send_keys(username)

elem  = WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.XPATH, "//input[@name='vb_login_password']")))
    elem.send_keys(Keys.TAB)

driver.find_element_by_xpath("//input[@type='submit']").click()

    print("success !!!!")

driver.close() # close the browser
    return driver

if __name__ == '__main__':
    MS_login("USERNAME","PASSWORD")

window很好打开，用户名填写但是无法填写或提交密码点击提交。

PS：主要问题可能来自密码字段有 display:none 属性，所以我无法模拟对密码字段的 TAB 操作并传递它，一旦我有把登录。

Answer 1

看来您已经非常了解如何使用各种方法进行抓取了。所缺少的只是 post 请求中的正确字段名称。

我使用了 chrome 开发工具（f12 - 然后转到网络选项卡）。如果您登录并快速停止浏览器 window 重定向，打开此窗口，您将能够看到对 login.php 的完整请求并查看字段等

有了它，我就可以为您构建它。它包括一个很好的响应转储功能。要测试我的代码是否有效，您可以将真实密码用于肯定情况，将错误密码行用于否定情况。

import requests
import json

s = requests.Session()

def dumpResponseData(r, fileName):
    print(r.status_code)

    print(json.dumps(dict(r.headers), indent=1))
    
    cookieDict = s.cookies.get_dict()
    print(json.dumps(cookieDict, indent=1))
    
    outfile = open(fileName, mode="w")
    outfile.write(r.text)
    outfile.close()

username = "your-username"
password = "your-password"
# password = "bad password"

def step1():
    data = dict()
    data["do"] = "login"
    data["vb_login_md5password"] = ""
    data["vb_login_md5password_utf"] = ""
    data["s"] = ""
    data["securitytoken"] = "guest"
    data["url"] = "/search.php?do=finduser&userid=1077817&contenttype=vBForum_Post&showposts=1"
    data["vb_login_username"] = username
    data["vb_login_password"] = password

    p = s.post('https://forums.futura-sciences.com/login.php?do=login', data=data)

    # Logged In?
    if "vbseo_loggedin" in s.cookies.keys():
        print("Logged In!")
    else:
        print("Login Failed :(")

if __name__ == "__main__":
    step1()

我新创建的 Futura 帐户中没有任何 posts，所以我真的不能再为你做任何测试 - 我不想用垃圾向他们的论坛发送垃圾邮件。

但我可能会先请求 post 搜索 url 并使用 bs4 抓取 links。

那么您可能只对每个 link 抓取的内容使用 wget -r。

Answer 2

@Researcher 关于请求库的建议是正确的。您没有发布浏览器将发送的所有请求参数。总的来说，我认为当你考虑静态内容和客户端时，很难获得拉取所有内容的请求 javascript

您在第 4 节中的 selenium 代码有一些错误：

 # yours
webdriver.find_element_by_id('ID').send_keys('USERNAME')
webdriver.find_element_by_id ('ID').send_keys('PASSWORD')
webdriver.find_element_by_id('submit').click()

# should be
webdriver.find_element_by_id('vb_login_username').send_keys('USERNAME')
webdriver.find_element_by_id('vb_login_password').send_keys('PASSWORD')
webdriver.find_element_by_xpath("//input[@type='submit']").click()

您可能需要 fiddle 使用提交按钮的 xpath。

提示：您可以通过截图进行调试：

webdriver.find_element_by_id('vb_login_username').send_keys('USERNAME')
webdriver.find_element_by_id('vb_login_password').send_keys('PASSWORD')

webdriver.get_screenshot_as_file('before_submit.png')
webdriver.find_element_by_xpath("//input[@type='submit']").click()
webdriver.get_screenshot_as_file('after_submit.png')

Shell 用于下载大量 HTML 文件并将它们与所有 CSS 静态存储的脚本

Shell script to download a lot of HTML files and store them statically with all CSS

html

python

shell

wget

download

更新