尝试在 ruby 中使用 open-uri,一些 HTML 内容作为 "Loading..." 进入

Trying to use open-uri in ruby, some HTML contents are coming in as "Loading..."

我正在尝试创建一个程序来比较网页上的特定内容,然后再进行比较,我目前正在努力获取将要更改的信息。但是,如果我检查页面中的元素,会出现会改变的文本,但如果我使用 open-uri 则不会,它以 "Loading..." 的形式出现(见图),有没有办法获得所有 HTML 文字?

Picture here.

这是我目前的代码

contents  = open('https://www.cargurus.com/Cars/l-Used-Mazda-MAZDASPEED6-d841', &:read)

File.open("testing.txt", "w") do |line|
line.puts "\r" + "#{contents}"
end

任何帮助获取正在加载...以更改为实际 HTML 代码的方法都很棒。

谢谢

您的网页仅包含 ajax requestopen-uri returns 服务器端页面,它不等待 ajax 请求

您可以使用下面的代码等待页面加载

#load the libraries 
require 'watir'
browser = Watir::Browser.new
browser.goto "https://www.cargurus.com/Cars/l-Used-Mazda-MAZDASPEED6-d841"
# giving some time for website to load
sleep 2
puts browser.html

注意:您需要 chromedriver 才能使用脚本 http://chromedriver.chromium.org/downloads 如果你不想在浏览器中打开 url 那么你可以使用 headless-WebKit

问题

因此,open uri 只会发出 HTTP 请求并让您访问正文。在这种情况下,正文是 html。 html 具有此数据的占位符,这就是您所看到的。然后 html 表示加载一些 javascript ,这将向服务器发出另一个请求以获取数据,当数据进来时,它将用真实数据替换占位符。因此,要处理此问题,您最终需要从 javascript 发出的请求返回的任何内容。

三个解决方案

从我最不喜欢到我最喜欢的顺序排列。

  1. 您可以尝试计算 JavaScript 以使其在 html 上运行。这会很痛苦,所以我不推荐它,但如果你想走那条路,我认为有一个叫做 "the ruby racer" 的 gem 或其他东西(IIRC,它包装了 v8)。
  2. 您可以启动网络浏览器,让浏览器处理所有的 cray cray,然后在更新后向浏览器请求 html。这就是 Rahul 的解决方案所做的,它是一个非常好的解决方案。这不是我最喜欢的,因为它很重,而且您只能查看 html 中显示的信息。这称为 "scraping",它非常脆弱(某些设计人员在页面周围移动了一些东西,您的脚本就会中断),并且信息采用人工呈现格式,这意味着您通常必须做很多小的解析工作。
  3. 您可以打开浏览器的开发工具,转到网络选项卡,过滤 XHR 请求,然后重新加载页面。其中之一请求获取用于填充占位符的数据。找出它是哪一个,然后您可以自己提出该请求。这在某些方面也可能很脆弱,例如,有时您必须拥有正确的 cookie,并且您经常必须试验浏览器发送的内容以确定您需要多少(通常比发送的少得多,这是适用于您的情况)。 提示:当你这样做时,将请求数据与解析和探索数据分开(即将其保存到文件中,然后在查看数据时从文件中获取数据而不是制作数据)每次都有一个新的请求...这样它就不会改变你,你也不会受到速率限制)

解决方案 #3

所以,我很好奇,于是自己尝试了第 3 种解决方案,效果非常好,请检查一下:

require 'uri'
require 'net/http'

# build a post request to the URL that the page got the data from
uri = URI 'https://www.cargurus.com/Cars/inventorylisting/ajaxFetchSubsetInventoryListing.action?sourceContext=untrackedExternal_true_0'
req = Net::HTTP::Post.new(uri)

# set some headers
req['origin']        = 'https://www.cargurus.com' # for cross origin requests
req['cache-control'] = 'no-cache'                 # no caching, just in case,
req['pragma']        = 'no-cache'                 # we prob don't want stale data

# looks like you can pass it an awful lot of filters to use
req.set_form_data(
  "page"=>"1", "zip"=>"", "address"=>"", "latitude"=>"", "longitude"=>"",
  "distance"=>"100", "selectedEntity"=>"d841", "transmission"=>"ANY",
  "entitySelectingHelper.selectedEntity2"=>"", "minPrice"=>"", "maxPrice"=>"",
  "minMileage"=>"", "maxMileage"=>"", "bodyTypeGroup"=>"", "serviceProvider"=>"",
  "filterBySourcesString"=>"", "filterFeaturedBySourcesString"=>"",
  "displayFeaturedListings"=>"true", "searchSeoPageType"=>"",
  "inventorySearchWidgetType"=>"AUTO", "allYearsForTrimName"=>"false",
  "daysOnMarketMin"=>"", "daysOnMarketMax"=>"", "vehicleDamageCategoriesRaw"=>"",
  "minCo2Emission"=>"", "maxCo2Emission"=>"", "vatOnly"=>"false",
  "minEngineDisplacement"=>"", "maxEngineDisplacement"=>"", "minMpg"=>"",
  "maxMpg"=>"", "minEnginePower"=>"", "maxEnginePower"=>"", "isRecentSearchView"=>"false"
)

# make the request (200 means it worked)
res = Net::HTTP.start(uri.hostname, uri.port, use_ssl: true) { |http| http.request req }
res.code # => "200"

# parse the response
require 'json'
json = JSON.parse res.body

# we're on page 1 of 1, and there are 48 results on this page
json['page']                   # => 1
json['listings'].size          # => 48
json['remainingResults']       # => false

# apparently we're looking at some sort of car or smth
json['modelId']                # => "d841"
json['modelName']              # => "Mazda MAZDASPEED6"

# a bunch of places sell this car
json['sellers'].size           # => 47
json['sellers'][0]['location'] # => "Portland OR, 97217"

# the first of our 48 cars seems to be a deal
listing = json['listings'][0]
listing['mainPictureUrl']        # => "https://static.cargurus.com/images/forsale/2018/05/24/02/58/2006_mazda_mazdaspeed6-pic-61663369386257285-152x114.jpeg"
listing['expectedPriceString']   # => ",972"
listing['priceString']           # => ",890"
listing['daysOnMarket']          # => 61
listing['savingsRecommendation'] # => "Good Deal"
listing['carYear']               # => 2006
listing['mileageString']         # => "81,803"

# none of the 48 are salvaged or lemons
json['listings'].count { |l| l['lemon'] }   # => 0
json['listings'].count { |l| l['salvage'] } # => 0

# the savings recommendations seem reasonably distributed
json['listings'].group_by { |l| l["savingsRecommendation"] }.map { |rec, ls| [rec, ls.size] }
# => [["Good Deal", 4],
#     ["Fair Deal", 11],
#     ["No Price Analysis", 23],
#     ["High Price", 8],
#     ["Overpriced", 2]]