如何从我试图在字符串中抓取的网页中获取 html？

Question

我写了下面的代码：

require "http/client"
require "myhtml"

puts "Give me the URL of the page to be scraped."

url = gets

html=<<-HTML
 [Here goes the html of the website to be scraped]
HTML

myhtml = Myhtml::Parser.new(html)

myhtml.nodes(:div).each do |node|
  id = node.attribute_by("id")

  if first_link = node.scope.nodes(:a).first?
    href = first_link.attribute_by("href")
    link_text = first_link.inner_text

    puts "div with id #{id} have link [#{link_text}](#{href})"
  else
    puts "div with id #{id} have no links"
  end
end

如何从我试图在字符串中抓取的网页中获取 html 以便我可以替换

html=<<-HTML
 [Here goes the html of the website to be scraped]
HTML

类似

response = requests.get(url)

html = BeautifulSoup(response.text, 'html.parser')

来自以下 Python 代码：


url = input("What is the address of the web page in question?\n")

response = requests.get(url)

html = BeautifulSoup(response.text, 'html.parser')

来自以下 Rust 代码的

或 let html = reqwest::get(url).await?.text().await?;：

println!("Give me the URL of the page to be scraped."); 
 let mut url = String::new();
 io::stdin().read_line(&mut url).expect("Failed to read line");

 let html = reqwest::get(url).await?.text().await?;

shard myhtml 的文档没有提供足够的信息我的例子来解决这个问题。可以用他们的 Crystal's HTTP client 来完成吗？标准库？当我更换

html=<<-HTML
 [Here goes the html of the website to be scraped]
HTML

和

response = HTTP::Client.get url

html = response.body

我收到以下错误：

response = HTTP::Client.get url #no overload matches 'HTTP::Client.get' with type (String | Nil)
                             ^--
Error: no overload matches 'HTTP::Client.get' with type (String | Nil)

Overloads are:
 - HTTP::Client.get(url : String | URI, headers : HTTP::Headers | ::Nil = nil, body : BodyType = nil, tls : TLSContext = nil)
 - HTTP::Client.get(url : String | URI, headers : HTTP::Headers | ::Nil = nil, body : BodyType = nil, tls : TLSContext = nil, &block)
 - HTTP::Client.get(url, headers : HTTP::Headers | ::Nil = nil, tls : TLSContext = nil, *, form : String | IO | Hash)
 - HTTP::Client.get(url, headers : HTTP::Headers | ::Nil = nil, tls : TLSContext = nil, *, form : String | IO | Hash, &block)
Couldn't find overloads for these types:
 - HTTP::Client.get(Nil)

我可以从网页上获取文本通过硬编码，例如response = HTTP::Client.get "https://github.com/monero-project/monero/releases" 但这还不够，因为我希望该应用程序具有交互性。

Answer 1

你很接近，是类型系统在抱怨。 HTTP::Client.get 期望 String（或者 String | URL）。但是，在您的代码中，您的 url 变量也可以是 nil 并且是 String? 类型，它是 String | Nil 的缩写。如果您对 URL 进行硬编码，则它不能是 nil，但始终是 String 类型。因此 HTTP::Client.get 调用有效。

查看 documentation of the get function:

def gets(chomp = true) : String?

Reads a line from this IO. A line is terminated by the \n character. Returns nil if called at the end of this IO.

有多种方法可以解决它，但基本思想是您必须确保在进行 HTTP 调用时 url 不能是 nil。例如：

url = gets
if url
  # now url cannot be nil
  response = HTTP::Client.get url
  html = response.body
  puts html
end

进一步阅读：if var

如何从我试图在字符串中抓取的网页中获取 html？

How do I get the html from the web page that I am trying to scrape in a string?

httpclient

web-scraping

crystal-lang