使用 HttpBuilder 解析 HTML 时收到部分结果
Receiving partial results when parsing HTML with HttpBuilder
当我像下面这样用 HttpBuilder
解析 HTML 时,我没有收到完整的 HTML,正如我在访问该页面并检查时看到的那样。例如,在生成的文件中看不到 <img>
标记。
@Grab(group='org.codehaus.groovy.modules.http-builder', module='http-builder', version='0.7' )
import groovyx.net.http.HTTPBuilder
import static groovyx.net.http.Method.GET
import static groovyx.net.http.ContentType.JSON
import groovy.json.*
def http = new HTTPBuilder('http://www.google.com')
def html = http.get(uri: 'http://www.imdb.com/title/tt2004420/', contentType: groovyx.net.http.ContentType.TEXT) { resp, reader ->
def p = new XmlSlurper(new org.cyberneko.html.parsers.SAXParser()).parseText(s)
new File("/Users/../Documents/temp.txt") << p
}
我希望通过解析来计算 html 页面上的图像数量。
发生这种情况是因为当您解析文件并显示它时,只显示内容 - 没有标签。在 运行 以下脚本之后:
@Grab(group='org.codehaus.groovy.modules.http-builder', module='http-builder', version='0.7' )
import groovyx.net.http.HTTPBuilder
import static groovyx.net.http.Method.GET
import static groovyx.net.http.ContentType.JSON
import groovy.json.*
def http = new HTTPBuilder('http://www.google.com')
def html = http.get(uri: 'http://www.imdb.com/title/tt2004420/', contentType: groovyx.net.http.ContentType.TEXT) { resp, reader ->
def p = new XmlSlurper(new org.cyberneko.html.parsers.SAXParser()).parseText(reader.text)
new File("lol") << p
}
lol
文件包含例如以下行:
IMDbMoreAllTitlesTV
EpisodesNamesCompaniesKeywordsCharactersQuotesBiosPlotsMovies,
哪个(部分)在解析之前看起来:
<div class="quicksearch_dropdown_wrapper">
<select name="s" id="quicksearch" class="quicksearch_dropdown navbarSprite"
onchange="jumpMenu(this); suggestionsearch_dropdown_choice(this);">
<option value="all" >All</option>
<option value="tt" >Titles</option>
<option value="ep" >TV Episodes</option>
<option value="nm" >Names</option>
<option value="co" >Companies</option>
<option value="kw" >Keywords</option>
<option value="ch" >Characters</option>
<option value="qu" >Quotes</option>
<option value="bi" >Bios</option>
<option value="pl" >Plots</option>
</select>
</div>
如果您想查看标签,请使用以下脚本:
@Grab(group='org.codehaus.groovy.modules.http-builder', module='http-builder', version='0.7' )
import groovyx.net.http.HTTPBuilder
import static groovyx.net.http.Method.GET
import static groovyx.net.http.ContentType.JSON
import groovy.json.*
def http = new HTTPBuilder('http://www.google.com')
def html = http.get(uri: 'http://www.imdb.com/title/tt2004420/', contentType: groovyx.net.http.ContentType.TEXT) { resp, reader ->
new File("lol") << reader.text
}
当我像下面这样用 HttpBuilder
解析 HTML 时,我没有收到完整的 HTML,正如我在访问该页面并检查时看到的那样。例如,在生成的文件中看不到 <img>
标记。
@Grab(group='org.codehaus.groovy.modules.http-builder', module='http-builder', version='0.7' )
import groovyx.net.http.HTTPBuilder
import static groovyx.net.http.Method.GET
import static groovyx.net.http.ContentType.JSON
import groovy.json.*
def http = new HTTPBuilder('http://www.google.com')
def html = http.get(uri: 'http://www.imdb.com/title/tt2004420/', contentType: groovyx.net.http.ContentType.TEXT) { resp, reader ->
def p = new XmlSlurper(new org.cyberneko.html.parsers.SAXParser()).parseText(s)
new File("/Users/../Documents/temp.txt") << p
}
我希望通过解析来计算 html 页面上的图像数量。
发生这种情况是因为当您解析文件并显示它时,只显示内容 - 没有标签。在 运行 以下脚本之后:
@Grab(group='org.codehaus.groovy.modules.http-builder', module='http-builder', version='0.7' )
import groovyx.net.http.HTTPBuilder
import static groovyx.net.http.Method.GET
import static groovyx.net.http.ContentType.JSON
import groovy.json.*
def http = new HTTPBuilder('http://www.google.com')
def html = http.get(uri: 'http://www.imdb.com/title/tt2004420/', contentType: groovyx.net.http.ContentType.TEXT) { resp, reader ->
def p = new XmlSlurper(new org.cyberneko.html.parsers.SAXParser()).parseText(reader.text)
new File("lol") << p
}
lol
文件包含例如以下行:
IMDbMoreAllTitlesTV EpisodesNamesCompaniesKeywordsCharactersQuotesBiosPlotsMovies,
哪个(部分)在解析之前看起来:
<div class="quicksearch_dropdown_wrapper">
<select name="s" id="quicksearch" class="quicksearch_dropdown navbarSprite"
onchange="jumpMenu(this); suggestionsearch_dropdown_choice(this);">
<option value="all" >All</option>
<option value="tt" >Titles</option>
<option value="ep" >TV Episodes</option>
<option value="nm" >Names</option>
<option value="co" >Companies</option>
<option value="kw" >Keywords</option>
<option value="ch" >Characters</option>
<option value="qu" >Quotes</option>
<option value="bi" >Bios</option>
<option value="pl" >Plots</option>
</select>
</div>
如果您想查看标签,请使用以下脚本:
@Grab(group='org.codehaus.groovy.modules.http-builder', module='http-builder', version='0.7' )
import groovyx.net.http.HTTPBuilder
import static groovyx.net.http.Method.GET
import static groovyx.net.http.ContentType.JSON
import groovy.json.*
def http = new HTTPBuilder('http://www.google.com')
def html = http.get(uri: 'http://www.imdb.com/title/tt2004420/', contentType: groovyx.net.http.ContentType.TEXT) { resp, reader ->
new File("lol") << reader.text
}