仅从网页中抓取特定细节

Question

我正在使用 Jsoup 从网页检索详细信息并将其写入文本文件。我可以只检索其中的一部分吗？例如在下面的 link 中，我只想获取职位描述。

http://aldogroup.luceosolutions.com/recruit/stores/advert_details.php?id=3136&_lang=en&partner_id=139

有时招聘信息来自不同的网站，因此 html 标签的格式可能会有所不同。我需要一种方法来只检索职位描述。以下代码检索网页上的所有内容。我怎样才能只得到职位描述？请帮忙

public class MainCollector {

    public static void main(String[] args) {
        // TODO Auto-generated method stub

        Document doc;
        try {
            doc = Jsoup.connect("http://aldogroup.luceosolutions.com/recruit/stores/advert_details.php?id=3136&_lang=en&partner_id=139").get();
            String title = doc.title();
            String body = doc.body().toString();
            Document convertText = Jsoup.parseBodyFragment(body);
            String convertedText = convertText.text();
            System.out.println("Title:" + title);
            System.out.println("Body:" + convertedText);
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
    }
}

Answer 1

你可以使用这个 -

Elements e = doc.select(".annonce > p:nth-child(5)");
System.out.println(e.text());

要正确 CSS selector 您可以打开浏览器的开发人员工具（按 F12），然后选择检查器工具。
您还应该将 user agent 字符串添加到您的请求中，这样您将从浏览器和程序中获得相同的页面 -

doc = Jsoup.connect("http://aldogroup.luceosolutions.com/recruit/stores/advert_details.php?id=3136&_lang=en&partner_id=139")
                .userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64; rv:45.0) Gecko/20100101 Firefox/45.0")
                .get();

仅从网页中抓取特定细节

Only scrape specific details from a web page

html

java

webpage

web-scraping

jsoup