如何在 Java 中抓取我想要的 HTML 数据?
How can I scrape the HTML data which I want in Java?
我正在练习并从网站上抓取数据。我一直停留在 URL 是 https://www.ilan.gov.tr/ilan/kategori/12/iflas-hukuku-davalari?txv=12¤tPage=1. I want to get Kurum - İlan Numarası - Şehir ( Corporation - Notice Number - City ) datas. I can't scrape div I think. When I compile the code which includes this code div.search-results-header row
It doesn't work. Also I want to get first 20 pages of this site. How can I do this? There are complicated bunch of code so I'm adding images as attachments. If you tell me at least how can I get Kurum I think I can handle others. Thank you.
的站点中
但是,这是我正在为项目编写的代码。
public static void main(String[] args) 抛出异常 {
File iflasHukuku = new File("/Users/Berkan/Desktop/Iflas Hukuku.txt");
iflasHukuku.createNewFile();
FileWriter fileWriter = new FileWriter(iflasHukuku);
BufferedWriter bufferedWriter = new BufferedWriter(fileWriter);
final Document document = Jsoup.connect("https://www.ilan.gov.tr/ilan/kategori/12/iflas-hukuku-davalari?txv=12¤tPage=1").get();
for(Element x: document.select(".search-results-table-container container mb-4 ng-tns-c6-3 ng-star-inserted")) {
final String kurumAdi = x.select("div.search-results-header row").text();
System.out.println(kurumAdi);
}
}
该网页似乎是 Angular 应用程序。因此,您不能简单地使用 Jsoup.connect 获取 HTML 内容,因为浏览器需要执行 JS 来呈现页面。因此,您必须使用 WebDriver 来加载内容并获取 pageSource 并将其发送到 Jsoup。
看到这个:
import io.github.bonigarcia.wdm.WebDriverManager;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.chrome.ChromeOptions;
import org.openqa.selenium.support.ui.WebDriverWait;
public class JSoupTest {
public static void main(String[] args) {
WebDriverManager.chromedriver().setup(); //downloads the driver
ChromeOptions chromeOptions = new ChromeOptions();
chromeOptions.setHeadless(true);
WebDriver driver = new ChromeDriver(chromeOptions);
driver.get("https://www.ilan.gov.tr/ilan/kategori/12/iflas-hukuku-davalari?txv=12¤tPage=1");
WebDriverWait wait = new WebDriverWait(driver, 30);
wait.until(webDriver -> driver.getPageSource().contains("İlan Açıklaması"));
final Document document = Jsoup.parse(driver.getPageSource());
Elements xx = document.select(".search-results-row");
for (Element x : document.select(".search-results-row")) {
System.out.println(x.text());
//parse it further
}
}
}
所需的依赖项:
<dependency>
<groupId>io.github.bonigarcia</groupId>
<artifactId>webdrivermanager</artifactId>
<version>4.2.2</version>
</dependency>
<dependency>
<groupId>org.seleniumhq.selenium</groupId>
<artifactId>selenium-chrome-driver</artifactId>
<version>3.141.59</version>
</dependency>
<dependency>
<groupId>org.seleniumhq.selenium</groupId>
<artifactId>selenium-support</artifactId>
<version>3.141.59</version>
</dependency>
<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<version>28.2-jre</version>
</dependency>
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.13.1</version>
</dependency>
我正在练习并从网站上抓取数据。我一直停留在 URL 是 https://www.ilan.gov.tr/ilan/kategori/12/iflas-hukuku-davalari?txv=12¤tPage=1. I want to get Kurum - İlan Numarası - Şehir ( Corporation - Notice Number - City ) datas. I can't scrape div I think. When I compile the code which includes this code div.search-results-header row
It doesn't work. Also I want to get first 20 pages of this site. How can I do this? There are complicated bunch of code so I'm adding images as attachments. If you tell me at least how can I get Kurum I think I can handle others. Thank you.
但是,这是我正在为项目编写的代码。
public static void main(String[] args) 抛出异常 {
File iflasHukuku = new File("/Users/Berkan/Desktop/Iflas Hukuku.txt");
iflasHukuku.createNewFile();
FileWriter fileWriter = new FileWriter(iflasHukuku);
BufferedWriter bufferedWriter = new BufferedWriter(fileWriter);
final Document document = Jsoup.connect("https://www.ilan.gov.tr/ilan/kategori/12/iflas-hukuku-davalari?txv=12¤tPage=1").get();
for(Element x: document.select(".search-results-table-container container mb-4 ng-tns-c6-3 ng-star-inserted")) {
final String kurumAdi = x.select("div.search-results-header row").text();
System.out.println(kurumAdi);
}
}
该网页似乎是 Angular 应用程序。因此,您不能简单地使用 Jsoup.connect 获取 HTML 内容,因为浏览器需要执行 JS 来呈现页面。因此,您必须使用 WebDriver 来加载内容并获取 pageSource 并将其发送到 Jsoup。
看到这个:
import io.github.bonigarcia.wdm.WebDriverManager;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.chrome.ChromeOptions;
import org.openqa.selenium.support.ui.WebDriverWait;
public class JSoupTest {
public static void main(String[] args) {
WebDriverManager.chromedriver().setup(); //downloads the driver
ChromeOptions chromeOptions = new ChromeOptions();
chromeOptions.setHeadless(true);
WebDriver driver = new ChromeDriver(chromeOptions);
driver.get("https://www.ilan.gov.tr/ilan/kategori/12/iflas-hukuku-davalari?txv=12¤tPage=1");
WebDriverWait wait = new WebDriverWait(driver, 30);
wait.until(webDriver -> driver.getPageSource().contains("İlan Açıklaması"));
final Document document = Jsoup.parse(driver.getPageSource());
Elements xx = document.select(".search-results-row");
for (Element x : document.select(".search-results-row")) {
System.out.println(x.text());
//parse it further
}
}
}
所需的依赖项:
<dependency>
<groupId>io.github.bonigarcia</groupId>
<artifactId>webdrivermanager</artifactId>
<version>4.2.2</version>
</dependency>
<dependency>
<groupId>org.seleniumhq.selenium</groupId>
<artifactId>selenium-chrome-driver</artifactId>
<version>3.141.59</version>
</dependency>
<dependency>
<groupId>org.seleniumhq.selenium</groupId>
<artifactId>selenium-support</artifactId>
<version>3.141.59</version>
</dependency>
<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<version>28.2-jre</version>
</dependency>
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.13.1</version>
</dependency>