如何解析来自 CNBC 市场页面的表格数据?
How to parse tabular data from CNBC Markets Page?
我正在编写一个程序,它接受用户输入以连接到站点,将其 html 下载到文本中,并每天两次从 table 检索数据。我知道代码不会对任何页面都适用(一旦我开始工作,我可能会 "hardwire" 将 url 放入代码中)。我目前的问题是我的 jsoup 解析器没有正确读取表格数据。我不确定我的元素选择器是否过于通用? table 看起来像是标准的 table/tr/td 格式,但我的行数组填充大小为 0。如果有人可以帮助我调试我的解析器并可能提供一些关于在哪里寻找使其获取数据的建议每天默默地两次,我真的很感激!没有 runtime/compile 错误,只需要更正输出。
源站点:https://www.cnbc.com/us-markets/
table(片段)的源代码:
<table class="BasicTable-table"><thead class="BasicTable-tableHeading BasicTable-tableHeadingSortable"><tr><th class="BasicTable-textData"><span>SYMBOL <span class="icon-sort undefined"></span></span></th><th class="BasicTable-numData"><span>PRICE <span class="icon-sort undefined"></span></span></th><th class="BasicTable-numData">
我的代码:
public class StockScraper {
public static void main(String[] args) {
Scanner input = new Scanner (System.in);
System.out.println("Enter the complete url (including http://) of the site you would like to parse:");
String html = input.nextLine();
try {
Document doc = Jsoup.connect(html).get();
System.out.printf("Title: %s", doc.title());
//Try to print site content
System.out.println("");
System.out.println("Writing html contents to 'html.txt'...");
//Save html contents to text file
PrintWriter outputfile = new PrintWriter("html.txt");
outputfile.print(doc.outerHtml());
outputfile.close();
//Select stock data you want to retrieve
System.out.println("Enter the name of the stock you want to check");
String name = input.nextLine();
//Pull data from CNBC Markets
Element table = doc.select("table").get(0);
Elements rows = table.select("tr");
System.out.println(rows.size());
for(int i = 1; i < rows.size(); i++) {
Element rowx = rows.get(i);
Elements col = rows.select("td");
if(col.get(0).equals(name)) {
System.out.println("I worked!");
System.out.println(col.get(1));
}
}
} catch (IOException e) {
e.printStackTrace();
}
}
这里的问题是这个网站是一个动态页面,在浏览器最初下载页面后加载内容。 Jsoup 不足以抓取这样的页面。您有几个选择:
1) 使用模拟浏览器的工具并进行所有必要的 api 调用。几个选项是 Selenium WebDriver 或 HTMLUnit。
2) 找出您在本网站上感兴趣的 api 调用,然后直接调用那些 api 以获得可以解析的 JSON 文档。您可以通过在浏览器中打开开发人员工具查看 api 详细信息,然后查看“网络”选项卡。对于此站点,示例如下,其中包括 DJI 的股票报价:
https://quote.cnbc.com/quote-html-webservice/quote.htm?noform=1&partnerId=2&fund=1&exthrs=0&output=json&symbolType=issue&symbols=599362|579435|593933|49020635|49031016|5093160|617254|601065&requestMethod=extended
Returns:
ExtendedQuoteResult: {
xmlns: "http://quote.cnbc.com/services/MultiQuote/2006",
ExtendedQuote: [{
QuickQuote: {
symbol: ".DJI",
code: "0",
curmktstatus: "REG_MKT",
FundamentalData: {
yrlodate: "2020-03-23",
yrloprice: "18213.65",
yrhidate: "2020-02-12",
yrhiprice: "29568.57"
},
mappedSymbol: {
xsi:nil: "true"
},
source: "Exchange",
cnbcId: "599362",
prev_prev_closing: "21413.44",
high: "22783.45",
low: "21693.63",
provider: "CNBC Quote Cache",
streamable: "0",
last_time: "2020-04-06T17:16:28.000-0400",
countryCode: "US",
previous_day_closing: "21052.53",
altName: "Dow Industrials",
reg_last_time: "2020-04-06T17:16:28.000-0400",
last_time_msec: "1586207788000",
altSymbol: ".DJI",
change_pct: "7.73",
providerSymbol: ".DJI",
assetSubType: "Index",
comments: "RIC",
last: "22679.99",
issue_id: "599362",
cacheServed: "false",
responseTime: "Mon Apr 06 19:12:09 EDT 2020",
change: "1627.46",
timeZone: "EDT",
onAirName: "Dow Industrials",
symbolType: "issue",
assetType: "INDEX",
volume: "614200990",
fullVolume: "614200990",
realTime: "true",
name: "Dow Jones Industrial Average",
quoteDesc: { },
exchange: "Dow Jones Global Indexes",
shortName: "DJIA",
cachedTime: "Mon Apr 06 19:12:09 EDT 2020",
currencyCode: "USD",
open: "21693.63"
}
}
...
我正在编写一个程序,它接受用户输入以连接到站点,将其 html 下载到文本中,并每天两次从 table 检索数据。我知道代码不会对任何页面都适用(一旦我开始工作,我可能会 "hardwire" 将 url 放入代码中)。我目前的问题是我的 jsoup 解析器没有正确读取表格数据。我不确定我的元素选择器是否过于通用? table 看起来像是标准的 table/tr/td 格式,但我的行数组填充大小为 0。如果有人可以帮助我调试我的解析器并可能提供一些关于在哪里寻找使其获取数据的建议每天默默地两次,我真的很感激!没有 runtime/compile 错误,只需要更正输出。
源站点:https://www.cnbc.com/us-markets/ table(片段)的源代码:
<table class="BasicTable-table"><thead class="BasicTable-tableHeading BasicTable-tableHeadingSortable"><tr><th class="BasicTable-textData"><span>SYMBOL <span class="icon-sort undefined"></span></span></th><th class="BasicTable-numData"><span>PRICE <span class="icon-sort undefined"></span></span></th><th class="BasicTable-numData">
我的代码:
public class StockScraper {
public static void main(String[] args) {
Scanner input = new Scanner (System.in);
System.out.println("Enter the complete url (including http://) of the site you would like to parse:");
String html = input.nextLine();
try {
Document doc = Jsoup.connect(html).get();
System.out.printf("Title: %s", doc.title());
//Try to print site content
System.out.println("");
System.out.println("Writing html contents to 'html.txt'...");
//Save html contents to text file
PrintWriter outputfile = new PrintWriter("html.txt");
outputfile.print(doc.outerHtml());
outputfile.close();
//Select stock data you want to retrieve
System.out.println("Enter the name of the stock you want to check");
String name = input.nextLine();
//Pull data from CNBC Markets
Element table = doc.select("table").get(0);
Elements rows = table.select("tr");
System.out.println(rows.size());
for(int i = 1; i < rows.size(); i++) {
Element rowx = rows.get(i);
Elements col = rows.select("td");
if(col.get(0).equals(name)) {
System.out.println("I worked!");
System.out.println(col.get(1));
}
}
} catch (IOException e) {
e.printStackTrace();
}
}
这里的问题是这个网站是一个动态页面,在浏览器最初下载页面后加载内容。 Jsoup 不足以抓取这样的页面。您有几个选择:
1) 使用模拟浏览器的工具并进行所有必要的 api 调用。几个选项是 Selenium WebDriver 或 HTMLUnit。
2) 找出您在本网站上感兴趣的 api 调用,然后直接调用那些 api 以获得可以解析的 JSON 文档。您可以通过在浏览器中打开开发人员工具查看 api 详细信息,然后查看“网络”选项卡。对于此站点,示例如下,其中包括 DJI 的股票报价:
https://quote.cnbc.com/quote-html-webservice/quote.htm?noform=1&partnerId=2&fund=1&exthrs=0&output=json&symbolType=issue&symbols=599362|579435|593933|49020635|49031016|5093160|617254|601065&requestMethod=extended
Returns:
ExtendedQuoteResult: {
xmlns: "http://quote.cnbc.com/services/MultiQuote/2006",
ExtendedQuote: [{
QuickQuote: {
symbol: ".DJI",
code: "0",
curmktstatus: "REG_MKT",
FundamentalData: {
yrlodate: "2020-03-23",
yrloprice: "18213.65",
yrhidate: "2020-02-12",
yrhiprice: "29568.57"
},
mappedSymbol: {
xsi:nil: "true"
},
source: "Exchange",
cnbcId: "599362",
prev_prev_closing: "21413.44",
high: "22783.45",
low: "21693.63",
provider: "CNBC Quote Cache",
streamable: "0",
last_time: "2020-04-06T17:16:28.000-0400",
countryCode: "US",
previous_day_closing: "21052.53",
altName: "Dow Industrials",
reg_last_time: "2020-04-06T17:16:28.000-0400",
last_time_msec: "1586207788000",
altSymbol: ".DJI",
change_pct: "7.73",
providerSymbol: ".DJI",
assetSubType: "Index",
comments: "RIC",
last: "22679.99",
issue_id: "599362",
cacheServed: "false",
responseTime: "Mon Apr 06 19:12:09 EDT 2020",
change: "1627.46",
timeZone: "EDT",
onAirName: "Dow Industrials",
symbolType: "issue",
assetType: "INDEX",
volume: "614200990",
fullVolume: "614200990",
realTime: "true",
name: "Dow Jones Industrial Average",
quoteDesc: { },
exchange: "Dow Jones Global Indexes",
shortName: "DJIA",
cachedTime: "Mon Apr 06 19:12:09 EDT 2020",
currencyCode: "USD",
open: "21693.63"
}
}
...