如何从 https://rbs.indianrail.gov.in/ShortPath/ShortPath.jsp 抓取火车路线数据

How to scrape train route data from https://rbs.indianrail.gov.in/ShortPath/ShortPath.jsp

我正在尝试从 https://rbs.indianrail.gov.in/ShortPath/ShortPath.jsp, by providing source and destination stations it displays list of intermediate stations within a table. but its hiding some intermediate stations under several buttons to limit the size of the table,i think. on clicking the buttons, it pushes hidden data on to the table. using jsoup i could get initial data in the table. but dont know how to get the hidden data. on button click, one javascript function requesting data using POST method from https://rbs.indianrail.gov.in/ShortPath/StationXmlServlet by passing "route=inter,index=1,distance=goods,PageName=ShortPath" as parameters and the response is in json. as the parameters are not relevant to the displayed table, i can not make direct request to the https://rbs.indianrail.gov.in/ShortPath/StationXmlServlet.

获取中间火车站列表信息
        private void shortestPath(String source, String destination) {

        Document doc;
        try {
            doc = Jsoup.connect(url)
                    .data("srcCode", source.toUpperCase())
                    .data("destCode", destination.toUpperCase())
                    .data("guageType", "S")
                    .data("transhipmentFlag", "false")
                    .data("distance", "goods")
                    .post();
            Element table = doc.select("tbody").get(0);
            Elements rows = table.select("tr");
            stationCodeList = new String[rows.size() - 3];
            jsonPath = new JSONObject();
            for (int row = 3; row < rows.size(); row++) {
                JSONObject jsonObject = new JSONObject();
                Elements cols = rows.get(row).select("td");
                String code = cols.get(1).text();
                String name = cols.get(2).text();
                String cum_dist = cols.get(3).text();
                String inter_dist = cols.get(4).text();
                String gauge = cols.get(5).text();
                String carry_cap = cols.get(6).text();
               
                jsonObject.put("Code", code);
                jsonObject.put("Name", name);
                jsonObject.put("Cumulative Distance", cum_dist);
                jsonObject.put("inter Distance", inter_dist);
                jsonObject.put("Gauge Type", gauge);
                jsonObject.put("Carrying Capacity", carry_cap);
                jsonPath.put(code, jsonObject);
                stationCodeList[row - 3] = code;
            }
        } catch (Exception e) {
            e.printStackTrace();
        }
        this.destination =new Station(stationCodeList[stationCodeList.length-1]);
    }

提前致谢

如果您看一下 this answer,您将了解如何获得浏览器发出的完全相同的请求。

使用您的示例,对 StationXmlServlet 的最小且有效的 POST 请求看起来 curl 类似:

curl --request POST 'https://rbs.indianrail.gov.in/ShortPath/StationXmlServlet' \
  -H 'Content-Type: application/x-www-form-urlencoded' \
  -H 'Cookie: JSESSIONID1=0000ob7e89cT3vUAYkBxF6oyW4w:APP2SERV1' \
  --data-raw 'route=inter&index=1&distance=goods&PageName=ShortPath'

As the parameters are not relevant to the displayed table, i can not make direct request to the https://rbs.indianrail.gov.in/ShortPath/StationXmlServlet.

我不认为那是真的。请求正文中的 index 是 master table.

中从零开始的行索引

解决方案

事实证明,您只需遵循与在网络浏览器中使用该页面时完全相同的顺序即可。也就是说,你要先加载mastertable,这样当你想查询详情的时候,站点才知道你在看哪个table。会话 cookie 跟踪此状态。

首先,您打开着陆页并获得 Cookie:

HttpRequest cookieRequest = HttpRequest.newBuilder()
    .uri(URI.create("https://rbs.indianrail.gov.in/ShortPath/ShortPath.jsp"))
    .GET()
    .build();
HttpResponse<String> cookieResponse =
    client.send(cookieRequest, BodyHandlers.ofString());
String cookie = cookieResponse.headers().firstValue("Set-Cookie").get();

接下来,加载主控 table,给定指定的表单参数:

HttpRequest masterRequest = HttpRequest.newBuilder()
    .uri(URI.create("https://rbs.indianrail.gov.in/ShortPath/ShortPathServlet"))
    .header("Content-Type", "application/x-www-form-urlencoded")
    .header("Cookie", cookie)
    .POST(BodyPublishers.ofString("srcCode=RGDA&destCode=JSWT&findPath0.x=42&findPath0.y=13&gaugeType=S&distance=goods&PageName=ShortPath"))
    .build();
HttpResponse<String> masterResponse =
    client.send(masterRequest, BodyHandlers.ofString());
String masterTableHTML = masterResponse.body();
// Document masterTablePage = Jsoup.parse(masterTableHTML);
// ...

最后可以查询master每一行的明细table。在下面的示例中,我们查询第一行的详细信息。

HttpRequest detailsRequest = HttpRequest.newBuilder()
    .uri(URI.create("https://rbs.indianrail.gov.in/ShortPath/StationXmlServlet"))
    .header("Content-Type", "application/x-www-form-urlencoded")
    .header("Cookie", cookie)
    .POST(BodyPublishers.ofString("route=inter&index=0&distance=goods&PageName=ShortPath"))
    .build();
HttpResponse<String> detailsResponse =
    client.send(detailsRequest, BodyHandlers.ofString());
String jsonResponse = detailsResponse.body();
System.out.println(jsonResponse);