使用 java 中的 JSOUP 库从 HTML 中读取内容

To read content from HTML using JSOUP Library in java

我的 HTML 电子邮件正文如下所示。

我想获取 emailBody 中存在的每个元素(公司、优先级、描述等),最后制作 Json 键值对。

我希望通过使用JSOUP Library我们能够实现这一点。

HTML内容:

<html>
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
            <meta content="text/html; charset=iso-8859-1">
                <style type="text/css" style="display:none">
                    <!--
p
        {margin-top:0;
        margin-bottom:0}
-->

                </style>
            </head>
            <body dir="ltr">
                <div style="font-family:Calibri,Arial,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0)">
                    <p class="MsoNormal" style="margin:0cm; font-size:11pt; font-family:Calibri,sans-serif">
                        <a>
                            <b>
                                <span lang="EN-US" style="font-size:10.0pt; font-family:&quot;Segoe UI&quot;,sans-serif; color:black">Mandatory fields are Ticket Type, Company, Priority, Short Description.&nbsp;</span>
                            </b>
                        </a>
                    </p>
                    <p class="MsoNormal" style="margin:0cm; font-size:11pt; font-family:Calibri,sans-serif">
                        <span style="">
                            <b>
                                <span lang="EN-US" style="">&nbsp;</span>
                            </b>
                        </span>
                    </p>
                    <span style=""></span>
                    <span style=""></span>
                    <span style=""></span>
                    <span style=""></span>
                    <span style=""></span>
                    <span style=""></span>
                    <span style=""></span>
                    <span style=""></span>
                    <span style=""></span>
                    <span style=""></span>
                    <span style=""></span>
                    <span style=""></span>
                    <table class="MsoNormalTable" align="left" width="666" style="width:499.25pt; border-collapse:collapse; border:none; margin-left:6.75pt; margin-right:6.75pt; margin-bottom:5.75pt">
                        <tbody>
                            <tr style="height:19.85pt">
                                <td width="150" valign="top" style="width:112.25pt; border:solid windowtext 1.0pt; padding:0cm 5.4pt 0cm 5.4pt; height:19.85pt">
                                    <p class="MsoNormal" align="right" style="margin:0cm; font-size:11pt; font-family:Calibri,sans-serif; text-align:right; line-height:105%">
                                        <span style=""></span>
                                        <span style="">
                                            <b>
                                                <span style="color:red">Ticket Type: </span>
                                            </b>
                                        </span>
                                        <span style="">
                                            <span style="">&nbsp;</span>
                                        </span>
                                    </p>
                                </td>
                                <td width="516" valign="top" style="width:387.0pt; border:solid windowtext 1.0pt; border-left:none; padding:0cm 5.4pt 0cm 5.4pt; height:19.85pt">
                                    <p class="MsoNormal" style="margin:0cm; font-size:11pt; font-family:Calibri,sans-serif; line-height:105%">
                                        <span style="">
                                            <span style="">Incident&nbsp;</span>
                                        </span>
                                    </p>
                                </td>
                            </tr>
                            <tr style="height:17.35pt">
                                <td width="150" valign="top" style="width:112.25pt; border:solid windowtext 1.0pt; border-top:none; padding:0cm 5.4pt 0cm 5.4pt; height:17.35pt">
                                    <p class="MsoNormal" align="right" style="margin:0cm; font-size:11pt; font-family:Calibri,sans-serif; text-align:right; line-height:105%">
                                        <span style="">
                                            <b>
                                                <span style="">
                                                    <span style="color:red">Company:</span>&nbsp;
                                                </span>
                                            </b>
                                        </span>
                                    </p>
                                </td>
                                <td width="516" valign="top" style="width:387.0pt; border-top:none; border-left:none; border-bottom:solid windowtext 1.0pt; border-right:solid windowtext 1.0pt; padding:0cm 5.4pt 0cm 5.4pt; height:17.35pt">
                                    <p class="MsoNormal" style="margin:0cm; font-size:11pt; font-family:Calibri,sans-serif; line-height:105%">
                                        <span style="">
                                            <span style="">Grupo Bimbo DSDE&nbsp;</span>
                                        </span>
                                    </p>
                                </td>
                            </tr>
                            <tr style="height:17.35pt">
                                <td width="150" valign="top" style="width:112.25pt; border:solid windowtext 1.0pt; border-top:none; padding:0cm 5.4pt 0cm 5.4pt; height:17.35pt">
                                    <p class="MsoNormal" align="right" style="margin:0cm; font-size:11pt; font-family:Calibri,sans-serif; text-align:right; line-height:105%">
                                        <span style="">
                                            <b>
                                                <span style="">Configuration Item:&nbsp;</span>
                                            </b>
                                        </span>
                                    </p>
                                </td>
                                <td width="516" valign="top" style="width:387.0pt; border-top:none; border-left:none; border-bottom:solid windowtext 1.0pt; border-right:solid windowtext 1.0pt; padding:0cm 5.4pt 0cm 5.4pt; height:17.35pt">
                                    <span style=""></span>
                                    <p class="MsoNormal" style="margin:0cm; font-size:11pt; font-family:Calibri,sans-serif; line-height:105%">
                                        <span style="">
                                            <span style="">&nbsp;</span>
                                        </span>
                                    </p>
                                </td>
                            </tr>
                            <tr style="height:17.35pt">
                                <td width="150" valign="top" style="width:112.25pt; border:solid windowtext 1.0pt; border-top:none; padding:0cm 5.4pt 0cm 5.4pt; height:17.35pt">
                                    <p class="MsoNormal" align="right" style="margin:0cm; font-size:11pt; font-family:Calibri,sans-serif; text-align:right; line-height:105%">
                                        <span style="">
                                            <b>
                                                <span style="color:red">Priority:&nbsp;</span>
                                            </b>
                                        </span>
                                    </p>
                                </td>
                                <td width="516" valign="top" style="width:387.0pt; border-top:none; border-left:none; border-bottom:solid windowtext 1.0pt; border-right:solid windowtext 1.0pt; padding:0cm 5.4pt 0cm 5.4pt; height:17.35pt">
                                    <p class="MsoNormal" style="margin:0cm; font-size:11pt; font-family:Calibri,sans-serif; line-height:105%">
                                        <span style="">
                                            <span style="">P3-Moderate&nbsp;</span>
                                        </span>
                                    </p>
                                </td>
                            </tr>
                            <tr style="height:17.35pt">
                                <td width="150" valign="top" style="width:112.25pt; border:solid windowtext 1.0pt; border-top:none; padding:0cm 5.4pt 0cm 5.4pt; height:17.35pt">
                                    <p class="MsoNormal" align="right" style="margin:0cm; font-size:11pt; font-family:Calibri,sans-serif; text-align:right; line-height:105%">
                                        <span style="">
                                            <b>
                                                <span style="color:red">Short Description:&nbsp;</span>
                                            </b>
                                        </span>
                                    </p>
                                </td>
                                <td width="516" valign="top" style="width:387.0pt; border-top:none; border-left:none; border-bottom:solid windowtext 1.0pt; border-right:solid windowtext 1.0pt; padding:0cm 5.4pt 0cm 5.4pt; height:17.35pt">
                                    <span style=""></span>
                                    <p class="MsoNormal" style="margin:0cm; font-size:11pt; font-family:Calibri,sans-serif; line-height:105%">
                                        <span style="">
                                            <span style="">&nbsp;</span>
                                        </span>
                                    </p>
                                </td>
                            </tr>
                            <tr style="height:17.35pt">
                                <td width="150" valign="top" style="width:112.25pt; border:solid windowtext 1.0pt; border-top:none; padding:0cm 5.4pt 0cm 5.4pt; height:17.35pt">
                                    <p class="MsoNormal" align="right" style="margin:0cm; font-size:11pt; font-family:Calibri,sans-serif; text-align:right; line-height:105%">
                                        <span style="">
                                            <b>
                                                <span style="">Description:&nbsp;</span>
                                            </b>
                                        </span>
                                    </p>
                                </td>
                                <td width="516" valign="top" style="width:387.0pt; border-top:none; border-left:none; border-bottom:solid windowtext 1.0pt; border-right:solid windowtext 1.0pt; padding:0cm 5.4pt 0cm 5.4pt; height:17.35pt">
                                    <span style=""></span>
                                    <p class="MsoNormal" style="margin:0cm; font-size:11pt; font-family:Calibri,sans-serif; line-height:105%">
                                        <span style="">
                                            <span style="">&nbsp;</span>
                                        </span>
                                    </p>
                                </td>
                            </tr>
                        </tbody>
                    </table>
                    <p class="MsoNormal" style="margin:0cm; font-size:11pt; font-family:Calibri,sans-serif">
                        <span style=""></span>
                        <span style="">
                            <b>
                                <u>
                                    <span lang="EN-US" style="">&nbsp;</span>
                                </u>
                            </b>
                        </span>
                    </p>
                    <p class="MsoNormal" style="margin:0cm; font-size:11pt; font-family:Calibri,sans-serif">
                        <span style="">
                            <b>
                                <u>
                                    <span lang="EN-US" style="">&nbsp;</span>
                                </u>
                            </b>
                        </span>
                    </p>
                    <p class="MsoNormal" style="margin:0cm; font-size:11pt; font-family:Calibri,sans-serif">
                        <span style="">
                            <b>
                                <u>
                                    <span lang="EN-US" style="">&nbsp;</span>
                                </u>
                            </b>
                        </span>
                    </p>
                    <p class="MsoNormal" style="margin:0cm; font-size:11pt; font-family:Calibri,sans-serif">
                        <span style="">
                            <b>
                                <u>
                                    <span lang="EN-US" style="">&nbsp;</span>
                                </u>
                            </b>
                        </span>
                    </p>
                    <p class="MsoNormal" style="margin:0cm; font-size:11pt; font-family:Calibri,sans-serif">
                        <span style="">
                            <b>
                                <u>
                                    <span lang="EN-US" style="">&nbsp;</span>
                                </u>
                            </b>
                        </span>
                    </p>
                    <p class="MsoNormal" style="margin:0cm; font-size:11pt; font-family:Calibri,sans-serif">
                        <span style="">
                            <b>
                                <u>
                                    <span lang="EN-US" style="">&nbsp;</span>
                                </u>
                            </b>
                        </span>
                    </p>
                    <p class="MsoNormal" style="margin:0cm; font-size:11pt; font-family:Calibri,sans-serif">
                        <span style="">
                            <b>
                                <u>
                                    <span lang="EN-US" style="">&nbsp;</span>
                                </u>
                            </b>
                        </span>
                    </p>
                    <p class="MsoNormal" style="margin:0cm; font-size:11pt; font-family:Calibri,sans-serif">
                        <span style="">
                            <b>
                                <u>
                                    <span lang="EN-US" style="">&nbsp;</span>
                                </u>
                            </b>
                        </span>
                    </p>
                    <p class="MsoNormal" style="margin:0cm; font-size:11pt; font-family:Calibri,sans-serif">
                        <span style="">
                            <b>
                                <u>
                                    <span lang="EN-US" style="">&nbsp;</span>
                                </u>
                            </b>
                        </span>
                    </p>
                    <p class="MsoNormal" style="margin:0cm; font-size:11pt; font-family:Calibri,sans-serif">
                        <span style="">
                            <b>
                                <u>
                                    <span lang="EN-US" style="font-size:13.0pt">NOTE: Request user to contact CMS Operations Team at +91-7337894153/8939984385 for Critical Incidents -&nbsp; P1 with business impact.&nbsp;</span>
                                </u>
                            </b>
                        </span>
                    </p>
                    <p class="MsoNormal" style="margin:0cm; font-size:11pt; font-family:Calibri,sans-serif">
                        <span style="">
                            <b>
                                <span lang="EN-US" style="">&nbsp;</span>
                            </b>
                        </span>
                    </p>
                    <p class="MsoNormal" style="margin:0cm; font-size:11pt; font-family:Calibri,sans-serif">
                        <span style="">
                            <span lang="EN-US" style="font-size:10.5pt; color:black">Thanks,
                                <br>CMS Support
                                </span>
                            </span>
                            <span style="">
                                <b>
                                    <span lang="EN-US" style="">&nbsp;</span>
                                </b>
                            </span>
                        </p>
                        <br>
                        </div>
                    </body>
                </html>

代码:

public class FetchHTMLBody {
    private static final Logger logger = Logger.getLogger(FetchHTMLBody.class.getName());
public static void main(String[] args) {
    
    String htmlContent="<html Content mentioned above>"

    Document doc = Jsoup.parse(htmlContent);

    List<String> keys = new ArrayList<>();
    List<Map<String, String>> dataPairs = new ArrayList<>();

    Elements trElements = doc.getElementsByTag("tr");
    //logger.info("trElements::"  + trElements);

        for (int i = 0; i < trElements.size(); i++) {
        Element element = trElements.get(i);
        Elements pElements = element.getElementsByTag("td");
        //logger.info("pElements::"  + pElements);
        Map<String, String> map = new HashMap<>();
        for (int i1 = 0; i1 < pElements.size(); i1++) {
            Element p = pElements.get(i1);
            if (i == 0) {
                keys.add(p.text());
            } else {
                map.put(keys.get(i1), p.text());
            }
        }
        dataPairs.add(map);
        logger.info("trElements::"  + dataPairs);
    }

}
}

上面的代码给出了如下错误的输出

[{}, {Ticket Type:  =Company: , Incident =Grupo Bimbo DSDE }, {Ticket Type:  =Configuration Item: , Incident = }, {Ticket Type:  =Priority: , Incident =P3-Moderate }, {Ticket Type:  =Short Description: , Incident = }, {Ticket Type:  =Description: , Incident = }]

预期输出如下:

Ticket Type=Incident,
Company=Grupo Bimbo DSDE,
Configuration Item=null,
Priority=P3-Moderate,
Short Description=null,
Description=null

谁能帮我解决这个问题?

调试 Java 代码后,我找到了适用于我的用例的解决方案。

解决方案代码:

Document doc = Jsoup.parse(htmlContent);

List<String> keys = new ArrayList<>();
List<Map<String, String>> dataPairs = new ArrayList<>();
  Map<String, String> map = new HashMap<>();

Elements trElements = doc.getElementsByTag("tr");
//logger.info("trElements::"  + trElements);

    for (int i = 0; i < trElements.size(); i++) {
    Element element = trElements.get(i);
    Elements pElements = element.getElementsByTag("td");
    //logger.info("pElements::"  + pElements);
  
    for (int i1 = 0; i1 < pElements.size(); i1++) {
        Element p = pElements.get(i1);
        if (i1 == 0)
        keys.add(p.text());
        else
        map.put(keys.get(i), p.text());
       
    }
    
    
}
    logger.info("Map::"  + map.entrySet());

输出:

[Configuration Item: = , Ticket Type:  =Incident , Description: = , Priority: =P3-Moderate , Short Description: = , Company: =Grupo Bimbo DSDE ]
    Document doc = Jsoup.parse(htmlContent);

    List<String> keys = new ArrayList<>();
    List<Map<String, String>> dataPairs = new ArrayList<>();

    Elements trElements = doc.getElementsByTag("tr");
    //logger.info("trElements::"  + trElements);

    for (Element trElement : trElements) {
        Elements tdElements = trElement.getElementsByTag("td");
        //logger.info("pElements::"  + pElements);

        Map<String, String> map = new HashMap<>();
        keys.add(tdElements.get(0).text());
        map.put(tdElements.get(0).text(), tdElements.get(1).text());
        dataPairs.add(map);
        logger.info("trElements::"  + dataPairs);
    }