将 HTML 标签解析为 XML

Parsing HTML tags into XML

我正在尝试解析嵌入在下面 HTML 文件中的 XML。以下是其中一个标签的详细信息:

           DOM<tr class="iris_table_row">
                <td style=" width:37.50%; text-align:left; " class="ta_10"><span class="ta_10">Tangible assets</span></td>
                <td style=" width:2.50%; text-align:right; " class="ta_10"><span class="ta_10">2</span></td>
                <td style=" width:30.00%; text-align:right; " class="ta_61"><ix:nonFraction contextRef="cfwd_31_03_2014" name="ns5:TangibleFixedAssets" unitRef="GBP" decimals="0" format="ixt2:numdotdecimal" scale="0" xmlns:ix="http://www.xbrl.org/2008/inlineXBRL">7,956</ix:nonFraction></td>
                <td style=" width:1.25%; " class="ta_61" />
                <td style=" width:26.25%; text-align:right; " class="ta_60"><ix:nonFraction contextRef="cfwd_31_03_2013" name="ns5:TangibleFixedAssets" unitRef="GBP" decimals="0" format="ixt2:numdotdecimal" scale="0" xmlns:ix="http://www.xbrl.org/2008/inlineXBRL">5,402</ix:nonFraction></td>
                <td style=" width:1.25%; " class="ta_60" />
                <td style=" width:1.25%; " class="ta_10" />
            </tr>

我曾尝试在 java 中使用 DOM 解析器来执行此操作,但它无法识别 XML 标签。

下面代码中db.parse(fXmlFile)的值为"null".

File fXmlFile = new File("Prod223_1254_04903825_20140331 copy.xml");

    DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
    dbf.setValidating(false);
    dbf.setNamespaceAware(true);
    dbf.setIgnoringComments(false);
    dbf.setIgnoringElementContentWhitespace(false);
    dbf.setExpandEntityReferences(false);
    DocumentBuilder db = dbf.newDocumentBuilder();

    System.out.println(db.parse(fXmlFile));

如何将所有标签和信息放入java?理想情况下,我可以将它们加载到一个 bean 中。

这是我尝试解析的文件类型的示例。

 <?xml version="1.0" encoding="utf-8"?><html xml:lang="en" xmlns="http://www.w3.org/1999/xhtml" xmlns:ix="http://www.xbrl.org/2008/inlineXBRL" xmlns:ixt="http://www.xbrl.org/inlineXBRL/transformation/2010-04-20" xmlns:ixt2="http://www.xbrl.org/inlineXBRL/transformation/2011-07-31" xmlns:link="http://www.xbrl.org/2003/linkbase" xmlns:xbrli="http://www.xbrl.org/2003/instance" xmlns:xbrldi="http://xbrl.org/2006/xbrldi" xmlns:xl="http://www.xbrl.org/2003/XLink" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:fn="http://www.w3.org/2005/xpath-functions" xmlns:iris="http://www.iris.co.uk/ixbrl" xmlns:ns0="http://www.xbrl.org/uk/gaap/core-full/2009-09-01" xmlns:ns5="http://www.xbrl.org/uk/gaap/core/2009-09-01" xmlns:ns6="http://www.xbrl.org/uk/reports/direp/2009-09-01" xmlns:ns7="http://www.xbrl.org/uk/cd/business/2009-09-01" xmlns:ns8="http://www.xbrl.org/uk/all/types/2009-09-01" xmlns:ns9="http://xbrl.org/2005/xbrldt" xmlns:ns10="http://www.xbrl.org/uk/all/common/2009-09-01" xmlns:ns11="http://www.xbrl.org/2006/ref" xmlns:ns12="http://www.xbrl.org/uk/cd/countries/2009-09-01" xmlns:ns13="http://www.xbrl.org/uk/all/ref/2009-09-01" xmlns:ns14="http://www.xbrl.org/uk/cd/currencies/2009-09-01" xmlns:ns15="http://www.xbrl.org/uk/cd/exchanges/2009-09-01" xmlns:ns16="http://www.xbrl.org/uk/cd/languages/2009-09-01" xmlns:ns17="http://www.xbrl.org/2004/ref" xmlns:ns18="http://www.xbrl.org/uk/all/gaap-ref/2009-09-01" xmlns:ns19="http://www.xbrl.org/uk/reports/aurep/2009-09-01" xmlns:iso4217="http://www.xbrl.org/2003/iso4217" xmlns:ns20="http://www.govtalk.gov.uk/uk/fr/tax/full-gaap-dpl/2013-10-01" xmlns:ns21="http://www.govtalk.gov.uk/uk/fr/tax/dpl-gaap-main/2013-10-01" xmlns:ns22="http://www.govtalk.gov.uk/uk/fr/tax/dpl-gaap/2013-10-01" xmlns:ns23="http://www.govtalk.gov.uk/uk/fr/tax/dpl-core/2013-10-01">
<head>
    <meta name="PostingEntryNumber" content="4" />
    <meta name="PeriodRecordNumber" content="2341" />
    <meta content="application/xhtml+xml; charset=UTF-8" http-equiv="Content-Type" />
    <meta name="description" content="iXBRL report production" />
    <meta name="Mode" content="CH" />
    <meta http-equiv="X-UA-Compatible" content="IE=8" />

    <title>Shortt Orthopaedics Limited - Limited company - abbreviated - 11.6</title>
    <style type="text/css">
        @media print
        {
            hr { display:none; }
            .portraitpage
            {
                min-height:273mm;
                max-width:170mm;
            }
            .landscapepage
            {
                min-height:170mm;
                max-width:273mm;
            }
        }
        @media screen
        {
            .portraitpage
            {
                max-width:170mm;
                min-height:273mm;
                margin:12mm 20mm 12mm 20mm;
            }
            .landscapepage
            {
                max-width:273mm;
                min-height:170mm;
                margin:12mm 20mm 12mm 20mm;
            }
        }
        body{ margin:0px; font-size:1.3em; }
        td{ padding:0px; }
        div.portraitpage{ page-break-after:always; position:relative; }
        div.landscapepage{ page-break-after:always; position:relative; }
            div.header{ position:relative; }
            div.footer{ left:0px; right:0px; bottom:0px; text-align:center; position:absolute; }
    div.container{ position:relative; }
                    div.maintext{ width:100.00%; position:relative; }
                    div.tagged_blob{ width:100.00%; position:relative; }
                                table.iris_table{ width:100.00%; border-collapse:collapse; }
                table.iris_table_header{ width:100.00%; border-collapse:collapse; }
                table.iris_table_footer{ width:100.00%; border-collapse:collapse; }
        div.hr.iris_hr{ width:100.00%; }
            td.total_single{ border-top:thin solid black; }
            td.total_double{ border-top:double black; }
        .ta_10{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; }
        .ta_11{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; }
        .ta_12{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; }
        .ta_13{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; }
        .ta_20{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; }
        .ta_21{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; }
        .ta_22{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; }
        .ta_23{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; }
        .ta_30{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; }
        .ta_31{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; }
        .ta_32{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; }
        .ta_33{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; }
        .ta_40{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; }
        .ta_41{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; }
        .ta_42{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; }
        .ta_43{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; }
        .ta_50{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; }
        .ta_51{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; }
        .ta_52{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; }
        .ta_53{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; }
        .ta_60{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; }
        .ta_61{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; }
        .ta_62{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; }
        .ta_63{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; }
        .ta_70{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; }
        .ta_71{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; }
        .ta_72{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; }
        .ta_73{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; }
        .ta_80{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; }
        .ta_81{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; }
        .ta_82{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; }
        .ta_83{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; }
        .ta_90{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; }
        .ta_91{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; }
        .ta_92{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; }
        .ta_93{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; }
        .ta_100{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; }
        .ta_101{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; }
        .ta_102{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; }
        .ta_103{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; }
        .ta_110{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; }
        .ta_111{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; }
        .ta_112{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; }
        .ta_113{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; }
        .ta_120{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; }
        .ta_121{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; }
        .ta_122{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; }
        .ta_123{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; }
        .ta_130{ color:rgb(0, 0, 0); font-family:"Courier New"; font-size:13px; font-weight:400; }
        .ta_131{ color:rgb(0, 0, 0); font-family:"Courier New"; font-size:13px; font-weight:700; }
        .ta_132{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Courier New"; font-size:13px; font-weight:700; }
        .ta_133{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Courier New"; font-size:13px; font-weight:400; }
        .ta_140{ color:rgb(0, 0, 0); font-family:"Arial"; font-size:13px; font-weight:400; }
        .ta_141{ color:rgb(0, 0, 0); font-family:"Arial"; font-size:13px; font-weight:400; }
        .ta_142{ color:rgb(0, 0, 0); font-family:"Arial"; font-size:13px; font-weight:400; }
        .ta_143{ color:rgb(0, 0, 0); font-family:"Arial"; font-size:13px; font-weight:400; }
    </style>
</head>
<body xml:lang="en">
    <div style="display:none">
        <ix:header>
            <ix:hidden>
                <ix:nonNumeric contextRef="FY_31_03_2014" name="ns7:NameAuthor" order="1" tupleRef="XBRLDocumentAuthorGrouping_Group45" xmlns:ix="http://www.xbrl.org/2008/inlineXBRL"></ix:nonNumeric>
                <ix:nonNumeric contextRef="FY_31_03_2014" name="ns7:DescriptionOrTitleAuthor" order="2" tupleRef="XBRLDocumentAuthorGrouping_Group45" xmlns:ix="http://www.xbrl.org/2008/inlineXBRL"></ix:nonNumeric>
                <ix:nonNumeric contextRef="FY_31_03_2014" name="ns7:UKCompaniesHouseRegisteredNumber" xmlns:ix="http://www.xbrl.org/2008/inlineXBRL">07189486</ix:nonNumeric>
                <ix:nonNumeric contextRef="CountriesHypercube_FY_31_03_2014_Set1" name="ns7:CountryFormationOrIncorporation" format="ixt2:nocontent" xmlns:ix="http://www.xbrl.org/2008/inlineXBRL" />
                <ix:nonNumeric contextRef="CurrenciesHypercube_FY_31_03_2014_Set2" name="ns7:PrincipalCurrencyUsedInBusinessReport" format="ixt2:nocontent" xmlns:ix="http://www.xbrl.org/2008/inlineXBRL" />
                <ix:nonNumeric contextRef="EntityOfficersHypercube_FY_31_03_2014_Set3" name="ns5:NameDirectorSigningAccounts" format="ixt2:nocontent" xmlns:ix="http://www.xbrl.org/2008/inlineXBRL" />
                <ix:nonNumeric contextRef="cfwd_31_03_2014" name="ns7:StartDateForPeriodCoveredByReport" format="ixt2:datedaymonthyear" xmlns:ix="http://www.xbrl.org/2008/inlineXBRL">1.4.13</ix:nonNumeric>
                <ix:nonNumeric contextRef="cfwd_31_03_2014" name="ns7:EndDateForPeriodCoveredByReport" format="ixt2:datedaymonthyear" xmlns:ix="http://www.xbrl.org/2008/inlineXBRL">31.3.14</ix:nonNumeric>
                <ix:nonNumeric contextRef="cfwd_31_03_2014" name="ns7:BalanceSheetDate" format="ixt2:datedaymonthyear" xmlns:ix="http://www.xbrl.org/2008/inlineXBRL">31.3.14</ix:nonNumeric>
                <ix:nonNumeric contextRef="FY_31_03_2014" name="ns7:EntityAccountsType" xmlns:ix="http://www.xbrl.org/2008/inlineXBRL">Company accounts</ix:nonNumeric>
                <ix:nonNumeric contextRef="FY_31_03_2014" name="ns7:LegalFormOfEntity" xmlns:ix="http://www.xbrl.org/2008/inlineXBRL">Private Limited Company</ix:nonNumeric>
                <ix:nonNumeric contextRef="FY_31_03_2014" name="ns7:DescriptionPeriodCoveredByReport" xmlns:ix="http://www.xbrl.org/2008/inlineXBRL">FY</ix:nonNumeric>
                <ix:nonNumeric contextRef="FY_31_03_2014" name="ns7:EntityTrading" format="ixt2:booleantrue" xmlns:ix="http://www.xbrl.org/2008/inlineXBRL">true</ix:nonNumeric>

[Whosebug 限制正文]

我认为您需要分两步走。

  • 使用 HTML 解析器获取有问题的嵌入式 XML
  • ...然后在内容上使用 DOM 解析器

HTML 并不总是 XML 兼容(除非你使用的 XHTML 已经变得不那么流行了)。浏览器会让很多东西漏掉,比如缺少标签、单引号和双引号、没有值的属性等,这可能是您的网站无法解析的原因。

有很多可用。

根据文档,DTD validation always takes place,即使您告诉它不要!

您想做的是创建一个新的 DTD,将您的名称空间添加到标准 XHTML DTD; W3 站点 discusses how to acheive this,他们给出的示例是针对 MathML 的:

First, define a content model module that instantiates the MathML DTD and connects it to the content model:

<!-- File: mathml-model.mod -->
<!ENTITY % XHTML1-math
     PUBLIC "-//W3C//DTD MathML 2.0//EN"
            "http://www.w3.org/TR/MathML2/dtd/mathml2.dtd" >
%XHTML1-math;

<!ENTITY % Inlspecial.extra 
     "%a.qname; | %img.qname; | %object.qname; | %map.qname; 
      | %Mathml.Math.qname;" >

Next, define a DTD driver that identifies our new content model module as the content model for the DTD, and hands off processing to the XHTML 1.1 driver (for example):

<!-- File: xhtml-mathml.dtd -->
<!ENTITY % xhtml-model.mod
      SYSTEM "mathml-model.mod" >
<!ENTITY % xhtml11.dtd
     PUBLIC "-//W3C//DTD XHTML 1.1//EN"
            "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd" >
%xhtml11.dtd;