如何解析 HTML 个节点

Question

我的网站流量。

经过身份验证的用户将上传 docx。
我正在使用 OpenXmlPowerTools API 将此 docx 转换为 HTML
保存文件
将html页面的每个节点保存到数据库中。

数据库：-

tblNodeCollection

NodeId
节点类型（预期值 - <p>、<h1>、<h3>、<table>）
NodeContent（预期值 - <p> This is p content </p>

在步骤 #3 之前没有问题。但是我 对如何将节点集合保存到 table.

毫无头绪

我用谷歌搜索并找到了 HTMLAgiiltiyPack，但不太了解它。

using DocumentFormat.OpenXml.Packaging;
using HtmlAgilityPack;
using OpenXmlPowerTools;

namespace ExportData 
{
public class ExportHandler 
{
public void GenerateHTML()
    {
        byte[] byteArray = File.ReadAllBytes(@"d:\test.docx");
        using (MemoryStream memoryStream = new MemoryStream())
        {
            memoryStream.Write(byteArray, 0, byteArray.Length);
            using (WordprocessingDocument doc =
                WordprocessingDocument.Open(memoryStream, true))
            {
                HtmlConverterSettings settings = new HtmlConverterSettings()
                {
                    PageTitle = "My Page Title"
                };
                XElement html = HtmlConverter.ConvertToHtml(doc, settings);

                File.WriteAllText(@"d:\Test.html", html.ToStringNewLineOnAttributes());


            }
        }

        //now how do I proceed from here
    }
 }

任何类型的 help/guidance 高度赞赏。

Answer 1

根据我们在评论中的讨论以及您似乎坚持的部分，我建议如下：

SO 上的 Question 可能会提供一些关于如何转换为 html 的帮助。

当然，您仍然面临需要能够拆分每个页面的问题（正如您在评论中提到的），您可能能够将每个页面导出到html 个人。

至于你的数据库结构，我建议类似于：

[Document Table]
  - Document ID
  - Document Name
  - Any other data you need per-document

[Node Table]
  - Node ID
  - Document ID (foreign key)
  - Node Content (string)

确保您在节点 table 上有合理的索引，因为随着时间的推移，您可能会在数千行（如果不是数百万行）中进行搜索（尤其是文档 ID 上的行）。

为每个节点（例如 bigint 位置）创建一个索引属性可能也很有用，这样您就可以通过按顺序将节点放回原处来重建文档。

不过，总的来说，我的建议是尝试让你的老板明白原因，并真正反对这个愚蠢的设计决定。

Answer 2

下面是如何解析 html 并将其保存到数据库的简化过程。我希望这会帮助你 and/or 让你知道如何解决你的问题

        HtmlWeb h = new HtmlWeb();
        HtmlAgilityPack.HtmlDocument doc = h.Load("");
        HtmlNodeCollection tableNodes = doc.DocumentNode.SelectNodes("//table");
        HtmlNodeCollection h1Nodes = doc.DocumentNode.SelectNodes("//h1");
        HtmlNodeCollection pNodes = doc.DocumentNode.SelectNodes("//p");
        //get other nodes here

        foreach (var pNode in pNodes)
        {
            string id = pNode.Id;
            string content = pNode.InnerText;
            string tag = pNode.Name;

            //do other stuff here and then save to database

            //just an example...
            SqlConnection conn = new SqlConnection("here goes conection string");
            SqlCommand cmd = new SqlCommand();
            cmd.Connection = conn;
            cmd.CommandText = "INSERT INTO tblNodeCollection (Tag, Id, Content) VALUES (@tag, @id, @content)";
            cmd.Parameters.Add("@tag", tag);
            cmd.Parameters.Add("@id", id);
            cmd.Parameters.Add("@content", content);

            cmd.ExecuteNonQuery();
        }

如何解析 HTML 个节点

How to parse HTML nodes

c#

asp.net

asp.net-mvc

openxml