如何检查段落内容以在 c# 中读取 .docx 文件 line-by-line

How to do check the paragraphs content to read a .docx file line-by-line in c#

上传后我想逐行阅读 .docx 文件

我的 file.docx 分为 章节 段落 章节

file.docx

的结构
Chapter 1 - Events
alert or disservices
significant activities

Chapter 2 – Safety
near miss
security checks

Chapter 3 – Training
environment
upkeep

我试过使用Microsoft.Office.Interop.Word阅读文档。

整个文档

现在根据章节我要在相应的数据库中插入章节和段落内容table

例如

Chapter 1 - Events
 - alert or disservices
Lorem ipsum dolor sit amet, consectetur adipiscing elit ….
…. ….
…. ….
- significant activities
Phasellus dui nunc, rutrum vitae dictum eleifend, ullamcorper hendrerit sem ….
…. ….
…. ….

必须插入tableEvents

-- ----------------------------
-- Table structure for events
-- ----------------------------
DROP TABLE IF EXISTS `events`;
CREATE TABLE `events` (
  `sID` int(11) NOT NULL AUTO_INCREMENT,
  `alert_or_disservices` longtext,
  `significant_activities` longtext,
  PRIMARY KEY (`sID`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

你能帮帮我吗?

在此先感谢您的帮助或建议

下面是我的代码

protected void Page_Load(object sender, EventArgs e)
{
    if (!IsPostBack)
    {
        Application word = new Application();
        object miss = Missing.Value;
        object path = @"C:\file.docx";
        object readOnly = true;
        Document docs = word.Documents.Open(ref path, ref miss, ref readOnly, ref miss, ref miss,
                                            ref miss, ref miss, ref miss, ref miss, ref miss, ref miss, 
                                            ref miss, ref miss, ref miss, ref miss, ref miss);

        string totaltext = "";      //the whole document

        for (int i = 0; i < docs.Paragraphs.Count; i++)
        {   
            totaltext += docs.Paragraphs[i + 1].Range.Text.ToString() + "<br />";
        }

        Response.Write(totaltext);
        docs.Close();
        word.Quit();
    }
}

更新 #1

  1. 标题可识别的章节
  2. 警告或损害只是在文本连字符之前
  3. 每个新段落都以文本连字符开头
  4. 警告块中没有硬 returns/paragraph 标记
  5. 我为每一章创建了一个table,列标题与段落标题相同,但如果有更好的解决方案,欢迎

我想分享我的文件 .docx 供您下载,但我不知道如何下载。

我试过 wetransfer 但未获批准,因为它是不受信任的来源

更新#2

protected void Page_Load(object sender, EventArgs e)
{
    if (!IsPostBack)
    {
        var wdApp = new Microsoft.Office.Interop.Word.Application();
        var doc = wdApp.Documents.Open(@"C:\file.docx");

        var ran = doc.Content;
        var fin = ran.Find;
        fin.ClearFormatting();
        fin.MatchWildcards = false;
        fin.Text = "";
        fin.set_Style("Chapter 1 - Events"); //use your heading style here, e.g. Heading 1
        fin.Execute();
        while (fin.Found)
        {
            var chap = ran.Text;

            //cut off "Chapter[space]" from start, clean text from trailing carriage returns and stuff
            chap = chap.Substring(8).TrimEnd('\r', '\n', '\t', ' ');

            //Heading ended by hard return/para mark; get text of following paragraph '-alert or disservice'
            ran = doc.Range(ran.End, ran.End).Paragraphs[1].Range;
            var subhead = ran.Text;

            //clean subheading of leading hyphen and space, trailing stuff
            subhead = subhead.TrimStart(' ', '-').TrimEnd('\r', '\n', '\t', ' ');

            //get text under subheading = contents, clean up
            ran = doc.Range(ran.End, ran.End).Paragraphs[1].Range;
            var contents = ran.Text;
            contents = contents.TrimEnd('\r', '\n', '\t', ' ');

            //write to db
            string constr = ConfigurationManager.ConnectionStrings["cn"].ConnectionString;

            string strSql = @"INSERT INTO Chapters (chapter, subheading, contents) VALUES (?, ?, ?)";

            using (MySqlConnection con = new MySqlConnection(constr))
            {
                using (MySqlCommand cmd = new MySqlCommand(strSql))
                {
                    con.Open();
                    cmd.Parameters.AddWithValue("param1", chap);
                    cmd.Parameters.AddWithValue("param2", subhead);
                    cmd.Parameters.AddWithValue("param3", contents);
                    cmd.ExecuteNonQuery();
                    con.Close();
                }
            }

            ran = doc.Range(ran.End, doc.Content.End);
            fin = ran.Find;
            fin.ClearFormatting();
            fin.MatchWildcards = false;
            fin.Text = "";
            fin.set_Style("Chapter 1 - Events"); //use your heading style here, e.g. Heading 1
            fin.Execute();
        }
        doc.Close(false);
        wdApp.Quit();
    }
}

好的,B计划。

数据库:每章一个 table 是糟糕的设计。因此,我用了一个 table 来表示全部。

这个有点快和脏。通常,您会为章节设置一个 table,为子章节设置一个 table,其中有一列用于章节 ID。我建议在它起作用后对其进行改进。

这是 SQLite,但您可以轻松地将其改编为 InnoDb:

CREATE TABLE Chapters (
    sID integer PRIMARY KEY AUTOINCREMENT,
    chapter text NOT NULL,
    subheading1 text NOT NULL,
    contents1 text NULL,
    subheading2 text NOT NULL,
    contents2 text NULL
)

由于我们基本上是在处理纯文本,所以让我们将互操作减少到最低限度,然后使用正则表达式完成其余工作:

var wdApp = new Word.Application();
var doc = wdApp.Documents.Open(@"D:[=11=]_Projekte_temp\Lorem ipsum.docx");

var txt = doc.Content.Text;


doc.Close(false);
wdApp.Quit();

var rex = new Regex(@"(Chapter[\s\t])(.+?)([\r\n]+?)(\s?\-\s?)(.+?[\r\n]+?)(.+?)([\r\n]+?)(\-\s)(.+?[\r\n]+?)(.+?[\r\n])");
var mCol = rex.Matches(txt);

foreach (Match m in mCol)
{
    var chap = m.Groups[2].Value;
    var subh1 = m.Groups[5].Value;
    var cont1 = m.Groups[6].Value;
    var subh2 = m.Groups[9].Value;
    var cont2 = m.Groups[10].Value;

   //write to db
    var strSql = @"INSERT INTO Chapters (chapter, subheading1, contents1, subheading2, contents2) VALUES ($chap, $sub1, $con1, $sub2, $con2)";
    using (var con = new SQLiteConnection("Data Source =\"D:\00_Projekte_temp\wordtest.db\";Version=3"))
    {
        con.Open();
        using (var cmd = new SQLiteCommand(strSql, con))
        {
            cmd.Parameters.AddWithValue("$chap", chap);
            cmd.Parameters.AddWithValue("$sub1", subh1);
            cmd.Parameters.AddWithValue("$con1", cont1);
            cmd.Parameters.AddWithValue("$sub2", subh2);
            cmd.Parameters.AddWithValue("$con2", cont2);
            cmd.ExecuteNonQuery();
        }
        con.Close();
    }
}

我还建议将来您的作者直接向您发送纯文本,或者您从 Interop 转移到 OpenXml,因为这使您独立于 Word,因此也可以在服务器上运行。