读取word文件的最快方法

Question

我正在使用“Microsoft Interop Library”来阅读 word 文件。我有 100 多个 Word 文件，使用 Interop 只读取所有这些文件的 150 个段落需要很长时间。

是否有更快的库或其他阅读方式？

  Application word = new Application();
  Document doc = new Document();

  object fileName = "";
  // Define an object to pass to the API for missing parameters
  object missing = System.Type.Missing;
  doc = word.Documents.Open(ref fileName,
          ref missing, ref missing, ref missing, ref missing,
          ref missing, ref missing, ref missing, ref missing,
          ref missing, ref missing, ref missing, ref missing,
          ref missing, ref missing, ref missing);

  String read = string.Empty;
  List<string> data = new List<string>();
  for (int i = 0; i < 150; i++) //Read Only 150 Paragraphs
  {
      string temp = doc.Paragraphs[i + 1].Range.Text.Trim();
      if (temp != string.Empty)
          data.Add(temp);
  }                

  foreach (var paragraphs in data)
  {
      Console.WriteLine(paragraphs);
  }

  ((_Document)doc).Close();
  ((_Application)word).Quit();

Answer 1

对于纯文本提取，您可以在 word 文件中搜索 <w:t> 元素（docx 是一个 zip 存档 xml 个文件）。请检查这个假设（文档数据在word/document.xml）之前用7zip 你用它。

// using System.IO.Compression;
// using System.Xml;

/// <summary>
/// Returns every paragraph in a word document.
/// </summary>
public IEnumerable<string> ExtractText(string filename)
{
    // Open zip compressed xml files.
    using var zip = ZipFile.OpenRead(filename);
    // Search for document content.
    using var stream = zip.GetEntry("word/document.xml")?.Open();
    if (stream == null) { yield break; }
    using var reader = XmlReader.Create(stream);
    while (reader.Read())
    {
        // Search for <w:t> values in document.xml
        if (reader.NodeType == XmlNodeType.Element && reader.LocalName == "t")
        {
            yield return reader.ReadElementContentAsString();
        }
    }
}

用法：

foreach (var paragraph in ExtractText("test.docx"))
{
    Console.WriteLine("READ A PARAGRAPH");
    Console.WriteLine(paragraph);
}

读取word文件的最快方法

Fastest way to read word files

c#

office-interop