使用 Interop 从 Excel 文件中删除空行和列的最快方法

Fastest method to remove Empty rows and Columns From Excel Files using Interop

我有很多 excel 个包含数据的文件,其中包含空行和空列。 如下所示

我正在尝试使用互操作从 excel 中删除空行和空列。 我创建了一个简单的 winform 应用程序并使用了以下代码,它工作正常。

Dim lstFiles As New List(Of String)
lstFiles.AddRange(IO.Directory.GetFiles(m_strFolderPath, "*.xls", IO.SearchOption.AllDirectories))

Dim m_XlApp = New Excel.Application
Dim m_xlWrkbs As Excel.Workbooks = m_XlApp.Workbooks
Dim m_xlWrkb As Excel.Workbook

For Each strFile As String In lstFiles
    m_xlWrkb = m_xlWrkbs.Open(strFile)
    Dim m_XlWrkSheet As Excel.Worksheet = m_xlWrkb.Worksheets(1)
    Dim intRow As Integer = 1

    While intRow <= m_XlWrkSheet.UsedRange.Rows.Count
        If m_XlApp.WorksheetFunction.CountA(m_XlWrkSheet.Cells(intRow, 1).EntireRow) = 0 Then
            m_XlWrkSheet.Cells(intRow, 1).EntireRow.Delete(Excel.XlDeleteShiftDirection.xlShiftUp)
        Else
            intRow += 1
        End If
    End While

    Dim intCol As Integer = 1
    While intCol <= m_XlWrkSheet.UsedRange.Columns.Count
        If m_XlApp.WorksheetFunction.CountA(m_XlWrkSheet.Cells(1, intCol).EntireColumn) = 0 Then
            m_XlWrkSheet.Cells(1, intCol).EntireColumn.Delete(Excel.XlDeleteShiftDirection.xlShiftToLeft)
        Else
            intCol += 1
        End If
    End While
Next

m_xlWrkb.Save()
m_xlWrkb.Close(SaveChanges:=True)

Marshal.ReleaseComObject(m_xlWrkb)
Marshal.ReleaseComObject(m_xlWrkbs)
m_XlApp.Quit()
Marshal.ReleaseComObject(m_XlApp)

但是清理大 excel 文件时会花费很多时间。 有什么优化此代码的建议吗?或另一种更快清理此 excel 文件的方法?有没有一键删除空行的功能?

如果答案使用 C#,我没有问题

编辑:

我上传了一个样本文件Sample File。但并非所有文件都具有相同的结构。

也许需要考虑的事项:

Sub usedRangeDeleteRowsCols()
    Dim LastRow, LastCol, i As Long

    LastRow = Cells.Find(What:="*", SearchDirection:=xlPrevious, SearchOrder:=xlByRows).Row
    LastCol = Cells.Find(What:="*", SearchDirection:=xlPrevious, SearchOrder:=xlByColumns).Column

    For i = LastRow To 1 Step -1
        If WorksheetFunction.CountA(Range(Cells(i, 1), Cells(i, LastCol))) = 0 Then
            Cells(i, 1).EntireRow.Delete
        End If
    Next

    For i = LastCol To 1 Step -1
        If WorksheetFunction.CountA(Range(Cells(1, i), Cells(LastRow, i))) = 0 Then
            Cells(1, i).EntireColumn.Delete
        End If
    Next
End Sub

认为与原始代码中的等效函数相比,有两个效率。首先,我们没有使用 Excel 不可靠的 UsedRange 属性,而是找到最后一个值,只扫描真正使用范围​​内的行和列。

其次,工作表计数功能再次仅在真正使用的范围内起作用 - 例如,在搜索空白行时,我们仅查看已使用列的范围(而不是 .EntireRow)。

For 循环向后工作,因为例如,每次删除一行时,后续数据的行地址都会更改。向后工作意味着 "data to be worked on" 的行地址不变。

您可以打开与工作表的 ADO 连接,获取字段列表,发出仅包含已知字段的 SQL 语句,并排除已知字段中没有值的记录。

在我看来,最耗时的部分可能是枚举和查找空行和列。

关于: http://www.howtogeek.com/206696/how-to-quickly-and-easily-delete-blank-rows-and-columns-in-excel-2013/

编辑:

怎么样:

m_XlWrkSheet.Columns("A:A").SpecialCells(xlCellTypeBlanks).EntireRow.Delete
m_XlWrkSheet.Rows("1:1").SpecialCells(xlCellTypeBlanks).EntireColumn.Delete

测试样本数据结果看起来不错,性能更好(从 VBA 测试但差异很大)。

更新:

在样本 Excel 上测试了 14k 行(由样本数据制成)原始代码 ~30 秒,此版本 <1s

我发现如果工作表很大,循环遍历 excel 工作表可能需要一些时间。所以我的解决方案试图避免工作表中出现任何循环。为了避免遍历工作表,我从 usedRange 返回的单元格中创建了一个二维对象数组:

Excel.Range targetCells = worksheet.UsedRange;
object[,] allValues = (object[,])targetCells.Cells.Value;

这是我循环获取空行和空列索引的数组。我制作了 2 个 int 列表,一个保留要删除的行索引,另一个保留要删除的列索引。

List<int> emptyRows = GetEmptyRows(allValues, totalRows, totalCols);
List<int> emptyCols = GetEmptyCols(allValues, totalRows, totalCols);

这些列表将从高到低排序,以简化从下往上删除行和从右到左删除列的操作。然后简单地遍历每个列表并删除适当的 row/col。

DeleteRows(emptyRows, worksheet);
DeleteCols(emptyCols, worksheet);

最后在删除所有空行和列后,我将文件另存为新文件名。

希望对您有所帮助。

编辑:

解决了 UsedRange 问题,如果工作表顶部有空行,这些行现在将被删除。此外,这将删除起始数据左侧的所有空列。这允许索引正常工作,即使在数据开始之前有空行或空列。 这是通过获取 UsedRange 中第一个单元格的地址来完成的,这将是一个格式为“$A$1:$D$4”的地址。如果顶部的空行和左侧的空列要保留而不被删除,这将允许使用偏移量。在这种情况下,我只是删除它们。要获取从顶部删除的行数,可以通过第一个“$A$4”地址计算,其中“4”是第一个数据出现的行。所以我们需要删除前 3 行。列地址的格式为“A”、“AB”甚至“AAD”,这需要一些翻译,感谢 How to convert a column number (eg. 127) into an excel column (eg. AA) 我能够确定左侧有多少列需要删除。

class Program {
  static void Main(string[] args) {
    Excel.Application excel = new Excel.Application();
    string originalPath = @"H:\ExcelTestFolder\Book1_Test.xls";
    Excel.Workbook workbook = excel.Workbooks.Open(originalPath);
    Excel.Worksheet worksheet = workbook.Worksheets["Sheet1"];
    Excel.Range usedRange = worksheet.UsedRange;

    RemoveEmptyTopRowsAndLeftCols(worksheet, usedRange);

    DeleteEmptyRowsCols(worksheet);

    string newPath = @"H:\ExcelTestFolder\Book1_Test_Removed.xls";
    workbook.SaveAs(newPath, Excel.XlSaveAsAccessMode.xlNoChange);

    workbook.Close();
    excel.Quit();
    System.Runtime.InteropServices.Marshal.ReleaseComObject(workbook);
    System.Runtime.InteropServices.Marshal.ReleaseComObject(excel);
    Console.WriteLine("Finished removing empty rows and columns - Press any key to exit");
    Console.ReadKey();
  }

  private static void DeleteEmptyRowsCols(Excel.Worksheet worksheet) {
    Excel.Range targetCells = worksheet.UsedRange;
    object[,] allValues = (object[,])targetCells.Cells.Value;
    int totalRows = targetCells.Rows.Count;
    int totalCols = targetCells.Columns.Count;

    List<int> emptyRows = GetEmptyRows(allValues, totalRows, totalCols);
    List<int> emptyCols = GetEmptyCols(allValues, totalRows, totalCols);

    // now we have a list of the empty rows and columns we need to delete
    DeleteRows(emptyRows, worksheet);
    DeleteCols(emptyCols, worksheet);
  }

  private static void DeleteRows(List<int> rowsToDelete, Excel.Worksheet worksheet) {
    // the rows are sorted high to low - so index's wont shift
    foreach (int rowIndex in rowsToDelete) {
      worksheet.Rows[rowIndex].Delete();
    }
  }

  private static void DeleteCols(List<int> colsToDelete, Excel.Worksheet worksheet) {
    // the cols are sorted high to low - so index's wont shift
    foreach (int colIndex in colsToDelete) {
      worksheet.Columns[colIndex].Delete();
    }
  }

  private static List<int> GetEmptyRows(object[,] allValues, int totalRows, int totalCols) {
    List<int> emptyRows = new List<int>();

    for (int i = 1; i < totalRows; i++) {
      if (IsRowEmpty(allValues, i, totalCols)) {
        emptyRows.Add(i);
      }
    }
    // sort the list from high to low
    return emptyRows.OrderByDescending(x => x).ToList();
  }

  private static List<int> GetEmptyCols(object[,] allValues, int totalRows, int totalCols) {
    List<int> emptyCols = new List<int>();

    for (int i = 1; i < totalCols; i++) {
      if (IsColumnEmpty(allValues, i, totalRows)) {
        emptyCols.Add(i);
      }
    }
    // sort the list from high to low
    return emptyCols.OrderByDescending(x => x).ToList();
  }

  private static bool IsColumnEmpty(object[,] allValues, int colIndex, int totalRows) {
    for (int i = 1; i < totalRows; i++) {
      if (allValues[i, colIndex] != null) {
        return false;
      }
    }
    return true;
  }

  private static bool IsRowEmpty(object[,] allValues, int rowIndex, int totalCols) {
    for (int i = 1; i < totalCols; i++) {
      if (allValues[rowIndex, i] != null) {
        return false;
      }
    }
    return true;
  }

  private static void RemoveEmptyTopRowsAndLeftCols(Excel.Worksheet worksheet, Excel.Range usedRange) {
    string addressString = usedRange.Address.ToString();
    int rowsToDelete = GetNumberOfTopRowsToDelete(addressString);
    DeleteTopEmptyRows(worksheet, rowsToDelete);
    int colsToDelete = GetNumberOfLeftColsToDelte(addressString);
    DeleteLeftEmptyColumns(worksheet, colsToDelete);
  }

  private static void DeleteTopEmptyRows(Excel.Worksheet worksheet, int startRow) {
    for (int i = 0; i < startRow - 1; i++) {
      worksheet.Rows[1].Delete();
    }
  }

  private static void DeleteLeftEmptyColumns(Excel.Worksheet worksheet, int colCount) {
    for (int i = 0; i < colCount - 1; i++) {
      worksheet.Columns[1].Delete();
    }
  }

  private static int GetNumberOfTopRowsToDelete(string address) {
    string[] splitArray = address.Split(':');
    string firstIndex = splitArray[0];
    splitArray = firstIndex.Split('$');
    string value = splitArray[2];
    int returnValue = -1;
    if ((int.TryParse(value, out returnValue)) && (returnValue >= 0))
      return returnValue;
    return returnValue;
  }

  private static int GetNumberOfLeftColsToDelte(string address) {
    string[] splitArray = address.Split(':');
    string firstindex = splitArray[0];
    splitArray = firstindex.Split('$');
    string value = splitArray[1];
    return ParseColHeaderToIndex(value);
  }

  private static int ParseColHeaderToIndex(string colAdress) {
    int[] digits = new int[colAdress.Length];
    for (int i = 0; i < colAdress.Length; ++i) {
      digits[i] = Convert.ToInt32(colAdress[i]) - 64;
    }
    int mul = 1; int res = 0;
    for (int pos = digits.Length - 1; pos >= 0; --pos) {
      res += digits[pos] * mul;
      mul *= 26;
    }
    return res;
  }
}

编辑 2: 为了进行测试,我创建了一个循环遍历工作表的方法,并将其与循环遍历对象数组的代码进行了比较。它显示出显着差异。

遍历工作表并删除空行和空列的方法。

enum RowOrCol { Row, Column };
private static void ConventionalRemoveEmptyRowsCols(Excel.Worksheet worksheet) {
  Excel.Range usedRange = worksheet.UsedRange;
  int totalRows = usedRange.Rows.Count;
  int totalCols = usedRange.Columns.Count;

  RemoveEmpty(usedRange, RowOrCol.Row);
  RemoveEmpty(usedRange, RowOrCol.Column);
}

private static void RemoveEmpty(Excel.Range usedRange, RowOrCol rowOrCol) {
  int count;
  Excel.Range curRange;
  if (rowOrCol == RowOrCol.Column)
    count = usedRange.Columns.Count;
  else
    count = usedRange.Rows.Count;

  for (int i = count; i > 0; i--) {
    bool isEmpty = true;
    if (rowOrCol == RowOrCol.Column)
      curRange = usedRange.Columns[i];
    else
      curRange = usedRange.Rows[i];

    foreach (Excel.Range cell in curRange.Cells) {
      if (cell.Value != null) {
        isEmpty = false;
        break; // we can exit this loop since the range is not empty
      }
      else {
        // Cell value is null contiue checking
      }
    } // end loop thru each cell in this range (row or column)

    if (isEmpty) {
      curRange.Delete();
    }
  }
}

然后是testing/timing两个方法的Main。

enum RowOrCol { Row, Column };

static void Main(string[] args)
{
  Excel.Application excel = new Excel.Application();
  string originalPath = @"H:\ExcelTestFolder\Book1_Test.xls";
  Excel.Workbook workbook = excel.Workbooks.Open(originalPath);
  Excel.Worksheet worksheet = workbook.Worksheets["Sheet1"];
  Excel.Range usedRange = worksheet.UsedRange;

  // Start test for looping thru each excel worksheet
  Stopwatch sw = new Stopwatch();
  Console.WriteLine("Start stopwatch to loop thru WORKSHEET...");
  sw.Start();
  ConventionalRemoveEmptyRowsCols(worksheet);
  sw.Stop();
  Console.WriteLine("It took a total of: " + sw.Elapsed.Milliseconds + " Miliseconds to remove empty rows and columns...");

  string newPath = @"H:\ExcelTestFolder\Book1_Test_RemovedLoopThruWorksheet.xls";
  workbook.SaveAs(newPath, Excel.XlSaveAsAccessMode.xlNoChange);
  workbook.Close();
  Console.WriteLine("");

  // Start test for looping thru object array
  workbook = excel.Workbooks.Open(originalPath);
  worksheet = workbook.Worksheets["Sheet1"];
  usedRange = worksheet.UsedRange;
  Console.WriteLine("Start stopwatch to loop thru object array...");
  sw = new Stopwatch();
  sw.Start();
  DeleteEmptyRowsCols(worksheet);
  sw.Stop();

  // display results from second test
  Console.WriteLine("It took a total of: " + sw.Elapsed.Milliseconds + " Miliseconds to remove empty rows and columns...");
  string newPath2 = @"H:\ExcelTestFolder\Book1_Test_RemovedLoopThruArray.xls";
  workbook.SaveAs(newPath2, Excel.XlSaveAsAccessMode.xlNoChange);
  workbook.Close();
  excel.Quit();
  System.Runtime.InteropServices.Marshal.ReleaseComObject(workbook);
  System.Runtime.InteropServices.Marshal.ReleaseComObject(excel);
  Console.WriteLine("");
  Console.WriteLine("Finished testing methods - Press any key to exit");
  Console.ReadKey();
}

编辑 3 根据 OP 要求... 我更新并更改了代码以匹配 OP 代码。有了这个,我发现了一些有趣的结果。见下文。

我更改了代码以匹配您正在使用的函数,即……EntireRow 和 CountA。下面的代码我发现它执行得非常糟糕。 运行 一些测试我发现下面的代码是在 800+ 毫秒的执行时间。然而,一个细微的变化产生了巨大的变化。

在线:

while (rowIndex <= worksheet.UsedRange.Rows.Count)

这大大减慢了速度。如果您为 UsedRang 创建一个范围变量并且不在 while 循环的每次迭代中保持重新抓取它,将会产生巨大的差异。所以……当我将 while 循环更改为……

Excel.Range usedRange = worksheet.UsedRange;
int rowIndex = 1;

while (rowIndex <= usedRange.Rows.Count)
and
while (colIndex <= usedRange.Columns.Count)

这与我的对象数组解决方案非常接近。我没有 post 结果,因为您可以使用下面的代码并更改 while 循环以在每次迭代时获取 UsedRange 或使用变量 usedRange 来测试它。

private static void RemoveEmptyRowsCols3(Excel.Worksheet worksheet) {
  //Excel.Range usedRange = worksheet.UsedRange;     // <- using this variable makes the while loop much faster 
  int rowIndex = 1;

  // delete empty rows
  //while (rowIndex <= usedRange.Rows.Count)     // <- changing this one line makes a huge difference - not grabbibg the UsedRange with each iteration...
  while (rowIndex <= worksheet.UsedRange.Rows.Count) {
    if (excel.WorksheetFunction.CountA(worksheet.Cells[rowIndex, 1].EntireRow) == 0) {
      worksheet.Cells[rowIndex, 1].EntireRow.Delete(Excel.XlDeleteShiftDirection.xlShiftUp);
    }
    else {
      rowIndex++;
    }
  }

  // delete empty columns
  int colIndex = 1;
  // while (colIndex <= usedRange.Columns.Count) // <- change here also

  while (colIndex <= worksheet.UsedRange.Columns.Count) {
    if (excel.WorksheetFunction.CountA(worksheet.Cells[1, colIndex].EntireColumn) == 0) {
      worksheet.Cells[1, colIndex].EntireColumn.Delete(Excel.XlDeleteShiftDirection.xlShiftToLeft);
    }
    else {
      colIndex++;
    }
  }
}

@Hadi

更新

如果 excel 在上次使用的行和列之后包含额外的空白行和列,您可以更改 DeleteColsDeleteRows 函数以获得更好的性能:

private static void DeleteRows(List<int> rowsToDelete, Microsoft.Office.Interop.Excel.Worksheet worksheet)
{
    // the rows are sorted high to low - so index's wont shift

    List<int> NonEmptyRows = Enumerable.Range(1, rowsToDelete.Max()).ToList().Except(rowsToDelete).ToList();

    if (NonEmptyRows.Max() < rowsToDelete.Max())
    {

        // there are empty rows after the last non empty row

        Microsoft.Office.Interop.Excel.Range cell1 = worksheet.Cells[NonEmptyRows.Max() + 1,1];
        Microsoft.Office.Interop.Excel.Range cell2 = worksheet.Cells[rowsToDelete.Max(), 1];

        //Delete all empty rows after the last used row
        worksheet.Range[cell1, cell2].EntireRow.Delete(Microsoft.Office.Interop.Excel.XlDeleteShiftDirection.xlShiftUp);


    }    //else last non empty row = worksheet.Rows.Count



    foreach (int rowIndex in rowsToDelete.Where(x => x < NonEmptyRows.Max()))
    {
        worksheet.Rows[rowIndex].Delete();
    }
}

private static void DeleteCols(List<int> colsToDelete, Microsoft.Office.Interop.Excel.Worksheet worksheet)
{
    // the cols are sorted high to low - so index's wont shift

    //Get non Empty Cols
    List<int> NonEmptyCols = Enumerable.Range(1, colsToDelete.Max()).ToList().Except(colsToDelete).ToList();

    if (NonEmptyCols.Max() < colsToDelete.Max())
    {

        // there are empty rows after the last non empty row

        Microsoft.Office.Interop.Excel.Range cell1 = worksheet.Cells[1,NonEmptyCols.Max() + 1];
        Microsoft.Office.Interop.Excel.Range cell2 = worksheet.Cells[1,NonEmptyCols.Max()];

        //Delete all empty rows after the last used row
        worksheet.Range[cell1, cell2].EntireColumn.Delete(Microsoft.Office.Interop.Excel.XlDeleteShiftDirection.xlShiftToLeft);


    }            //else last non empty column = worksheet.Columns.Count

    foreach (int colIndex in colsToDelete.Where(x => x < NonEmptyCols.Max()))
    {
        worksheet.Columns[colIndex].Delete();
    }
}

查看我的回答

据我所知,最简单的方法是隐藏 non-blank 单元格并删除可见的单元格:

var range = m_XlWrkSheet.UsedRange;
range.SpecialCells(XlCellType.xlCellTypeConstants).EntireRow.Hidden = true;
range.SpecialCells(XlCellType.xlCellTypeVisible).Delete(XlDeleteShiftDirection.xlShiftUp);
range.EntireRow.Hidden = false;

更快的方法是根本不删除任何内容,而是移动(剪切+粘贴)non-blank 区域。

最快的互操作方式(有不用打开文件的更快更复杂的方法)是获取数组中的所有值,移动数组中的值,然后将值放回原处:

object[,] values = m_XlWrkSheet.UsedRange.Value2 as object[,];

// some code here (the values start from values[1, 1] not values[0, 0])

m_XlWrkSheet.UsedRange.Value2 = values;