大型数据集的唯一计数公式

Unique Count Formula for large dataset

我无法确定在相邻单元格中输入 10 以指示在处理大型数据集时值是否唯一的方法。我读过多种实现此目的的方法,但其中 none 对我的目的来说似乎很有效:我正在使用 Excel 2010 的实例(所以我 在数据透视表中具有 Distinct Count 功能,当我尝试使用 PowerPivot 时,由于处理限制它使我的文件崩溃。

在这个 Whosebug 问题中:Simple Pivot Table to Count Unique Values 有使用 SUMPRODUCTCOUNTIF 的建议,但是像我这样处理 50,000 多行时,这会导致糟糕的性能和一个文件大小约为 35 MB 而不是 3 MB。我想知道对于大型动态数据集是否有更好的解决方案,无论是公式还是 VBA.

我想要完成的一个例子是(Unique 列是相邻的单元格):

Name   Week   Unique
John   1      1
Sally  1      1
John   1      0
Sally  2      1

我试图编写与 COUNTIF 相同的功能,但没有成功:

For Each Cell In ThisWorkbook.Worksheets("Overtime & Type Data").Range("Z2:Z" & DataLastRow)
If Worksheets("Overtime & Type Data").Cells(Cell.Row, 26) <> Worksheets("Overtime & Type Data").Cells(Cell.Row - 1, 26) Then
FirstCell = Cell.Row
End If
If (Worksheets("Overtime & Type Data").Range(Cells(FirstCell, 26), Cells(Cell.Row, 26)) = Worksheets("Overtime & Type Data").Range(Cells(Cell.Row, 26))) = True Then
    Cell.Value = 1
Else
    Cell.Value = 0
End If
Next Cell

一种方法是按姓名和周排序。然后,您可以通过与前一行进行比较来确定任何行的唯一性。

如果您需要保留顺序,您可以先写一列索引号(1、2、3、...)来跟踪顺序。计算出Unique后,按Index排序,恢复原来的顺序。

整个过程可以通过相对较少的步骤手动完成,或者使用 VBA 自动完成。

此代码 运行 在不到 3 秒的时间内成功处理了超过 130,000 行。调整列字母以适合您的数据集。

Sub tgr()

    Const colName As String = "A"
    Const colWeek As String = "B"
    Const colOutput As String = "C"

    Dim ws As Worksheet
    Dim rngData As Range
    Dim DataCell As Range
    Dim rngFound As Range
    Dim collUniques As Collection
    Dim arrResults() As Long
    Dim ResultIndex As Long
    Dim UnqCount As Long

    Set ws = ThisWorkbook.Sheets("Overtime & Type Data")
    Set rngData = ws.Range(colName & 2, ws.Cells(Rows.Count, colName).End(xlUp))
    Set collUniques = New Collection
    ReDim arrResults(1 To rngData.Cells.Count, 1 To 1)

    On Error Resume Next
    For Each DataCell In rngData.Cells
        ResultIndex = ResultIndex + 1
        collUniques.Add ws.Cells(DataCell.Row, colName).Value & ws.Cells(DataCell.Row, colWeek).Value, ws.Cells(DataCell.Row, colName).Value & ws.Cells(DataCell.Row, colWeek).Value
        If collUniques.Count > UnqCount Then
            UnqCount = collUniques.Count
            arrResults(ResultIndex, 1) = 1
        Else
            arrResults(ResultIndex, 1) = 0
        End If
    Next DataCell
    On Error GoTo 0

    ws.Cells(rngData.Row, colOutput).Resize(rngData.Cells.Count).Value = arrResults

End Sub

我不确定这对 50000 个值的处理效果如何,但它会在大约一秒钟内达到 ~1500。

Sub unique()
    Dim myColl As New Collection
    Dim isDup As Boolean
    Dim myValue As String
    Dim r As Long

    On Error GoTo DuplicateValue
    For r = 1 To Sheet1.UsedRange.Rows.Count
        isDup = False
        'Combine the value of the 2 cells together
        ' and add that string to our collection
        'If it is already in the collection it errors
        myValue = Sheet1.Cells(r, 1).Value & Sheet1.Cells(r, 2).Value
        myColl.Add r, myValue
        If isDup Then
            Sheet1.Cells(r, 3).Value = "0"
        Else
            Sheet1.Cells(r, 3).Value = "1"
        End If
    Next
    On Error GoTo 0
    Exit Sub
DuplicateValue:
    'The value is already in the collection so put a 0
    isDup = True
    Resume Next
End Sub

几乎所有的批量操作都会击败涉及工作表单元格的循环。您可以 trim 通过在内存中执行所有计算并仅在完成时将值返回到工作表 en masse 来稍微缩短时间。

Sub is_a_dupe()
    Dim v As Long, vTMP As Variant, vUNQs As Variant, dUNQs As Object

    Debug.Print Timer
    On Error GoTo bm_Uh_Oh
    Set dUNQs = CreateObject("Scripting.Dictionary")

    With Worksheets("Sheet1")

        vTMP = .Range(.Cells(2, 1), .Cells(Rows.Count, 2).End(xlUp)).Value2
        ReDim vUNQs(1 To UBound(vTMP, 1), 1 To 1)

        For v = LBound(vTMP, 1) To UBound(vTMP, 1)
            If dUNQs.Exists(Join(Array(vTMP(v, 1), vTMP(v, 2)))) Then
                vUNQs(v, 1) = 0
            Else
                dUNQs.Add Key:=Join(Array(vTMP(v, 1), vTMP(v, 2))), _
                          Item:=vTMP(v, 2)
                vUNQs(v, 1) = 1
            End If
        Next v

        .Cells(2, 3).Resize(UBound(vUNQs, 1), 1) = vUNQs

    End With

    Debug.Print Timer

bm_Uh_Oh:
    dUNQs.RemoveAll
    Set dUNQs = Nothing
End Sub

以前的经验告诉我,数据的多样性(以及硬件等)会影响过程的计时,但在我的随机样本数据中,我收到了这些经过的时间。

 50K records ..... 0.53 seconds
130K records .... 1.32 seconds
500K records .... 4.92 seconds