大型数据集的唯一计数公式
Unique Count Formula for large dataset
我无法确定在相邻单元格中输入 1
或 0
以指示在处理大型数据集时值是否唯一的方法。我读过多种实现此目的的方法,但其中 none 对我的目的来说似乎很有效:我正在使用 Excel 2010 的实例(所以我 不 在数据透视表中具有 Distinct Count 功能,当我尝试使用 PowerPivot 时,由于处理限制它使我的文件崩溃。
在这个 Whosebug 问题中:Simple Pivot Table to Count Unique Values 有使用 SUMPRODUCT
或 COUNTIF
的建议,但是像我这样处理 50,000 多行时,这会导致糟糕的性能和一个文件大小约为 35 MB 而不是 3 MB。我想知道对于大型动态数据集是否有更好的解决方案,无论是公式还是 VBA.
我想要完成的一个例子是(Unique 列是相邻的单元格):
Name Week Unique
John 1 1
Sally 1 1
John 1 0
Sally 2 1
我试图编写与 COUNTIF
相同的功能,但没有成功:
For Each Cell In ThisWorkbook.Worksheets("Overtime & Type Data").Range("Z2:Z" & DataLastRow)
If Worksheets("Overtime & Type Data").Cells(Cell.Row, 26) <> Worksheets("Overtime & Type Data").Cells(Cell.Row - 1, 26) Then
FirstCell = Cell.Row
End If
If (Worksheets("Overtime & Type Data").Range(Cells(FirstCell, 26), Cells(Cell.Row, 26)) = Worksheets("Overtime & Type Data").Range(Cells(Cell.Row, 26))) = True Then
Cell.Value = 1
Else
Cell.Value = 0
End If
Next Cell
一种方法是按姓名和周排序。然后,您可以通过与前一行进行比较来确定任何行的唯一性。
如果您需要保留顺序,您可以先写一列索引号(1、2、3、...)来跟踪顺序。计算出Unique后,按Index排序,恢复原来的顺序。
整个过程可以通过相对较少的步骤手动完成,或者使用 VBA 自动完成。
此代码 运行 在不到 3 秒的时间内成功处理了超过 130,000 行。调整列字母以适合您的数据集。
Sub tgr()
Const colName As String = "A"
Const colWeek As String = "B"
Const colOutput As String = "C"
Dim ws As Worksheet
Dim rngData As Range
Dim DataCell As Range
Dim rngFound As Range
Dim collUniques As Collection
Dim arrResults() As Long
Dim ResultIndex As Long
Dim UnqCount As Long
Set ws = ThisWorkbook.Sheets("Overtime & Type Data")
Set rngData = ws.Range(colName & 2, ws.Cells(Rows.Count, colName).End(xlUp))
Set collUniques = New Collection
ReDim arrResults(1 To rngData.Cells.Count, 1 To 1)
On Error Resume Next
For Each DataCell In rngData.Cells
ResultIndex = ResultIndex + 1
collUniques.Add ws.Cells(DataCell.Row, colName).Value & ws.Cells(DataCell.Row, colWeek).Value, ws.Cells(DataCell.Row, colName).Value & ws.Cells(DataCell.Row, colWeek).Value
If collUniques.Count > UnqCount Then
UnqCount = collUniques.Count
arrResults(ResultIndex, 1) = 1
Else
arrResults(ResultIndex, 1) = 0
End If
Next DataCell
On Error GoTo 0
ws.Cells(rngData.Row, colOutput).Resize(rngData.Cells.Count).Value = arrResults
End Sub
我不确定这对 50000 个值的处理效果如何,但它会在大约一秒钟内达到 ~1500。
Sub unique()
Dim myColl As New Collection
Dim isDup As Boolean
Dim myValue As String
Dim r As Long
On Error GoTo DuplicateValue
For r = 1 To Sheet1.UsedRange.Rows.Count
isDup = False
'Combine the value of the 2 cells together
' and add that string to our collection
'If it is already in the collection it errors
myValue = Sheet1.Cells(r, 1).Value & Sheet1.Cells(r, 2).Value
myColl.Add r, myValue
If isDup Then
Sheet1.Cells(r, 3).Value = "0"
Else
Sheet1.Cells(r, 3).Value = "1"
End If
Next
On Error GoTo 0
Exit Sub
DuplicateValue:
'The value is already in the collection so put a 0
isDup = True
Resume Next
End Sub
几乎所有的批量操作都会击败涉及工作表单元格的循环。您可以 trim 通过在内存中执行所有计算并仅在完成时将值返回到工作表 en masse 来稍微缩短时间。
Sub is_a_dupe()
Dim v As Long, vTMP As Variant, vUNQs As Variant, dUNQs As Object
Debug.Print Timer
On Error GoTo bm_Uh_Oh
Set dUNQs = CreateObject("Scripting.Dictionary")
With Worksheets("Sheet1")
vTMP = .Range(.Cells(2, 1), .Cells(Rows.Count, 2).End(xlUp)).Value2
ReDim vUNQs(1 To UBound(vTMP, 1), 1 To 1)
For v = LBound(vTMP, 1) To UBound(vTMP, 1)
If dUNQs.Exists(Join(Array(vTMP(v, 1), vTMP(v, 2)))) Then
vUNQs(v, 1) = 0
Else
dUNQs.Add Key:=Join(Array(vTMP(v, 1), vTMP(v, 2))), _
Item:=vTMP(v, 2)
vUNQs(v, 1) = 1
End If
Next v
.Cells(2, 3).Resize(UBound(vUNQs, 1), 1) = vUNQs
End With
Debug.Print Timer
bm_Uh_Oh:
dUNQs.RemoveAll
Set dUNQs = Nothing
End Sub
以前的经验告诉我,数据的多样性(以及硬件等)会影响过程的计时,但在我的随机样本数据中,我收到了这些经过的时间。
50K records ..... 0.53 seconds
130K records .... 1.32 seconds
500K records .... 4.92 seconds
我无法确定在相邻单元格中输入 1
或 0
以指示在处理大型数据集时值是否唯一的方法。我读过多种实现此目的的方法,但其中 none 对我的目的来说似乎很有效:我正在使用 Excel 2010 的实例(所以我 不 在数据透视表中具有 Distinct Count 功能,当我尝试使用 PowerPivot 时,由于处理限制它使我的文件崩溃。
在这个 Whosebug 问题中:Simple Pivot Table to Count Unique Values 有使用 SUMPRODUCT
或 COUNTIF
的建议,但是像我这样处理 50,000 多行时,这会导致糟糕的性能和一个文件大小约为 35 MB 而不是 3 MB。我想知道对于大型动态数据集是否有更好的解决方案,无论是公式还是 VBA.
我想要完成的一个例子是(Unique 列是相邻的单元格):
Name Week Unique
John 1 1
Sally 1 1
John 1 0
Sally 2 1
我试图编写与 COUNTIF
相同的功能,但没有成功:
For Each Cell In ThisWorkbook.Worksheets("Overtime & Type Data").Range("Z2:Z" & DataLastRow)
If Worksheets("Overtime & Type Data").Cells(Cell.Row, 26) <> Worksheets("Overtime & Type Data").Cells(Cell.Row - 1, 26) Then
FirstCell = Cell.Row
End If
If (Worksheets("Overtime & Type Data").Range(Cells(FirstCell, 26), Cells(Cell.Row, 26)) = Worksheets("Overtime & Type Data").Range(Cells(Cell.Row, 26))) = True Then
Cell.Value = 1
Else
Cell.Value = 0
End If
Next Cell
一种方法是按姓名和周排序。然后,您可以通过与前一行进行比较来确定任何行的唯一性。
如果您需要保留顺序,您可以先写一列索引号(1、2、3、...)来跟踪顺序。计算出Unique后,按Index排序,恢复原来的顺序。
整个过程可以通过相对较少的步骤手动完成,或者使用 VBA 自动完成。
此代码 运行 在不到 3 秒的时间内成功处理了超过 130,000 行。调整列字母以适合您的数据集。
Sub tgr()
Const colName As String = "A"
Const colWeek As String = "B"
Const colOutput As String = "C"
Dim ws As Worksheet
Dim rngData As Range
Dim DataCell As Range
Dim rngFound As Range
Dim collUniques As Collection
Dim arrResults() As Long
Dim ResultIndex As Long
Dim UnqCount As Long
Set ws = ThisWorkbook.Sheets("Overtime & Type Data")
Set rngData = ws.Range(colName & 2, ws.Cells(Rows.Count, colName).End(xlUp))
Set collUniques = New Collection
ReDim arrResults(1 To rngData.Cells.Count, 1 To 1)
On Error Resume Next
For Each DataCell In rngData.Cells
ResultIndex = ResultIndex + 1
collUniques.Add ws.Cells(DataCell.Row, colName).Value & ws.Cells(DataCell.Row, colWeek).Value, ws.Cells(DataCell.Row, colName).Value & ws.Cells(DataCell.Row, colWeek).Value
If collUniques.Count > UnqCount Then
UnqCount = collUniques.Count
arrResults(ResultIndex, 1) = 1
Else
arrResults(ResultIndex, 1) = 0
End If
Next DataCell
On Error GoTo 0
ws.Cells(rngData.Row, colOutput).Resize(rngData.Cells.Count).Value = arrResults
End Sub
我不确定这对 50000 个值的处理效果如何,但它会在大约一秒钟内达到 ~1500。
Sub unique()
Dim myColl As New Collection
Dim isDup As Boolean
Dim myValue As String
Dim r As Long
On Error GoTo DuplicateValue
For r = 1 To Sheet1.UsedRange.Rows.Count
isDup = False
'Combine the value of the 2 cells together
' and add that string to our collection
'If it is already in the collection it errors
myValue = Sheet1.Cells(r, 1).Value & Sheet1.Cells(r, 2).Value
myColl.Add r, myValue
If isDup Then
Sheet1.Cells(r, 3).Value = "0"
Else
Sheet1.Cells(r, 3).Value = "1"
End If
Next
On Error GoTo 0
Exit Sub
DuplicateValue:
'The value is already in the collection so put a 0
isDup = True
Resume Next
End Sub
几乎所有的批量操作都会击败涉及工作表单元格的循环。您可以 trim 通过在内存中执行所有计算并仅在完成时将值返回到工作表 en masse 来稍微缩短时间。
Sub is_a_dupe()
Dim v As Long, vTMP As Variant, vUNQs As Variant, dUNQs As Object
Debug.Print Timer
On Error GoTo bm_Uh_Oh
Set dUNQs = CreateObject("Scripting.Dictionary")
With Worksheets("Sheet1")
vTMP = .Range(.Cells(2, 1), .Cells(Rows.Count, 2).End(xlUp)).Value2
ReDim vUNQs(1 To UBound(vTMP, 1), 1 To 1)
For v = LBound(vTMP, 1) To UBound(vTMP, 1)
If dUNQs.Exists(Join(Array(vTMP(v, 1), vTMP(v, 2)))) Then
vUNQs(v, 1) = 0
Else
dUNQs.Add Key:=Join(Array(vTMP(v, 1), vTMP(v, 2))), _
Item:=vTMP(v, 2)
vUNQs(v, 1) = 1
End If
Next v
.Cells(2, 3).Resize(UBound(vUNQs, 1), 1) = vUNQs
End With
Debug.Print Timer
bm_Uh_Oh:
dUNQs.RemoveAll
Set dUNQs = Nothing
End Sub
以前的经验告诉我,数据的多样性(以及硬件等)会影响过程的计时,但在我的随机样本数据中,我收到了这些经过的时间。
50K records ..... 0.53 seconds
130K records .... 1.32 seconds
500K records .... 4.92 seconds