Characters 对象的更快替代品
Faster alternatives to Characters object
我需要从 Excel 单元格的内容中提取文本段落,其中发起者基本上使用删除线字体手动跟踪更改。这些段落可以通过某些字符模式来识别,但我必须忽略删除线字符才能看到它们。删除线字符不会出现在每个单元格的常规位置,因此基本上随机散布在普通字体文本中。
我已经使用 VBA 实现了我的目标 Excel,但是解决方案非常(并且不切实际)慢。在本网站和更广泛的网络上搜索答案后,似乎使用 Characters 对象是罪魁祸首。
所以我的问题是:有没有人找到一种方法来解析这种不涉及 Characters 对象的文本?
我写的用于解析的子代码太长了 post,但下面是一些以类似方式使用 Characters 对象的测试代码。这需要 60 秒来解析其中包含 3000 个字符的单元格。以这种速度,处理我收到的整个电子表格需要 50 个小时。
Private Sub FindLineBreakChars(TargetCell As Excel.Range)
Dim n As Integer
Dim ch As String
Dim st As Boolean
If TargetCell.Cells.Count <> 1 Then
Call MsgBox("Error: more or less than one cell in range specified.")
Else
If IsEmpty(TargetCell.Value) Then
Call MsgBox("Error: target cell is empty.")
Else
If Len(TargetCell.Value) = 0 Then
Call MsgBox("Error: target cell contains an empty string.")
Else
'Parse the characters in the cell one by one.
For n = 1 To TargetCell.Characters.Count
ch = TargetCell.Characters(n, 1).Text
st = TargetCell.Characters(n, 1).Font.Strikethrough
If ch = vbCr Then
Debug.Print "#" & n & ": Carriage Return (vbCr)" & ", strikethrough = " & st & vbCrLf
ElseIf ch = vbLf Then
Debug.Print "#" & n & ": Line Feed (vbLf)" & ", strikethrough = " & st & vbCrLf
End If
Next n
End If
End If
End If
End Sub
你说得对,Characters
的访问速度很慢,所以你的目标应该是尽可能减少它的使用。
我不明白你的需求细节,但下面的代码应该让你知道如何加速代码。它只读取一个单元格的内容一次,将文本分成单独的行,计算单个换行符的位置并查看该位置的格式。据我所知,没有办法一次访问所有格式,但现在对 characters
-object 的访问减少到每行一个:
With TargetCell
Dim lines() As String, lineNo As Integer, textLen As Long
lines = Split(.Value2, vbLf)
textLen = Len(lines(0)) + 1
For lineNo = 1 To UBound(lines)
Dim st
st = .Characters(textLen, 1).Font.Strikethrough
Debug.Print "#" & textLen & ": LineFeed (vbLf) strikethrough = " & st
textLen = textLen + Len(lines(lineNo)) + 1
Next lineNo
End With
据我所知,Excel 仅使用 LineFeed 字符将换行符存储在单元格中,因此代码仅检查该字符。
这可能会满足您的性能需求:它调用一个函数来解析 XML 单元格内容的表示形式,删除删除部分,并 returns 剩余文本。
会比遍历快很多Characters
Sub Tester()
Debug.Print NoStrikeThrough(Range("A1"))
End Sub
'Needs a reference to Microsoft XML, v6.0
' in your VBA Project references
Function NoStrikeThrough(c As Range) '
Dim doc As New MSXML2.DOMDocument60, rv As String
Dim x As MSXML2.IXMLDOMNode, s As MSXML2.IXMLDOMNode
'need to add some namespaces
doc.SetProperty "SelectionNamespaces", _
"xmlns:ss='urn:schemas-microsoft-com:office:spreadsheet' " & _
"xmlns:ht='http://www.w3.org/TR/REC-html40'"
doc.LoadXML c.Value(11) 'cell data as XML
Set x = doc.SelectSingleNode("//ss:Data")'<< cell content
Set s = x.SelectSingleNode("//ht:S") '<< strikethrough
Do While Not s Is Nothing
Debug.Print "Struck:", s.Text
x.RemoveChild s '<< remove struck section
Set s = x.SelectSingleNode("//ht:S")
Loop
NoStrikeThrough = doc.Text
End Function
编辑:这是另一种方法,将文本分解为 "blocks" 并检查每个块以查看是否有任何删除线。这比逐个字符快多少可能取决于块大小和每个单元格中删除文本的分布。
Function NoStrikeThrough2(c As Range)
Const BLOCK As Long = 50
Dim L As Long, i As Long, n As Long, pos As Long, x As Long
Dim rv As String, s As String, v
L = Len(c.Value)
n = Application.Ceiling(L / BLOCK, 1) 'how many blocks to check
pos = 1 'block start position
For i = 1 To n
v = c.Characters(pos, BLOCK).Font.Strikethrough
If IsNull(v) Then
'if strikethough is "mixed" in this block - parse out
' character-by-character
s = ""
For x = pos To pos + BLOCK
If Not c.Characters(x, 1).Font.Strikethrough Then
s = s & c.Characters(x, 1).Text
End If
Next x
rv = rv & s
ElseIf v = False Then
'no strikethrough - take the whole block
rv = rv & c.Characters(pos, BLOCK).Text
End If
pos = pos + BLOCK 'next block position.
Next i
NoStrikeThrough2 = rv
End Function
EDIT2:如果您需要确保在处理单元格之前没有删除所有换行符 -
Sub ClearParaStrikes(c As Range)
Dim pos As Long
pos = InStr(pos + 1, c.Value, vbLf)
Do While pos > 0
Debug.Print "vbLf at " & pos
c.Characters(pos, 1).Font.Strikethrough = False
pos = InStr(pos + 1, c.Value, vbLf)
Loop
End Sub
我需要从 Excel 单元格的内容中提取文本段落,其中发起者基本上使用删除线字体手动跟踪更改。这些段落可以通过某些字符模式来识别,但我必须忽略删除线字符才能看到它们。删除线字符不会出现在每个单元格的常规位置,因此基本上随机散布在普通字体文本中。
我已经使用 VBA 实现了我的目标 Excel,但是解决方案非常(并且不切实际)慢。在本网站和更广泛的网络上搜索答案后,似乎使用 Characters 对象是罪魁祸首。
所以我的问题是:有没有人找到一种方法来解析这种不涉及 Characters 对象的文本?
我写的用于解析的子代码太长了 post,但下面是一些以类似方式使用 Characters 对象的测试代码。这需要 60 秒来解析其中包含 3000 个字符的单元格。以这种速度,处理我收到的整个电子表格需要 50 个小时。
Private Sub FindLineBreakChars(TargetCell As Excel.Range)
Dim n As Integer
Dim ch As String
Dim st As Boolean
If TargetCell.Cells.Count <> 1 Then
Call MsgBox("Error: more or less than one cell in range specified.")
Else
If IsEmpty(TargetCell.Value) Then
Call MsgBox("Error: target cell is empty.")
Else
If Len(TargetCell.Value) = 0 Then
Call MsgBox("Error: target cell contains an empty string.")
Else
'Parse the characters in the cell one by one.
For n = 1 To TargetCell.Characters.Count
ch = TargetCell.Characters(n, 1).Text
st = TargetCell.Characters(n, 1).Font.Strikethrough
If ch = vbCr Then
Debug.Print "#" & n & ": Carriage Return (vbCr)" & ", strikethrough = " & st & vbCrLf
ElseIf ch = vbLf Then
Debug.Print "#" & n & ": Line Feed (vbLf)" & ", strikethrough = " & st & vbCrLf
End If
Next n
End If
End If
End If
End Sub
你说得对,Characters
的访问速度很慢,所以你的目标应该是尽可能减少它的使用。
我不明白你的需求细节,但下面的代码应该让你知道如何加速代码。它只读取一个单元格的内容一次,将文本分成单独的行,计算单个换行符的位置并查看该位置的格式。据我所知,没有办法一次访问所有格式,但现在对 characters
-object 的访问减少到每行一个:
With TargetCell
Dim lines() As String, lineNo As Integer, textLen As Long
lines = Split(.Value2, vbLf)
textLen = Len(lines(0)) + 1
For lineNo = 1 To UBound(lines)
Dim st
st = .Characters(textLen, 1).Font.Strikethrough
Debug.Print "#" & textLen & ": LineFeed (vbLf) strikethrough = " & st
textLen = textLen + Len(lines(lineNo)) + 1
Next lineNo
End With
据我所知,Excel 仅使用 LineFeed 字符将换行符存储在单元格中,因此代码仅检查该字符。
这可能会满足您的性能需求:它调用一个函数来解析 XML 单元格内容的表示形式,删除删除部分,并 returns 剩余文本。
会比遍历快很多Characters
Sub Tester()
Debug.Print NoStrikeThrough(Range("A1"))
End Sub
'Needs a reference to Microsoft XML, v6.0
' in your VBA Project references
Function NoStrikeThrough(c As Range) '
Dim doc As New MSXML2.DOMDocument60, rv As String
Dim x As MSXML2.IXMLDOMNode, s As MSXML2.IXMLDOMNode
'need to add some namespaces
doc.SetProperty "SelectionNamespaces", _
"xmlns:ss='urn:schemas-microsoft-com:office:spreadsheet' " & _
"xmlns:ht='http://www.w3.org/TR/REC-html40'"
doc.LoadXML c.Value(11) 'cell data as XML
Set x = doc.SelectSingleNode("//ss:Data")'<< cell content
Set s = x.SelectSingleNode("//ht:S") '<< strikethrough
Do While Not s Is Nothing
Debug.Print "Struck:", s.Text
x.RemoveChild s '<< remove struck section
Set s = x.SelectSingleNode("//ht:S")
Loop
NoStrikeThrough = doc.Text
End Function
编辑:这是另一种方法,将文本分解为 "blocks" 并检查每个块以查看是否有任何删除线。这比逐个字符快多少可能取决于块大小和每个单元格中删除文本的分布。
Function NoStrikeThrough2(c As Range)
Const BLOCK As Long = 50
Dim L As Long, i As Long, n As Long, pos As Long, x As Long
Dim rv As String, s As String, v
L = Len(c.Value)
n = Application.Ceiling(L / BLOCK, 1) 'how many blocks to check
pos = 1 'block start position
For i = 1 To n
v = c.Characters(pos, BLOCK).Font.Strikethrough
If IsNull(v) Then
'if strikethough is "mixed" in this block - parse out
' character-by-character
s = ""
For x = pos To pos + BLOCK
If Not c.Characters(x, 1).Font.Strikethrough Then
s = s & c.Characters(x, 1).Text
End If
Next x
rv = rv & s
ElseIf v = False Then
'no strikethrough - take the whole block
rv = rv & c.Characters(pos, BLOCK).Text
End If
pos = pos + BLOCK 'next block position.
Next i
NoStrikeThrough2 = rv
End Function
EDIT2:如果您需要确保在处理单元格之前没有删除所有换行符 -
Sub ClearParaStrikes(c As Range)
Dim pos As Long
pos = InStr(pos + 1, c.Value, vbLf)
Do While pos > 0
Debug.Print "vbLf at " & pos
c.Characters(pos, 1).Font.Strikethrough = False
pos = InStr(pos + 1, c.Value, vbLf)
Loop
End Sub