从化学式中提取数字

Question

很抱歉，如果这个问题已经被询问和回答，但我找不到满意的答案。

我有一个化学式列表，按顺序包括：C、H、N 和 O。我想在每个字母后拉出数字。问题是并非所有的公式都包含 N。但是，所有公式都包含 C、H 和 O。并且数字可以是单数、双数或（仅在 H 的情况下）三位数。

因此数据如下所示：

C₂₀H₃₇N_1O5
C₁₀H₁₂O₃
C₂₀H₁₉N₃O₄
C₂₃H₄₀O₃
C₉H₁₃N₁O₃
C₁₄H₂₆O₄
C₅₈H₁₀₀N₂O₉

我希望列表中的每个元素编号位于单独的列中。所以在第一个例子中它将是：

20 37 1 5

我一直在努力：

=IFERROR(MID(LEFT(A2,FIND("H",A2)-1),FIND("C",A2)+1,LEN(A2)),"")

分离出C#。但是，在此之后我被卡住了，因为 H# 的两侧是 O 或 N。

是否有 excel 公式或 VBA 可以做到这一点？

Answer 1

使用 VBA 这是一项简单的任务 - 您必须遍历字符并检查值是否为数字。对于 Excel，该解决方案包含一些冗余。但这是可行的。例如，

如果应用以下公式，

C20H37NO5 将 return 20375：

=IF(ISNUMBER(1*MID(A1,1,1)),MID(A1,1,1),"")&
IF(ISNUMBER(1*MID(A1,2,1)),MID(A1,2,1),"")&
IF(ISNUMBER(1*MID(A1,3,1)),MID(A1,3,1),"")&
IF(ISNUMBER(1*MID(A1,4,1)),MID(A1,4,1),"")&
IF(ISNUMBER(1*MID(A1,5,1)),MID(A1,5,1),"")&
IF(ISNUMBER(1*MID(A1,6,1)),MID(A1,6,1),"")&
IF(ISNUMBER(1*MID(A1,7,1)),MID(A1,7,1),"")&
IF(ISNUMBER(1*MID(A1,8,1)),MID(A1,8,1),"")&
IF(ISNUMBER(1*MID(A1,9,1)),MID(A1,9,1),"")

目前，它检查前 9 个字符是否为数字。如果要包括超过 9，则只需在公式中添加几行即可。

公式中有个小技巧——1*。如果可能，它将文本字符转换为数字。因此，作为文本的 5 乘以 1 成为数字字符。

Answer 2

使用split and like方法。

Sub test()
    Dim vDB As Variant, vR() As Variant
    Dim s As String
    Dim vSplit As Variant
    Dim i As Long, n As Long, j As Integer

    vDB = Range("a2", Range("a" & Rows.Count).End(xlUp))

    n = UBound(vDB, 1)
    ReDim vR(1 To n, 1 To 4)
    For i = 1 To n
        s = vDB(i, 1)
        For j = 1 To Len(s)
            If Mid(s, j, 1) Like "[A-Z]" Then
                s = Replace(s, Mid(s, j, 1), " ")
            End If
        Next j
        vSplit = Split(s, " ")
        For j = 1 To UBound(vSplit)

            vR(i, j) = vSplit(j)
        Next j
    Next i
    Range("b2").Resize(n, 4) = vR
End Sub

Answer 3

我在 VBA 中使用正则表达式完成了此操作。您可能也可以像 Vityata 建议的那样通过遍历字符串来做到这一点，但我怀疑这会稍微快一些并且更容易阅读。

Option Explicit

Function find_associated_number(chemical_formula As Range, element As String) As Variant
  Dim regex As Object: Set regex = CreateObject("VBScript.RegExp")
  Dim pattern As String
  Dim matches As Object

  If Len(element) > 1 Or chemical_formula.CountLarge <> 1 Then
    find_associated_number = CVErr(xlErrName)
  Else
    pattern = element + "(\d+)\D"
    With regex
      .pattern = pattern
      .ignorecase = True
      If .test(chemical_formula) Then
        Set matches = .Execute(chemical_formula)
        find_associated_number = matches(0).submatches(0)
      Else
        find_associated_number = CVErr(xlErrNA)
      End If
    End With
  End If
End Function

然后您像往常一样在 sheet 中使用公式：

C 列包含碳原子数，D 列包含氮原子数。只需通过复制公式并更改它搜索的元素来扩展它。

Answer 4

如果您想要 vba 解决方案来提取所有数字，我的首选解决方案是使用正则表达式。以下代码将从字符串

中提取所有数字

Sub GetMolecularFormulaNumbers()
    Dim rng As Range
    Dim RegExp As Object
    Dim match, matches
    Dim j As Long

    Set rng = Range(Cells(1, 1), Cells(Cells(Rows.Count, 1).End(xlUp).Row, 1))
    Set RegExp = CreateObject("vbscript.regexp")
    With RegExp
        .Pattern = "\d+"
        .IgnoreCase = True
        .Global = True

        For Each c In rng
            j = 0
            Set matches = .Execute(c)
            If matches.Count > 0 Then
                For Each match In matches
                    j = j + 1
                    c.Offset(0, j) = CInt(match)
                Next match
            End If
        Next c
    End With
End Sub

Answer 5

使用正则表达式

regular expressions（正则表达式）这是一项很好的任务。因为 VBA 不支持开箱即用的正则表达式，我们需要先引用 Windows 库。

在 Tools 下添加对正则表达式的引用，然后在 References
并选择 Microsoft VBScript 正则表达式 5.5

将此函数添加到模块

 Option Explicit 

 Public Function ChemRegex(ByVal ChemFormula As String, ByVal Element As String) As Long
     Dim strPattern As String
     strPattern = "([CNHO])([0-9]*)" 
                  'this pattern is limited to the elements C, N, H and O only.
     Dim regEx As New RegExp

     Dim Matches As MatchCollection, m As Match

     If strPattern <> "" Then
         With regEx
             .Global = True
             .MultiLine = True
             .IgnoreCase = False
             .Pattern = strPattern
         End With

         Set Matches = regEx.Execute(ChemFormula)
         For Each m In Matches
             If m.SubMatches(0) = Element Then
                 ChemRegex = IIf(Not m.SubMatches(1) = vbNullString, m.SubMatches(1), 1) 
                             'this IIF ensures that in CH4O the C and O are count as 1
                 Exit For
             End If
         Next m
     End If
 End Function

在单元格公式中使用这样的函数

例如在单元格 B2 中：=ChemRegex($A2,B) 并将其复制到其他单元格

还可以识别元素多次出现的化学式，例如 `CH₃OH` 或 `CH₂COOH`

请注意，上面的代码无法计算 CH3OH 之类的元素出现不止一次的情况。那么只有第一个 H3 被计数，最后一个被省略。

如果您还需要识别 CH3OH 或 CH2COOH 格式的公式（并总结元素的出现），那么您也需要更改代码以识别这些……

If m.SubMatches(0) = Element Then
    ChemRegex = ChemRegex + IIf(Not m.SubMatches(1) = vbNullString, m.SubMatches(1), 1)
    'Exit For needs to be removed.
End If

还可以识别具有 2 个字母元素的化学式，例如 `NaOH` 或 `CaCl₂`

除了上述针对 多次出现的元素 的更改外，还使用此模式：

strPattern = "([A-Z][a-z]?)([0-9]*)"   'https://regex101.com/r/nNv8W6/2

请注意，它们需要采用正确的 upper/lower 字母大小写。 CaCl2 有效但 cacl2 或 CACL2.
无效

请注意，这并不能证明这些字母组合是否是周期性 table 的现有元素。所以这也将识别例如。 Xx2Zz5Q 作为虚构元素 Xx = 2、Zz = 5 和 Q = 1.

要仅接受周期性 table 中存在的组合，请使用以下模式：

 strPattern = "([A][cglmrstu]|[B][aehikr]?|[C][adeflmnorsu]?|[D][bsy]|[E][rsu]|[F][elmr]?|[G][ade]|[H][efgos]?|[I][nr]?|[K][r]?|[L][airuv]|[M][cdgnot]|[N][abdehiop]?|[O][gs]?|[P][abdmortu]?|[R][abefghnu]|[S][bcegimnr]?|[T][abcehilms]|[U]|[V]|[W]|[X][e]|[Y][b]?|[Z][nr])([0-9]*)"
 'https://regex101.com/r/Hlzta2/3
 'This pattern includes all 118 elements up to today. 
 'If new elements are found/generated by scientist they need to be added to the pattern.

还可以识别带括号的化学式，例如 `Ca(OH)₂`

因此需要另一个正则表达式来处理括号并将它们相乘。

Public Function ChemRegex(ByVal ChemFormula As String, ByVal Element As String) As Long
    Dim regEx As New RegExp
    With regEx
        .Global = True
        .MultiLine = True
        .IgnoreCase = False
    End With
    
    'first pattern matches every element once
    regEx.Pattern = "([A][cglmrstu]|[B][aehikr]?|[C][adeflmnorsu]?|[D][bsy]|[E][rsu]|[F][elmr]?|[G][ade]|[H][efgos]?|[I][nr]?|[K][r]?|[L][airuv]|[M][cdgnot]|[N][abdehiop]?|[O][gs]?|[P][abdmortu]?|[R][abefghnu]|[S][bcegimnr]?|[T][abcehilms]|[U]|[V]|[W]|[X][e]|[Y][b]?|[Z][nr])([0-9]*)"
    
    Dim Matches As MatchCollection
    Set Matches = regEx.Execute(ChemFormula)
    
    Dim m As Match
    For Each m In Matches
        If m.SubMatches(0) = Element Then
            ChemRegex = ChemRegex + IIf(Not m.SubMatches(1) = vbNullString, m.SubMatches(1), 1)
        End If
    Next m
    
    'second patternd finds parenthesis and multiplies elements within
    regEx.Pattern = "(\((.+?)\)([0-9]+)+)+?"
    Set Matches = regEx.Execute(ChemFormula)
    For Each m In Matches
        ChemRegex = ChemRegex + ChemRegex(m.SubMatches(1), Element) * (m.SubMatches(2) - 1) '-1 because all elements were already counted once in the first pattern
    Next m
End Function

这也将识别括号。请注意，它不识别嵌套括号。

也可以看看类似的问题：

Answer 6

这似乎工作得很好：

B2中的公式如下。上下拖动

=IFERROR(IFERROR(--(MID($A2,SEARCH(B,$A2)+1,3)),IFERROR(--(MID($A2,SEARCH(B,$A2)+1,2)),--MID($A2,SEARCH(B,$A2)+1,1))),0)

或者更短的数组公式，必须用ctrl+shift+enter[=32=输入]

=MAX(IFERROR(--MID($A2,SEARCH(B,$A2)+1,ROW($A:$A)),0))

如果你想让 VBA 超级简单，像这样的方法也可以：

Public Function ElementCount(str As String, element As String) As Long
    Dim i As Integer
    Dim s As String

    For i = 1 To 3
        s = Mid(str, InStr(str, element) + 1, i)
        On Error Resume Next
        ElementCount = CLng(s)
        On Error GoTo 0
    Next i
End Function

像这样使用它：

=ElementCount(A1,"C")

从化学式中提取数字

Extract numbers from chemical formula

excel

vba

excel-formula

chemistry

使用正则表达式

还可以识别元素多次出现的化学式，例如 `CH₃OH` 或 `CH₂COOH`

还可以识别具有 2 个字母元素的化学式，例如 `NaOH` 或 `CaCl₂`

还可以识别带括号的化学式，例如 `Ca(OH)₂`

从化学式中提取数字

Extract numbers from chemical formula

excel

vba

excel-formula

chemistry

使用正则表达式

还可以识别元素多次出现的化学式，例如 CH₃OH 或 CH₂COOH

还可以识别具有 2 个字母元素的化学式，例如 NaOH 或 CaCl₂

还可以识别带括号的化学式，例如 Ca(OH)₂

还可以识别元素多次出现的化学式，例如 `CH₃OH` 或 `CH₂COOH`

还可以识别具有 2 个字母元素的化学式，例如 `NaOH` 或 `CaCl₂`

还可以识别带括号的化学式，例如 `Ca(OH)₂`