如何在特殊字符和其他所有字符之间插入 space

Question

我正在处理一些乳胶文本，我需要清理它以便根据间距正确拆分它。

所以字符串：

\mathrm l  >\mathrm li ^ + >\mathrm mg ^   +>\mathrm a  \beta+  \mathrm co

应该是：

\mathrm l  > \mathrm li ^ + > \mathrm mg ^   + > \mathrm a  \beta +  \mathrm co

所以为了让我拆分它，如果它是一个特殊字符，我必须在每个字符之间创建间距。另外我想保持乳胶符号完整无缺 \something.

我可以 re.compile([a-zA-Z0-9 \]) 获取所有特殊字符，但我该如何处理插入空格？

我写过类似这样的代码，但就效率而言看起来不太好。（或者是？）

def insert_space(sentence):
    '''
    Add a space around special characters So "x+y +-=y \latex" becomes: "x + y + - = y \latex"
    '''
    string = ''
    for i in sentence:
        if (not i.isalnum()) and i not in [' ','\']:
            string += ' '+i+' '
        else:
            string += i
    return re.sub('\s+', ' ',string)

Answer 1

我没有使用过 LaTeX，所以如果您确定 [a-zA-Z0-9 \] 捕获了所有非特殊字符的内容，您可以这样做。

import re

def insert_space(sentence):
    sentence = re.sub(r'(?<! )(?![a-zA-Z0-9 \])', ' ', sentence)
    sentence = re.sub(r'(?<!^)(?<![a-zA-Z0-9 \])(?! )', ' ', sentence)
    return sentence

my_string = '\mathrm l  >\mathrm li ^ + >\mathrm mg ^   +>\mathrm a  \beta+  \mathrm co'
print('before', my_string)
# before \mathrm l  >\mathrm li ^ + >\mathrm mg ^   +>\mathrm a  \beta+  \mathrm co
print('after', insert_space(my_string))
# after \mathrm l  > \mathrm li ^ + > \mathrm mg ^   + > \mathrm a  \beta +  \mathrm co

第一个regex是：

(?<! ) space.
(?![a-zA-Z0-9 \]) 对您指定的字符 class 进行否定预测。
将所有这些替换为 space ' '。

第二个regex是：

(?<!^) 字符串开头的负向后视。
(?<![a-zA-Z0-9 \]) 对您指定的角色 class 的负面看法。
(?! ) 负面展望space。
将所有这些替换为 space ' '。

如此有效，它首先找到特殊字符和另一个非 space 字符之间的所有 space，然后在该位置插入 space。

您还需要包含 (?<!^) 的原因是忽略字符串开头和第一个字符之间的位置。否则它将在开头包含一个额外的 space。

如何在特殊字符和其他所有字符之间插入 space

How to Insert space between a special character and everything else

python

nlp

python-re