Python: 如何去除文本语料库中的标点符号,但不去除特殊词(例如c++、c#、.net等)中的标点符号

Python: How remove punctuation in text corpus, but not remove it in special words (e.g. c++, c#, .net, etc)

我有一个很大的 pandas 数据集,其中包含职位描述。我想标记它,但在此之前我应该​​删除停用词和标点符号。我对停用词没有任何问题。

如果我使用正则表达式来删除标点符号,我可能会丢失描述工作的非常重要的词(例如 c++ 开发人员、c#、.net 等)。

如此重要的单词列表非常大,因为它不仅包括编程语言名称,还包括公司名称。

例如,我想要下一个删除标点符号的方法:

之前:

Hi! We are looking for smart, young and hard-working c++ developer. Our perfect candidate should know: - c++, c#, .NET in expert level;

之后:

Hi We are looking for smart young and hard-working c++ developer Our perfect candidate should know c++ c# .NET in expert level

你能告诉我高级标记器或删除标点符号的方法吗?

我的解决方案

def clean(s: str, keep=None, remove=None):
    """ delete punctuation from "s" except special words """
    if keep is None:
        keep = []

    if remove is None:
        remove = []

    protected = [False for _ in s]  # True if you keep

    # compute protected chars
    for w in keep:  # for every special word
        for i in range(len(s)-len(w)):
            same = True
            for j in range(len(w)):
                if w[j] != s[i + j]:
                    same = False
            if same:
                for j in range(len(w)):
                    protected[i + j] = True

    # delete unwanted chars
    out = ''
    for i in range(len(s)):
        if protected[i]:
            out += s[i]
        else:
            if s[i] not in remove:
                out += s[i]

    return out


if __name__ == "__main__":

    test = "Hi! We are looking for smart, young and hard-working c++ developer. Our perfect candidate should know:" \
           " - c++, c# in expert level;"

    Remove = ['.', ',', ':', ';', '+', '-', '!', '?', '#']
    Keep = ['c++', 'c#']

    print(clean(test, keep=Keep, remove=Remove))

您可以使用模式:

[!,.:;-](?= |$)

匹配任何后跟空格或 !,.:;- 的字符字符串结尾。


在Python中:

import re
text = "Hi! We are looking for smart, young and hard-working c++ developer. Our perfect candidate should know: - c++, c#, .NET in expert level;"
print (re.sub(r'[!,.:;-](?= |$)',r'',text))

打印:

Hi We are looking for smart young and hard-working c++ developer Our perfect candidate should know  c++ c# .NET in expert level