Python: 如何去除文本语料库中的标点符号,但不去除特殊词(例如c++、c#、.net等)中的标点符号
Python: How remove punctuation in text corpus, but not remove it in special words (e.g. c++, c#, .net, etc)
我有一个很大的 pandas 数据集,其中包含职位描述。我想标记它,但在此之前我应该删除停用词和标点符号。我对停用词没有任何问题。
如果我使用正则表达式来删除标点符号,我可能会丢失描述工作的非常重要的词(例如 c++ 开发人员、c#、.net 等)。
如此重要的单词列表非常大,因为它不仅包括编程语言名称,还包括公司名称。
例如,我想要下一个删除标点符号的方法:
之前:
Hi! We are looking for smart, young and hard-working c++ developer. Our perfect candidate should know: - c++, c#, .NET in expert level;
之后:
Hi We are looking for smart young and hard-working c++ developer Our perfect candidate should know c++ c# .NET in expert level
你能告诉我高级标记器或删除标点符号的方法吗?
我的解决方案
def clean(s: str, keep=None, remove=None):
""" delete punctuation from "s" except special words """
if keep is None:
keep = []
if remove is None:
remove = []
protected = [False for _ in s] # True if you keep
# compute protected chars
for w in keep: # for every special word
for i in range(len(s)-len(w)):
same = True
for j in range(len(w)):
if w[j] != s[i + j]:
same = False
if same:
for j in range(len(w)):
protected[i + j] = True
# delete unwanted chars
out = ''
for i in range(len(s)):
if protected[i]:
out += s[i]
else:
if s[i] not in remove:
out += s[i]
return out
if __name__ == "__main__":
test = "Hi! We are looking for smart, young and hard-working c++ developer. Our perfect candidate should know:" \
" - c++, c# in expert level;"
Remove = ['.', ',', ':', ';', '+', '-', '!', '?', '#']
Keep = ['c++', 'c#']
print(clean(test, keep=Keep, remove=Remove))
您可以使用模式:
[!,.:;-](?= |$)
匹配任何后跟空格或 !
、,
、.
、:
、;
和 -
的字符字符串结尾。
在Python中:
import re
text = "Hi! We are looking for smart, young and hard-working c++ developer. Our perfect candidate should know: - c++, c#, .NET in expert level;"
print (re.sub(r'[!,.:;-](?= |$)',r'',text))
打印:
Hi We are looking for smart young and hard-working c++ developer Our perfect candidate should know c++ c# .NET in expert level
我有一个很大的 pandas 数据集,其中包含职位描述。我想标记它,但在此之前我应该删除停用词和标点符号。我对停用词没有任何问题。
如果我使用正则表达式来删除标点符号,我可能会丢失描述工作的非常重要的词(例如 c++ 开发人员、c#、.net 等)。
如此重要的单词列表非常大,因为它不仅包括编程语言名称,还包括公司名称。
例如,我想要下一个删除标点符号的方法:
之前:
Hi! We are looking for smart, young and hard-working c++ developer. Our perfect candidate should know: - c++, c#, .NET in expert level;
之后:
Hi We are looking for smart young and hard-working c++ developer Our perfect candidate should know c++ c# .NET in expert level
你能告诉我高级标记器或删除标点符号的方法吗?
我的解决方案
def clean(s: str, keep=None, remove=None):
""" delete punctuation from "s" except special words """
if keep is None:
keep = []
if remove is None:
remove = []
protected = [False for _ in s] # True if you keep
# compute protected chars
for w in keep: # for every special word
for i in range(len(s)-len(w)):
same = True
for j in range(len(w)):
if w[j] != s[i + j]:
same = False
if same:
for j in range(len(w)):
protected[i + j] = True
# delete unwanted chars
out = ''
for i in range(len(s)):
if protected[i]:
out += s[i]
else:
if s[i] not in remove:
out += s[i]
return out
if __name__ == "__main__":
test = "Hi! We are looking for smart, young and hard-working c++ developer. Our perfect candidate should know:" \
" - c++, c# in expert level;"
Remove = ['.', ',', ':', ';', '+', '-', '!', '?', '#']
Keep = ['c++', 'c#']
print(clean(test, keep=Keep, remove=Remove))
您可以使用模式:
[!,.:;-](?= |$)
匹配任何后跟空格或 !
、,
、.
、:
、;
和 -
的字符字符串结尾。
在Python中:
import re
text = "Hi! We are looking for smart, young and hard-working c++ developer. Our perfect candidate should know: - c++, c#, .NET in expert level;"
print (re.sub(r'[!,.:;-](?= |$)',r'',text))
打印:
Hi We are looking for smart young and hard-working c++ developer Our perfect candidate should know c++ c# .NET in expert level