在预处理文本中将标点符号作为自己的单位

Keeping punctuation as its own unit in Preprocessed Text

将一个句子拆分成其组成词和标点符号列表的代码是什么?大多数文本预处理程序都倾向于删除标点符号。

例如,如果我输入:

"Punctuations to be included as its own unit."

所需的输出将是:

result = ['Punctuations', 'to', 'be', 'included', 'as', 'its', 'own', 'unit', '.']

非常感谢!

您可能需要考虑使用自然语言工具包或 nltk

试试这个:

import nltk

sentence = "Punctuations to be included as its own unit."
tokens = nltk.word_tokenize(sentence)
print(tokens)

输出:['Punctuations', 'to', 'be', 'included', 'as', 'its', 'own', 'unit', '.']

可以使用以下代码段使用正则表达式来分隔列表中的单词和标点符号。

import string
import re

punctuations = string.punctuation
regularExpression="[\w]+|" + "[" + punctuations + "]"

content="Punctuations to be included as its own unit."
splittedWords_Puncs = re.findall(r""+regularExpression, content)
print(splittedWords_Puncs)

输出:['Punctuations'、'to'、'be'、'included'、'as'、'its'、'own'、 'unit', '.']