在没有标点符号的句子中计算字数 NLTK python

Question

我正在尝试在 python

中获取带有 nltk 的句子中的字数

这是我写的代码

import nltk

data = "Sample sentence, for checking. Here is an exclamation mark! Here is a question? This isn't an easy-task."

for i in nltk.sent_tokenize(data):
    print(nltk.word_tokenize(i))

这是输出

['Sample', 'sentence', ',', 'for', 'checking', '.']
['Here', 'is', 'an', 'exclamation', 'mark', '!']
['Here', 'is', 'a', 'question', '?']
['This', 'is', "n't", 'an', 'easy-task', '.']

有什么办法去掉标点符号，防止isn't分裂成两个词，把easy-task分裂成两个吗？

我需要的答案是这样的：

['Sample', 'sentence', 'for', 'checking']
['Here', 'is', 'an', 'exclamation', 'mark']
['Here', 'is', 'a', 'question']
['This', "isn't", 'an', 'easy', 'task']

我可以通过使用停用词来管理标点符号，例如：

import nltk

data = "Sample sentence, for checking. Here is an exclamation mark! Here is a question? This isn't an easy-task."

stopwords = [',', '.', '?', '!']

for i in nltk.sent_tokenize(data):
    for j in nltk.word_tokenize(i):
        if j not in stopwords:
            print(j, ', ', end="")
    print('\n')

输出：

Sample , sentence , for , checking , 

Here , is , an , exclamation , mark , 

Here , is , a , question , 

This , is , n't , an , easy-task ,

但这并不能修复 isn't 和 easy-task。有没有办法做到这一点？谢谢

Answer 1

您可以使用不同的分词器来满足您的要求。

import nltk
import string
tokenizer = nltk.TweetTokenizer()

for i in nltk.sent_tokenize(data):
    print(i)
    print([x for x in tokenizer.tokenize(i) if x not in string.punctuation])

#op
['Sample', 'sentence', 'for', 'checking']
['Here', 'is', 'an', 'exclamation', 'mark']
['Here', 'is', 'a', 'question']
['This', "isn't", 'an', 'easy-task']

在没有标点符号的句子中计算字数 NLTK python

Getting word count in a sentence without punctuation marks NLTK python

python

nlp

token

tokenize

nltk