如果它没有像 CD 这样的 pos 标签，如何删除整行？

Question

我正在阅读一篇新闻文章并使用 nltk 进行 pos-tagging。我想删除那些没有像 CD（数字）这样的 pos 标签的行。

import io
import nltk
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize
from nltk import pos_tag
stop_words = set(stopwords.words('english')) 
file1 = open("etorg.txt") 
line = file1.read()
file1.close()
print(line)
words = line.split() 
tokens = nltk.pos_tag(words)

如何删除所有不包含 CD 标签的句子？

Answer 1

只需使用[word for word in tokens if word[1] != 'CD']

编辑：要获取没有数字的句子，请使用此代码：

def has_number(sentence):
    for i in nltk.pos_tag(sentence.split()):
        if i[1] == 'CD':
            return ''
    return sentence

line = 'MNC claims 21 million sales in September. However, industry sources do not confirm this data. It is estimated that the reported sales could be in the range of fifteen to 18 million. '

''.join([has_number(x) for x in line.split('.')])

> ' However, industry sources do not confirm this data '

如果它没有像 CD 这样的 pos 标签，如何删除整行？

How to remove an entire line if it does not have a pos tag like CD?

python

tags

part-of-speech

sentence