如何使用 NLTK RegexpParser Chunk 为 Python 中的 POS_tagged 个单词提取特殊字符

Question

我有一些文字，例如：80% of 0,000 Each Human Resource/IT Department.

我需要提取 0,000 以及单词 Each Human Resource/IT Department

我已经使用 pos tagging 来标记分词后的单词。我能够提取 300,000，但无法提取 $ 符号。

我目前拥有的：

text = '80% of 0,000 Each Human Resource/IT Department'
train_text = text
sample_text = text
custom_sent_tokenizer = PunktSentenseTokenizer(train_text)
tokenized = custom_sent_tokenizer.tokenize(sample_text)

for i in tokenized:
    words = nltk.word_tokenize(i)
    tagged = nltk.pos_tag(words)

chunkGram = r"""chunk: {<DT>+<NN.*>+<NN.*>?|<NNP>?|<CD>+<NN>?|<NNP>?}"""


chunkParser = nltk.RegexpParser(chunkGram)
chunked = chunkParser.parse(tagged)

转换为列表时分块输出 - ['80 %', '300,000', 'Each Human Resource/IT Department']

我想要的：['80 %', '**$**300,000', 'Each Human Resource/IT Department']

我试过了

chunkGram = r"""chunk: {**</$CD>|**<DT>+<NN.*>+<NN.*>?|<NNP>?|<CD>+<NN>?|?}"""

还是不行。所以，我只需要 $ 和 CD

Answer 1

您需要添加<\$>？在你的语法中。

chunkGram = r"""chunk: {<DT>+<NN.*>+<NN.*>?|<$>?<CD>+<NN>?|<NNP>?}"""

代码：

import nltk
from nltk.tokenize import PunktSentenceTokenizer

text = '80% of 0,000 Each Human Resource/IT Department'
train_text = text
sample_text = text
custom_sent_tokenizer = PunktSentenceTokenizer(train_text)
tokenized = custom_sent_tokenizer.tokenize(sample_text)

for i in tokenized:
    words = nltk.word_tokenize(i)
    tagged = nltk.pos_tag(words)

chunkGram = r"""chunk: {<DT>+<NN.*>+<NN.*>?|<$>?<CD>+<NN>?|<NNP>?}"""

chunkParser = nltk.RegexpParser(chunkGram)
chunked = chunkParser.parse(tagged)

print(chunked)

输出：

(S
  (chunk 80/CD %/NN)
  of/IN
  (chunk $/$ 300,000/CD)
  (chunk Each/DT Human/NNP Resource/IT/NNP Department/NNP))

如何使用 NLTK RegexpParser Chunk 为 Python 中的 POS_tagged 个单词提取特殊字符

How to extract special characters using NLTK RegexpParser Chunk for POS_tagged words in Python

python

nlp

nltk

pos-tagger

text-chunking