如何使用 NLTK RegexpParser Chunk 为 Python 中的 POS_tagged 个单词提取特殊字符
How to extract special characters using NLTK RegexpParser Chunk for POS_tagged words in Python
我有一些文字,例如:80% of 0,000 Each Human Resource/IT Department.
我需要提取 0,000
以及单词 Each Human Resource/IT Department
我已经使用 pos tagging 来标记分词后的单词。我能够提取 300,000,但无法提取 $ 符号。
我目前拥有的:
text = '80% of 0,000 Each Human Resource/IT Department'
train_text = text
sample_text = text
custom_sent_tokenizer = PunktSentenseTokenizer(train_text)
tokenized = custom_sent_tokenizer.tokenize(sample_text)
for i in tokenized:
words = nltk.word_tokenize(i)
tagged = nltk.pos_tag(words)
chunkGram = r"""chunk: {<DT>+<NN.*>+<NN.*>?|<NNP>?|<CD>+<NN>?|<NNP>?}"""
chunkParser = nltk.RegexpParser(chunkGram)
chunked = chunkParser.parse(tagged)
转换为列表时分块输出 - ['80 %', '300,000', 'Each Human Resource/IT Department']
我想要的:['80 %', '**$**300,000', 'Each Human Resource/IT Department']
我试过了
chunkGram = r"""chunk: {**</$CD>|**<DT>+<NN.*>+<NN.*>?|<NNP>?|<CD>+<NN>?|
?}"""
还是不行。所以,我只需要 $ 和 CD
您需要添加<\$>?在你的语法中。
chunkGram = r"""chunk: {<DT>+<NN.*>+<NN.*>?|<$>?<CD>+<NN>?|<NNP>?}"""
代码:
import nltk
from nltk.tokenize import PunktSentenceTokenizer
text = '80% of 0,000 Each Human Resource/IT Department'
train_text = text
sample_text = text
custom_sent_tokenizer = PunktSentenceTokenizer(train_text)
tokenized = custom_sent_tokenizer.tokenize(sample_text)
for i in tokenized:
words = nltk.word_tokenize(i)
tagged = nltk.pos_tag(words)
chunkGram = r"""chunk: {<DT>+<NN.*>+<NN.*>?|<$>?<CD>+<NN>?|<NNP>?}"""
chunkParser = nltk.RegexpParser(chunkGram)
chunked = chunkParser.parse(tagged)
print(chunked)
输出:
(S
(chunk 80/CD %/NN)
of/IN
(chunk $/$ 300,000/CD)
(chunk Each/DT Human/NNP Resource/IT/NNP Department/NNP))
我有一些文字,例如:80% of 0,000 Each Human Resource/IT Department.
我需要提取 0,000
以及单词 Each Human Resource/IT Department
我已经使用 pos tagging 来标记分词后的单词。我能够提取 300,000,但无法提取 $ 符号。
我目前拥有的:
text = '80% of 0,000 Each Human Resource/IT Department'
train_text = text
sample_text = text
custom_sent_tokenizer = PunktSentenseTokenizer(train_text)
tokenized = custom_sent_tokenizer.tokenize(sample_text)
for i in tokenized:
words = nltk.word_tokenize(i)
tagged = nltk.pos_tag(words)
chunkGram = r"""chunk: {<DT>+<NN.*>+<NN.*>?|<NNP>?|<CD>+<NN>?|<NNP>?}"""
chunkParser = nltk.RegexpParser(chunkGram)
chunked = chunkParser.parse(tagged)
转换为列表时分块输出 - ['80 %', '300,000', 'Each Human Resource/IT Department']
我想要的:['80 %', '**$**300,000', 'Each Human Resource/IT Department']
我试过了
chunkGram = r"""chunk: {**</$CD>|**<DT>+<NN.*>+<NN.*>?|<NNP>?|<CD>+<NN>?|
?}"""
还是不行。所以,我只需要 $ 和 CD
您需要添加<\$>?在你的语法中。
chunkGram = r"""chunk: {<DT>+<NN.*>+<NN.*>?|<$>?<CD>+<NN>?|<NNP>?}"""
代码:
import nltk
from nltk.tokenize import PunktSentenceTokenizer
text = '80% of 0,000 Each Human Resource/IT Department'
train_text = text
sample_text = text
custom_sent_tokenizer = PunktSentenceTokenizer(train_text)
tokenized = custom_sent_tokenizer.tokenize(sample_text)
for i in tokenized:
words = nltk.word_tokenize(i)
tagged = nltk.pos_tag(words)
chunkGram = r"""chunk: {<DT>+<NN.*>+<NN.*>?|<$>?<CD>+<NN>?|<NNP>?}"""
chunkParser = nltk.RegexpParser(chunkGram)
chunked = chunkParser.parse(tagged)
print(chunked)
输出:
(S
(chunk 80/CD %/NN)
of/IN
(chunk $/$ 300,000/CD)
(chunk Each/DT Human/NNP Resource/IT/NNP Department/NNP))