Python TextBlob 包 - 确定“%”符号的 POS 标记但不将其打印为单词

Question

我正在用 python 的 TextBlob 包敲打我的头

识别段落中的句子
从句子中识别单词
确定这些词的 POS（词性）标签等...

一切顺利，直到我发现了一个可能的问题，如果我没记错的话。下面用示例代码片段对其进行了解释。

from textblob import TextBlob
sample = '''This is greater than that by 5%.''' #Sample Sentence
blob = TextBlob(sample)                         #Passing it to TextBlob package.
Words = blob.words                              #Splitting the Sentence into words.
Tags = blob.tags                                #Determining POS tag for each words in the sentence

print(Tags)
[('This', 'DT'), ('is', 'VBZ'), ('greater', 'JJR'), ('than', 'IN'), ('that', 'DT'), ('by', 'IN'), ('5', 'CD'), ('%', 'NN')]

print(Words)
['This', 'is', 'greater', 'than', 'that', 'by', '5']

如上所示，blob.tags 函数将“%”符号视为一个单独的词并确定 POS 标记。

而 blob.words 函数甚至不单独打印 '%' 符号或与其前一个单词一起打印。

我正在创建一个包含这两个函数输出的数据框。因此，由于长度不匹配问题，它没有被创建。

这是我的问题。 这个问题是否存在于 TextBlob 包中？有什么方法可以在单词列表中识别“%”吗？

Answer 1

在标记化时去除标点符号似乎是 TextBlob 开发者的一个有意识的决定：https://github.com/sloria/TextBlob/blob/dev/textblob/blob.py#L624

他们依赖于 NLTK 的分词器，它采用 include_punct 参数，但我没有看到通过 TextBlob 将 include_punct=True 传递到 NLTK 分词器的方法。

当遇到类似问题时，我将有趣的标点符号替换为旨在表示它的非字典文本常量，即：在标记化之前将“%”替换为 'PUNCTPERCENT'。这样，百分号的信息就不会丢失。

编辑：我的观点是正确的，在 TextBlob 初始化时，您可以通过其 __init__ 方法 https://github.com/sloria/TextBlob/blob/dev/textblob/blob.py#L328.

的 tokenizer 参数设置分词器

因此您可以轻松地向 TextBlob 传递一个尊重标点符号的分词器。

respectful_tokenizer = YourCustomTokenizerRepectsPunctuation()
blob = TextBlob('some text with %', tokenizer=repectful_tokenizer)

EDIT2：我在查看 TextBlob 的来源时运行进入了这个：https://github.com/sloria/TextBlob/blob/dev/textblob/blob.py#L372 注意 words 方法的文档字符串，它说你应该访问标记属性而不是 words 属性如果你想包含标点符号。

Answer 2

最后我发现 NLTK 可以正确识别符号。下面给出了相同的代码片段以供参考：

from nltk import word_tokenize
from nltk import pos_tag
Words = word_tokenize(sample)
Tags = pos_tag(Words)

print(Words)
['This', 'is', 'better', 'than', 'that', 'by', '5', '%']

print(Tags)
[('This', 'DT'), ('is', 'VBZ'), ('better', 'JJR'), ('than', 'IN'), ('that', 'DT'), ('by', 'IN'), ('5', 'CD'), ('%', 'NN')]

Python TextBlob 包 - 确定“%”符号的 POS 标记但不将其打印为单词

Python TextBlob Package - Determines POS tag for '%' symbol but do not print it as a word

python

textblob