双引号的 NLTK 单词标记化行为令人困惑
NLTK word tokenize behaviour for double quotation marks is confusing
import nltk
>>> nltk.__version__
'3.0.4'
>>> nltk.word_tokenize('"')
['``']
>>> nltk.word_tokenize('""')
['``', '``']
>>> nltk.word_tokenize('"A"')
['``', 'A', "''"]
看看它如何将 "
更改为双 `` 和 ''
?
这里发生了什么?为什么要改变性格?有解决办法吗?因为稍后我需要搜索字符串中的每个标记。
Python 2.7.6 有什么区别
TL;DR:
nltk.word_tokenize
更改起始双引号从 " -> ``
更改并结束双引号从 " -> ''
.
中长:
首先,nltk.word_tokenize
根据 Penn TreeBank 的标记化方式进行标记化,它来自 nltk.tokenize.treebank
,参见 https://github.com/nltk/nltk/blob/develop/nltk/tokenize/init.py#L91 and https://github.com/nltk/nltk/blob/develop/nltk/tokenize/treebank.py#L23
class TreebankWordTokenizer(TokenizerI):
"""
The Treebank tokenizer uses regular expressions to tokenize text as in Penn Treebank.
This is the method that is invoked by ``word_tokenize()``. It assumes that the
text has already been segmented into sentences, e.g. using ``sent_tokenize()``.
然后是 https://github.com/nltk/nltk/blob/develop/nltk/tokenize/treebank.py#L48, it comes from the "Robert MacIntyre's tokenizer", i.e. https://www.cis.upenn.edu/~treebank/tokenizer.sed
处缩写的正则表达式替换列表
缩略词拆分词如"gonna"、"wanna"等:
>>> from nltk import word_tokenize
>>> word_tokenize("I wanna go home")
['I', 'wan', 'na', 'go', 'home']
>>> word_tokenize("I gonna go home")
['I', 'gon', 'na', 'go', 'home']
之后我们到达您询问的标点符号部分,请参阅 https://github.com/nltk/nltk/blob/develop/nltk/tokenize/treebank.py#L63:
def tokenize(self, text):
#starting quotes
text = re.sub(r'^\"', r'``', text)
text = re.sub(r'(``)', r' ', text)
text = re.sub(r'([ (\[{<])"', r' `` ', text)
啊哈,起始引号从 "->``:
>>> import re
>>> text = '"A"'
>>> re.sub(r'^\"', r'``', text)
'``A"'
KeyboardInterrupt
>>> re.sub(r'(``)', r' ', re.sub(r'^\"', r'``', text))
' `` A"'
>>> re.sub(r'([ (\[{<])"', r' `` ', re.sub(r'(``)', r' ', re.sub(r'^\"', r'``', text)))
' `` A"'
>>> text_after_startquote_changes = re.sub(r'([ (\[{<])"', r' `` ', re.sub(r'(``)', r' ', re.sub(r'^\"', r'``', text)))
>>> text_after_startquote_changes
' `` A"'
然后我们看到 https://github.com/nltk/nltk/blob/develop/nltk/tokenize/treebank.py#L85 处理结束引号:
#ending quotes
text = re.sub(r'"', " '' ", text)
text = re.sub(r'(\S)(\'\')', r' ', text)
应用正则表达式:
>>> re.sub(r'"', " '' ", text_after_startquote_changes)
" `` A '' "
>>> re.sub(r'(\S)(\'\')', r' ', re.sub(r'"', " '' ", text_after_startquote_changes))
" `` A '' "
因此,如果您想在 nltk.word_tokenize
之后搜索双引号的标记列表,只需搜索 ``
和 ''
而不是 "
.
似乎不可能在任何 nltk 分词器中轻松处理引用的单词。
- 单引号单词被拆分为附加的开始引号,结束引号作为下一个标记
- 双引号被拆分为两个单引号,具体取决于分词器,在单词前产生 2 个标记,在单词后产生 2 个标记,或者在单词前产生 2 个单引号,在单词后产生 2 个单引号。
- 您应该搜索这些令牌
s = "marked with ''gonzaga'' "
from nltk.tokenize import word_tokenize
word_tokenize(s)
['marked', 'with', '``', 'gonzaga', "''"]
from sacremoses import MosesTokenizer
tokenizer = MosesTokenizer()
tokenizer.tokenize(s, escape=False)
['marked', 'with', "'", "'", 'gonzaga', "'", "'"]
import nltk
>>> nltk.__version__
'3.0.4'
>>> nltk.word_tokenize('"')
['``']
>>> nltk.word_tokenize('""')
['``', '``']
>>> nltk.word_tokenize('"A"')
['``', 'A', "''"]
看看它如何将 "
更改为双 `` 和 ''
?
这里发生了什么?为什么要改变性格?有解决办法吗?因为稍后我需要搜索字符串中的每个标记。
Python 2.7.6 有什么区别
TL;DR:
nltk.word_tokenize
更改起始双引号从 " -> ``
更改并结束双引号从 " -> ''
.
中长:
首先,nltk.word_tokenize
根据 Penn TreeBank 的标记化方式进行标记化,它来自 nltk.tokenize.treebank
,参见 https://github.com/nltk/nltk/blob/develop/nltk/tokenize/init.py#L91 and https://github.com/nltk/nltk/blob/develop/nltk/tokenize/treebank.py#L23
class TreebankWordTokenizer(TokenizerI):
"""
The Treebank tokenizer uses regular expressions to tokenize text as in Penn Treebank.
This is the method that is invoked by ``word_tokenize()``. It assumes that the
text has already been segmented into sentences, e.g. using ``sent_tokenize()``.
然后是 https://github.com/nltk/nltk/blob/develop/nltk/tokenize/treebank.py#L48, it comes from the "Robert MacIntyre's tokenizer", i.e. https://www.cis.upenn.edu/~treebank/tokenizer.sed
处缩写的正则表达式替换列表缩略词拆分词如"gonna"、"wanna"等:
>>> from nltk import word_tokenize
>>> word_tokenize("I wanna go home")
['I', 'wan', 'na', 'go', 'home']
>>> word_tokenize("I gonna go home")
['I', 'gon', 'na', 'go', 'home']
之后我们到达您询问的标点符号部分,请参阅 https://github.com/nltk/nltk/blob/develop/nltk/tokenize/treebank.py#L63:
def tokenize(self, text):
#starting quotes
text = re.sub(r'^\"', r'``', text)
text = re.sub(r'(``)', r' ', text)
text = re.sub(r'([ (\[{<])"', r' `` ', text)
啊哈,起始引号从 "->``:
>>> import re
>>> text = '"A"'
>>> re.sub(r'^\"', r'``', text)
'``A"'
KeyboardInterrupt
>>> re.sub(r'(``)', r' ', re.sub(r'^\"', r'``', text))
' `` A"'
>>> re.sub(r'([ (\[{<])"', r' `` ', re.sub(r'(``)', r' ', re.sub(r'^\"', r'``', text)))
' `` A"'
>>> text_after_startquote_changes = re.sub(r'([ (\[{<])"', r' `` ', re.sub(r'(``)', r' ', re.sub(r'^\"', r'``', text)))
>>> text_after_startquote_changes
' `` A"'
然后我们看到 https://github.com/nltk/nltk/blob/develop/nltk/tokenize/treebank.py#L85 处理结束引号:
#ending quotes
text = re.sub(r'"', " '' ", text)
text = re.sub(r'(\S)(\'\')', r' ', text)
应用正则表达式:
>>> re.sub(r'"', " '' ", text_after_startquote_changes)
" `` A '' "
>>> re.sub(r'(\S)(\'\')', r' ', re.sub(r'"', " '' ", text_after_startquote_changes))
" `` A '' "
因此,如果您想在 nltk.word_tokenize
之后搜索双引号的标记列表,只需搜索 ``
和 ''
而不是 "
.
似乎不可能在任何 nltk 分词器中轻松处理引用的单词。
- 单引号单词被拆分为附加的开始引号,结束引号作为下一个标记
- 双引号被拆分为两个单引号,具体取决于分词器,在单词前产生 2 个标记,在单词后产生 2 个标记,或者在单词前产生 2 个单引号,在单词后产生 2 个单引号。
- 您应该搜索这些令牌
s = "marked with ''gonzaga'' "
from nltk.tokenize import word_tokenize
word_tokenize(s)
['marked', 'with', '``', 'gonzaga', "''"]
from sacremoses import MosesTokenizer
tokenizer = MosesTokenizer()
tokenizer.tokenize(s, escape=False)
['marked', 'with', "'", "'", 'gonzaga', "'", "'"]