nltk: word_tokenize 更改引号
nltk: word_tokenize changes quotes
我正在使用 Python 的 nltk,我想标记一个包含引号的句子,但它会将 "
变成 ``
和 ''
。
例如:
>>> from nltk import word_tokenize
>>> sentence = 'He said "hey Bill!"'
>>> word_tokenize(sentence)
['He', 'said', '``', 'hey', 'Bill', '!', "''"]
为什么不保留原句中的引号,如何解决?
谢谢
它实际上是要这样做的,而不是偶然的。来自 Penn Treebank Tokenization
double quotes (") are changed to doubled single forward- and backward- quotes (`` and '')
在以前的版本中没有这样做,但是去年更新了。换句话说,如果你想改变你需要编辑 treebank.py
扩展 Leb 提供的答案:
Penn Treebank 标记化的 URL 不再可用。但它出现在 ftp://ftp.cis.upenn.edu/pub/treebank/public_html/tokenization.html
Copy-pasting这里的内容:
Treebank tokenization
Our tokenization is fairly simple:
most punctuation is split from adjoining words
double quotes (") are changed to doubled single forward- and backward- quotes (`` and '')
verb contractions and the Anglo-Saxon genitive of nouns are split into their component morphemes, and each morpheme is tagged
separately.
Examples
children's --> children 's
parents' --> parents '
won't --> wo n't
gonna --> gon na
I'm --> I 'm
This tokenization allows us to analyze each component separately, so (for example) "I" can be in the subject Noun Phrase while "'m" is
the head of the main verb phrase.
There are some subtleties for hyphens vs. dashes, elipsis dots (...) and so on, but these often depend on the particular corpus or
application of the tagged data.
In parsed corpora, bracket-like characters are converted to special 3-letter sequences, to avoid confusion with parse brackets. Some POS
taggers, such as Adwait Ratnaparkhi's MXPOST, require this form for
their input.
In other words, these tokens in POS files: ( ) [ ] { }
become, in parsed files: -LRB- -RRB- -RSB- -RSB- -LCB- -RCB-
(The acronyms stand for (Left|Right) (Round|Square|Curly) Bracket.)
Here is a simple sed script that does a decent enough job on most corpora, once the corpus has been formatted into
one-sentence-per-line.
来自斯坦福大学的例子:
https://nlp.stanford.edu/software/tokenizer.shtml
Command-line 用法 部分显示了如何根据 Penn Treebank 标记化规则更改双引号的示例。
https://www.nltk.org/_modules/nltk/tokenize/treebank.html
class TreebankWordTokenizer 显示了如何实施更改:
# starting quotes
(re.compile(r"^\""), r"``")
# ending quotes
(re.compile(r'"'), " '' ")
我正在使用 Python 的 nltk,我想标记一个包含引号的句子,但它会将 "
变成 ``
和 ''
。
例如:
>>> from nltk import word_tokenize
>>> sentence = 'He said "hey Bill!"'
>>> word_tokenize(sentence)
['He', 'said', '``', 'hey', 'Bill', '!', "''"]
为什么不保留原句中的引号,如何解决?
谢谢
它实际上是要这样做的,而不是偶然的。来自 Penn Treebank Tokenization
double quotes (") are changed to doubled single forward- and backward- quotes (`` and '')
在以前的版本中没有这样做,但是去年更新了。换句话说,如果你想改变你需要编辑 treebank.py
扩展 Leb 提供的答案:
Penn Treebank 标记化的 URL 不再可用。但它出现在 ftp://ftp.cis.upenn.edu/pub/treebank/public_html/tokenization.html
Copy-pasting这里的内容:
Treebank tokenization
Our tokenization is fairly simple:
most punctuation is split from adjoining words
double quotes (") are changed to doubled single forward- and backward- quotes (`` and '')
verb contractions and the Anglo-Saxon genitive of nouns are split into their component morphemes, and each morpheme is tagged
separately.Examples children's --> children 's parents' --> parents ' won't --> wo n't gonna --> gon na I'm --> I 'm
This tokenization allows us to analyze each component separately, so (for example) "I" can be in the subject Noun Phrase while "'m" is the head of the main verb phrase.
There are some subtleties for hyphens vs. dashes, elipsis dots (...) and so on, but these often depend on the particular corpus or application of the tagged data.
In parsed corpora, bracket-like characters are converted to special 3-letter sequences, to avoid confusion with parse brackets. Some POS taggers, such as Adwait Ratnaparkhi's MXPOST, require this form for their input. In other words, these tokens in POS files: ( ) [ ] { } become, in parsed files: -LRB- -RRB- -RSB- -RSB- -LCB- -RCB- (The acronyms stand for (Left|Right) (Round|Square|Curly) Bracket.)
Here is a simple sed script that does a decent enough job on most corpora, once the corpus has been formatted into one-sentence-per-line.
来自斯坦福大学的例子:
https://nlp.stanford.edu/software/tokenizer.shtml
Command-line 用法 部分显示了如何根据 Penn Treebank 标记化规则更改双引号的示例。
https://www.nltk.org/_modules/nltk/tokenize/treebank.html
class TreebankWordTokenizer 显示了如何实施更改:
# starting quotes
(re.compile(r"^\""), r"``")
# ending quotes
(re.compile(r'"'), " '' ")