nltk: word_tokenize 更改引号

nltk: word_tokenize changes quotes

我正在使用 Python 的 nltk,我想标记一个包含引号的句子,但它会将 " 变成 ``''

例如:

>>> from nltk import word_tokenize

>>> sentence = 'He said "hey Bill!"'
>>> word_tokenize(sentence)
['He', 'said', '``', 'hey', 'Bill', '!', "''"]

为什么不保留原句中的引号,如何解决?

谢谢

它实际上是要这样做的,而不是偶然的。来自 Penn Treebank Tokenization

double quotes (") are changed to doubled single forward- and backward- quotes (`` and '')

在以前的版本中没有这样做,但是去年更新了。换句话说,如果你想改变你需要编辑 treebank.py

扩展 Leb 提供的答案:

Penn Treebank 标记化的 URL 不再可用。但它出现在 ftp://ftp.cis.upenn.edu/pub/treebank/public_html/tokenization.html

Copy-pasting这里的内容:

Treebank tokenization

    Our tokenization is fairly simple:
  
  • most punctuation is split from adjoining words

  • double quotes (") are changed to doubled single forward- and backward- quotes (`` and '')

  • verb contractions and the Anglo-Saxon genitive of nouns are split into their component morphemes, and each morpheme is tagged
    separately.

    Examples
         children's --> children 's
         parents' --> parents '
         won't --> wo n't
         gonna --> gon na
         I'm --> I 'm
    

    This tokenization allows us to analyze each component separately, so (for example) "I" can be in the subject Noun Phrase while "'m" is the head of the main verb phrase.

  • There are some subtleties for hyphens vs. dashes, elipsis dots (...) and so on, but these often depend on the particular corpus or application of the tagged data.

  • In parsed corpora, bracket-like characters are converted to special 3-letter sequences, to avoid confusion with parse brackets. Some POS taggers, such as Adwait Ratnaparkhi's MXPOST, require this form for their input. In other words, these tokens in POS files: ( ) [ ] { } become, in parsed files: -LRB- -RRB- -RSB- -RSB- -LCB- -RCB- (The acronyms stand for (Left|Right) (Round|Square|Curly) Bracket.)

    Here is a simple sed script that does a decent enough job on most corpora, once the corpus has been formatted into one-sentence-per-line.

来自斯坦福大学的例子:

https://nlp.stanford.edu/software/tokenizer.shtml

Command-line 用法 部分显示了如何根据 Penn Treebank 标记化规则更改双引号的示例。

https://www.nltk.org/_modules/nltk/tokenize/treebank.html

class TreebankWordTokenizer 显示了如何实施更改:

# starting quotes
 (re.compile(r"^\""), r"``")

# ending quotes
(re.compile(r'"'), " '' ")