为什么词性标注算法将“can't”标记为单独的词？

Question

我正在使用 Stanford Log-linear Part-Of-Speech Tagger，这是我标记的例句：

He can't do that

标记后我得到这个结果：

He_PRP ca_MD n't_RB do_VB that_DT

可以看到，can't被拆分成两个词，ca被标记为情态（MD），n't被标记为副词（RB）?

如果我单独使用can not实际上得到相同的结果：can是MD而not是RB，所以这种分手方式是预期的而不是说分手像 can_MD 和 't_RB?

Answer 1

注意：这不是完美的答案。
我认为问题出在 Stanford POS Tagger 中使用的 Tokenizer，而不是标注器本身。 Tokenizer (PTBTokenizer) 无法正确处理撇号：
1- Stanford PTBTokenizer token's split delimiter。
2- Stanford coreNLP - split words ignoring apostrophe.
正如他们在此处提到的 Stanford Tokenizer，PTTBokenizer 将标记句子：

"Oh, no," she's saying, "our 0 blender can't handle something this hard!"

至：

......
our
$
400
blender
ca
n't
handle
something

尝试找到合适的标记化方法并将其应用于标记器，如下所示：

    import java.util.List;
    import edu.stanford.nlp.ling.HasWord;
    import edu.stanford.nlp.ling.Sentence;
    import edu.stanford.nlp.ling.TaggedWord;
    import edu.stanford.nlp.tagger.maxent.MaxentTagger;

    public class Test {

        public static void main(String[] args) throws Exception {
            String model = "F:/code/stanford-postagger-2015-04-20/models/english-left3words-distsim.tagger";  
            MaxentTagger tagger = new MaxentTagger(model);
            List<HasWord> sent;
            sent = Sentence.toWordList("He", "can", "'t", "do", "that", ".");
            //sent = Sentence.toWordList("He", "can't", "do", "that", ".");
            List<TaggedWord> taggedSent = tagger.tagSentence(sent);
            for (TaggedWord tw : taggedSent) {
                 System.out.print(tw.word() + "=" +  tw.tag() + " , " );

            }

        }

}

输出：

He=PRP , can=MD , 't=VB , do=VB , that=DT , .=. ,

为什么词性标注算法将“can't”标记为单独的词？

Why POS tagging algorithm tags `can't` as separate words?

pos-tagger

stanford-nlp