为什么词性标注算法将“can't”标记为单独的词?

Why POS tagging algorithm tags `can't` as separate words?

我正在使用 Stanford Log-linear Part-Of-Speech Tagger,这是我标记的例句:

He can't do that

标记后我得到这个结果:

He_PRP ca_MD n't_RB do_VB that_DT

可以看到,can't被拆分成两个词,ca被标记为情态(MD),n't被标记为副词(RB)?

如果我单独使用can not实际上得到相同的结果:can是MD而not是RB,所以这种分手方式是预期的而不是说分手像 can_MD't_RB?

注意:这不是完美的答案。
我认为问题出在 Stanford POS Tagger 中使用的 Tokenizer,而不是标注器本身。 Tokenizer (PTBTokenizer) 无法正确处理撇号:
1- Stanford PTBTokenizer token's split delimiter
2- Stanford coreNLP - split words ignoring apostrophe.
正如他们在此处提到的 Stanford Tokenizer,PTTBokenizer 将标记句子:

"Oh, no," she's saying, "our 0 blender can't handle something this hard!"

至:

......
our
$
400
blender
ca
n't
handle
something

尝试找到合适的标记化方法并将其应用于标记器,如下所示:

    import java.util.List;
    import edu.stanford.nlp.ling.HasWord;
    import edu.stanford.nlp.ling.Sentence;
    import edu.stanford.nlp.ling.TaggedWord;
    import edu.stanford.nlp.tagger.maxent.MaxentTagger;

    public class Test {

        public static void main(String[] args) throws Exception {
            String model = "F:/code/stanford-postagger-2015-04-20/models/english-left3words-distsim.tagger";  
            MaxentTagger tagger = new MaxentTagger(model);
            List<HasWord> sent;
            sent = Sentence.toWordList("He", "can", "'t", "do", "that", ".");
            //sent = Sentence.toWordList("He", "can't", "do", "that", ".");
            List<TaggedWord> taggedSent = tagger.tagSentence(sent);
            for (TaggedWord tw : taggedSent) {
                 System.out.print(tw.word() + "=" +  tw.tag() + " , " );

            }

        }

}

输出:

He=PRP , can=MD , 't=VB , do=VB , that=DT , .=. ,