N-gram 的 CoreNLP API？

Question

CoreNLP 是否有 API 用于获取一元组、二元组、三元组等？

比如我有一个字符串"I have the best car "。我很想得到：

I
I have
the
the best
car

基于我传递的字符串。

Answer 1

您可以使用 CoreNLP 进行标记化，但要抓取 n-gram，请使用您正在使用的任何语言本地进行。如果，比如说，您要将其传输到 Python，您可以使用列表切片和一些列表理解将它们分开：

>>> tokens
['I', 'have', 'the', 'best', 'car']
>>> unigrams = [tokens[i:i+1] for i,w in enumerate(tokens) if i+1 <= len(tokens)]
>>> bigrams = [tokens[i:i+2] for i,w in enumerate(tokens) if i+2 <= len(tokens)]
>>> trigrams = [tokens[i:i+3] for i,w in enumerate(tokens) if i+3 <= len(tokens)]
>>> unigrams
[['I'], ['have'], ['the'], ['best'], ['car']]
>>> bigrams
[['I', 'have'], ['have', 'the'], ['the', 'best'], ['best', 'car']]
>>> trigrams
[['I', 'have', 'the'], ['have', 'the', 'best'], ['the', 'best', 'car']]

CoreNLP 非常适合进行 NLP 繁重的工作，例如依赖项、coref、POS 标记等。如果您只是想标记化，这似乎有点过分了，就像将消防车带到水枪战中一样。使用 TreeTagger 之类的东西同样可以满足您对标记化的需求。

Answer 2

如果您在 Java 中编码，请查看 CoreNLP 中 StringUtils class 中的 getNgrams* 函数。

您也可以使用 CollectionUtils.getNgrams（StringUtils class 也使用）

N-gram 的 CoreNLP API？

CoreNLP API for N-grams?

nlp

n-gram

pos-tagger

stanford-nlp